This document contains some simple examples for how to use the advanced filtering options in https://plae.nei.nih.gov and a brief introduction to using the Seurat file to do custom tests.
By “gene focused” we mean that:
For those who are most comfortable with stained in situ slices of the retina this visualization may be useful. The major cell types of the retina are laid on in rough anatomical positioning. The cell types are colored by intensity, with the brighter colors meaning that the gene is more highly expressed in that cell type. As an example we show expression of RHO (rhodopsin, a rod marker) and RPE65 (RPE marker).
If you are curious about a gene, then there are several ways you can learn about its retinal cell type expression patterning. We will use ATOH7, a transcription factor that regulates retinal ganglion development as our example gene.
The UMAP view is a two dimensional representation of the individual cells in the scEiaD. Cells that are closer together have more related gene expression profiles (and thus are likely to be similar cell types).
Let’s go the UMAP - Table viewer in plae:
Viz -> UMAP - Tables
Dark gray are cells which have no detectable ATOH7.
Yellow is the highest expression and dark purple is the lowest expression.
What if we want to see which cells have the highest expression? We can use the “Filter Gene Expression” slider to only show cells with expression above a log2(expression) value.
We see that the highest expressing cells are in the “center” before the branching happens.
By default plae shows data for all organisms in the database (human, mouse, macaque).
If we only want to see ATOH7 expression in human data, then that is very easy with the powerful “Scatter Filter Category” and “Gene Filter On” sections.
While the UMAP view is cool looking, it can’t show you everything….what if we want to know what kind of cells are expressing ATOH7?
We can have quantified information on where ATOH7 is expressed by Cell Type (predicted) (this is our machine learned cell type labels) and organism.
We see that about 50% of the mouse and human neurogenic cell type express ATOH7. In raw counts that is 16,449 of 28,811 total mouse neurogenic cells. They have an average expression of 1.86. You can sort or filter the table based on queries. If you wanted to see ATOH7 expression in the RGCs this is trivial to do by typing in the box below.
This shows us that ATOH7 expression seems to be dropping in the maturing/mature RGCs (and is much lower in the macaque) relative to the neurogenic population.
As scEiaD is constructed from publicly available datasets, you can also filter the data to only show results from a specific paper. This may be useful if you using the results from that paper and want to check or confirm a finding.
You can see information about the papers / studies in scEiaD by using the adjacent “Make Meta Table” section as follows:
We see that the Clark et al. 2019 study did Smart-seq2 and 10X across many developmental time points in mouse. They study_accession ID is SRP158081. We can use this ID to look at ATOH7 expression only in this study in both the UMAP view and the table view
As we have a huge number of studies and samples, we can use this (for single cell data) unusual view: a boxplot! We can see how ATOH7 expression changes across celltype and study.
We’ve entered ATOH7 as the gene to plot (1). We are faceting (splitting the plot into separate sub-plots) on Cell Type (predict) (2). We are coloring the data points by study_accession (each study’s average gene expression across the Cell Type (predict) is plotted separately) (3). But we see … nothing. Why?
That is because the Plot Height (400) is not high enough. The text is prioritized over the data, so they are hidden. As it is extremely difficult to “auto” pick the correct height, it was more straightforward to have the user pick it. Usually a value around 700 - 1200 will give a reasonable view.
So yes, now we can see the data.
So each point is an independent study. We see high ATOH7 expression in the neurogenic population, across many studies. But we also see one of studies with ATOH7 expression in cones. The legend shows which colors correspond to which study.
One of the studies is a bright pink…probably SRP200599. We can confirm that by replotting the data with a filter that only shows study_accession SRP200599.
Yep, that is it. We can jump to the UMAP - Table view to pull up the metadata we have extracted about the study.
This is from Buenaventura et al. and is a study that enriched early mouse cones. Some work suggests that loss of ATOH7 inhibits cone specification. These cones may be “early” cones or late neurogenic cells that are developing into cones.
If you want to get a sense about what is present in scEiaD, then there are several tools you can use. For our example, we will be starting with the rods.
These are published labels
Very high expression in the rods (and expressed in other cells too)
We provide the full data as seurat (v3) or anndata (scanpy) objects you can download for boutique analysis. Here we demonstrate how you can use the Seurat object to run a quick custom diff test on a sub-population of neurogenic cells. The Seurat object is 25 GB in size and likely will cause serious memory issues on a generic laptop.
library(Seurat)
## Registered S3 method overwritten by 'spatstat.geom':
## method from
## print.boxx cli
## Attaching SeuratObject
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(SingleCellExperiment)
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
##
## count
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:tidyr':
##
## expand
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## The following object is masked from 'package:purrr':
##
## reduce
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
##
## Attaching package: 'SummarizedExperiment'
## The following object is masked from 'package:SeuratObject':
##
## Assays
## The following object is masked from 'package:Seurat':
##
## Assays
library(scran)
## Loading required package: scuttle
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
##
#system('wget -O ~/data/scEiaD/scEiaD_all_seurat_v3.Rdata https://hpc.nih.gov/~mcgaugheyd/scEiaD/2021_11_11/scEiaD_all_seurat_v3.Rdata)
# the -O renames the file and puts it in the ~/data/scEiaD directory
load('/Volumes/McGaughey_S/scEiaD_v3/scEiaD_all_seurat_v3.Rdata')
We see that the neurogenic cells are a small minority of these clusters. But perhaps still we can find some interesting differences between them.
@meta.data %>%
scEiaD_dropletgroup_by(cluster, CellType_predict) %>%
summarise(Count = n()) %>%
mutate(Perc = (Count / sum(Count)) * 100) %>%
filter(Perc > 5) %>%
filter(CellType_predict == 'Neurogenic Cell') %>%
arrange(-Count)
## `summarise()` has grouped output by 'cluster'. You can override using the
## `.groups` argument.
## # A tibble: 3 × 4
## # Groups: cluster [3]
## cluster CellType_predict Count Perc
## <dbl> <chr> <int> <dbl>
## 1 46 Neurogenic Cell 5408 6.54
## 2 47 Neurogenic Cell 3577 7.70
## 3 49 Neurogenic Cell 1313 9.68
The most common is cluster 46 and 47
This will take a few minutes to run. This is why we cannot offer on-demand custom diff testing as while it runs the whole app shuts down for everyone while it computes.
# Create new column with a pasted together cluster ID and a CellType
<- AddMetaData(scEiaD_droplet,
scEiaD_droplet metadata =
paste0(
@meta.data$cluster,
scEiaD_droplet'_',
@meta.data$CellType_predict
scEiaD_droplet
), col.name = 'clusterCT' )
# tell Seurat this new column is the default identity
Idents(scEiaD_droplet) <- scEiaD_droplet@meta.data$clusterCT
<- FindMarkers(scEiaD_droplet, ident.1 = '46_Neurogenic Cell', ident.2 = '47_Neurogenic Cell')
diff_test %>% arrange(p_val) %>% filter(abs(avg_log2FC) > 2) %>% head(5) diff_test
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## ENSG00000164104 0.000000e+00 2.541205 0.780 0.598 0.000000e+00
## ENSG00000110092 6.339242e-218 2.704172 0.610 0.457 9.616630e-214
## ENSG00000142676 1.290179e-111 2.317366 0.203 0.039 1.957201e-107
## ENSG00000131469 1.297518e-111 2.074207 0.189 0.030 1.968335e-107
## ENSG00000137154 6.726508e-111 2.190992 0.189 0.030 1.020411e-106
We see the top hit (by p value and log2FC > 2) is HMGB2 (ENSG00000164104), which is associated with a transition from quiescent to proliferative neurogenesis (https://pubmed.ncbi.nlm.nih.gov/28771884/).
Let us now demonstrate how having a deeply collated set of data can be of value in asking more specific questions, like “how to macula/fova cones differ relative to peripheral cones.”
First, we need to see whether this is even a tractable question.
@meta.data %>%
scEiaD_dropletas_tibble() %>%
group_by(retina_region, organism, study_accession, CellType_predict) %>%
summarise(Count = n()) %>%
filter(CellType_predict %in% c("Cone")) %>%
filter(Count > 10, !is.na(retina_region))
## `summarise()` has grouped output by 'retina_region', 'organism',
## 'study_accession'. You can override using the `.groups` argument.
## # A tibble: 9 × 5
## # Groups: retina_region, organism, study_accession [9]
## retina_region organism study_accession CellType_predict Count
## <chr> <chr> <chr> <chr> <int>
## 1 Macula Homo sapiens EGAD00001006350 Cone 337
## 2 Macula Homo sapiens SRP151023 Cone 95
## 3 Macula Homo sapiens SRP194595 Cone 48
## 4 Macula Homo sapiens SRP222958 Cone 15
## 5 Macula Homo sapiens SRP255195 Cone 156
## 6 Peripheral Homo sapiens SRP151023 Cone 291
## 7 Peripheral Homo sapiens SRP222958 Cone 14
## 8 Peripheral Homo sapiens SRP255195 Cone 28
## 9 Peripheral Macaca fascicularis SRP158528 Cone 649
# Create new column with a pasted together cluster ID and a CellType
<- AddMetaData(scEiaD_droplet,
scEiaD_droplet metadata =
paste0(
@meta.data$retina_region,
scEiaD_droplet'_',
@meta.data$CellType_predict,
scEiaD_droplet'_',
@meta.data$organism
scEiaD_droplet
), col.name = 'regionCT')
# filter down scEiaD
Idents(scEiaD_droplet) <- scEiaD_droplet@meta.data$regionCT
<- subset(scEiaD_droplet, idents = c('Macula_Cone_Homo sapiens','Peripheral_Cone_Homo sapiens'))
scEiaD_droplet__subset
# create SCE object so we can use the scran test which balances covariates
<- as.SingleCellExperiment(scEiaD_droplet__subset)
sce
# run test, blocking (covariate) for study accession
<- findMarkers(sce,
sce_diff_test group = sce$regionCT,
block = sce$study_accession,
pval.type = 'all',
test="wilcox")
# while the default test is a wilcox which returns an AUC
# I like to also run a t test which returns a logFC which is useful in filtering
# as something with a high AUC (power to distinguish between groups) but
# a low logFC is less likely to be interesting
<- findMarkers(sce,
sce_diff_test_t group = sce$regionCT,
block = sce$study_accession,
pval.type = 'all',
test.type = 't')
# get human understandable gene name
library(org.Hs.eg.db)
<- mapIds(org.Hs.eg.db, keys = row.names(sce_diff_test$`Macula_Cone_Homo sapiens`), keytype = "ENSEMBL", column="SYMBOL") symbols
## 'select()' returned 1:many mapping between keys and columns
<- sce_diff_test$`Peripheral_Cone_Homo sapiens` %>%
auc_test as_tibble(rownames = 'name') %>%
left_join(symbols %>% enframe()) %>%
filter(FDR < 0.01) %>%
::rename(p.value.auc = p.value,
dplyrFDR.auc = FDR)
## Joining, by = "name"
<- sce_diff_test_t$`Peripheral_Cone_Homo sapiens` %>%
t_test as_tibble(rownames = 'name') %>%
filter(FDR < 0.01) %>%
::rename(p.value.t = p.value,
dplyrFDR.t = FDR)
%>%
auc_test left_join(t_test, by = 'name') %>%
filter(FDR.auc < 0.01, FDR.t < 0.01) %>%
filter(summary.AUC > 0.7 | summary.AUC < 0.3) %>%
filter(abs(summary.logFC) > 1) %>%
arrange(-`logFC.Macula_Cone_Homo.sapiens`)
## # A tibble: 21 × 10
## name p.value.auc FDR.auc summary.AUC AUC.Macula_Cone… value p.value.t
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 ENSG000001… 8.90e-27 1.09e-24 0.811 0.811 NRL 5.17e-40
## 2 ENSG000002… 1.54e-35 3.90e-33 0.147 0.147 CHCH… 4.43e-51
## 3 ENSG000001… 3.41e-42 1.85e-39 0.0896 0.0896 GUK1 2.82e-54
## 4 ENSG000001… 4.29e-36 1.16e-33 0.131 0.131 ATP5… 6.90e-60
## 5 ENSG000001… 1.31e-29 1.97e-27 0.163 0.163 IMPG1 1.07e-38
## 6 ENSG000001… 2.66e-47 3.04e-44 0.0489 0.0489 COX2 1.87e-50
## 7 ENSG000002… 2.24e-31 3.86e-29 0.158 0.158 ND4L 1.24e-37
## 8 ENSG000000… 1.81e-43 1.15e-40 0.134 0.134 MAP2 4.24e-32
## 9 ENSG000001… 1.11e-38 4.56e-36 0.133 0.133 NDUF… 2.53e-59
## 10 ENSG000001… 1.21e-41 5.39e-39 0.130 0.130 COTL1 5.08e-70
## # … with 11 more rows, and 3 more variables: FDR.t <dbl>, summary.logFC <dbl>,
## # logFC.Macula_Cone_Homo.sapiens <dbl>
Top gene more highly expressed in the macula compared to the periphery. But … seems to be driven by just one study (SRP151023). So this doesn’t look like a good candidate to me.
VlnPlot(scEiaD_droplet__subset, features = c('ENSG00000139053'))
VlnPlot(scEiaD_droplet__subset, c('ENSG00000139053'), split.by = 'regionCT', group.by='study_accession', ) + aes(color=scEiaD_droplet__subset$regionCT)
## The default behaviour of split.by has changed.
## Separate violin plots are now plotted side-by-side.
## To restore the old behaviour of a single split violin,
## set split.plot = TRUE.
##
## This message will be shown once per session.
## Warning: Groups with fewer than two data points have been dropped.
The effect is consistent across multiple studies, which increases confidence this is a real pattern.
PCP4 modulates calmodulin, which is involved in cone photoreceptor ribbon replenishment (https://pubmed.ncbi.nlm.nih.gov/25311636/). So maybe not crazy? Or maybe it is. I only know a dangerous amount about retinal function.
VlnPlot(scEiaD_droplet__subset, features = c('ENSG00000183036'))
VlnPlot(scEiaD_droplet__subset, c('ENSG00000183036'), split.by = 'regionCT', group.by='study_accession', ) + aes(color=scEiaD_droplet__subset$regionCT)
## Warning: Groups with fewer than two data points have been dropped.