Welcome

This document contains some simple examples for how to use the advanced filtering options in https://plae.nei.nih.gov and a brief introduction to using the Seurat file to do custom tests.

Gene Focused

By “gene focused” we mean that:

  1. You have a gene you are interested in
  2. You want to learn more about where / when / what it is expressed in

In silico In Situ

For those who are most comfortable with stained in situ slices of the retina this visualization may be useful. The major cell types of the retina are laid on in rough anatomical positioning. The cell types are colored by intensity, with the brighter colors meaning that the gene is more highly expressed in that cell type. As an example we show expression of RHO (rhodopsin, a rod marker) and RPE65 (RPE marker).

In silico in situ screenshot with Rho gene

In silico in situ screenshot with RPE65 gene

UMAP - Table

If you are curious about a gene, then there are several ways you can learn about its retinal cell type expression patterning. We will use ATOH7, a transcription factor that regulates retinal ganglion development as our example gene.

The UMAP view is a two dimensional representation of the individual cells in the scEiaD. Cells that are closer together have more related gene expression profiles (and thus are likely to be similar cell types).

Let’s go the UMAP - Table viewer in plae:

Viz -> UMAP - Tables

umap visualization with atoh7 gene

Dark gray are cells which have no detectable ATOH7.

Yellow is the highest expression and dark purple is the lowest expression.

umap visualization with atoh7 with relaxed expression filtering

Show highest expressing cells

What if we want to see which cells have the highest expression? We can use the “Filter Gene Expression” slider to only show cells with expression above a log2(expression) value.

We see that the highest expressing cells are in the “center” before the branching happens.

umap visualization with atoh7 with tightened expression filtering

Species Filtering

By default plae shows data for all organisms in the database (human, mouse, macaque).

If we only want to see ATOH7 expression in human data, then that is very easy with the powerful “Scatter Filter Category” and “Gene Filter On” sections.

umap visualization with atoh7 on only human data

Table Information

While the UMAP view is cool looking, it can’t show you everything….what if we want to know what kind of cells are expressing ATOH7?

We can have quantified information on where ATOH7 is expressed by Cell Type (predicted) (this is our machine learned cell type labels) and organism.

screenshot of atoh7 gene info table

We see that about 50% of the mouse and human neurogenic cell type express ATOH7. In raw counts that is 16,449 of 28,811 total mouse neurogenic cells. They have an average expression of 1.86. You can sort or filter the table based on queries. If you wanted to see ATOH7 expression in the RGCs this is trivial to do by typing in the box below.

screenshot of atoh7 gene info table with filtering shown for cell type

This shows us that ATOH7 expression seems to be dropping in the maturing/mature RGCs (and is much lower in the macaque) relative to the neurogenic population.

Study filtering

As scEiaD is constructed from publicly available datasets, you can also filter the data to only show results from a specific paper. This may be useful if you using the results from that paper and want to check or confirm a finding.

You can see information about the papers / studies in scEiaD by using the adjacent “Make Meta Table” section as follows:

meta table study screenshot

We see that the Clark et al. 2019 study did Smart-seq2 and 10X across many developmental time points in mouse. They study_accession ID is SRP158081. We can use this ID to look at ATOH7 expression only in this study in both the UMAP view and the table view

umap visualization of atoh7 expression in Clark et al 2019 study

table info for atoh7 only showing info from Clark et al

Expression Plot

As we have a huge number of studies and samples, we can use this (for single cell data) unusual view: a boxplot! We can see how ATOH7 expression changes across celltype and study.

How do we get here?

expression plot page screenshot

Make a plot….that shows nothing?

We’ve entered ATOH7 as the gene to plot (1). We are faceting (splitting the plot into separate sub-plots) on Cell Type (predict) (2). We are coloring the data points by study_accession (each study’s average gene expression across the Cell Type (predict) is plotted separately) (3). But we see … nothing. Why?

atoh7 expression plot intentional bad example screenshot

That is because the Plot Height (400) is not high enough. The text is prioritized over the data, so they are hidden. As it is extremely difficult to “auto” pick the correct height, it was more straightforward to have the user pick it. Usually a value around 700 - 1200 will give a reasonable view.

Useable plot

atoh7 expression plot with sufficient plot height given

So yes, now we can see the data.

Some cones have ATOH7 expression?

So each point is an independent study. We see high ATOH7 expression in the neurogenic population, across many studies. But we also see one of studies with ATOH7 expression in cones. The legend shows which colors correspond to which study.

screenshot of legend of expression plot One of the studies is a bright pink…probably SRP200599. We can confirm that by replotting the data with a filter that only shows study_accession SRP200599.

Yep, that is it. We can jump to the UMAP - Table view to pull up the metadata we have extracted about the study.

This is from Buenaventura et al.  and is a study that enriched early mouse cones. Some work suggests that loss of ATOH7 inhibits cone specification. These cones may be “early” cones or late neurogenic cells that are developing into cones.

Cell Type Focused

If you want to get a sense about what is present in scEiaD, then there are several tools you can use. For our example, we will be starting with the rods.

How many rods do we have?

These are published labels

How many rods do we have after the machine learning?

How many rods (predicted) do we have across organism?

How many rods (predicted) we we have across organism and study?

What genes are differentially expressed in the Rods?

And what does ROM1 expression look like in the UMAP?

Very high expression in the rods (and expressed in other cells too)

Advanced Stuff - Analysis in R

We provide the full data as seurat (v3) or anndata (scanpy) objects you can download for boutique analysis. Here we demonstrate how you can use the Seurat object to run a quick custom diff test on a sub-population of neurogenic cells. The Seurat object is 25 GB in size and likely will cause serious memory issues on a generic laptop.

library(Seurat)
## Registered S3 method overwritten by 'spatstat.geom':
##   method     from
##   print.boxx cli
## Attaching SeuratObject
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(SingleCellExperiment)
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:tidyr':
## 
##     expand
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## The following object is masked from 'package:purrr':
## 
##     reduce
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## 
## Attaching package: 'SummarizedExperiment'
## The following object is masked from 'package:SeuratObject':
## 
##     Assays
## The following object is masked from 'package:Seurat':
## 
##     Assays
library(scran)
## Loading required package: scuttle
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
#system('wget -O ~/data/scEiaD/scEiaD_all_seurat_v3.Rdata https://hpc.nih.gov/~mcgaugheyd/scEiaD/2021_11_11/scEiaD_all_seurat_v3.Rdata)
# the -O renames the file and puts it in the ~/data/scEiaD directory
load('/Volumes/McGaughey_S/scEiaD_v3/scEiaD_all_seurat_v3.Rdata')

ID neurogenic cells in different clusters

We see that the neurogenic cells are a small minority of these clusters. But perhaps still we can find some interesting differences between them.

scEiaD_droplet@meta.data %>% 
  group_by(cluster, CellType_predict) %>% 
  summarise(Count = n()) %>% 
  mutate(Perc = (Count / sum(Count)) * 100) %>% 
  filter(Perc > 5) %>% 
  filter(CellType_predict == 'Neurogenic Cell') %>% 
  arrange(-Count)
## `summarise()` has grouped output by 'cluster'. You can override using the
## `.groups` argument.
## # A tibble: 3 × 4
## # Groups:   cluster [3]
##   cluster CellType_predict Count  Perc
##     <dbl> <chr>            <int> <dbl>
## 1      46 Neurogenic Cell   5408  6.54
## 2      47 Neurogenic Cell   3577  7.70
## 3      49 Neurogenic Cell   1313  9.68

The most common is cluster 46 and 47

Compare cluster 46 - Neurogenic vs cluster 47 - Neurogenic

This will take a few minutes to run. This is why we cannot offer on-demand custom diff testing as while it runs the whole app shuts down for everyone while it computes.

# Create new column with a pasted together cluster ID and a CellType
scEiaD_droplet <- AddMetaData(scEiaD_droplet, 
                              metadata = 
                                paste0(
                                  scEiaD_droplet@meta.data$cluster, 
                                  '_', 
                                  scEiaD_droplet@meta.data$CellType_predict
                                  ), 
                              col.name = 'clusterCT' )
# tell Seurat this new column is the default identity
Idents(scEiaD_droplet) <- scEiaD_droplet@meta.data$clusterCT
diff_test <- FindMarkers(scEiaD_droplet, ident.1 = '46_Neurogenic Cell', ident.2 = '47_Neurogenic Cell')
diff_test %>% arrange(p_val) %>% filter(abs(avg_log2FC) > 2) %>% head(5)
##                         p_val avg_log2FC pct.1 pct.2     p_val_adj
## ENSG00000164104  0.000000e+00   2.541205 0.780 0.598  0.000000e+00
## ENSG00000110092 6.339242e-218   2.704172 0.610 0.457 9.616630e-214
## ENSG00000142676 1.290179e-111   2.317366 0.203 0.039 1.957201e-107
## ENSG00000131469 1.297518e-111   2.074207 0.189 0.030 1.968335e-107
## ENSG00000137154 6.726508e-111   2.190992 0.189 0.030 1.020411e-106

We see the top hit (by p value and log2FC > 2) is HMGB2 (ENSG00000164104), which is associated with a transition from quiescent to proliferative neurogenesis (https://pubmed.ncbi.nlm.nih.gov/28771884/).

Compare macula cones vs peripheral cones

Let us now demonstrate how having a deeply collated set of data can be of value in asking more specific questions, like “how to macula/fova cones differ relative to peripheral cones.”

First, we need to see whether this is even a tractable question.

scEiaD_droplet@meta.data %>% 
  as_tibble() %>% 
  group_by(retina_region, organism, study_accession, CellType_predict) %>% 
  summarise(Count = n()) %>% 
  filter(CellType_predict %in% c("Cone")) %>% 
  filter(Count > 10, !is.na(retina_region))
## `summarise()` has grouped output by 'retina_region', 'organism',
## 'study_accession'. You can override using the `.groups` argument.
## # A tibble: 9 × 5
## # Groups:   retina_region, organism, study_accession [9]
##   retina_region organism            study_accession CellType_predict Count
##   <chr>         <chr>               <chr>           <chr>            <int>
## 1 Macula        Homo sapiens        EGAD00001006350 Cone               337
## 2 Macula        Homo sapiens        SRP151023       Cone                95
## 3 Macula        Homo sapiens        SRP194595       Cone                48
## 4 Macula        Homo sapiens        SRP222958       Cone                15
## 5 Macula        Homo sapiens        SRP255195       Cone               156
## 6 Peripheral    Homo sapiens        SRP151023       Cone               291
## 7 Peripheral    Homo sapiens        SRP222958       Cone                14
## 8 Peripheral    Homo sapiens        SRP255195       Cone                28
## 9 Peripheral    Macaca fascicularis SRP158528       Cone               649
# Create new column with a pasted together cluster ID and a CellType
scEiaD_droplet <- AddMetaData(scEiaD_droplet, 
                              metadata = 
                                paste0(
                                  scEiaD_droplet@meta.data$retina_region, 
                                  '_', 
                                  scEiaD_droplet@meta.data$CellType_predict,
                                  '_',
                                  scEiaD_droplet@meta.data$organism
                                  ), 
                              col.name = 'regionCT')

# filter down scEiaD
Idents(scEiaD_droplet) <- scEiaD_droplet@meta.data$regionCT
scEiaD_droplet__subset <- subset(scEiaD_droplet, idents = c('Macula_Cone_Homo sapiens','Peripheral_Cone_Homo sapiens'))

# create SCE object so we can use the scran test which balances covariates
sce <- as.SingleCellExperiment(scEiaD_droplet__subset)

# run test, blocking (covariate) for study accession
sce_diff_test <- findMarkers(sce,
                group = sce$regionCT,
                block = sce$study_accession,
                pval.type = 'all',
                test="wilcox")
# while the default test is a wilcox which returns an AUC 
# I like to also run a t test which returns a logFC which is useful in filtering
# as something with a high AUC (power to distinguish between groups) but
# a low logFC is less likely to be interesting
sce_diff_test_t <- findMarkers(sce,
                group = sce$regionCT,
                block = sce$study_accession,
                pval.type = 'all',
                test.type = 't')

# get human understandable gene name
library(org.Hs.eg.db)
symbols <- mapIds(org.Hs.eg.db, keys = row.names(sce_diff_test$`Macula_Cone_Homo sapiens`), keytype = "ENSEMBL", column="SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
auc_test <- sce_diff_test$`Peripheral_Cone_Homo sapiens` %>% 
  as_tibble(rownames = 'name') %>% 
  left_join(symbols %>% enframe()) %>% 
  filter(FDR < 0.01) %>% 
  dplyr::rename(p.value.auc = p.value,
         FDR.auc = FDR)
## Joining, by = "name"
t_test <- sce_diff_test_t$`Peripheral_Cone_Homo sapiens` %>% 
  as_tibble(rownames = 'name') %>% 
  filter(FDR < 0.01) %>% 
    dplyr::rename(p.value.t = p.value,
         FDR.t = FDR)

auc_test %>% 
  left_join(t_test, by = 'name') %>% 
  filter(FDR.auc < 0.01, FDR.t < 0.01) %>% 
  filter(summary.AUC > 0.7 | summary.AUC < 0.3) %>% 
  filter(abs(summary.logFC) > 1) %>% 
  arrange(-`logFC.Macula_Cone_Homo.sapiens`)
## # A tibble: 21 × 10
##    name        p.value.auc  FDR.auc summary.AUC AUC.Macula_Cone… value p.value.t
##    <chr>             <dbl>    <dbl>       <dbl>            <dbl> <chr>     <dbl>
##  1 ENSG000001…    8.90e-27 1.09e-24      0.811            0.811  NRL    5.17e-40
##  2 ENSG000002…    1.54e-35 3.90e-33      0.147            0.147  CHCH…  4.43e-51
##  3 ENSG000001…    3.41e-42 1.85e-39      0.0896           0.0896 GUK1   2.82e-54
##  4 ENSG000001…    4.29e-36 1.16e-33      0.131            0.131  ATP5…  6.90e-60
##  5 ENSG000001…    1.31e-29 1.97e-27      0.163            0.163  IMPG1  1.07e-38
##  6 ENSG000001…    2.66e-47 3.04e-44      0.0489           0.0489 COX2   1.87e-50
##  7 ENSG000002…    2.24e-31 3.86e-29      0.158            0.158  ND4L   1.24e-37
##  8 ENSG000000…    1.81e-43 1.15e-40      0.134            0.134  MAP2   4.24e-32
##  9 ENSG000001…    1.11e-38 4.56e-36      0.133            0.133  NDUF…  2.53e-59
## 10 ENSG000001…    1.21e-41 5.39e-39      0.130            0.130  COTL1  5.08e-70
## # … with 11 more rows, and 3 more variables: FDR.t <dbl>, summary.logFC <dbl>,
## #   logFC.Macula_Cone_Homo.sapiens <dbl>

Always look at the plots

PDE6H

Top gene more highly expressed in the macula compared to the periphery. But … seems to be driven by just one study (SRP151023). So this doesn’t look like a good candidate to me.

VlnPlot(scEiaD_droplet__subset, features = c('ENSG00000139053'))

VlnPlot(scEiaD_droplet__subset, c('ENSG00000139053'), split.by = 'regionCT', group.by='study_accession', ) + aes(color=scEiaD_droplet__subset$regionCT)
## The default behaviour of split.by has changed.
## Separate violin plots are now plotted side-by-side.
## To restore the old behaviour of a single split violin,
## set split.plot = TRUE.
##       
## This message will be shown once per session.
## Warning: Groups with fewer than two data points have been dropped.

PCP4

The effect is consistent across multiple studies, which increases confidence this is a real pattern.

PCP4 modulates calmodulin, which is involved in cone photoreceptor ribbon replenishment (https://pubmed.ncbi.nlm.nih.gov/25311636/). So maybe not crazy? Or maybe it is. I only know a dangerous amount about retinal function.

VlnPlot(scEiaD_droplet__subset, features = c('ENSG00000183036'))

VlnPlot(scEiaD_droplet__subset, c('ENSG00000183036'), split.by = 'regionCT', group.by='study_accession', ) + aes(color=scEiaD_droplet__subset$regionCT)
## Warning: Groups with fewer than two data points have been dropped.