1 scRNA: Processing Basics

Introduction to Seurat for scRNA processing.

This is an R Markdown document that was designed to introduce and review Seurat and basic analysis of single cell RNA-seq (scRNA) data.

For this tutorial, we will be using single cells from our Human Breast Cell Atlas.

Samples were processed via 10x Genomics GEM-X Flex Gene Expression library generation.

This tutorial will review the following parts of standard scRNA-seq analysis using the Seurat package:

RNA Processing Basic workflow: scRNA Processing Basic workflow

What we're covering

Creating Seurat objects from cellranger output.
Examining basic quality control
- number of genes nFeature_RNA
- number of UMIs nCount_RNA
- number of cells
- percentage of mito/ribo genes
- find debris
- find doublets (DoubletFinder, not covered in this tutorial)
Data normalization
- Account for different sequencing effort per cell
Clustering the cells
- Dimensionality reduction and projection onto 2D spaces
- Find clusters with different resolutions (low, medium and high)
- How to choose an appropriate clustering resolution
Finding differentially expressed features
- Differential gene expression between clusters
- Cluster identification
Assigning cell type identities to clusters
- Manual assignment based on top genes
- Automated assignment using label transfer from libraries like SingleR check the detailed tutorial
- Find anchors between query and reference in Seurat using the function called TransferData

For this tutorial, we used R 4.3.3 for MacOS: R-4.3.3-arm64.pkg

Helpful R tip

If you want to know more about a function, you can use ?FUNCTION_NAME to examine the inputs, where FUNCTION_NAME is the name of the function using the correct function capitalization

R Library Basics

Note, this tutorial was prepared using Seurat 5

you might need to install the packages if you are running your analysis on a local computer

Most of R packages can be installed using:

## from CRAN repositories
install.packages('XXX')

##alternative ways 
BiocManager::install('XXX')
remotes::install_github('XXX')

## Install multiple packages
install.packages(c("P1","P2","P3",dependencies = TRUE)

## Install packages with a specific version using: 
remotes::install_version('XXX',version=1.2-3)

Let's get started with loading R packages!

to start in our preset course env

# This step only needs to be run once when setting up the course environment for the first time.
cp ~mirao/.Renviron ~/
echo module load EBModules R hdf5 EBH100 CMake JAGS >> ~/.bashrc
echo module load GCC/13.3.0 >> ~/.bashrc

# otherwise, install R and RStudio into your local device
[RStudio Desktop] (https://posit.co/download/rstudio-desktop/)

# if run everything at local device
# install all the packages required for the workshop

# library install
# pre-install
install.packages('BiocManager')
library('BiocManager')

package_list <- c('gypsum','scran','BiocNeighbors','glmGamPoi','SingleR','scRNAseq','celldex','scater','scrapper')

BiocManager::install(package_list)

{

knitr::opts_chunk$set(
  cache = TRUE,
  cache.lazy = FALSE,
  tidy = TRUE,
  autodep=TRUE)

suppressPackageStartupMessages({

## basic
library(dplyr)
library(ggplot2)
library(patchwork)
library(plyr)
library(readr)
library(knitr)
library(flexdashboard)
library(tidyverse)
library(remotes)
library(gypsum)

## Seurat
library(Seurat)
library(SeuratObject)
library(SeuratWrappers)
  
## cell annotation  
library(BiocNeighbors)
library(SingleR)  
library(scRNAseq) ## reference
library(celldex)
library(scater)
library(cowplot)

## set seed
set.seed(12345)
})

}

Set working directory

This will need to be changed to the specific location where you have placed the data to be analyzed. You should specify the directory where you are working and set the directory.

R Functions for this section:

<- Base R function - this is the assign function that is fairly R specific. You can also use =, which is a more common assignment syntax in other languages.
setwd Base R function - This will set your directory to the string you provided assuming you have permissions to that location and if you have provided a valid option.

Note slash direction is due to being run on either a Mac or Linux. If you are using a Windows machine, your slashes will be \ instead. This can be set using the . if are unsure.

Note Ideally, you would not need to set your working directory. Your files should be together in a folder. In general, I suggest a structure of functions, reports, and data under each project folder.

# option: run the analysis via RStudio on website
input_dir <- "/grid/singlecellcourse/data/tutorials/scRNA/sample_filtered_feature_bc_matrix/"
output_dir <- "/home/YOUR_NAME/R_BASIC_FOLDER/"
dir.create(output_dir)
setwd(output_dir)

# option: run the analysis at local RStudio
# transfer the data from sever to your local path
scp -r /INPUT_FOLDER/  /YOUR_LOCAL_PATH/ # terminal
output_dir <- "/home/YOUR_NAME/R_BASIC_FOLDER/"
dir.create(output_dir)
setwd(output_dir)

Additionally, for data organization, it would be helpful to have the folder structure established before beginning the analysis.

# create the folder for figures objects and scripts if they are not existing in your output path
  if (!dir.exists("figures")) dir.create("figures")
  if (!dir.exists("objects")) dir.create("objects")
  if (!dir.exists("scripts")) dir.create("scripts")

Data Downloads

Before starting this tutorial, please making sure the data was downloaded to your working path.

In most cases, the easiest way is to downloaded the cell matrix files (usually in HDF5 format for R or H5AD format for python).

Download the scRNA HBCA cells from this github repo. We will do this is R using a system call to download the github permalinks

#path of count matrix

#this should make three new files in your directory
list.files()

1. Create Seurat Objects from 10x Matrix output (.mtx)

You can read in data from Read10X (h5) / read_h5ad (h5ad) format

Or, from a more basic output of Read10X, which takes the input of a directory with three files matrix.mtx, barcodes.tsv, and genes.tsv.

Reading in the object

Functions for this section:

# load data from input directory
# input_dir <- "/grid/singlecellcourse/data/tutorials/scRNA/sample_filtered_feature_bc_matrix"
hbca <- Read10X(input_dir)

hbca[1:10, 1:10]

dim(hbca)

## [1] 18082  9751

Initialize object

Note We call these "features", not "genes" as a more generic term because we use a very similar object structure for other data modalities, such as when you have antibodies or ATAC peaks.

Here, we choose how much of the raw data we want to work with by setting specific limits.

Here, we state that we require a gene to be present in at least min_cells cells and each cell is required to have at least min_features features or genes.

Functions:

CreateSeuratObject SeuratObject function - this creates a Seurat object from the raw data. We also can set the min cells and min features/genes. For completely raw results you would set both of these arguments to zero.

Note The min cells and min features listed below are good starting values, but these should be set and determined experimentally or informed by pre-existing information you have about the data. For example, for an immune-rich data set, frozen, or nuclei samples, we would lower the threshold.

For min.cells, a good starting place is between 3-5 cells. However, sometimes it makes sense to not include this threshold at all for a single sample and only apply it when you have multiple samples.
For min.features (fresh single cells), a good starting place for fresh single cell samples is between 100 to 200. Lean towards lower bound for immune heavy samples where you have selected for immune cells in your experiment. If you have very viable, and therefore very good samples, having this set at the higher end makes sense. This number is also going to be impacted by sequencing coverage and depth. I would usually start at 200 for this threshold.
For min.features (frozen or single nuclei), a good starting place for frozen or nuclei samples is between 50 to 150. Lean towards lower bound for immune heavy samples where you have selected for immune cells in your experiment. I would usually start at 100 for these.
Alternative method of thresholding Other people may choose their ranges using the standard dev or median absolute deviation. However, you need to be careful with these. Usually they do not help set the lower bound of the threshold so much as the upper bound.

## Turn count matrix into a Seurat object 
hbca_raw <- CreateSeuratObject(counts = hbca, project = "hbca_p85", min.cells = 0, min.features = 0)  ## we load a processed matrix without any filtering
hbca_raw

# An object of class Seurat 
# 18082 features across 9751 samples within 1 assay 
# Active assay: RNA (18082 features, 0 variable features)
#  1 layer present: counts

Note

Examine the meta data where QC or cell specific information could be stored

Functions:

head function from the utils package (fairly basic R package). This results the first part of a data frame, table, matrix or vector.
tail function is similar to the head function, but starts from the bottom instead of the top

This specifically examines the meta data in the Seurat object and looks at top 5 items.

If you wanted to look at the whole thing you can use View, but it can crash R if it is too large.

# Explore merged metadata
head([email protected],10)
tail([email protected],10)
# View([email protected])

2. Basic quality control

Below, we add the percentage of a specific feature to our meta data. We are assigning the Seurat object a meta feature called percent.mt for percent of mitochondrial genes.

Functions: 1. PercentageFeatureSet - Seurat specific function which uses regular expression patterns to select specific features to calculate the percentage of those features making up a cell's gene expression.

R regular expressions guide

^ at the beginning of the regular expression notes that this should be the first character of the feature
RP[SL] In this expression, “RP” matches genes that encode for ribosomal proteins; “S” or “L” specifies the subunit of the ribosome
:digit: matches any digit
| this can be used for "or" expressions

Mitochondrial genes percentage

Here is the example with mitochondrial genes percentage, which is often used to assess sample quality with a high fraction of apoptotic or lysing cells. In this case, high mitochondrial gene percentage is considered poor quality.

# mitochondrial ratio
hbca_raw[["percent.mt"]] <- PercentageFeatureSet(hbca_raw, pattern = "^MT-")
# use pattern = "^mt-" for mouse

Other biological features

a) Ribosomal genes percentage

Ribosomal protein genes are used in the filtering of single-cell RNA sequencing data to assess the quality of the data and remove poor-quality cells. The proportion of ribosomal protein reads can be an indicator of RNA degradation. High ribosomal protein content can be associated with specific cell types, such as immune cells.

However, the demonstration data we are using has already filtered out all the Ribosomal and Mitochondrial Genes

hbca_raw[["percent.rb"]] <- PercentageFeatureSet(hbca_raw, pattern = "^RP[SL][[:digit:]]|^RPLP[[:digit:]]|^RPSA") 
# pattern = "^Rp[SL][[:digit:]]|^Rplp[[:digit:]]|^Rpsa") for mouse

Important

To break this pattern down in plain language:
^RP[SL]:digit:|^RPLP:digit:|^RPSA
means
IF a gene name starts with RP then has an S or an L (i.e. RPS... RPL...) then has a number (RPS1..RPS2..RPL1...RPL2..)
OR (|) IF a gene names starts with RPLP and has a digit (i.e. RPLP1..RPLP2..)
OR (|) IF a gene name starts with RPSA
count that gene as part of the percentage feature set

b) ratio of genes and umi

This produces a ratio that is related to how much UMI saturation there is per gene

hbca_raw[["umi_gene_ratio"]] <- log10(hbca_raw@meta.data$nCount_RNA/hbca_raw@meta.data$nFeature_RNA)

c) Mitochondrial and Ribosomal genes percentage

Add the Mitochondrial and Ribosomal genes percentage together

hbca_raw[["mt_rb"]] <- hbca_raw@meta.data$percent.mt + hbca_raw@meta.data$percent.rb

Example structure

It is also good to check the summary for the items before you move on.

summary(hbca_raw@meta.data)

Visualize the QC

One of the first items you want to do when looking at your data is to visualize the quality of the samples.

Violin Plots

Visualize QC metrics as a violin plot

Functions:

VlnPlot - Seurat specific function that shows violin plots of single cell data. This can be anything from QC metrics, meta data, scores, or single or combined gene expression values Note you can have multiple items, including different ncols.

When you examine the violin plots, you may consider using some the standard deviations or median absolute deviations as cut-offs. Alternatively, strict pre-defined cut-offs are possible, but they are often very arbitrary and likely need to vary by sample

As discussed previously:

nfeature: represents the number of unique genes detected in each cell
nCount: represents the total number of molecules (UMIs) detected within a cell
percent mt: percent of mitochondrial genes
percent rb: percent of ribosomal genes
mt rb: percent of mitochondrial genes and ribosomal genes together
umi gene ratio: represents the number of genes detected per UMI (unique molecular identifier) for each cell. A higher UMI gene ratio indicates that more genes are detected per UMI, which is generally considered a good indicator of data quality. However, it can also show duplicates.

# plot the QC VlnPlot
VlnPlot(hbca_raw, features = c("nFeature_RNA", "nCount_RNA", "umi_gene_ratio", "percent.mt"), ncol = 2)

QC_raw

create seurat with QC (min.cells = 3, min.features = 200)

Filtering criterion:

Genes detected in < 3 cells were excluded
Cells with < 200 total detected genes were excluded
Cells with >= 20% of mitochondria-expressed genes were excluded. High mtRNA content may indicate apoptosis.

#create seurat with QC 

hbca_filtered <- CreateSeuratObject(counts = hbca, project = "hbca_p85", min.cells = 3, min.features = 200)

hbca_filtered

# An object of class Seurat 
# 16074 features across 9725 samples within 1 assay 
# Active assay: RNA (16074 features, 0 variable features)
# 1 layer present: counts

# basic QC
# mitochondrial ratio
hbca_filtered[["percent.mt"]] <- PercentageFeatureSet(hbca_filtered, pattern = "^MT-")

# ration of gene ans umi
hbca_filtered[["umi_gene_ratio"]] <- log10(hbca_filtered@meta.data$nCount_RNA/hbca_filtered@meta.data$nFeature_RNA)

hbca_filtered <- subset(x = hbca_filtered, subset =percent.mt < 20)

# Addition way to filter low quality cells (for FLEX, NOT for snRNA, you could choose the lower nFeature_RNA as initial threshold, eg: 200-500)
# hbca_filtered<- subset(hbca_filtered, subset = nFeature_RNA > 200)

# plot the QC VlnvPlot

VlnPlot(hbca_filtered, features = c("nFeature_RNA", "nCount_RNA", "umi_gene_ratio", "percent.mt"), ncol = 2)

# visualize QC_ratio flexdashboard

QC_ratio <- round(ncol(hbca_filtered)/ncol(hbca_raw) * 100, 2)

flexdashboard::gauge(QC_ratio, min = 0, max = 100, symbol = "%", gaugeSectors(success = c(50, 100),
    danger = c(0, 50)))

Note Please note that there is no one-size-fits-all filtering criteria, as the normal ranges of these metrics can vary dramatically from one experiment to another, depending on sample origin as well as reagents and sequencing depths. It would be helpful to know about your sample and have the expectations before the QC filtering.

Save the raw and filtered rds data

Many of the steps we have completed have taken a long time to run. To save yourself time in the future, it is a good idea to save your Seurat object as an RDS, so you can read it into your R environment again in the future, if needed.

getwd()
#setwd(output_dir)
saveRDS(hbca_raw, './objects/hbca_p85_raw.rds')
saveRDS(hbca_filtered, './objects/hbca_p85_filtered.rds')

3. Data Normalization

Biological heterogeneity in single-cell RNA-seq data is often confounded by technical factors including sequencing depth, to remove this technical variability, data will be normalized.

SCTransform normalization and variance stabilization of molecular count data. Can be adjusted by molecular items such as percent.mt

This replaces a number of traditional normalization procedures and usually takes longer, but is preferred.
It is based on the use of analytic Pearson residuals for normalization of single-cell RNA-seq data and was covered in this paper
Seurat SCTransform was covered in this paper

Previous versions would use NormalizeData with global-scaling normalization method LogNormalize, as well as scaling methods ScaleData and FindVariableFeatures.

These are all normalization steps to help compare data between samples and to allow parametric methods to be used by shifting the data into a normal distribution, as opposed to a skewed distribution which requires non-parametric tests

Why SCTransform outbeats LogNormalize is explained in seurat SCT tutorial as well as the this paper Though this is disputed in edge cases, for instance SCTransform may over-normalize rare events. A further discussion is here.

SCTransform

This is one of the slower functions; I suggest that you run it on a cluster if processing a lot of cells.

However, for this tutorial we are sticking to a single sample and relatively low cell counts.

Seurat provides a fairly good walk through

SCTransform replaces the following three early processing commands, which were used in the paper: specifically NormalizeData(), ScaleData(), and FindVariableFeatures()

Function: 1. SCTransform - normalizes, scales, and regresses out specific variables.

Data is normalized via log1p(counts), changing the data shape to be more of a normal distribution.
The scale.data centers the data by calculating Pearson residuals and removes unwanted sources of variation from a single-cell dataset.
For example, we could regress out heterogeneity associated with cell cycle stage or mitochondrial contamination. In this case, we will be regressing out percentage mito.

This can be slow, and often people do not look at all features. You can change the “variable.features.n” to have more items, but currently it only examines the top features.

## this step will take a while, for our 6k dataset, it takes 2 mins
hbca_filtered <- SCTransform(hbca_filtered, vars.to.regress = 'percent.mt', do.scale = TRUE, verbose = FALSE)

DefaultAssay(hbca_filtered)  ## the default assay will automatically switch from 'RNA' to 'SCT'

# save object after SCT
saveRDS(hbca_filtered,'./objects/hbca_p85_filtered_SCT.rds' )

Variable genes

Here, we examine the most highly variable genes.

These genes provide information about the sample or clusters. If you are looking at multiple experiments, it might show you batch effects or effects between groups. House keeping genes would not be expected to show up as the most highly variable genes, but cell type or treatment specific-genes should show up here.

top10F <- head(VariableFeatures(hbca_filtered), 10)
head(VariableFeatures(hbca_filtered), 10)

# [1] "COL3A1" "COL1A1" "SELE"   "IL6"    "DCN"    "APOD"   "MUCL1"  "COL1A2" "CSF3"   "SRGN"

Here, we plot the genes to examine the different features with labels.

Function:

VariableFeaturePlot highlights all the variable features. This is a dot plot that shows residual variance and the generic mean.

# plot variable features with and without labels
var_plot <- VariableFeaturePlot(hbca_filtered)
labtop10_plot <- LabelPoints(plot = var_plot, points = top10F, repel = TRUE)

labtop10_plot

labtop10_plot

Linear dimensional reduction

We need to reduce this full matrix of cell x gene into something that is interpretable, with minimal loss of information. For this we use dimensionality reduction. This is to group the variance of many genes into a much smaller number of groups or principle components (PC). This reduces our dimension from cell x gene to cell x PC.

One of the most standard methods of dimensional reduction is Principal Component Analysis PCA

Function:

RunPCA This runs the PCA for dimension reduction.

### Linear dimensional reduction
hbca_filtered <- RunPCA(hbca_filtered, features = VariableFeatures(object = hbca_filtered))
# Here, we examine and visualize the PCA results by printing them out:
print(hbca_filtered[["pca"]], dims = 1:5, nfeatures = 5)

# PC_ 1 
# Positive:  DCN, COL6A2, TIMP2, SPARC, COL1A2 
# Negative:  SPINT2, CDH1, CD24, JUP, EPCAM 
# PC_ 2 
# Positive:  CD93, PECAM1, PLVAP, VWF, CD74 
# Negative:  COL6A2, DCN, COL1A2, LUM, COL6A1 
# PC_ 3 
# Positive:  TM4SF1, BCAM, SPARCL1, AQP1, GJA1 
# Negative:  LAPTM5, SAMSN1, RGS1, GPR183, TYROBP 
# PC_ 4 
# Positive:  MYL9, TAGLN, ACTA2, MYH11, TPM2 
# Negative:  ITGAV, ELOVL5, AREG, MLPH, SHB 
# PC_ 5 
# Positive:  CXCL16, CTSB, CSF2RA, SFRP1, MSR1 
# Negative:  CD96, TRBC2, CNOT6L, CCL5, SPOCK2

Here, we visualize the PCA results for the first 2 dimensions, looking at those associated with reduction components.

Function:

VizDimLoadings This examines the top genes associated with reduction components.

VizDimLoadings(hbca_filtered, dims = 1:2, reduction = "pca")

VizDim

This next visualization is the most standard method of looking at these items.

Function:

DimPlot Plots the PCA for the first two components.

DimPlot(hbca_filtered, reduction = "pca") + NoLegend()

dimplot

This next visualization examines the top differentially expressed genes in the first nine Principle components.

Function:

DimHeatmap this shows the different genes associated with PCs expressed in cells, allowing for easy exploration of the primary sources of heterogeneity in a datase. It can be useful when trying to decide which PCs to include for further downstream analyses

DimHeatmap(hbca_filtered, dims = 1:9, cells = 500, balanced = TRUE)

#?try DimHeatmap(hbca_filtered, dims = 1:20, cells = 500, balanced = TRUE)

Dimheatmap

Determining the best dims

ElbowPlot can also be used for selecting the best number of dims to be used. If you examine the curve for a single sample, you might see more of slow progression.

Function: 1. ElbowPlot this plots the standard dev of the PCA to try. You can change which type of reduction you use too.

In theory, you want to select where it starts to bend .

ElbowPlot(hbca_filtered, ndims = 50)

elbow

4. Clustering

Here, we find the neighbors using the selected number of dims.

Function:

1.FindNeighbors embed cells in SNN graph that calculated based on the euclidean distance in PCA space with previously defined dimensionality of the dataset (first 20 PCs)

# set up the dim number
dim <- 15
hbca_filtered <- FindNeighbors(hbca_filtered, dims = 1:dim)

## Computing nearest neighbor graph

## Computing SNN

We can also try multiple resolutions to find the desire cluster

low_res <- 0.1
mid_res <- 0.4
high_res <- 0.8

Find the clusters for the three different resolutions. Function:

FindClusters will serach the cell groups on KNN graph using Louvain algorithm (default) and output the number of clusters based on the resolutions defined above, with increased resolution values leading to a greater number of clusters.

hbca_filtered <- FindClusters(hbca_filtered, resolution = c(low_res, mid_res, high_res))

Create the Umap visualization based on the paper dims.

Note This function has a lot of arguments that can be adjusted, such as spread.

Note Changing the value of the "seed.use" argument can change the outcome.

Note Other parameters that can be used to optimize your UMAP include, but are not limited to, n.neighbors, n.epochs, min.dist and spread.

You can used functions like group.by to provide splitting.

hbca_filtered <- RunUMAP(hbca_filtered, dims=1:dim)

Resolution

DimPlot(hbca_filtered, reduction = "umap", group.by = paste0("SCT_snn_res.", c(low_res, mid_res, high_res)),label = T)

resolution

Differentially Expressed Genes

One of the major parts of single cell analysis is identifying the Differential Expression (DE) between clusters or groups.

In this case, we can examine the DE analysis between 2 group.

The default is examining a single cluster against all other clusters.

## Choose the resolution for DE
Idents(hbca_filtered) <- hbca_filtered$SCT_snn_res.0.1

## Find markers for each cluster. Will take long time 
hbca.markers <- FindAllMarkers(hbca_filtered, only.pos = TRUE)

hbca.markers %>%
  group_by(cluster) %>%
  dplyr::filter(avg_log2FC > 1) %>%
  slice_head(n = 5) %>%
  ungroup() -> top5

DoHeatmap(object = hbca_filtered, features = top5$gene) +theme(axis.text.y = element_text(size = 8, angle = 45, hjust = 1))

Doheatmap

Save the your rds object or save the environment

getwd()
# setwd(output_dir)
saveRDS(hbca_filtered,'./objects/hbca_p85_filtered_cluster.rds' )

## Load the rds object back
# hbca_filtered <- readRDS(file = './objects/hbca_p85_filtered_cluster.rds')

5. Cell Type Annotation

5.1. Manual assignment/refine based on top genes (Canonical Genes)

Often, it is good to have some genes that you expect to be high in specific clusters based on literature.

Good resources for this include:

Note Expected genes are likely to depend on your cell types, organ type, species, and whether you are looking at single cell or single nuclei.

However, it is good to have an idea of what you want to look at first. Not so much as a guide, but more as a sanity check.

For example, if you do FACs sorting to enrich for CD45+ cells, you expect that you will have enrichment for various types of white blood cells (WBCs), particularly T cells and B cells. Therefore, it is good to have both markers that you expect to be present and those you expected to be absent. Since we sorted for CD45+ cells, we would expect markers for WBCs to be enriched and those for endothelial and epithelial cells to be decreased or absent.

check the data quality for each cluster

VlnPlot(hbca_filtered, features = c("nCount_RNA","nFeature_RNA","percent.mt"), group.by= "SCT_snn_res.0.1",pt.size = 0)

check_QC

SCT01 <- DimPlot(hbca_filtered, reduction = "umap", group.by=paste0("SCT_snn_res.", c(low_res)) , label = TRUE, repel=TRUE) + theme(legend.position = "none")

SCT01

sct01

hbca_markers <- list(
  'Basal'=c('KRT14','KRT17','DST'),
  'LumHR'=c('AREG','ERBB4','AGR2'),
  'LumSec'=c('SCGB2A2','SLPI','KRT15'),
  'Endo'=c('SELE','FABP4','CSF3'),
  'Pericytes'=c('RGS5','MT1A','IGFBP5'),
  'B cells'=c('MZB1','CD79A','DERL3'),
  'Fibro'=c('DCN','APOD','COL1A1'),
  'T cells'=c('CD3D','CD3E','CD2'),
  'Myeloid'=c('CD68','C1QC','MS4A6A'),
  'Mast'=c('IL18R1','HPGD','HDC')
)

library(knitr)
df <- data.frame(hbca_markers)
kable(df)

Note

Markers are from Kumar et al, Nature, 2023

Visualizing marker genes

a) Feature Plot

To get a better idea of cell type identity, we can explore the markers by cluster using the FeaturePlot() function. This plots a umap, but colors the item based on the number. This function has many arguments that can be adjusted, such as the color scale (cols) or which features are plotted.

# visualize expression level of 3 cell type markers on umap
FeaturePlot(hbca_filtered,
            feature=c('SLPI','APOD','CD3D','CD68'),
            ncol = 2,
            label=TRUE)

featureplot

Note We could loop through all these markers to generate FeaturePlots and save them as PDF files for easy reference later.

pdf(paste0("./figures/hbca_p85_FeaturePlot_hbca_markers.pdf"), width = 10, height = 10)

# Loop through each cell type and its markers
for (cell_type in names(hbca_markers)) {
  markers <- hbca_markers[[cell_type]]
  
  # Create a plot for each marker
  plots <- lapply(markers, function(marker) {
    FeaturePlot(hbca_filtered, features = marker, cols = c("lightgrey", "blue"), reduction = "umap") +
      ggtitle(paste(cell_type, "-", marker))
  }
  )
  
  # Combine the plots into one grid
  combined_plot <- plot_grid(plotlist = plots, ncol = 2)
  
  # Print the combined plot to the PDF
  print(combined_plot)
}

# Close the PDF
dev.off()

b) Violin Plot

VlnPlot() function allow to visualize the distribution of a numeric variable for one or several groups.

# visualize the distribution of 3 cell type markers across clusters
VlnPlot(hbca_filtered,feature=c('SLPI','APOD','CD3D','CD68'), pt.size=0, ncol = 2)

vlnplot

Note We could loop through all these markers to generate VlnPlots and save them as PDF files for easy reference later.


pdf(paste0("./figures/hbca_p85_VlnPlot_hbca_markers.pdf"), width = 10, height = 10)
# Loop through each cell type and its markers
for (cell_type in names(hbca_markers)) {
  markers <- hbca_markers[[cell_type]]
  
  # Create a violin plot for each marker
  plots <- lapply(markers, function(marker) {
    VlnPlot(hbca_filtered, features = marker, pt.size = 0) +
      ggtitle(paste(cell_type, "-", marker)) +
      theme_minimal()
  }
  )
  
  # Combine the plots into one grid
  combined_plot <- plot_grid(plotlist = plots, ncol = 2)
  
  # Print the combined plot to the PDF
  print(combined_plot)
}

# Close the PDF
dev.off()

c) Dot Plot

Seurat also has a built in visualization function DotPlot() DotPlot is a intuitive way of visualizing how feature expression changes across different clusters. The size of the dot encodes the percentage of cells within a class, while the color encodes the AverageExpression level across all cells within a class (blue is high).

# Plot the hbca markers across clusters
DotPlot(hbca_filtered,hbca_markers,cluster.idents = TRUE, group.by= "SCT_snn_res.0.1") +  RotatedAxis()

dotplot

Naming Cluster

Now taking all of this information, we can surmise the cell types of the different clusters. You can manually assign cluster names using the features.

# Rename all identities
hbca_filtered <- RenameIdents(object = hbca_filtered, 
                              "0" = "LumSec",
                              "1" = "Fibro",
                              "2" = "LumHR",
                              "3" = "Endo",
                              "4" = "Pericytes",
                              "5" = "T cells",
                              "6" = "Basal",
                              "7" = "Mye",
                              "8" = "B cells",
                              '9'='Mast')


# Plot the UMAP
DimPlot(object = hbca_filtered, 
        reduction = "umap", 
        label = TRUE,
        label.size = 6) + NoLegend()

rename_cluster

# Add celltype annotation as a column in meta.data 
hbca_filtered$celltype <- Idents(hbca_filtered)

5.2. Automatic assignment based on top genes

We will use SingleR to automatically annotate our clusters. SingleR compares cells in a new dataset against curated reference profiles of known cell types, assigning each new cell to the reference type that its expression profile is most similar to.

The reference we are using is [human bulk RNA-seq data]: 259 RNA-seq samples of pure stroma and immune cells as generated and supplied by Blueprint and ENCODE Other ref resources:

Azimuth

remotes::install_github('satijalab/azimuth')

CELLxGENE

Based on the top genes information from human breast cancer bulk seq data.

library(SingleR)
ref <- celldex::BlueprintEncodeData()
# ?celldex::BlueprintEncodeData() ## check doc info

singleR_input <- GetAssayData(hbca_filtered, assay = "RNA", layer ="counts")

# Run SingleR annotation function
hbca_main_cluster <- SingleR(test = singleR_input,
                                              ref = ref,
                                              labels = ref$label.main,
                                              clusters = [email protected]$SCT_snn_res.0.1)

hbca_filtered$singleR_hbca_main <- hbca_main_cluster$labels[match(([email protected]$SCT_snn_res.0.1), rownames(hbca_main_cluster))]

plot_automated <- DimPlot(hbca_filtered, label = TRUE, repel = TRUE, reduction = "umap", group.by = "singleR_hbca_main")

plot_manual <- DimPlot(hbca_filtered, label = TRUE, repel = TRUE, reduction = "umap", group.by = "celltype")

plot_manual + plot_automated

singler

Finally you can save your data!

## save object
setwd(output_dir)
saveRDS(hbca_filtered, "./objects/hbca_p85_ann.rds")
## Alternatively, you can save the whole R environment(included the all objects and libraries).
save(list = ls(),file =  './objects/hbca_p85_flex.Rdata')

## Load your saved R environment 
# load('./objects/hbca_p85_flex.Rdata')

6. Commonly Used Markers in Single-Cell Analysis

Human Breast Cancer canonical Markers

# Non Immune
# Endothelial
Endo_genes <- c("CDH5", "VWF", "PECAM1", "ESAM", "THBD", "CD36", "CD34", "HEG1")
# Epithelial
Epi_genes <- c("EPCAM", "KRT5", "KRT17", "KRT18", "KRT19", "KRT8", "KRT14")  #'KRT6'
# Basal
Basal_genes <- c("KRT14", "KRT17", "KRT5")
# LumHR
LumHR_genes <- c("KRT18", "KRT8", "KRT19")
# LumHR
LumSec_genes <- c("KRT15", "KRT23", "KRT16", "KRT7")
keratins <- c(Basal_genes, LumHR_genes, LumHR_genes)
# Adipose
Adipo_genes <- c("APMAP", "ADIPOQ", "ADIPOQ-AS1", "TPRA1")
# Fibroblast Genes
Fibro_genes <- c("COL1A1", "LUM", "FBLN1", "DCN", "FBN1", "COL1A2", "PCOLCE")
non_immune_genes <- c(Endo_genes, Epi_genes, Adipo_genes, Fibro_genes)

# Immune and lymphocytes
# T-cells
Tcell_genes <- c("CD3D", "CD4", "CD8A", "TRAC", "TRBC1", "TRBC2", "IL7R", "CD96")
# B-cells
Bcell_genes <- c("CD19", "MS4A1", "CD79A", "CD79B", "BANK1", "SEL1L3", "IGHM", "IGLC3")
# NK killer cells
NKill_genes <- c("NKG7", "GNLY", "CTSW")
# General Immune genes
Immun_genes <- c("PTPRC", "SRGN", "TYROBP")  # double check
lymphocytes_cluster_genes <- c(Tcell_genes, Bcell_genes, NKill_genes, Immun_genes)

# Myeloid
myeloid_genes <- c("FCER1G", "ITGAM", "FCGR3A")  # the CD1c+ myeloid DCs, the CD141+ myeloid DCs and the CD303+ 
# Red Blood Cells
RedBld_genes <- c("HBA1", "HBA2", "HBB")
# Macrophages
Macro_genes <- c("CD14", "CD68", "CD163", "C1QA", "C1QB", "MRC1", "MSR1", "MS4A7")
# Monocytes
Monoc_genes <- c("LST1", "FCGR3A", "AIF1", "MAFB", "MS4A6A", "ANPEP", "LYZ")
# Dendrite genes
Dendr_genes <- c("LILRA4", "IRF7", "CLEC9A", "CD1C", "TCF4")
myeloid_cluster_genes <- c(myeloid_genes, RedBld_genes, Macro_genes, Monoc_genes,
    Dendr_genes)

Commonly Used Canonical Markers: Mouse Brainspan

brainspan_markers=list()
brainspan_markers[["GABAergic"]]=c("Gad1","Gad2","Slc32a1")  ## **
brainspan_markers[["Glutamatergic"]]=c("Slc17a7","Lamp5","Rorb","Fezf2")  ## **
brainspan_markers[["Granule"]]=c("Zic1","Neurod1","Etv1")
brainspan_markers[["Astrocytes"]]=c("Aqp4","Aldh1l1","Gfap")
brainspan_markers[["OPC"]]=c("Pdgfra","Cspg4")
brainspan_markers[["Oligodendrocyte"]]=c("Opalin","Prox1","Olig1","Mbp")
brainspan_markers[["Microglia"]]=c("Fyb","C1qa","C1qc")
brainspan_markers[["Endothelial"]]=c("Cldn5") 
brainspan_markers

Commonly Used Canonical Markers: Mouse Brain Canonical Marker

# Neural
Neural_genes <- c("Cadm2", "Pcdh9", "Syt1", "Dlg2", "Dlgap1", "Negr1", "Fgf14")
# Oligodendrocyte
OLigo_genes <- c("Mbp", "Plp1", "Enpp2", "Mag", "Mog", "Cldn11", "Mal")
# Astrocytes
Astro_genes <- c("Slc39a12", "Pla2g7", "Htra1", "Ptprz1", "Ttyh1", "Prdx6")
# Microglial
Microglial_genes <- c("Hexb", "C1qa", "C1qb", "C1qc", "Lgmn", "Csf1r", "Ctss")
# Purkinje
Purkinje_genes <- c("Auts2", "Foxp2", "Ebf1", "Grid2", "Dner", "Khdrsb2")
# Ependymal
Ependymal_genes <- c("Dynlrb2", "Tmem212", "Rsph1", "Foxj1", "Rarres2", "Tmem47")
# Hypendymal
Hypendymal_genes <- c("Rmst", "Epha7", "Zfhx4", "Efnb3", "Zic4", "Wls")
# Choroid
Choroid_genes <- c("Rbm47", "Htr2c", "Enpp2", "Ttr", "Fras1", "Glis3")
# Endothelial Mural
Mural_genes <- c("Adap2", "Pdgfrb", "Pde8b", "Myo1b", "Rbpms", "Dlc1")
# Endothelial Stalk
Stalk_genes <- c("Slco1a4", "Adgrl4", "Abcg2", "Nostrin", "Cldn5", "Ptprb")

# Granular neuron types - Location (Suggest to just use Cerbral cortex,
# thalamus and hypothalamus markers) Cerebellar granule cell - Cerebellum
CBgranular_genes <- c("Kcnd2", "Grik2", "Fgf14", "Chn2", "Tenm1", "Olfm3", "Astn2")  #Not present in this section
# Cerebellar golgi cell - Cerebellum
CBgolgi_genes <- c("Sgcd", "Lgi2", "Megf11", "Elavl2", "Cacna1d", "Fgf13", "Cdh4")  #Not present in this section
# Unipolar brush cell - Cerebral cortex and Cochlear nucleus
CTX_CN_UBC_genes <- c("Unc5c", "Tmem108", "Nnat", "Marchf1", "Ccser1", "Sorbs2")
# Cerebral cortex neuron
CTXneuron_genes <- c("Vamp2", "Selenow", "Grin2b", "Snap25", "Atp6v1g2", "Norad",
    "Syp")
# Hippocamal neuron
HIPneuron_genes <- c("Ncdn", "Ppfia2", "Rfx3", "Ahcyl2", "Epha7")
# Cortical Interneuron
CTXInterneuron_genes <- c("Galntl6", "Gad2", "Nap1l5", "Grip1", "Tspyl4", "Cplx1")
# Medium Spiny neuron
CTX_TH_MSN_genes <- c("Rarb", "Snhg14", "Rian", "Rgs9", "Ryr3", "Gnal")
Pallium_spiny_genes <- c("Ppp1r1b")
# Cajal-Retzius cells
CTX_HIP_CRC_genes <- c("Ndnf", "Reln", "Ramp1", "Thsd7b", "Fat3", "Kcnb2")
# Excitary projection neurons
EP_genes <- c("Tbr1", "Neurod6", "Satb2")
# Hypothalamus
HY_genes <- c("Nkx2-1", "Sim1", "Lhx6", "Lhx8")
# Thalamus
TH_genes <- c("Tcf7l2", "Six3", "Plekhg1")
# Midbrain and Pontine neurons
MBP_genes <- c("Otx2", "Gata3", "Pax5", "Pax7", "Sox14")


# Granular cell types - Neurotransmitters GABAergic neurons
GABAergic_genes <- c("Slc32a1", "Gabra1", "Gad2", "Galntl6", "Grip1", "Gbx1")
# Glutamergic neurons
Glutamate_genes <- c("Slc17a6", "Slc17a7", "Slc17a8", "Arpp21", "Car10", "Sv2b",
    "Camk4", "Pcsk2", "Brinp1")
# Cholinergic neurons
Choline_genes <- c("Chat", "Slc18a3", "Tbx20", "Chrnb3", "Nppb", "Ush1c", "Npy4r")
# Dopaminergic neurons
Dopa_genes <- c("Slc6a3", "Chrna6", "Gucy2c", "Aldh1a7", "Tdh")
# Serotonin neurons
Sero_genes <- c("Slc6a4", "Tph2")  #not in this section
# Noradrenergic neurons
Noradren_genes <- c("Slc6a2", "Dbh")  #not in this section
# Use oligo, astro, endo cells as non-neuronal cells

Session Information

To keep your analysis reproducible for yourself or others in the future, it is important to know which version of R you used, which packages you used, and the version of each package. Therefore, a good method to keep track of this is to use the sessionInfo() function. You should always output this at the end of your sessions.

sessionInfo()

Session Info

## R version 4.4.1 (2024-06-14) ## Platform: aarch64-apple-darwin20 ## Running under: macOS Ventura 13.6.7 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## time zone: America/Chicago ## tzcode source: internal ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] scater_1.32.0 scuttle_1.14.0 ## [3] scRNAseq_2.18.0 SingleCellExperiment_1.26.0 ## [5] SingleR_2.6.0 SummarizedExperiment_1.34.0 ## [7] Biobase_2.64.0 GenomicRanges_1.56.1 ## [9] GenomeInfoDb_1.40.1 IRanges_2.38.0 ## [11] S4Vectors_0.42.0 BiocGenerics_0.50.0 ## [13] MatrixGenerics_1.16.0 matrixStats_1.3.0 ## [15] SeuratWrappers_0.3.5 Seurat_5.1.0 ## [17] SeuratObject_5.0.2 sp_2.1-4 ## [19] plyr_1.8.9 patchwork_1.2.0 ## [21] ggplot2_3.5.1 dplyr_1.1.4 ## [23] flexdashboard_0.6.2 knitr_1.47 ## ## loaded via a namespace (and not attached): ## [1] ProtGenerics_1.36.0 spatstat.sparse_3.1-0 ## [3] bitops_1.0-7 webshot_0.5.5 ## [5] httr_1.4.7 RColorBrewer_1.1-3 ## [7] tools_4.4.1 sctransform_0.4.1 ## [9] alabaster.base_1.4.1 utf8_1.2.4 ## [11] R6_2.5.1 HDF5Array_1.32.0 ## [13] lazyeval_0.2.2 uwot_0.2.2 ## [15] rhdf5filters_1.16.0 withr_3.0.0 ## [17] gridExtra_2.3 progressr_0.14.0 ## [19] cli_3.6.3 formatR_1.14 ## [21] spatstat.explore_3.2-7 fastDummies_1.7.3 ## [23] labeling_0.4.3 alabaster.se_1.4.1 ## [25] sass_0.4.9 spatstat.data_3.1-2 ## [27] ggridges_0.5.6 pbapply_1.7-2 ## [29] Rsamtools_2.20.0 R.utils_2.12.3 ## [31] parallelly_1.37.1 limma_3.60.3 ## [33] rstudioapi_0.16.0 RSQLite_2.3.7 ## [35] generics_0.1.3 BiocIO_1.14.0 ## [37] ica_1.0-3 spatstat.random_3.2-3 ## [39] Matrix_1.7-0 ggbeeswarm_0.7.2 ## [41] fansi_1.0.6 abind_1.4-5 ## [43] R.methodsS3_1.8.2 lifecycle_1.0.4 ## [45] edgeR_4.2.0 yaml_2.3.8 ## [47] rhdf5_2.48.0 SparseArray_1.4.8 ## [49] BiocFileCache_2.12.0 Rtsne_0.17 ## [51] grid_4.4.1 blob_1.2.4 ## [53] dqrng_0.4.1 promises_1.3.0 ## [55] ExperimentHub_2.12.0 crayon_1.5.3 ## [57] miniUI_0.1.1.1 lattice_0.22-6 ## [59] beachmat_2.20.0 cowplot_1.1.3 ## [61] GenomicFeatures_1.56.0 KEGGREST_1.44.1 ## [63] metapod_1.12.0 pillar_1.9.0 ## [65] rjson_0.2.21 future.apply_1.11.2 ## [67] codetools_0.2-20 leiden_0.4.3.1 ## [69] glue_1.7.0 data.table_1.15.4 ## [71] remotes_2.5.0 vctrs_0.6.5 ## [73] png_0.1-8 gypsum_1.0.1 ## [75] spam_2.10-0 gtable_0.3.5 ## [77] cachem_1.1.0 xfun_0.45 ## [79] S4Arrays_1.4.1 mime_0.12 ## [81] survival_3.7-0 bluster_1.14.0 ## [83] statmod_1.5.0 fitdistrplus_1.1-11 ## [85] ROCR_1.0-11 nlme_3.1-165 ## [87] bit64_4.0.5 alabaster.ranges_1.4.1 ## [89] filelock_1.0.3 RcppAnnoy_0.0.22 ## [91] bslib_0.7.0 irlba_2.3.5.1 ## [93] vipor_0.4.7 KernSmooth_2.23-24 ## [95] colorspace_2.1-0 DBI_1.2.3 ## [97] ggrastr_1.0.2 tidyselect_1.2.1 ## [99] bit_4.0.5 compiler_4.4.1 ## [101] curl_5.2.1 httr2_1.0.1 ## [103] BiocNeighbors_1.22.0 hdf5r_1.3.10 ## [105] DelayedArray_0.30.1 plotly_4.10.4 ## [107] rtracklayer_1.64.0 scales_1.3.0 ## [109] lmtest_0.9-40 rappdirs_0.3.3 ## [111] stringr_1.5.1 digest_0.6.36 ## [113] goftest_1.2-3 spatstat.utils_3.0-5 ## [115] alabaster.matrix_1.4.1 rmarkdown_2.27 ## [117] XVector_0.44.0 htmltools_0.5.8.1 ## [119] pkgconfig_2.0.3 sparseMatrixStats_1.16.0 ## [121] highr_0.11 dbplyr_2.5.0 ## [123] fastmap_1.2.0 ensembldb_2.28.0 ## [125] rlang_1.1.4 htmlwidgets_1.6.4 ## [127] UCSC.utils_1.0.0 shiny_1.8.1.1 ## [129] DelayedMatrixStats_1.26.0 farver_2.1.2 ## [131] jquerylib_0.1.4 zoo_1.8-12 ## [133] jsonlite_1.8.8 BiocParallel_1.38.0 ## [135] R.oo_1.26.0 BiocSingular_1.20.0 ## [137] RCurl_1.98-1.14 magrittr_2.0.3 ## [139] GenomeInfoDbData_1.2.12 dotCall64_1.1-1 ## [141] Rhdf5lib_1.26.0 munsell_0.5.1 ## [143] Rcpp_1.0.12 viridis_0.6.5 ## [145] reticulate_1.38.0 stringi_1.8.4 ## [147] alabaster.schemas_1.4.0 zlibbioc_1.50.0 ## [149] MASS_7.3-61 AnnotationHub_3.12.0 ## [151] parallel_4.4.1 listenv_0.9.1 ## [153] ggrepel_0.9.5 deldir_2.0-4 ## [155] Biostrings_2.72.1 splines_4.4.1 ## [157] tensor_1.5 locfit_1.5-9.10 ## [159] igraph_2.0.3 spatstat.geom_3.2-9 ## [161] RcppHNSW_0.6.0 reshape2_1.4.4 ## [163] ScaledMatrix_1.12.0 BiocVersion_3.19.1 ## [165] XML_3.99-0.16.1 evaluate_0.24.0 ## [167] scran_1.32.0 BiocManager_1.30.23 ## [169] httpuv_1.6.15 RANN_2.6.1 ## [171] tidyr_1.3.1 purrr_1.0.2 ## [173] polyclip_1.10-6 alabaster.sce_1.4.0 ## [175] future_1.33.2 scattermore_1.2 ## [177] rsvd_1.0.5 xtable_1.8-4 ## [179] restfulr_0.0.15 AnnotationFilter_1.28.0 ## [181] RSpectra_0.16-1 later_1.3.2 ## [183] viridisLite_0.4.2 tibble_3.2.1 ## [185] beeswarm_0.4.0 memoise_2.0.1 ## [187] AnnotationDbi_1.66.0 GenomicAlignments_1.40.0 ## [189] cluster_2.1.6 globals_0.16.3

For more info about us click here.

1 scRNA: Processing Basics

Introduction to Seurat for scRNA processing.

What we're covering

Helpful R tip

R Library Basics

Let's get started with loading R packages!

Set working directory

Data Downloads

1. Create Seurat Objects from 10x Matrix output (.mtx)

Reading in the object

Initialize object

2. Basic quality control

Mitochondrial genes percentage

Other biological features

a) Ribosomal genes percentage

b) ratio of genes and umi

c) Mitochondrial and Ribosomal genes percentage

Example structure

Visualize the QC

Violin Plots

create seurat with QC (min.cells = 3, min.features = 200)

Save the raw and filtered rds data

3. Data Normalization

SCTransform

Variable genes

Linear dimensional reduction

Determining the best dims

4. Clustering

Resolution

Differentially Expressed Genes

Save the your rds object or save the environment

5. Cell Type Annotation

5.1. Manual assignment/refine based on top genes (Canonical Genes)

Visualizing marker genes

a) Feature Plot

b) Violin Plot

c) Dot Plot

Naming Cluster

5.2. Automatic assignment based on top genes

6. Commonly Used Markers in Single-Cell Analysis

Session Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally