-
Notifications
You must be signed in to change notification settings - Fork 3
1 scRNA: Processing Basics
This is an R Markdown document that was designed to introduce and review Seurat and basic analysis of single cell RNA-seq (scRNA) data.
For this tutorial, we will be using single cells from our Human Breast Cell Atlas.
Samples were processed via 10x Genomics GEM-X Flex Gene Expression library generation.
This tutorial will review the following parts of standard scRNA-seq analysis using the Seurat package:
RNA Processing Basic workflow:
-
Creating Seurat objects from cellranger output.
-
Examining basic quality control
- number of genes
nFeature_RNA
- number of UMIs
nCount_RNA
- number of cells
- percentage of mito/ribo genes
- find debris
- find doublets (DoubletFinder, not covered in this tutorial)
- number of genes
-
Data normalization
- Account for different sequencing effort per cell
-
Clustering the cells
- Dimensionality reduction and projection onto 2D spaces
- Find clusters with different resolutions (low, medium and high)
- How to choose an appropriate clustering resolution
-
Finding differentially expressed features
- Differential gene expression between clusters
- Cluster identification
-
Assigning cell type identities to clusters
- Manual assignment based on top genes
- Automated assignment using label transfer from libraries like
SingleR
check the detailed tutorial - Find anchors between query and reference in
Seurat
using the function called TransferData
For this tutorial, we used R 4.3.3
for MacOS: R-4.3.3-arm64.pkg
- If you want to know more about a function, you
can use
?FUNCTION_NAME
to examine the inputs, whereFUNCTION_NAME
is the name of the function using the correct function capitalization
Note, this tutorial was prepared using Seurat 5
you might need to install the packages if you are running your analysis on a local computer
Most of R packages can be installed using:
## from CRAN repositories
install.packages('XXX')
##alternative ways
BiocManager::install('XXX')
remotes::install_github('XXX')
## Install multiple packages
install.packages(c("P1","P2","P3",dependencies = TRUE)
## Install packages with a specific version using:
remotes::install_version('XXX',version=1.2-3)
to start in our preset course env
# This step only needs to be run once when setting up the course environment for the first time.
cp ~mirao/.Renviron ~/
echo module load EBModules R hdf5 EBH100 CMake JAGS >> ~/.bashrc
echo module load GCC/13.3.0 >> ~/.bashrc
# otherwise, install R and RStudio into your local device
[RStudio Desktop] (https://posit.co/download/rstudio-desktop/)
# if run everything at local device
# install all the packages required for the workshop
# library install
# pre-install
install.packages('BiocManager')
library('BiocManager')
package_list <- c('gypsum','scran','BiocNeighbors','glmGamPoi','SingleR','scRNAseq','celldex','scater','scrapper')
BiocManager::install(package_list)
{
knitr::opts_chunk$set(
cache = TRUE,
cache.lazy = FALSE,
tidy = TRUE,
autodep=TRUE)
suppressPackageStartupMessages({
## basic
library(dplyr)
library(ggplot2)
library(patchwork)
library(plyr)
library(readr)
library(knitr)
library(flexdashboard)
library(tidyverse)
library(remotes)
library(gypsum)
## Seurat
library(Seurat)
library(SeuratObject)
library(SeuratWrappers)
## cell annotation
library(BiocNeighbors)
library(SingleR)
library(scRNAseq) ## reference
library(celldex)
library(scater)
library(cowplot)
## set seed
set.seed(12345)
})
}
This will need to be changed to the specific location where you have placed the data to be analyzed. You should specify the directory where you are working and set the directory.
R Functions for this section:
-
<-
BaseR
function - this is the assign function that is fairly R specific. You can also use=
, which is a more common assignment syntax in other languages. -
setwd
BaseR
function - This will set your directory to the string you provided assuming you have permissions to that location and if you have provided a valid option.
Note slash direction is due to being run on either a Mac or
Linux. If you are using a Windows machine, your slashes will be \
instead. This can be set using the .
if are unsure.
Note Ideally, you would not need to set your working directory. Your files should be together in a folder. In general, I suggest a structure of functions, reports, and data under each project folder.
# option: run the analysis via RStudio on website
input_dir <- "/grid/singlecellcourse/data/tutorials/scRNA/sample_filtered_feature_bc_matrix/"
output_dir <- "/home/YOUR_NAME/R_BASIC_FOLDER/"
dir.create(output_dir)
setwd(output_dir)
# option: run the analysis at local RStudio
# transfer the data from sever to your local path
scp -r /INPUT_FOLDER/ /YOUR_LOCAL_PATH/ # terminal
output_dir <- "/home/YOUR_NAME/R_BASIC_FOLDER/"
dir.create(output_dir)
setwd(output_dir)
Additionally, for data organization, it would be helpful to have the folder structure established before beginning the analysis.
# create the folder for figures objects and scripts if they are not existing in your output path
if (!dir.exists("figures")) dir.create("figures")
if (!dir.exists("objects")) dir.create("objects")
if (!dir.exists("scripts")) dir.create("scripts")
Before starting this tutorial, please making sure the data was downloaded to your working path.
In most cases, the easiest way is to downloaded the cell matrix files (usually in HDF5 format for R or H5AD format for python).
Download the scRNA HBCA cells from this github repo. We will do this is R using a system call to download the github permalinks
#path of count matrix
#this should make three new files in your directory
list.files()
You can read in data from Read10X (h5) / read_h5ad (h5ad)
format
Or, from a more basic output of Read10X
, which takes the input of a directory with three files matrix.mtx
, barcodes.tsv
, and genes.tsv
.
Functions for this section:
# load data from input directory
# input_dir <- "/grid/singlecellcourse/data/tutorials/scRNA/sample_filtered_feature_bc_matrix"
hbca <- Read10X(input_dir)
hbca[1:10, 1:10]

dim(hbca)
## [1] 18082 9751
Note We call these "features", not "genes" as a more generic term because we use a very similar object structure for other data modalities, such as when you have antibodies or ATAC peaks.
Here, we choose how much of the raw data we want to work with by setting specific limits.
Here, we state that we require a gene to be present in at least min_cells
cells and each cell is required to have at least min_features
features or genes.
Functions:
CreateSeuratObject
SeuratObject function - this creates a Seurat object from the raw data. We also can set the min cells and min features/genes. For completely raw results you would set both of these arguments to zero.
Note The min cells and min features listed below are good starting values, but these should be set and determined experimentally or informed by pre-existing information you have about the data. For example, for an immune-rich data set, frozen, or nuclei samples, we would lower the threshold.
-
For
min.cells
, a good starting place is between 3-5 cells. However, sometimes it makes sense to not include this threshold at all for a single sample and only apply it when you have multiple samples. -
For
min.features
(fresh single cells), a good starting place for fresh single cell samples is between 100 to 200. Lean towards lower bound for immune heavy samples where you have selected for immune cells in your experiment. If you have very viable, and therefore very good samples, having this set at the higher end makes sense. This number is also going to be impacted by sequencing coverage and depth. I would usually start at 200 for this threshold. -
For
min.features
(frozen or single nuclei), a good starting place for frozen or nuclei samples is between 50 to 150. Lean towards lower bound for immune heavy samples where you have selected for immune cells in your experiment. I would usually start at 100 for these. -
Alternative method of thresholding Other people may choose their ranges using the standard dev or median absolute deviation. However, you need to be careful with these. Usually they do not help set the lower bound of the threshold so much as the upper bound.
## Turn count matrix into a Seurat object
hbca_raw <- CreateSeuratObject(counts = hbca, project = "hbca_p85", min.cells = 0, min.features = 0) ## we load a processed matrix without any filtering
hbca_raw
# An object of class Seurat
# 18082 features across 9751 samples within 1 assay
# Active assay: RNA (18082 features, 0 variable features)
# 1 layer present: counts
Note
Examine the meta data where QC or cell specific information could be stored
Functions:
-
head
function from the utils package (fairly basic R package). This results the first part of a data frame, table, matrix or vector. -
tail
function is similar to the head function, but starts from the bottom instead of the top
This specifically examines the meta data in the Seurat object and looks at top 5 items.
If you wanted to look at the whole thing you can use View
, but it can crash R if it is too large.
# Explore merged metadata
head([email protected],10)
tail([email protected],10)
# View([email protected])

Below, we add the percentage of a specific feature to our meta data. We are
assigning the Seurat object a meta feature called percent.mt
for
percent of mitochondrial genes.
Functions: 1. PercentageFeatureSet
- Seurat specific function which
uses regular expression patterns to select specific features to calculate the percentage of those features making up a cell's
gene expression.
R regular expressions guide
-
^
at the beginning of the regular expression notes that this should be the first character of the feature -
RP[SL]
In this expression, “RP” matches genes that encode for ribosomal proteins; “S” or “L” specifies the subunit of the ribosome -
:digit:
matches any digit -
|
this can be used for "or" expressions
Here is the example with mitochondrial genes percentage, which is often used to assess sample quality with a high fraction of apoptotic or lysing cells. In this case, high mitochondrial gene percentage is considered poor quality.
# mitochondrial ratio
hbca_raw[["percent.mt"]] <- PercentageFeatureSet(hbca_raw, pattern = "^MT-")
# use pattern = "^mt-" for mouse
Ribosomal protein genes are used in the filtering of single-cell RNA sequencing data to assess the quality of the data and remove poor-quality cells. The proportion of ribosomal protein reads can be an indicator of RNA degradation. High ribosomal protein content can be associated with specific cell types, such as immune cells.
However, the demonstration data we are using has already filtered out all the Ribosomal and Mitochondrial Genes
hbca_raw[["percent.rb"]] <- PercentageFeatureSet(hbca_raw, pattern = "^RP[SL][[:digit:]]|^RPLP[[:digit:]]|^RPSA")
# pattern = "^Rp[SL][[:digit:]]|^Rplp[[:digit:]]|^Rpsa") for mouse
Important
To break this pattern down in plain language:
^RP[SL]:digit:|^RPLP:digit:|^RPSA
means
IF a gene name starts with RP then has an S or an L (i.e. RPS... RPL...) then has a number (RPS1..RPS2..RPL1...RPL2..)
OR (|) IF a gene names starts with RPLP and has a digit (i.e. RPLP1..RPLP2..)
OR (|) IF a gene name starts with RPSA
count that gene as part of the percentage feature set
This produces a ratio that is related to how much UMI saturation there is per gene
hbca_raw[["umi_gene_ratio"]] <- log10(hbca_raw@meta.data$nCount_RNA/hbca_raw@meta.data$nFeature_RNA)
Add the Mitochondrial and Ribosomal genes percentage together
hbca_raw[["mt_rb"]] <- hbca_raw@meta.data$percent.mt + hbca_raw@meta.data$percent.rb
It is also good to check the summary for the items before you move on.
summary(hbca_raw@meta.data)

One of the first items you want to do when looking at your data is to visualize the quality of the samples.
Visualize QC metrics as a violin plot
Functions:
-
VlnPlot
- Seurat specific function that shows violin plots of single cell data. This can be anything from QC metrics, meta data, scores, or single or combined gene expression values Note you can have multiple items, including differentncols
.
When you examine the violin plots, you may consider using some the standard deviations or median absolute deviations as cut-offs. Alternatively, strict pre-defined cut-offs are possible, but they are often very arbitrary and likely need to vary by sample
As discussed previously:
-
nfeature: represents the number of unique genes detected in each cell
-
nCount: represents the total number of molecules (UMIs) detected within a cell
-
percent mt: percent of mitochondrial genes
-
percent rb: percent of ribosomal genes
-
mt rb: percent of mitochondrial genes and ribosomal genes together
-
umi gene ratio: represents the number of genes detected per UMI (unique molecular identifier) for each cell. A higher UMI gene ratio indicates that more genes are detected per UMI, which is generally considered a good indicator of data quality. However, it can also show duplicates.
# plot the QC VlnPlot
VlnPlot(hbca_raw, features = c("nFeature_RNA", "nCount_RNA", "umi_gene_ratio", "percent.mt"), ncol = 2)
Filtering criterion:
- Genes detected in < 3 cells were excluded
- Cells with < 200 total detected genes were excluded
- Cells with >= 20% of mitochondria-expressed genes were excluded. High mtRNA content may indicate apoptosis.
#create seurat with QC
hbca_filtered <- CreateSeuratObject(counts = hbca, project = "hbca_p85", min.cells = 3, min.features = 200)
hbca_filtered
# An object of class Seurat
# 16074 features across 9725 samples within 1 assay
# Active assay: RNA (16074 features, 0 variable features)
# 1 layer present: counts
# basic QC
# mitochondrial ratio
hbca_filtered[["percent.mt"]] <- PercentageFeatureSet(hbca_filtered, pattern = "^MT-")
# ration of gene ans umi
hbca_filtered[["umi_gene_ratio"]] <- log10(hbca_filtered@meta.data$nCount_RNA/hbca_filtered@meta.data$nFeature_RNA)
hbca_filtered <- subset(x = hbca_filtered, subset =percent.mt < 20)
# Addition way to filter low quality cells (for FLEX, NOT for snRNA, you could choose the lower nFeature_RNA as initial threshold, eg: 200-500)
# hbca_filtered<- subset(hbca_filtered, subset = nFeature_RNA > 200)
# plot the QC VlnvPlot
VlnPlot(hbca_filtered, features = c("nFeature_RNA", "nCount_RNA", "umi_gene_ratio", "percent.mt"), ncol = 2)
# visualize QC_ratio flexdashboard
QC_ratio <- round(ncol(hbca_filtered)/ncol(hbca_raw) * 100, 2)
flexdashboard::gauge(QC_ratio, min = 0, max = 100, symbol = "%", gaugeSectors(success = c(50, 100),
danger = c(0, 50)))

Note Please note that there is no one-size-fits-all filtering criteria, as the normal ranges of these metrics can vary dramatically from one experiment to another, depending on sample origin as well as reagents and sequencing depths. It would be helpful to know about your sample and have the expectations before the QC filtering.
Many of the steps we have completed have taken a long time to run. To save yourself time in the future, it is a good idea to save your Seurat object as an RDS, so you can read it into your R environment again in the future, if needed.
getwd()
#setwd(output_dir)
saveRDS(hbca_raw, './objects/hbca_p85_raw.rds')
saveRDS(hbca_filtered, './objects/hbca_p85_filtered.rds')
Biological heterogeneity in single-cell RNA-seq data is often confounded by technical factors including sequencing depth, to remove this technical variability, data will be normalized.
SCTransform
normalization and variance stabilization of molecular count data. Can be adjusted by molecular items such as percent.mt
- This replaces a number of traditional normalization procedures and usually takes longer, but is preferred.
- It is based on the use of analytic Pearson residuals for normalization of single-cell RNA-seq data and was covered in this paper
- Seurat
SCTransform
was covered in this paper
Previous versions would use NormalizeData
with global-scaling normalization method LogNormalize
, as well as scaling methods ScaleData
and FindVariableFeatures
.
- These are all normalization steps to help compare data between samples and to allow parametric methods to be used by shifting the data into a normal distribution, as opposed to a skewed distribution which requires non-parametric tests
Why SCTransform
outbeats LogNormalize
is explained in seurat SCT tutorial as well as the this paper Though this is disputed in edge cases, for instance SCTransform may over-normalize rare events. A further discussion is here.
This is one of the slower functions; I suggest that you run it on a cluster if processing a lot of cells.
However, for this tutorial we are sticking to a single sample and relatively low cell counts.
Seurat provides a fairly good walk through
SCTransform replaces the following three early processing commands, which were used in the paper: specifically NormalizeData()
, ScaleData()
, and FindVariableFeatures()
Function: 1. SCTransform
- normalizes, scales, and regresses out specific variables.
-
Data is normalized via log1p(counts), changing the data shape to be more of a normal distribution.
-
The scale.data centers the data by calculating Pearson residuals and removes unwanted sources of variation from a single-cell dataset.
-
For example, we could regress out heterogeneity associated with cell cycle stage or mitochondrial contamination. In this case, we will be regressing out percentage mito.
This can be slow, and often people do not look at all features. You can change the “variable.features.n” to have more items, but currently it only examines the top features.
## this step will take a while, for our 6k dataset, it takes 2 mins
hbca_filtered <- SCTransform(hbca_filtered, vars.to.regress = 'percent.mt', do.scale = TRUE, verbose = FALSE)
DefaultAssay(hbca_filtered) ## the default assay will automatically switch from 'RNA' to 'SCT'
# save object after SCT
saveRDS(hbca_filtered,'./objects/hbca_p85_filtered_SCT.rds' )
Here, we examine the most highly variable genes.
These genes provide information about the sample or clusters. If you are looking at multiple experiments, it might show you batch effects or effects between groups. House keeping genes would not be expected to show up as the most highly variable genes, but cell type or treatment specific-genes should show up here.
top10F <- head(VariableFeatures(hbca_filtered), 10)
head(VariableFeatures(hbca_filtered), 10)
# [1] "COL3A1" "COL1A1" "SELE" "IL6" "DCN" "APOD" "MUCL1" "COL1A2" "CSF3" "SRGN"
Here, we plot the genes to examine the different features with labels.
Function:
-
VariableFeaturePlot
highlights all the variable features. This is a dot plot that shows residual variance and the generic mean.
# plot variable features with and without labels
var_plot <- VariableFeaturePlot(hbca_filtered)
labtop10_plot <- LabelPoints(plot = var_plot, points = top10F, repel = TRUE)
labtop10_plot
We need to reduce this full matrix of cell x gene
into something that is interpretable, with minimal loss of information. For this we use dimensionality reduction.
This is to group the variance of many genes into a much smaller number of groups or principle components (PC). This reduces our dimension from cell x gene
to cell x PC
.
One of the most standard methods of dimensional reduction is Principal Component Analysis PCA
Function:
-
RunPCA
This runs the PCA for dimension reduction.
### Linear dimensional reduction
hbca_filtered <- RunPCA(hbca_filtered, features = VariableFeatures(object = hbca_filtered))
# Here, we examine and visualize the PCA results by printing them out:
print(hbca_filtered[["pca"]], dims = 1:5, nfeatures = 5)
# PC_ 1
# Positive: DCN, COL6A2, TIMP2, SPARC, COL1A2
# Negative: SPINT2, CDH1, CD24, JUP, EPCAM
# PC_ 2
# Positive: CD93, PECAM1, PLVAP, VWF, CD74
# Negative: COL6A2, DCN, COL1A2, LUM, COL6A1
# PC_ 3
# Positive: TM4SF1, BCAM, SPARCL1, AQP1, GJA1
# Negative: LAPTM5, SAMSN1, RGS1, GPR183, TYROBP
# PC_ 4
# Positive: MYL9, TAGLN, ACTA2, MYH11, TPM2
# Negative: ITGAV, ELOVL5, AREG, MLPH, SHB
# PC_ 5
# Positive: CXCL16, CTSB, CSF2RA, SFRP1, MSR1
# Negative: CD96, TRBC2, CNOT6L, CCL5, SPOCK2
Here, we visualize the PCA results for the first 2 dimensions, looking at those associated with reduction components.
Function:
-
VizDimLoadings
This examines the top genes associated with reduction components.
VizDimLoadings(hbca_filtered, dims = 1:2, reduction = "pca")
This next visualization is the most standard method of looking at these items.
Function:
-
DimPlot
Plots the PCA for the first two components.
DimPlot(hbca_filtered, reduction = "pca") + NoLegend()
This next visualization examines the top differentially expressed genes in the first nine Principle components.
Function:
-
DimHeatmap
this shows the different genes associated with PCs expressed in cells, allowing for easy exploration of the primary sources of heterogeneity in a datase. It can be useful when trying to decide which PCs to include for further downstream analyses
DimHeatmap(hbca_filtered, dims = 1:9, cells = 500, balanced = TRUE)
#?try DimHeatmap(hbca_filtered, dims = 1:20, cells = 500, balanced = TRUE)
ElbowPlot can also be used for selecting the best number of dims to be used. If you examine the curve for a single sample, you might see more of slow progression.
Function: 1. ElbowPlot
this plots the standard dev of the PCA to try.
You can change which type of reduction you use too.
In theory, you want to select where it starts to bend .
ElbowPlot(hbca_filtered, ndims = 50)
Here, we find the neighbors using the selected number of dims.
Function:
1.FindNeighbors
embed cells in SNN graph that calculated based on the euclidean distance in PCA space with previously defined dimensionality of the dataset (first 20 PCs)
# set up the dim number
dim <- 15
hbca_filtered <- FindNeighbors(hbca_filtered, dims = 1:dim)
## Computing nearest neighbor graph
## Computing SNN
We can also try multiple resolutions to find the desire cluster
low_res <- 0.1
mid_res <- 0.4
high_res <- 0.8
Find the clusters for the three different resolutions. Function:
-
FindClusters
will serach the cell groups on KNN graph using Louvain algorithm (default) and output the number of clusters based on the resolutions defined above, with increased resolution values leading to a greater number of clusters.
hbca_filtered <- FindClusters(hbca_filtered, resolution = c(low_res, mid_res, high_res))
Create the Umap visualization based on the paper dims.
Note This function has a lot of arguments that can be adjusted, such as spread.
Note Changing the value of the "seed.use" argument can change the outcome.
Note Other parameters that can be used to optimize your UMAP include, but are not limited to, n.neighbors
, n.epochs
,
min.dist
and spread
.
You can used functions like group.by
to provide splitting.
hbca_filtered <- RunUMAP(hbca_filtered, dims=1:dim)
DimPlot(hbca_filtered, reduction = "umap", group.by = paste0("SCT_snn_res.", c(low_res, mid_res, high_res)),label = T)
One of the major parts of single cell analysis is identifying the Differential Expression (DE) between clusters or groups.
In this case, we can examine the DE analysis between 2 group.
The default is examining a single cluster against all other clusters.
## Choose the resolution for DE
Idents(hbca_filtered) <- hbca_filtered$SCT_snn_res.0.1
## Find markers for each cluster. Will take long time
hbca.markers <- FindAllMarkers(hbca_filtered, only.pos = TRUE)
hbca.markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
slice_head(n = 5) %>%
ungroup() -> top5
DoHeatmap(object = hbca_filtered, features = top5$gene) +theme(axis.text.y = element_text(size = 8, angle = 45, hjust = 1))
getwd()
# setwd(output_dir)
saveRDS(hbca_filtered,'./objects/hbca_p85_filtered_cluster.rds' )
## Load the rds object back
# hbca_filtered <- readRDS(file = './objects/hbca_p85_filtered_cluster.rds')
Often, it is good to have some genes that you expect to be high in specific clusters based on literature.
Good resources for this include:
Note Expected genes are likely to depend on your cell types, organ type, species, and whether you are looking at single cell or single nuclei.
However, it is good to have an idea of what you want to look at first. Not so much as a guide, but more as a sanity check.
For example, if you do FACs sorting to enrich for CD45+ cells, you expect that you will have enrichment for various types of white blood cells (WBCs), particularly T cells and B cells. Therefore, it is good to have both markers that you expect to be present and those you expected to be absent. Since we sorted for CD45+ cells, we would expect markers for WBCs to be enriched and those for endothelial and epithelial cells to be decreased or absent.
check the data quality for each cluster
VlnPlot(hbca_filtered, features = c("nCount_RNA","nFeature_RNA","percent.mt"), group.by= "SCT_snn_res.0.1",pt.size = 0)
SCT01 <- DimPlot(hbca_filtered, reduction = "umap", group.by=paste0("SCT_snn_res.", c(low_res)) , label = TRUE, repel=TRUE) + theme(legend.position = "none")
SCT01
hbca_markers <- list(
'Basal'=c('KRT14','KRT17','DST'),
'LumHR'=c('AREG','ERBB4','AGR2'),
'LumSec'=c('SCGB2A2','SLPI','KRT15'),
'Endo'=c('SELE','FABP4','CSF3'),
'Pericytes'=c('RGS5','MT1A','IGFBP5'),
'B cells'=c('MZB1','CD79A','DERL3'),
'Fibro'=c('DCN','APOD','COL1A1'),
'T cells'=c('CD3D','CD3E','CD2'),
'Myeloid'=c('CD68','C1QC','MS4A6A'),
'Mast'=c('IL18R1','HPGD','HDC')
)
library(knitr)
df <- data.frame(hbca_markers)
kable(df)

Note
Markers are from Kumar et al, Nature, 2023
To get a better idea of cell type identity, we can explore the markers by cluster using the FeaturePlot()
function.
This plots a umap
, but colors the item based on the number.
This function has many arguments that can be adjusted, such as the color scale (cols) or which features are plotted.
# visualize expression level of 3 cell type markers on umap
FeaturePlot(hbca_filtered,
feature=c('SLPI','APOD','CD3D','CD68'),
ncol = 2,
label=TRUE)
Note We could loop through all these markers to generate FeaturePlots and save them as PDF files for easy reference later.
pdf(paste0("./figures/hbca_p85_FeaturePlot_hbca_markers.pdf"), width = 10, height = 10)
# Loop through each cell type and its markers
for (cell_type in names(hbca_markers)) {
markers <- hbca_markers[[cell_type]]
# Create a plot for each marker
plots <- lapply(markers, function(marker) {
FeaturePlot(hbca_filtered, features = marker, cols = c("lightgrey", "blue"), reduction = "umap") +
ggtitle(paste(cell_type, "-", marker))
}
)
# Combine the plots into one grid
combined_plot <- plot_grid(plotlist = plots, ncol = 2)
# Print the combined plot to the PDF
print(combined_plot)
}
# Close the PDF
dev.off()
VlnPlot()
function allow to visualize the distribution of a numeric variable for one or several groups.
# visualize the distribution of 3 cell type markers across clusters
VlnPlot(hbca_filtered,feature=c('SLPI','APOD','CD3D','CD68'), pt.size=0, ncol = 2)
Note We could loop through all these markers to generate VlnPlots and save them as PDF files for easy reference later.
pdf(paste0("./figures/hbca_p85_VlnPlot_hbca_markers.pdf"), width = 10, height = 10)
# Loop through each cell type and its markers
for (cell_type in names(hbca_markers)) {
markers <- hbca_markers[[cell_type]]
# Create a violin plot for each marker
plots <- lapply(markers, function(marker) {
VlnPlot(hbca_filtered, features = marker, pt.size = 0) +
ggtitle(paste(cell_type, "-", marker)) +
theme_minimal()
}
)
# Combine the plots into one grid
combined_plot <- plot_grid(plotlist = plots, ncol = 2)
# Print the combined plot to the PDF
print(combined_plot)
}
# Close the PDF
dev.off()
Seurat also has a built in visualization function DotPlot()
DotPlot is a intuitive way of visualizing how feature expression changes across different clusters. The size of the dot encodes the percentage of cells within a class, while the color encodes the AverageExpression level across all cells within a class (blue is high).
# Plot the hbca markers across clusters
DotPlot(hbca_filtered,hbca_markers,cluster.idents = TRUE, group.by= "SCT_snn_res.0.1") + RotatedAxis()
Now taking all of this information, we can surmise the cell types of the different clusters. You can manually assign cluster names using the features.
# Rename all identities
hbca_filtered <- RenameIdents(object = hbca_filtered,
"0" = "LumSec",
"1" = "Fibro",
"2" = "LumHR",
"3" = "Endo",
"4" = "Pericytes",
"5" = "T cells",
"6" = "Basal",
"7" = "Mye",
"8" = "B cells",
'9'='Mast')
# Plot the UMAP
DimPlot(object = hbca_filtered,
reduction = "umap",
label = TRUE,
label.size = 6) + NoLegend()
# Add celltype annotation as a column in meta.data
hbca_filtered$celltype <- Idents(hbca_filtered)
We will use SingleR to automatically annotate our clusters. SingleR compares cells in a new dataset against curated reference profiles of known cell types, assigning each new cell to the reference type that its expression profile is most similar to.
The reference we are using is [human bulk RNA-seq data]: 259 RNA-seq samples of pure stroma and immune cells as generated and supplied by Blueprint and ENCODE Other ref resources:
remotes::install_github('satijalab/azimuth')
Based on the top genes information from human breast cancer bulk seq data.
library(SingleR)
ref <- celldex::BlueprintEncodeData()
# ?celldex::BlueprintEncodeData() ## check doc info
singleR_input <- GetAssayData(hbca_filtered, assay = "RNA", layer ="counts")
# Run SingleR annotation function
hbca_main_cluster <- SingleR(test = singleR_input,
ref = ref,
labels = ref$label.main,
clusters = [email protected]$SCT_snn_res.0.1)
hbca_filtered$singleR_hbca_main <- hbca_main_cluster$labels[match(([email protected]$SCT_snn_res.0.1), rownames(hbca_main_cluster))]
plot_automated <- DimPlot(hbca_filtered, label = TRUE, repel = TRUE, reduction = "umap", group.by = "singleR_hbca_main")
plot_manual <- DimPlot(hbca_filtered, label = TRUE, repel = TRUE, reduction = "umap", group.by = "celltype")
plot_manual + plot_automated
Finally you can save your data!
## save object
setwd(output_dir)
saveRDS(hbca_filtered, "./objects/hbca_p85_ann.rds")
## Alternatively, you can save the whole R environment(included the all objects and libraries).
save(list = ls(),file = './objects/hbca_p85_flex.Rdata')
## Load your saved R environment
# load('./objects/hbca_p85_flex.Rdata')
Human Breast Cancer canonical Markers
# Non Immune
# Endothelial
Endo_genes <- c("CDH5", "VWF", "PECAM1", "ESAM", "THBD", "CD36", "CD34", "HEG1")
# Epithelial
Epi_genes <- c("EPCAM", "KRT5", "KRT17", "KRT18", "KRT19", "KRT8", "KRT14") #'KRT6'
# Basal
Basal_genes <- c("KRT14", "KRT17", "KRT5")
# LumHR
LumHR_genes <- c("KRT18", "KRT8", "KRT19")
# LumHR
LumSec_genes <- c("KRT15", "KRT23", "KRT16", "KRT7")
keratins <- c(Basal_genes, LumHR_genes, LumHR_genes)
# Adipose
Adipo_genes <- c("APMAP", "ADIPOQ", "ADIPOQ-AS1", "TPRA1")
# Fibroblast Genes
Fibro_genes <- c("COL1A1", "LUM", "FBLN1", "DCN", "FBN1", "COL1A2", "PCOLCE")
non_immune_genes <- c(Endo_genes, Epi_genes, Adipo_genes, Fibro_genes)
# Immune and lymphocytes
# T-cells
Tcell_genes <- c("CD3D", "CD4", "CD8A", "TRAC", "TRBC1", "TRBC2", "IL7R", "CD96")
# B-cells
Bcell_genes <- c("CD19", "MS4A1", "CD79A", "CD79B", "BANK1", "SEL1L3", "IGHM", "IGLC3")
# NK killer cells
NKill_genes <- c("NKG7", "GNLY", "CTSW")
# General Immune genes
Immun_genes <- c("PTPRC", "SRGN", "TYROBP") # double check
lymphocytes_cluster_genes <- c(Tcell_genes, Bcell_genes, NKill_genes, Immun_genes)
# Myeloid
myeloid_genes <- c("FCER1G", "ITGAM", "FCGR3A") # the CD1c+ myeloid DCs, the CD141+ myeloid DCs and the CD303+
# Red Blood Cells
RedBld_genes <- c("HBA1", "HBA2", "HBB")
# Macrophages
Macro_genes <- c("CD14", "CD68", "CD163", "C1QA", "C1QB", "MRC1", "MSR1", "MS4A7")
# Monocytes
Monoc_genes <- c("LST1", "FCGR3A", "AIF1", "MAFB", "MS4A6A", "ANPEP", "LYZ")
# Dendrite genes
Dendr_genes <- c("LILRA4", "IRF7", "CLEC9A", "CD1C", "TCF4")
myeloid_cluster_genes <- c(myeloid_genes, RedBld_genes, Macro_genes, Monoc_genes,
Dendr_genes)
Commonly Used Canonical Markers: Mouse Brainspan
brainspan_markers=list()
brainspan_markers[["GABAergic"]]=c("Gad1","Gad2","Slc32a1") ## **
brainspan_markers[["Glutamatergic"]]=c("Slc17a7","Lamp5","Rorb","Fezf2") ## **
brainspan_markers[["Granule"]]=c("Zic1","Neurod1","Etv1")
brainspan_markers[["Astrocytes"]]=c("Aqp4","Aldh1l1","Gfap")
brainspan_markers[["OPC"]]=c("Pdgfra","Cspg4")
brainspan_markers[["Oligodendrocyte"]]=c("Opalin","Prox1","Olig1","Mbp")
brainspan_markers[["Microglia"]]=c("Fyb","C1qa","C1qc")
brainspan_markers[["Endothelial"]]=c("Cldn5")
brainspan_markers
Commonly Used Canonical Markers: Mouse Brain Canonical Marker
# Neural
Neural_genes <- c("Cadm2", "Pcdh9", "Syt1", "Dlg2", "Dlgap1", "Negr1", "Fgf14")
# Oligodendrocyte
OLigo_genes <- c("Mbp", "Plp1", "Enpp2", "Mag", "Mog", "Cldn11", "Mal")
# Astrocytes
Astro_genes <- c("Slc39a12", "Pla2g7", "Htra1", "Ptprz1", "Ttyh1", "Prdx6")
# Microglial
Microglial_genes <- c("Hexb", "C1qa", "C1qb", "C1qc", "Lgmn", "Csf1r", "Ctss")
# Purkinje
Purkinje_genes <- c("Auts2", "Foxp2", "Ebf1", "Grid2", "Dner", "Khdrsb2")
# Ependymal
Ependymal_genes <- c("Dynlrb2", "Tmem212", "Rsph1", "Foxj1", "Rarres2", "Tmem47")
# Hypendymal
Hypendymal_genes <- c("Rmst", "Epha7", "Zfhx4", "Efnb3", "Zic4", "Wls")
# Choroid
Choroid_genes <- c("Rbm47", "Htr2c", "Enpp2", "Ttr", "Fras1", "Glis3")
# Endothelial Mural
Mural_genes <- c("Adap2", "Pdgfrb", "Pde8b", "Myo1b", "Rbpms", "Dlc1")
# Endothelial Stalk
Stalk_genes <- c("Slco1a4", "Adgrl4", "Abcg2", "Nostrin", "Cldn5", "Ptprb")
# Granular neuron types - Location (Suggest to just use Cerbral cortex,
# thalamus and hypothalamus markers) Cerebellar granule cell - Cerebellum
CBgranular_genes <- c("Kcnd2", "Grik2", "Fgf14", "Chn2", "Tenm1", "Olfm3", "Astn2") #Not present in this section
# Cerebellar golgi cell - Cerebellum
CBgolgi_genes <- c("Sgcd", "Lgi2", "Megf11", "Elavl2", "Cacna1d", "Fgf13", "Cdh4") #Not present in this section
# Unipolar brush cell - Cerebral cortex and Cochlear nucleus
CTX_CN_UBC_genes <- c("Unc5c", "Tmem108", "Nnat", "Marchf1", "Ccser1", "Sorbs2")
# Cerebral cortex neuron
CTXneuron_genes <- c("Vamp2", "Selenow", "Grin2b", "Snap25", "Atp6v1g2", "Norad",
"Syp")
# Hippocamal neuron
HIPneuron_genes <- c("Ncdn", "Ppfia2", "Rfx3", "Ahcyl2", "Epha7")
# Cortical Interneuron
CTXInterneuron_genes <- c("Galntl6", "Gad2", "Nap1l5", "Grip1", "Tspyl4", "Cplx1")
# Medium Spiny neuron
CTX_TH_MSN_genes <- c("Rarb", "Snhg14", "Rian", "Rgs9", "Ryr3", "Gnal")
Pallium_spiny_genes <- c("Ppp1r1b")
# Cajal-Retzius cells
CTX_HIP_CRC_genes <- c("Ndnf", "Reln", "Ramp1", "Thsd7b", "Fat3", "Kcnb2")
# Excitary projection neurons
EP_genes <- c("Tbr1", "Neurod6", "Satb2")
# Hypothalamus
HY_genes <- c("Nkx2-1", "Sim1", "Lhx6", "Lhx8")
# Thalamus
TH_genes <- c("Tcf7l2", "Six3", "Plekhg1")
# Midbrain and Pontine neurons
MBP_genes <- c("Otx2", "Gata3", "Pax5", "Pax7", "Sox14")
# Granular cell types - Neurotransmitters GABAergic neurons
GABAergic_genes <- c("Slc32a1", "Gabra1", "Gad2", "Galntl6", "Grip1", "Gbx1")
# Glutamergic neurons
Glutamate_genes <- c("Slc17a6", "Slc17a7", "Slc17a8", "Arpp21", "Car10", "Sv2b",
"Camk4", "Pcsk2", "Brinp1")
# Cholinergic neurons
Choline_genes <- c("Chat", "Slc18a3", "Tbx20", "Chrnb3", "Nppb", "Ush1c", "Npy4r")
# Dopaminergic neurons
Dopa_genes <- c("Slc6a3", "Chrna6", "Gucy2c", "Aldh1a7", "Tdh")
# Serotonin neurons
Sero_genes <- c("Slc6a4", "Tph2") #not in this section
# Noradrenergic neurons
Noradren_genes <- c("Slc6a2", "Dbh") #not in this section
# Use oligo, astro, endo cells as non-neuronal cells
To keep your analysis reproducible for yourself or others in the future, it is important to know which version of R you used, which packages you used, and the version of each package. Therefore, a good method to keep track of this is to use the sessionInfo() function. You should always output this at the end of your sessions.
sessionInfo()
Session Info
## R version 4.4.1 (2024-06-14) ## Platform: aarch64-apple-darwin20 ## Running under: macOS Ventura 13.6.7 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## time zone: America/Chicago ## tzcode source: internal ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] scater_1.32.0 scuttle_1.14.0 ## [3] scRNAseq_2.18.0 SingleCellExperiment_1.26.0 ## [5] SingleR_2.6.0 SummarizedExperiment_1.34.0 ## [7] Biobase_2.64.0 GenomicRanges_1.56.1 ## [9] GenomeInfoDb_1.40.1 IRanges_2.38.0 ## [11] S4Vectors_0.42.0 BiocGenerics_0.50.0 ## [13] MatrixGenerics_1.16.0 matrixStats_1.3.0 ## [15] SeuratWrappers_0.3.5 Seurat_5.1.0 ## [17] SeuratObject_5.0.2 sp_2.1-4 ## [19] plyr_1.8.9 patchwork_1.2.0 ## [21] ggplot2_3.5.1 dplyr_1.1.4 ## [23] flexdashboard_0.6.2 knitr_1.47 ## ## loaded via a namespace (and not attached): ## [1] ProtGenerics_1.36.0 spatstat.sparse_3.1-0 ## [3] bitops_1.0-7 webshot_0.5.5 ## [5] httr_1.4.7 RColorBrewer_1.1-3 ## [7] tools_4.4.1 sctransform_0.4.1 ## [9] alabaster.base_1.4.1 utf8_1.2.4 ## [11] R6_2.5.1 HDF5Array_1.32.0 ## [13] lazyeval_0.2.2 uwot_0.2.2 ## [15] rhdf5filters_1.16.0 withr_3.0.0 ## [17] gridExtra_2.3 progressr_0.14.0 ## [19] cli_3.6.3 formatR_1.14 ## [21] spatstat.explore_3.2-7 fastDummies_1.7.3 ## [23] labeling_0.4.3 alabaster.se_1.4.1 ## [25] sass_0.4.9 spatstat.data_3.1-2 ## [27] ggridges_0.5.6 pbapply_1.7-2 ## [29] Rsamtools_2.20.0 R.utils_2.12.3 ## [31] parallelly_1.37.1 limma_3.60.3 ## [33] rstudioapi_0.16.0 RSQLite_2.3.7 ## [35] generics_0.1.3 BiocIO_1.14.0 ## [37] ica_1.0-3 spatstat.random_3.2-3 ## [39] Matrix_1.7-0 ggbeeswarm_0.7.2 ## [41] fansi_1.0.6 abind_1.4-5 ## [43] R.methodsS3_1.8.2 lifecycle_1.0.4 ## [45] edgeR_4.2.0 yaml_2.3.8 ## [47] rhdf5_2.48.0 SparseArray_1.4.8 ## [49] BiocFileCache_2.12.0 Rtsne_0.17 ## [51] grid_4.4.1 blob_1.2.4 ## [53] dqrng_0.4.1 promises_1.3.0 ## [55] ExperimentHub_2.12.0 crayon_1.5.3 ## [57] miniUI_0.1.1.1 lattice_0.22-6 ## [59] beachmat_2.20.0 cowplot_1.1.3 ## [61] GenomicFeatures_1.56.0 KEGGREST_1.44.1 ## [63] metapod_1.12.0 pillar_1.9.0 ## [65] rjson_0.2.21 future.apply_1.11.2 ## [67] codetools_0.2-20 leiden_0.4.3.1 ## [69] glue_1.7.0 data.table_1.15.4 ## [71] remotes_2.5.0 vctrs_0.6.5 ## [73] png_0.1-8 gypsum_1.0.1 ## [75] spam_2.10-0 gtable_0.3.5 ## [77] cachem_1.1.0 xfun_0.45 ## [79] S4Arrays_1.4.1 mime_0.12 ## [81] survival_3.7-0 bluster_1.14.0 ## [83] statmod_1.5.0 fitdistrplus_1.1-11 ## [85] ROCR_1.0-11 nlme_3.1-165 ## [87] bit64_4.0.5 alabaster.ranges_1.4.1 ## [89] filelock_1.0.3 RcppAnnoy_0.0.22 ## [91] bslib_0.7.0 irlba_2.3.5.1 ## [93] vipor_0.4.7 KernSmooth_2.23-24 ## [95] colorspace_2.1-0 DBI_1.2.3 ## [97] ggrastr_1.0.2 tidyselect_1.2.1 ## [99] bit_4.0.5 compiler_4.4.1 ## [101] curl_5.2.1 httr2_1.0.1 ## [103] BiocNeighbors_1.22.0 hdf5r_1.3.10 ## [105] DelayedArray_0.30.1 plotly_4.10.4 ## [107] rtracklayer_1.64.0 scales_1.3.0 ## [109] lmtest_0.9-40 rappdirs_0.3.3 ## [111] stringr_1.5.1 digest_0.6.36 ## [113] goftest_1.2-3 spatstat.utils_3.0-5 ## [115] alabaster.matrix_1.4.1 rmarkdown_2.27 ## [117] XVector_0.44.0 htmltools_0.5.8.1 ## [119] pkgconfig_2.0.3 sparseMatrixStats_1.16.0 ## [121] highr_0.11 dbplyr_2.5.0 ## [123] fastmap_1.2.0 ensembldb_2.28.0 ## [125] rlang_1.1.4 htmlwidgets_1.6.4 ## [127] UCSC.utils_1.0.0 shiny_1.8.1.1 ## [129] DelayedMatrixStats_1.26.0 farver_2.1.2 ## [131] jquerylib_0.1.4 zoo_1.8-12 ## [133] jsonlite_1.8.8 BiocParallel_1.38.0 ## [135] R.oo_1.26.0 BiocSingular_1.20.0 ## [137] RCurl_1.98-1.14 magrittr_2.0.3 ## [139] GenomeInfoDbData_1.2.12 dotCall64_1.1-1 ## [141] Rhdf5lib_1.26.0 munsell_0.5.1 ## [143] Rcpp_1.0.12 viridis_0.6.5 ## [145] reticulate_1.38.0 stringi_1.8.4 ## [147] alabaster.schemas_1.4.0 zlibbioc_1.50.0 ## [149] MASS_7.3-61 AnnotationHub_3.12.0 ## [151] parallel_4.4.1 listenv_0.9.1 ## [153] ggrepel_0.9.5 deldir_2.0-4 ## [155] Biostrings_2.72.1 splines_4.4.1 ## [157] tensor_1.5 locfit_1.5-9.10 ## [159] igraph_2.0.3 spatstat.geom_3.2-9 ## [161] RcppHNSW_0.6.0 reshape2_1.4.4 ## [163] ScaledMatrix_1.12.0 BiocVersion_3.19.1 ## [165] XML_3.99-0.16.1 evaluate_0.24.0 ## [167] scran_1.32.0 BiocManager_1.30.23 ## [169] httpuv_1.6.15 RANN_2.6.1 ## [171] tidyr_1.3.1 purrr_1.0.2 ## [173] polyclip_1.10-6 alabaster.sce_1.4.0 ## [175] future_1.33.2 scattermore_1.2 ## [177] rsvd_1.0.5 xtable_1.8-4 ## [179] restfulr_0.0.15 AnnotationFilter_1.28.0 ## [181] RSpectra_0.16-1 later_1.3.2 ## [183] viridisLite_0.4.2 tibble_3.2.1 ## [185] beeswarm_0.4.0 memoise_2.0.1 ## [187] AnnotationDbi_1.66.0 GenomicAlignments_1.40.0 ## [189] cluster_2.1.6 globals_0.16.3
For more info about us click here.