Using MetaCell to define cell-cycle dependent and independent transcriptional heterogeneity

The problem

In this note we exemplify the application of MetaCell package to identify gene co-variation in scRNA-seq data of the HCT116 colorectal cancer cell-line, which is independent of the cell-cycle.

Cell-cycle associated expression is perhaps the most common feature to dominate transcriptional heterogeneity in scRNA databases (at least those that cover proliferative cell populations). Whereas the cell-cycle machinery is largely conserved, its transcrtiptional signature can vary greatly between different biological systems, mainly due to different replication time-scales and different fractions of proliferating cells.

Why not normalization?

So why not simply normalize it out? One common routine of exploring transcriptional heterogeneity in single cell RNA-seq data is finding gene correlation in normalized expression matrices. The main issue with crude normalization of such datasets is that raw UMI counts are sparse, and scaling them by any model which is predicting background trends introduces strong biases. The cycling of a cell is inherently linked to its mRNA content, and therefore is highly susceptible for various biases during normalization of raw UMI counts.

There are several ways proposed before to normalize cell cycle effects in scRNA-seq data - and these can work well in many scenarios. The key aspects of the approach we describe here is that it is compatible with very sparse data, as well as with richer datasets, and that it is not assuming any parameters of specific models (linear or other) for describing the linkage between genes and the cell cycle. Instead, it is building a non-parametric metacell model for the distribution of gene expression based on cell cycle alone, and then sample from this background model to compare correlations between genes given the observed data vs. correlations between genes given cell cycle effects only. From our experience this can reduce biases, and increase one’s confidence in the statistical robustness of inferred cell-cycle independent gene-gene correlations.

Our strategy

In this vignette we wished to avoid normalization of gene counts and then computing their correlation scores, and developed a strategy that is using the MetaCell analytic platform (Baran et al., Genomee bio. 2019) to estimate the degree of correlation between any gene pair and test how much of this correlation stems from cell-cycle trends.

We can summarize our strategy in two main steps:

Construct a balanced cell-to-cell KNN similarity graph (with MetaCell), based on expression of cell cycle genes only. This one will serve as the cleanest estimate we can get for similarity of cells based on their cell cycle transcriptional signatures.
Generate a randomized UMI vector for each cell, by sampling each gene’s expression from one of its K nearest cells, as defined by the clean cell-cycle model. Gene-to-gene correlations between such randomized vectors are expected to preserve cell cycle related gene modules, but to completely (or almost completely) eliminate correlations between genes that are co-expressed or co-repressed in a cell cycle independent fashion.

Setting up

But little steps for little feet. We will start by installing MetaCell as described here: https://github.com/tanaylab/metacell

Then, we would load the following packages:

library(tgstat)
library(tgconfig)
library(metacell)

and then load the UMI tables:

# 1. initialize a database. This links the package to directory that will store all our objects. In this case we will initialize the DB to the HCT116_cc directory:
if(!dir.exists("HCT116_cc")) dir.create("HCT116_cc/")
scdb_init(base_dir = "HCT116_cc/", force_reinit=T)
#> initializing scdb to HCT116_cc/
scfigs_init(base = "HCT116_cc/figs/")
# 2. download HCT116 10x data from our repositories
if(!dir.exists("HCT116_cc/10x_HCT116_WT/")) dir.create("HCT116_cc/10x_HCT116_WT/")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/HCT116_umi_tables/WT/barcodes.tsv", destfile = "HCT116_cc/10x_HCT116_WT/barcodes.tsv")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/HCT116_umi_tables/WT/genes.tsv", destfile = "HCT116_cc/10x_HCT116_WT/genes.tsv")
download.file("http://www.wisdom.weizmann.ac.il/~zoharme/HCT116_umi_tables/WT/matrix.mtx", destfile = "HCT116_cc/10x_HCT116_WT/matrix.mtx")
download.file("http://www.wisdom.weizmann.ac.il/~zoharme/HCT116_umi_tables/WT/barcodes.tsv", destfile = "HCT116_cc/10x_HCT116_WT/tidy_mat.csv")
# 3. load 10x UMI matrices
mcell_import_scmat_10x(mat_nm = "mat_wt",
                       base_dir = "HCT116_cc/10x_HCT116_WT/",
                       force=TRUE)
#> summing up total of 186 paralog genes into 91 unique genes
#> [1] TRUE
# 4. download some example figures
if(!dir.exists("HCT116_cc/example_figs/")) dir.create("HCT116_cc/example_figs/")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/metacell_ccVignette_exampleFigs/HCT116_wt_shuffling_example_matrix.png", destfile = "HCT116_cc/example_figs/HCT116_wt_shuffling_example_matrix.png")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/metacell_ccVignette_exampleFigs/HCT116_wt_shuffling_example_maximalCorrelations.png", destfile = "HCT116_cc/example_figs/HCT116_wt_shuffling_example_maximalCorrelations.png")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/metacell_ccVignette_exampleFigs/HCT116_wt_2dproj_ccModel_MKI67expression.png", destfile = "HCT116_cc/example_figs/HCT116_wt_2dproj_ccModel_MKI67expression.png")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/metacell_ccVignette_exampleFigs/HCT116_wt_total_UMI_per_metacells_MKI67expression.png", destfile = "HCT116_cc/example_figs/HCT116_wt_total_UMI_per_metacells_MKI67expression.png")

Find cell cycle associated genes

We can now begin work on our cell-cycle model. A good starting point will be to fish out genes that have significant correlation with at least one strong known cell-cycle “anchors” (i.e., MKI67, PTTG, GINS2, ACTB) in a downampled UMI matrix. We can optionally ask for genes to also be NOT correlate to any particular gene we assume is unrelated to cell-cycle (ANXA1). in order to do so, we will create a .txt table, summarizing genes that are correlated to our anchors:

mcell_mat_rpt_cor_anchors(mat_id="mat_wt", 
        gene_anchors = c("MKI67", "PTTG1","GINS2","ACTB"), 
        cor_thresh = 0.1, 
        gene_anti = c("ANXA1"), 
        tab_fn = "HCT116_cc/cors_anchors_hct116_cc.txt", 
        sz_cor_thresh=0.7)
# cor_thresh is an important parameter, that set the mininmal correlation that a gene needs to have to at least one of the anchors, in order to be included in the list.
# sz_cor_thresh is an optional parameter, setting the maximal allowed correlation of a gene to total cell UMI counts, in order to be included in the list.

For crude analysis, let’s take all genes that had a significant correlation with at least one of our anchors (so their names appear in the .txt table we have just saved), and create a MeatCell “geneSet” consisting the feature genes that will help us create the cell-cycle model:

# read table
anchors_gene_cor = read.table("HCT116_cc/cors_anchors_hct116_cc.txt", sep="\t", h=T, stringsAsFactors=F)
# get gene names and the specific anchor they are associated with
cc_genes = apply(anchors_gene_cor[,setdiff(colnames(anchors_gene_cor),c("szcor","max","negmax"))], 1, which.max)
# create a geneSet object 
cc_hct116 = tgGeneSets(cc_genes, "HCT116, Cell Cycle Associated")
scdb_add_gset("hct116_cellCycle", cc_hct116)
# create an expression matrix for cell-cycle geneSet only
mcell_mat_ignore_genes(new_mat_id = "mat_wt_cellCycle", mat_id = "mat_wt", 
                                ig_genes = names(scdb_gset("hct116_cellCycle")@gene_set),
                                reverse=T)

Optionally, we can be real Faynshmekers and narrow the list feature genes, by looking for additional structure within it. Specifically, we can use MetaCell to plot gene modules we find within it, and manually keep only those modules that contain known cell-cycle factors that display significant covariance:

# create 24 gene clusters of geneSet
mcell_gset_split_by_dsmat(gset_id = "hct116_cellCycle",
                          mat_id = "mat_wt_cellCycle", K=24)
#> will downsample the matrix, N= 750
# plot correlation within them
mcell_plot_gset_cor_mats(gset_id = "hct116_cellCycle", 
                         scmat_id = "mat_wt_cellCycle")
#> will downsample the matrix, N= 750

This will create a subfolder named hct116_cellCycle.gset_cors in our figures directory, where gene-gene correlations of the K module will be displayed, as well as each gene correlation to total UMI per cell. We will now go over and examine the modules, and annotate those that smell like cell-cycle.

# after examining the plots, decide which gene modules to keep (in this case, if you followed the exact lines so far: gene modules 6, 9, 11, 13, 15, 17 and 21)
mcell_gset_remove_clusts(gset_id = "hct116_cellCycle",
                                   filt_clusts = c(6,9,11,13,14,15,17,21),
                                    new_id = "hct116_cellCycle_narrow",
                                    reverse = T)
# create an expression matrix for the narrowed cell-cycle geneSet only
mcell_mat_ignore_genes(new_mat_id = "mat_wt_cellCycle_narrow", mat_id = "mat_wt", 
                                ig_genes = names(scdb_gset("hct116_cellCycle_narrow")@gene_set),
                                reverse=T) 
# here are four examples for correlated subgroup of genes that seem more (modules #13, #15) or less (modules #1, #18) related to the cell-cycle:

geneSet_module_13 geneSet_module_15 geneSet_module_18

Create a cell-cycle based KNN graph

Regardless of whether we decided to use all cell-cycle associated geneSet (“hct116_cellCycle”, which will be used in this example) or a purified subset of it (“hct116_cellCycle_narrow”), we can now create a balanced KNN similarity graph between cells that is based on expression coming from these genes only:

# create gene stats for our expression matrix:
mcell_add_gene_stat(mat_id = 'mat_wt_cellCycle', gstat_id = 'wt_cellCycle_geneStats')
# create a list of markers in our expression table (only filtering out very lowly expressed genes or genes that have minimal variance):
mcell_gset_filter_varmean(gset_id="wt_cellCycle_markers", gstat_id="wt_cellCycle_geneStats", T_vm=0.03, force_new=T)
mcell_gset_filter_cov(gset_id = "wt_cellCycle_markers", gstat_id="wt_cellCycle_geneStats", T_tot=20, T_top3=1)
#use marker genes and our expresion table to build a cell-cell similarity graph object:
mcell_add_cgraph_from_mat_bknn(mat_id="mat_wt_cellCycle", 
                               gset_id = "wt_cellCycle_markers", 
                               graph_id="HCT116_wt_cellCycle_graph",
                               K=100,
                               dsamp=T)
#> will downsample the matrix, N= 750 (and yes - this should have been chached
#> willl build balanced knn graph on 3285 cells and 295 genes, this can be a bit heavy for >20,000 cells
#> sim graph is missing 24 nodes, out of 3285
# K: a parameter that defines the initial attempt of the system to assign neighbors for each cell and ultimately can affect the size distribution of derived metacells.
# dsamp: a boolean parameter which indicates whether we want to downsample the expression matrix or not.

The last step added to our database a new cell-cell graph object, named HCT116_wt_cellCycle_graph. It describes the similarity between all cells in our dataset, based only on the expression of genes in the “wt_cellCycle_markers” geneSet.

Validate our cell-cycle model

However, and before we continue to the randomization step, it is a good practice to perform some sanity checks here - and try to validate that our model is indeed describing cell-cycle as we hoped.

There are many different ways to approach this, but since we are using the MetaCell platform, it makes sense to create metacells that are based on our cell graph, and test whether we can position metacells along a continuum of transcriptional cell-cycle dynamics. Remember that in most cases we may only succeed to model the fraction of cycling cells, which can vary between biological samples (the non-cycling cells might form a big “non-cycling” metacell). In addition, we can expect different distributions of total mRNA content per cell to be one hallmark of our cell-cycle based metacells (but not necessarily, especially if sequencing of libraries is not at saturation or close to it):

# Compute metacell using resampling iterations of graph cover k-means-like approach
mcell_coclust_from_graph_resamp(
                coc_id="HCT116_wt_cellCycle_coCluster100", 
                graph_id="HCT116_wt_cellCycle_graph",
                min_mc_size=50, 
                p_resamp=0.75, n_resamp=100)
#> running bootstrap to generate cocluster
#> done resampling

# build a metacell cover from co-clust data through filtering un-balanced edges and running graph cover
mcell_mc_from_coclust_balanced(
                coc_id="HCT116_wt_cellCycle_coCluster100", 
                mat_id= "mat_wt_cellCycle",
                mc_id= "metaCells_wt_cellCycle", 
                K=120, min_mc_size=50, alpha=2)
#> filtered 1612094 left with 802481 based on co-cluster imbalance
#> building metacell object, #mc 12
#> add batch counts
#> compute footprints
#> compute absolute ps
#> compute coverage ps
#> reordering metacells by hclust and most variable two markers
#> reorder on PTTG1 vs CLSPN

## We can then generate a 2D model of the formed metacells, and project expression of specific genes along them

For more functions and details on plotting metacells, checkout Metacell github repository: https://github.com/tanaylab/metacell

Some examples for plots to test cell cycle model:

coloring footprints of a known cell-cycle marker on the 2d metacell projection

showing distributions of total UMI-per-cell in each of the metacells

If you think that your model looks legit - well done, the longer step of the process is now behind us.

Find genes that are independent of cell-cycle

Provided with a good estimation for how cells are related to one another based on cell-cycle trends alone, we can now ask which genes are co-variating independently of it.

Specifically, we can take the original expression matrix and shuffle it, where the expression vector of each cell will be randomly sampled from the K cells that are most adjacent to it based on our cell-cycle graph:

gene_cors_cellCycle = mcell_cgraph_norm_gcor(cgraph_id = "HCT116_wt_cellCycle_graph",
                            mat_id = "mat_wt",
                            K=20,
                            min_gtot=100)
#min_gtot: a minimal threshold for total expression per gene in the downsampled matrix.

the list object gene_cors_cellCycle now holds two equally sized gene-gene correlation matrices - g_cor and r_cor, which will contain the correlation scores between and after the randomization process, respectively.

If the co-expression of two genes is related to cell-cycle, than we could expect that some degree of covariance between them will survive the randomization
However, if co-expression patterns are unrelated to cell-cycle, correlations between genes are expected to be completely eliminated after the randomization, (statistically, if K is large enough).

We can have a supervised look into the two matrices and see whether a specific gene-gene interaction that is interesting for us was eliminated or not.

Several ways of plotting may help us decide which gene pairs covariation is related to cell-cycle and which doesn’t. For example:

plot the maximal correlation score of each gene before and after the randomization:

#set diagonals of matrices to zero:
diag(gene_cors_cellCycle$gcor) = 0
diag(gene_cors_cellCycle$r_gcor) = 0
#plot
png("HCT116_cc/figs/wt_shuffled_max.png",width=280,height=280)
par(mar=c(5,5,1,1))
plot(
apply(gene_cors_cellCycle$gcor,1,max),
apply(gene_cors_cellCycle$r_gcor,1,max),
cex=.4, cex.lab=1.5,cex.axis=1.5,
xlab = "original max cor.",
ylab = "shuffled max cor.",
pch=19,
col="#00000030")
grid()
dev.off()

an example for how such plot can be interpreted:

It is apparent that some genes were able to maintain at least one significant correlation, and are therefore display dynamics that are, at least in part, associated with cell-cycle.

Still, it doesn’t mean that all their transcriptional covariation are fully dependent on the cell-cycle.

In order to have a more careful look at the cell-cycle based randomization on gene correlations, we can plot a “chimeric”" matrix (for selected genes, as in this example), comparing the original correlation value in its top triangle, and the shuffled correlation scores in the lower triangle:

an example for how it can look like:

lower triangle: shuffled gene-gene correlations, higher triangle: original gene-gene correlations

In principal, as cell-cycle is after all an evolutionary conserved process, we can try and use the same genes to implement our shuffling strategy in similar biological samples (not promising anything though).

In the following example, we have used the same list of cell-cycle associated genes we obtained from “wild-type HCT-116 cells (the geneSet hct116_cellCycle), in order to create a cell-cycle model and implement the randomization strategy on DKO HCT-116 cells, harboring two knockout mutations in DNA-methyltransferases (DNMT1 and DNMT3B):

#download 10x single cell RNA-seq data for DKO
if(!dir.exists("HCT116_cc/10x_HCT116_DKO/")) dir.create("HCT116_cc/10x_HCT116_DKO/")
download.file("http://www.wisdom.weizmann.ac.il/~zoharme/HCT116_umi_tables/DKO/barcodes.tsv","HCT116_cc/10x_HCT116_DKO/barcodes.tsv")
download.file("http://www.wisdom.weizmann.ac.il/~zoharme/HCT116_umi_tables/DKO/genes.tsv","HCT116_cc/10x_HCT116_DKO/genes.tsv")
download.file("http://www.wisdom.weizmann.ac.il/~zoharme/HCT116_umi_tables/DKO/matrix.mtx","HCT116_cc/10x_HCT116_DKO/matrix.mtx")
download.file("http://www.wisdom.weizmann.ac.il/~zoharme/HCT116_umi_tables/DKO/barcodes.tsv","HCT116_cc/10x_HCT116_DKO/tidy_mat.csv")
### 4. download some example figures
if(!dir.exists("HCT116_cc/example_figs/")) dir.create("HCT116_cc/example_figs/")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/metacell_ccVignette_exampleFigs/HCT116_dko_shuffling_example_matrix.png", destfile = "HCT116_cc/example_figs/HCT116_dko_shuffling_example_matrix.png")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/metacell_ccVignette_exampleFigs/HCT116_dko_shuffling_example_maximalCorrelations.png", destfile = "HCT116_cc/example_figs/HCT116_dko_shuffling_example_maximalCorrelations.png")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/metacell_ccVignette_exampleFigs/HCT116_dko_2dproj_ccModel_CCNB1expression.png", destfile = "HCT116_cc/example_figs/HCT116_dko_2dproj_ccModel_CCNB1expression.png")
download.file(url = "http://www.wisdom.weizmann.ac.il/~zoharme/metacell_ccVignette_exampleFigs/HCT116_dko_total_UMI_per_metacells_CCNB1expression.png", destfile = "HCT116_cc/example_figs/HCT116_dko_total_UMI_per_metacells_CCNB1expression.png")
#load DKO UMI matrix
mcell_import_scmat_10x(mat_nm = "mat_dko",
                       base_dir = "HCT116_cc/10x_HCT116_DKO/",
                       force=TRUE)
#> summing up total of 186 paralog genes into 91 unique genes
#> [1] TRUE
#create a DKO expression matrix for cell-cycle geneSet only
mcell_mat_ignore_genes(new_mat_id = "mat_dko_cellCycle", mat_id = "mat_dko", 
                                ig_genes = names(scdb_gset("hct116_cellCycle")@gene_set),
                                reverse=T)

As an exercise - go ahead and try find cell-cycle independent covariation in HCT116 DKO cells, by using MetaCell package and the methodology described above.

You can check yourself by comparing the output of your analysis to the figures shown below.

GOOD LUCK! zohar.meir@weizmann.ac.il

DKO cell-cycle model:

DKO: max. gene-correlations after shuffling

DKO cell-cycle randomization identifies cell-cycle dependenedt and independent transcriptional variance between cells:

DKO: max. gene-correlations after shuffling DKO: gene-correlations after shuffling