Computing Metacells - One-Pass Process

In this vignette we will demonstrate how to compute metacells starting with brand-new data, but heavily relying on previous analysis of similar data. This is in contrast to the full iterative process, where we repeatedly revise our decisions. Here in the one-pass vignette we assume that magically all our decisions were correct, so we can immediately compute the results (in practice, we simply applied the final decisions from the iterative process vignette).

In the real world, the actual process used will fall between these two extremes. That is, by using prior knowledge that applies to the specific data set, some of the decisions we make below will be correct from the start, but most likely, some still need to be revised, andn require some iterations to finalize.

To remove doubt, the results presented here are not suitable for use in any serious analysis. This is an example of the process, nothing more.

1. Setup

We'll start with importing the python libraries we'll be using and setup some global configuration. Just importing takes a few seconds, mainly because Python's importing of C++ extensions with a large number of functions is inefficient.

Then, configure the packages we use. Feel free to tweak as necessary.

Even though the process here is "one-pass", it still has two phases - a preliminary phase for computing the metacells, and a final phase for manually assigning type annotations to them. We'll save files under output/one-pass/preliminary and output/one-pass/final to keep things organized.

2. Reading the data

Our input here is the "full" data. This data already isn't the "raw" data. That is, we expect it to have gone through preliminary processing using basic cell-oriented tools, such as removing "doublet" cells, cells which are actually empty wells, etc. The specifics of such pre-processing may be different for each data set, and are outside the scope of this vignette.

Reading this data takes a ridiculously long time necause the AnnData implementation, in its infinite wisdom, does not use memory-mapping. When we switch to using DAF things would be much better and faster.

3. Cleaning the data

Having the "full" data, we will demonstrate performing some additional "universal" steps to extract the "clean" data out of it. Importantly, all we will do here is select which subset of the data to use; we will not perform any corrections (such as batch corrections) here. Such corrections are better applies based on the resulting metacell model, as this will allow cell-type-specific analysis to be done.

3.1 Excluding doublet cells

In general there may be some cells you may wish to exclude from the analysis based on prior knowledge, e.g. based on cells metadata or on some form of analysis. In this example, we assume we have been given a list of cells to exclude as doublets.

In the iterative process, we typically start with no a-priori excluded cells, and then iteratively exclude more cells until we settle on the final list. Here in the one-pass vignette, we assume we have the right list from the start. Realistically, the process could go either way, or we may have an initial list to exclude and enhance it during the iterations.

3.2 Decisions

Cleaning up the data (whether in pre-processing or using the steps below) requires us to make decisions, based on prior biological knowledge and experience with the pipeline. Also, none of the decisions is final. In real analysis, we often iterate the analysis, revisiting and refining prior decisions, until we obtain a "good" final result. Here in the one-pass vignette, we assume we get things right the first time around.

3.2.1 Excluding cells by UMIs count

The first decision we need to make is "how many UMIs can a clean cell contain". Cells with "too few" UMIs indicate very low-quality sampling and might be just empty droplets. Cells with "too many" UMIs might be doublets. The thresholds to use depends on the technology and the dataset, and the specifics of any pre-processing done, if any. The method we use here is to look at the distribution of total UMIs per cell, and pick "reasonable" thresholds for exclusions based on it.

We can visualize the cell UMIs threshold decision and adjust it accordingly:

3.2.2 Excluding genes by name

The next decision we need to make is which genes to exclude from the data, by their names. The poster children for this are mytochondrial genes and strong sex-specific genes. Here we also exclude the non-coding NEAT1 gene.

In the iterative process, we may discover additional genes to exclude later on. Here in the one-pass vignette, we assume we have the right list from the start. Realistically, one would start with a list that was used for similar data in the past, and possibly tweak it later as needed.

We'll instruct the package to exclude the genes we have chosen. By default, the code will also automatically exclude a few additional genes, specifically genes with very high variance that are not correlated with any other gene (bursty_lonely_gene, see the documentation for details). Such genes are useless for grouping cells together; even worse, including them is actively harmful as they cause cells to appear to be deviants (have a gene in which they are very different from the rest of the cells in the same metacell), and thereby be marked as outliers.

3.2.3 Excluding cells by high excluded gene UMIs

The next decision we need to make is which cells to exclude due to containing too many UMIs in the excluded genes. If a cell contains "too many" excluded (mainly mytochondrial) gene UMIs, this may indicate a badly sampled cell, leading to very skewed results. Again, the exact threshold depends on both the technology and the dataset. Here we resort to looking at the distribution of the fraction of excluded genes in each cell, and manually picking the threshold.

We start by computing the total UMIs of excluded genes in each cell. We only need to do this once (as long as we don't change the list of excluded genes).

Next we'll pick a maximal fraction of excluded UMIs in each cell.

We can visualize the cell excluded genes fraction threshold decision and adjust it accordingly:

We can now instruct the package to exclude the cells we have chosen.

3.3 Extract the clean data

Having decided on which cells and genes to exclude, we can now extract the clean data out of the full dataset.

3.4 Save the data

It is good practice keep the pristine input data in case we'll need to do something else with it. We'll therefore save a copy of the "full" data with the manual annotations we have collected, as well as the "clean" data we extracted from it. We do this prior to computing metacells, which will add further annotations, so we'll be able to easily go back to the unannotated versions of the data if (realistically, when) we revise our decisions.

Using AnnData, this is a full copy of the data; when we'll switch to DAF we'll be able to just store the additional annotations we made, which would take practically no space and be much faster.

4. Compute the metacells

We'll rename our data to "cells" to distinguish it from the metacells we are about to compute. We'll be adding annotations to this data (most importantly, the metacell each cell belongs to), and will later save it separately from the "clean" data we saved above.

4.1 Decisions

Even though we have the clean cells data, we can't compute metacells for it before making a few more decisions.

4.1.1 Lateral genes

A crucial decision when running metacells is the list of genes are lateral, that is, should not be used to group cells together. The poster child for this are cell-cycle genes. These genes are strong and any clustering algorithm will therefore prefer to group together cells in the same cell-cycle state, at the expense of mixing up other (reasonably close) cell states, which are what we are actually interested in. Note that lateral genes are still used in deviant cell detection, that is, each lateral gene should still have a consistent expression level in all the cells in each metacell.

In the iterative process, we start with a very short list of genes we have a strong a-priori opinion about, expand it, and then iteratively add more genes until we settle on the final list. Here in the one-pass vignette, we assume we have the right list from the start. Realistically, one could start with a list that was used for similar data in the past, and possibly tweak it later as needed.

4.1.2 Noisy genes

Another important list of genes are the genes with higher than usual variance, which will cause a lot of cells to be deviant (inconsistent with their metacells), and therefore be marked as deviants (outliers). Marking them as "noisy" allows for more variance for these genes in each metacell, reducing the number of such outliers.

In the iterative process, we typically start with an empty list of genes, and then iteratively add more genes until we settle on the final list. Here in the one-pass vignette, we assume we have the right list from the start. Realistically, one could start with a list that was used for similar data in the past, and possiblly tweak it later as needed.

4.1.3 Parallelization

Finally, we need to decide on how much parallelization to use. This is a purely technical decision - it does not affect the results, just the performance.

The more parallel piles we use, the faster the computation will be (up to the number of physical processors, but that is handled automatically for us).

However, having more parallel piles means using more memory. If we run out of memory, we'll need to reduce the number of parallel piles. You can track the memory usage by running top or htop during the computation.

We provide a guesstimator for the maximal number of parallel piles that will fit in memory. This is by no means perfect, but it is a starting point.

4.2 Computation

We are (finally) ready to actually group the cells into metacells.

4.2.1 Hyper-parameters

The metacells pipeline has a lot of hyper parameters you can tweak. The defaults were chosen such that scRNA-seq data, especially 10x data, should work "well" out of the box. You should read the documentation and have a good understanding of the effect of anny parameter you may want to tweak, keeping in mind the synergy between some of the parameters.

If we had to call out one hyper-parameter you might wish to tweak, it would be the target_metacell_size. This specifies the "ideal" number of cells in each metacell. The algorithm works hard to keep the actual metacell size close to this value - in particular, metacells larger than twice this size will be split, and metacells which are much smaller than this size will be merged, or dissolved (become outliers).

By default this value is set to 96. Setting it to a smaller value will create more metacells, which may allow capturing more subtle differences between cell states (e.g. along a gradient); this, however, would come at the cost of making less robust estimations of each gene's expression level in each metacell. This might be a worthwhile tradeoff if your cells are of higher quality.

One technical "hyper-paramater" you must specify is the random_seed. A non-zero value will genenrate reproducible results. A zero value will generate non-reproducible results and will be slightly faster. We "strongly urge" you to use a non-zero as reproducible results are much easier to deal with.

4.2.2 Assigning cells to metacells

This is the core of the method. It can take a while. The dataset used in this example is far from trivial - it contains ~1/3M cells. This takes ~10 minutes to compute on a hefty (48 HT cores, 0.5TB RAM) server. You can see progress is being made by running the computation with a progress bar. This progress is rather non-linear, for example there's a pause at the start when computing rare gene modules, and it isn't smooth afterwards either. You can skip it alltogether by getting rid of the with statement.

4.2.3 Collecting the metacells

The above merely computed a metacell name and index for each cell ("Outliers" and negative for outlier cells). We still need to collect all the cells of each metacell, to create a new AnnData where each profile is a metacell. Note that in this new metacells data, we no longer have UMIs per gene; instead, for each gene, we have an estimate of the fraction of its UMIs out of the total UMIs. Since AnnData can only hold a single 2D dataset, the result must be a separate object (with each "observation" being a metacell), so we copy all the per-gene annotations from the cells dataset to the result.

If the input data contains per-cell annotations of interest, then here would be a good place to convey that data to the metacells object. Since each metacell is a combination of multiple cells, this requires a decision on how exactly to do this for each per-cell annotation. We can either reduce the multiple per-cell values into a single one (such as the most frequent value or the mean value), or we can collect the fraction of cells that have each value in each metacell (for categorical data). Since AnnData is limited to simple 2D data, the latter is implemented as a series of per-metacell annotations (one for each possible value). This will be handled better in DAF.

4.3 Computing for MCView

So we have our metacells data. This is pretty opaque as of itself. MCView is our interactive tool for exploring the data, and for assigning type labels to the metacells.

However, in order to use MCView, we first need to compute all sort of quality control data for MCView to display. This again may take a while (but much less then computing the metacells above).

Here's a preview of the 2D UMAP view of the data (without any color annotations, as we do not have type annotations yet):

4.4 Saving the data

We'll save the results on disk for future reference, and to allow them to be imported into MCView.

5. Importing into MCView

This vignette focuses on the metacells package, not MCView, which deserves a full vignette of its own. Still, here are some basics about how to use it.

5.1 Installing MCView

The MCView is written in R but is not a standard CRAN package. To install it, you should type (in R):

install.packages("remotes")
remotes::install_github("tanaylab/MCView")

5.2 Importing data set

Since MCView is written in R, it isn't easy to run it inside a Python notebook. Instead we've provided a small R script that will load the data we saved above, and import it into an MCView application. Here is the code of the script for reference:

library("MCView")

args <- commandArgs(trailingOnly=TRUE)

if (length(args) == 6) {
    prefix <- args[1]
    name <- args[2]
    title <- args[3]
    type <- args[4]
    atlas_prefix <- args[5]
    atlas_name <- args[6]
    import_dataset(
        sprintf("../mcview/%s", name),                           # The directory to create
        sprintf("%s-%s", prefix, gsub("/", "-", name)),          # The name of the dataset
        sprintf("../output/%s/%s.metacells.h5ad", name, prefix), # The metacells h5ad file
        metadata_fields = "all",                                 # Ask to import all the metadata
        title = title,                                           # A title for the GUI
        cell_type_field = type,                                  # The name of the type field
        cell_type_colors_file = "../captured/type_colors.csv",   # The type colors CSV file
        projection_weights_file = sprintf("../output/%s/%s.atlas_weights.csv", name, prefix),
        atlas_project = sprintf("../mcview/%s", atlas_name),
        atlas_dataset = sprintf("%s-%s", atlas_prefix, gsub("/", "-", atlas_name)),
    )

} else if (length(args) == 4) {
    prefix <- args[1]
    name <- args[2]
    title <- args[3]
    type <- args[4]
    import_dataset(
        sprintf("../mcview/%s", name),                           # The directory to create
        sprintf("%s-%s", prefix, gsub("/", "-", name)),          # The name of the dataset
        sprintf("../output/%s/%s.metacells.h5ad", name, prefix), # The metacells h5ad file
        metadata_fields = "all",                                 # Ask to import all the metadata
        title = title,                                           # A title for the GUI
        cell_type_field = type,                                  # The name of the type field
        cell_type_colors_file = "../captured/type_colors.csv"    # The type colors CSV file
    )

} else if (length(args) == 3) {
    prefix <- args[1]
    name <- args[2]
    title <- args[3]
    import_dataset(
        sprintf("../mcview/%s", name),                           # The directory to create
        sprintf("%s-%s", prefix, gsub("/", "-", name)),          # The name of the dataset
        sprintf("../output/%s/%s.metacells.h5ad", name, prefix), # The metacells h5ad file
        metadata_fields = "all",                                 # Ask to import all the metadata
        title = title                                            # A title for the GUI
    )

} else {
    stopifnot(FALSE)
}

We'll just run it as an external process using Rscript:

5.3 Running MCView

The simplest way to run MCView is, in R, to type:

library(MCView)
run_app("mcview/one-pass/preliminary") # The path we imported the dataset into

Since MCView is a shiny application, you have many other options, which are outside the scope of this vignette.

5.4 Annotating types in MCView

The basic procedure we use for annotating metacell types is to look at the markers heatmap to identify interesting genes, and then using the gene-gene plots to identify groups of metacells that present the right combination of markers to justify some type label. Here's an example gene-gene plot showing a gradient between HSC (Hematopoietic Stem Cell) and LMPP (Lymphoid-primed Multipotent Progenitors):

HLF-AVP

If some metacells express a combination of markers for two unrelated types, we will consider labeling them as doublets. All these decisions require prior biological knowledge of the genes and the system. Here's an example of a doublet metacell containing a mixture of Ery (erythrocytes) and T-cells:

HBB-TRBC1

To kickstart the process, MCView will automatically cluster the metacells innto some groups, and give a color to each. This makes the markers heatmap at least somewhat informative even before manual type annotation has begun. Note these clusters arejust as a very crude starting point. Sometimes a cluster will map well to a single cell type, but most likely, a cell type would contain metacells from several clusters, or a cluster will need to be split between two cell types.

5.5 Updating types in MCView

The MCView application is stateless, that is, any type annotations created in it are not persistent. If you close your browser's window, any annotations you made will be lost, unless you first export the types as a CSV file. You can then start a new instance of the application in your browser and import this file back to continue where you left off. We saved a copy of this exported file in captured/one-pass.preliminary.types.csv which we'll use below.

You can also export a small convenience file which just contains the mapping between types and colors. In this vignette we'll assume this mapping does not change between the applications. We saved a copy of this file in captured/type_colors.csv - in fact, you will notice that we use this in the scripts/import_dataset.r script above when we create applications with type annotations already embedded in them.

These MCView mechanisms make it easy for a web server to deal with multiple simultaneous users which may modifying things at the same time. For the web server, the application files are read-only, which is exactly what you want when publishing data for users to see. This even allows the users to work on alternative type annotations without impacting each another.

However, this makes it more difficult for a single user which actively edits the data. The way around it is to do embed the exported file into the application's directory, that is, modify the application's disk files. From that point on, all users that will open the application will see the new type annotations.

We'll do this now, using the following simple script:

library("MCView")

args <- commandArgs(trailingOnly=TRUE)
stopifnot(length(args) == 2)
name <- args[1]
types <- args[2]

update_metacell_types(
    sprintf("../mcview/%s", name),               # The directory of the app
    sprintf("hca_bm-%s", gsub("/", "-", name)),  # The name of the dataset
    types                                        # The types CSV files
)

6. Applying type annotations from MCView

6.1 Reading the data

The MCView application is stateless, that is, any type annotations created in it are not persistent. You must export them (as a CSV file). What we'll do here is read this data, and embed it into out data set, so that it will be available for later analysis.

6.2 Conveying type annotations

We'll now apply these type annotations to both the metacells and the cells:

6.3 Removing doublet meta/cells

Some of the metacells were annotated as doublets. In the full iterative process, we would exclude the cells of these metacells, re-compute the metacells, and re-annotate them. Here in the one-pass vignette, we remove the cells, and also remove the metacells, and stop at that. This is "OK", but recomputing (and re-annotating) the metacells will give better results at the cost of additional effort.

For each cell we keep not only the metacell_name but also the metacell index. Since we've just deleted a few metacells, we'll need to patch these indices.

7. Finalizing the data

This is a good place to delete, rename, compute and in general ensure all per-gene and per-metacell properties exist, are properly named, etc. Again, how much effort (if any) is put into this varies greatly depending on the intended use of the final results.

7.1 Marking essential genes

If we are preparing a reference atlas, that is, we expect to project other data into it (as described in the projection vignette), then it is recommended to designate a few genes as "essential" for each cell type.

When we will project data onto the atlas, we'll do a lot of quality assurance to ensure the projection makes biological sense. The (optional) essential gene lists help with this quality assurance. Specifically, if a gene is designated as essential for some cell type, and some projected data "seems like" it is of this cell type, we'll ensure it agrees with the atlas for the essential genes. For example, if a projected metacell does not express CD8, then we can't say it is a CD8 T-cell. It may be very similar to a CD8 T-cell, but we just can't reasonably that it is one.

Selecting the (few) genes to use as essential for each cell type requires, as usual, the expertise of the biologist preparing the atlas. Note that all we require is a list of genes, not some quantitative threshold to use as a cell type classifier. The quantitative aspect is automatically provided by the projection algorithm itself (which is outside the scope of this vignette).

7.2 Recomputing for MCView

We've already computed all the MCView data. Yes, we have added the assignment of types to metacells since, but none of the computed data actually depends on these types, so there's no need to compute it again.

We also removed some doublet metacells. Again, this doesn't impact the quality control computations, which are per-metacell. However, it does impact the UMAP 2D projection, as it is very sensitive to doublets. Typically removing them allows UMAP to generate a better 2D layout, so we'll call it again.

In general compute_for_mcview is just a collection of calls to (11!) lower-level quality control computation functions, which you can also call manually on your own. The compute_umap_by_markers is just one of these functions.

Finally, note that a nice-looking 2D UMAP is of course useful, and serves for very rudimentary quality assurance. However, we do not consider this to be an analysis tool. Specifically, we do not worry when this (rather arbitrary) 2D projection changes during the iterative process (or, for that matter, when recomputed here). Instead we first and foremost focus on the markers heatmap (shown in MCView), which shows the true complexity of the data. In other words, just skipping the recomputation of the UMAP and keeping the original one above would have been "OK" as well.

We'll see the results in MCView, but here's a preview of the UMAP (this type, with colors using the type annotations).

7.4 Save the data

We'll save the final results for whatever usage we intend for them (and for importing into MCView).

8. Importing into MCView

The difference between this MCView application and the preliminary one is that this one will know about the type annotations (and will reflect any spit-and-polish we have done in the final phase).

That's it. We now have our final results, both as h5ad files and as an MCView application, to use as we see fit.

To remove doubt, the results presented here are not suitable for use in any serious analysis. This is an example of the process, nothing more.