Computing Metacells - Projection Process

In this vignette we will demonstrate how to compute metacells starting with brand-new data, but heavily relying on a previously computed metacells atlas for similar data. This is very similar to the one-pass process, except that we have an "oracle" (the atlas) that helps us with the initial decisions (and with the type annotations). However, this oracle is not perfect, so this does not replace the iterative process, where we repeatedly revise our decisions; rather, it allows us to greatly reduce the number of iterations, by using with a much better starting point.

Here in the projection vignette we'll just show how to compute the first iteration to start this process. As times goes by, we hope to establish a library of atlases for various data sets, so that this would become the most common procedure for analyzing new data.

We will use the final results of the one-pass process as out atlas. To remove doubt, this atlas and the results presented here are not suitable for use in any serious analysis. This is an example of the process, nothing more.

1. Setup

We'll start with importing the python libraries we'll be using and setup some global configuration. Just importing takes a few seconds, mainly because Python's importing of C++ extensions with a large number of functions is inefficient.

Then, configure the packages we use. Feel free to tweak as necessary.

Even though the process here is truly "one-pass", realistically it should be followed by additional iterations. We'll therefore save files under output/projection/preliminary to drive the point home - in fact, we'll need output/projection/corrected as well below.

2. Reading the data

Our input here is the "clean" data. That is, unlike in the one-pass and iterative process vignettes, here we assume we already excluded all the bad genes and cells. Naturally you will have to go through the steps described there to obtain the clean data, but there's no value in repeating them here in this vignette.

Note that the atlas isn't a great help in cleaning your data. The thresholds of the atlas do not apply to the new data set. The list of excluded genes might, but, the atlas by definition does not include its excluded genes. It is convenient to look at the documentation published with the atlas for the list of genes that were excluded and use them as the basis. It good practice to include this list in the atlas documentation - for that matter, it is good practice to have atlas documentation in the 1st place. But even so, this isn't automated.

We also need to read our atlas. We'll just load the results of the one-pass process here, with the understanding that this is just an example. This means you have to run the one-pass vignette before you can run this one. Note that we only need to load the metacells data. We do not need the per-cell data to use an atlas, and typically it is not published as part of the atlas as it is very large and is not used in the atlas-based analysis.

Note that the new data set should use the "same" genes names as the atlas. This isn't a problem if both data sets were sequenced using the same (or "compatible") technologies. Note that is acceptable to have some genes only in the atlas, or only in the new data set; what is important is that if the "same" gene appears in both, it has the same name.

Alas, while there definitely are common conventions for gene names, there are no strongs standards for gene names - some (important!) genes have multiple names used in different systems. That's true even for well-studied organisms such as mice and men. If you are working with less studied exotic organisms such as, say jellyfish or fungi, the situation is even worse. We expect that as atlases accumulate this will become less of a problem as "de-facto" standards will emerge. For now, you are on your own here. In this vignette we have the names already aligned.

3. Compute the metacells

3.1 Decisions

Even though we have the clean cells data, we can't compute metacells for it before making a few more decisions.

3.1.1 Lateral genes

A crucial decision when running metacells is the list of genes are lateral, that is, should not be used to group cells together. The poster child for this are cell-cycle genes. These genes are strong and any clustering algorithm will therefore prefer to group together cells in the same cell-cycle state, at the expense of mixing up other (reasonably close) cell states, which are what we are actually interested in. Note that lateral genes are still used in deviant cell detection, that is, each lateral gene should still have a consistent expression level in all the cells in each metacell.

Since we have an atlas computed for similar data, we can leverage the work that went into computing it by simply starting with the list used for creating this atlas. Here in the projection vignette we'll use it as-is. Realistically, there's no guarantee this list would be perfect for our new data, and in theory it may contain genes that we may wish to keep, so it is a good idea to review it, and refine it in following iterations. Still, this is much easier than the start-from-scratch procedure described in the iterative process vignette, and the number of follow-up iterations "should" be much lower.

3.1.2 Noisy genes

The same considerations apply to noisy genes. We'll just start with the list from the atlas. Here in the projection vignette we'll use it as-is. Realistically it may be revised in follow-up iterations as usual.

3.1.3 Parallelization

Finally, we need to decide on how much parallelization to use. This is a purely technical decision - it does not affect the results, just the performance.

The more parallel piles we use, the faster the computation will be (up to the number of physical processors, but that is handled automatically for us).

However, having more parallel piles means using more memory. If we run out of memory, we'll need to reduce the number of parallel piles. You can track the memory usage by running top or htop during the computation.

We provide a guesstimator for the maximal number of parallel piles that will fit in memory. This is by no means perfect, but it is a starting point.

3.2 Computation

We are ready to actually group the cells into metacells.

3.2.1 Hyper-parameters

The metacells pipeline has a lot of hyper parameters you can tweak. The defaults were chosen such that scRNA-seq data, especially 10x data, should work "well" out of the box. You should read the documentation and have a good understanding of the effect of anny parameter you may want to tweak, keeping in mind the synergy between some of the parameters.

If we had to call out one hyper-parameter you might wish to tweak, it would be the target_metacell_size. This specifies the "ideal" number of cells in each metacell. The algorithm works hard to keep the actual metacell size close to this value - in particular, metacells larger than twice this size will be split, and metacells which are much smaller than this size will be merged, or dissolved (become outliers).

By default this value is set to 96. Setting it to a smaller value will create more metacells, which may allow capturing more subtle differences between cell states (e.g. along a gradient); this, however, would come at the cost of making less robust estimations of each gene's expression level in each metacell. This might be a worthwhile tradeoff if your cells are of higher quality.

One technical "hyper-paramater" you must specify is the random_seed. A non-zero value will genenrate reproducible results. A zero value will generate non-reproducible results and will be slightly faster. We "strongly urge" you to use a non-zero as reproducible results are much easier to deal with.

3.2.2 Assigning cells to metacells

This is the core of the method. It can take a while. The dataset used in this example is trivial - it contains ~50K cells. This takes ~2 minutes to compute on our hefty (48 HT cores, 0.5TB RAM) server.

3.2.3 Collecting the metacells

The above merely computed a metacell name and index for each cell ("Outliers" and negative for outlier cells). We still need to collect all the cells of each metacell, to create a new AnnData where each profile is a metacell. Note that in this new metacells data, we no longer have UMIs per gene; instead, for each gene, we have an estimate of the fraction of its UMIs out of the total UMIs. Since AnnData can only hold a single 2D dataset, the result must be a separate object (with each "observation" being a metacell), so we copy all the per-gene annotations from the cells dataset to the result.

If the input data contains per-cell annotations of interest, then here would be a good place to convey that data to the metacells object. Our example data here doesn't contain any annotations of interest so we'll skip this part.

3.3 Computing for MCView

So we have our metacells data. This is pretty opaque as of itself. MCView is our interactive tool for exploring the data, and for assigning type labels to the metacells.

However, in order to use MCView, we first need to compute all sort of quality control data for MCView to display. This again may take a while (but much less then computing the metacells above).

Here's a preview of the 2D UMAP view of the data (without any color annotations, as we do not have type annotations yet):

4. Projection

Normally, now we would have now used MCView to manuall add type annotations to the metacells, review the gene lists and in general continue with the iterative process. However, we have an atlas with type annotations, which has much more data, and can be used to give us immediate automatic insights about our data, including initial type annotations. Again, this does not excuse us from performing additional iterations to review and refine the analysis, but it significantly reduces our effort.

4.1 Computing the projection

Computing the projection requires no decisions from us - that's the whole point. The one required option is whether we want to use the reproducible (slightly slower) implementation or we are OK with a non-reproducible (slightly faster) implementation. We strongly recommend using the reproducible implementation as that makes life much easier when managing the analysis process.

(Actually we tell a lie, there is one important decision to make, described below. For now we'll pretend there isn't.)

The return value is a sparse weights matrix - for each metacell of our data, it gives its "best" representation as a weighted average of a few atlas metacells.

In addition to the weights, this also set various annotations in the data. One of these is the projected_type, which is "if you forced me to pick a type annotation for each new data metacell, what would it be?". Note this does not necessarily mean this is the "right" type of the new data metacell. You should definitely review this (e.g. using MCView). For example, the new data metacell may be a doublet. Sometimes the projection algorithm will detect this, and indicate it by providing a non-empty projected_secondary_type for the metacell. Sometimes, the new data metacell may be of a new cell type that does not exist in the atlas. The projection algorithm will try and detect this, and indicate it by setting the similar mask to False for that metacell. And, of course, the projection algorithm is fallible. So you should not blindly accept the results.

At any rate, here's the same UMAP as above, but this time together with the projected_type annotations:

4.2 Saving the data

We'll save the results on disk for future reference, and to allow them to be imported into MCView.

This time, we also have the weights to consider. We'll wannt MCView to read them as well, so it will be able to provide us with nice features for dealing with the projection results.

5. Importing into MCView

This vignette focuses on the metacells package, not MCView, which deserves a full vignette of its own. Still, here are some basics about how to use it.

5.1 Installing MCView

The MCView is written in R but is not a standard CRAN package. To install it, you should type (in R):

install.packages("remotes")
remotes::install_github("tanaylab/MCView")

5.2 Importing the data set

Since MCView is written in R, it isn't easy to run it inside a Python notebook. Instead we've provided a small R script that will load the data we saved above, and import it into an MCView application. Here is the code of the script for reference:

library("MCView")

args <- commandArgs(trailingOnly=TRUE)

if (length(args) == 6) {
    prefix <- args[1]
    name <- args[2]
    title <- args[3]
    type <- args[4]
    atlas_prefix <- args[5]
    atlas_name <- args[6]
    import_dataset(
        sprintf("../mcview/%s", name),                           # The directory to create
        sprintf("%s-%s", prefix, gsub("/", "-", name)),          # The name of the dataset
        sprintf("../output/%s/%s.metacells.h5ad", name, prefix), # The metacells h5ad file
        metadata_fields = "all",                                 # Ask to import all the metadata
        title = title,                                           # A title for the GUI
        cell_type_field = type,                                  # The name of the type field
        cell_type_colors_file = "../captured/type_colors.csv",   # The type colors CSV file
        projection_weights_file = sprintf("../output/%s/%s.atlas_weights.csv", name, prefix),
        atlas_project = sprintf("../mcview/%s", atlas_name),
        atlas_dataset = sprintf("%s-%s", atlas_prefix, gsub("/", "-", atlas_name)),
    )

} else if (length(args) == 4) {
    prefix <- args[1]
    name <- args[2]
    title <- args[3]
    type <- args[4]
    import_dataset(
        sprintf("../mcview/%s", name),                           # The directory to create
        sprintf("%s-%s", prefix, gsub("/", "-", name)),          # The name of the dataset
        sprintf("../output/%s/%s.metacells.h5ad", name, prefix), # The metacells h5ad file
        metadata_fields = "all",                                 # Ask to import all the metadata
        title = title,                                           # A title for the GUI
        cell_type_field = type,                                  # The name of the type field
        cell_type_colors_file = "../captured/type_colors.csv"    # The type colors CSV file
    )

} else if (length(args) == 3) {
    prefix <- args[1]
    name <- args[2]
    title <- args[3]
    import_dataset(
        sprintf("../mcview/%s", name),                           # The directory to create
        sprintf("%s-%s", prefix, gsub("/", "-", name)),          # The name of the dataset
        sprintf("../output/%s/%s.metacells.h5ad", name, prefix), # The metacells h5ad file
        metadata_fields = "all",                                 # Ask to import all the metadata
        title = title                                            # A title for the GUI
    )

} else {
    stopifnot(FALSE)
}

We'll just run it as an external process using Rscript:

5.3 Running MCView

The simplest way to run MCView is, in R, to type:

library(MCView)
run_app("mcview/projection/preliminary") # The path we imported the dataset into

Since MCView is a shiny application, you have many other options, which are outside the scope of this vignette.

6. Technology-Corrected Projection

We'll not go over all the MCView features for dealing with projection (a topic worthy of a vignette of its own). We'll note, however, that there's a new Projection QC tab which shows quality controls of the projection. We'll take a look at "Projected correlation per metacell". This shows us the distribution (across metacells) of the correlation between the new data set metacell and the projected image of the metacell on the atlas (that is, the weighted average of some atlas metacells). This is computed only for the genes the projection algorithm thinks it actually managed to project "well", so we expect it to be high. In fact, we expect it to be very high. Specifically, when we see a median of ~0.87 in the QC tab, we worry. That's an R^2 of ~0.75, that is, ~25% of the variance of the genes in the new data set is not captured by the projection on the atlas (we zoomed to the range 0.8 - 1.0):

Uncorrected Correlation

So why this poor result? Well, it turns out, the new data set we are projecting was collected using 10X v3, while the atlas was computed using 10X v2. In general, any scRNA-seq technology has different sensitivity for different mRNA molecules, and this is equivalent to applying some technology-dependent multiplicative factor to the number of UMIs of each gene. For this reason, our projection mechanism will optionally try to estimate a multiplicative correction factor for each (relevant) gene, such that this will (ideally) cancel out the difference between the technologies.

This is not the default as applying it to data which is not expected to need such corrections may overfit the new data on the atlas. This overfitting is mild; for example, nucleus-only scRNA-seq data gives very non-linear biases for some mRNA molecules when compared to whole-cell data, so using this correction will not force-fit the data sets together (that is, you will still get lower R^2 and a lot of not-similar metacells). Still, overfitting is overfitting, and should never be the default. Use with care, YMMV, etc.

At any rate, let's re-run the projection with this correction. We'll start by renaming the data to distinguish it from the previous run. Note we do not need to re-compute the metacells:

And recompute all the projection annotations, this time with project_correction:

Here's the UMAP with these (improved) type annotations:

And we'll save the data again for MCView:

If we look at the projected QC tab now, we see much better results (again we zoomed to the range 0.8 - 1.0):

Corrected Correlation

Now the median is ~0.95, which is an R^2 of ~0.90, that is, only ~10% of the variance of the (genes we did manage to fit) is not explained by the projection. Much better.

7. Follow-up iterations

In theory, that's it. We now have our final results, both as h5ad files and as an MCView application, to use as we see fit.

In practice, not so fast.

We really need to manually review the gene lists and the type annotations. Most likely, we'll at least modify the type annotations. We'll need to consider the similar and secondary_projection_type annotations, possibly marking some metacells as doublets. Note that the projection algorithm doesn't merely give you these high-level indicators. It allows you to look at the differential expression between the new data and the atlas, even for a specific metacell, to allow making the judgment calls about whether the new data metacell is "really" of some atlas type, or possibly requires a new cell type label which does not exist in the atlas.

Also, if we discover we need to modify the lateral or noisy genes lists, we'll have to recompute the new data metacells. If we do so, we may wish to project the improved metacells on the atlas for reference, but apply the previous iteration types, since they would be of higher quality.

In short, what we have done so far is effectively to merely skip to the N-th iteration of the iterative process, saving us a lot of effort, but we are by no means done.

To remove doubt, the results presented here are not suitable for use in any serious analysis. This is an example of the process, nothing more.