Computing Metacells - Iterative Process

1. Setup

In this vignette we will demonstrate how to compute metacells starting with brand-new data, with minimal or no reliance on previous analysis of similar data. This is in contrast to the one-pass vignette, where magically all our decisions were correct and we could immediately compute the final result (in practice, we simply applied there the final results from the iterative process below).

In the real world, the actual process used will fall between these two extremes. That is, by using prior knowledge that applies to the specific data set, some of the decisions we make below will be correct from the start, but most likely, some still need to be revised, andn require some iterations to finalize.

To remove doubt, the results presented here are not suitable for use in any serious analysis. This is an example of the process, nothing more.

We'll start with importing the python libraries we'll be using and setup some global configuration. Just importing takes a few seconds, mainly because Python's importing of C++ extensions with a large number of functions is inefficient.

Then, configure the packages we use. Feel free to tweak as necessary.

Since the process here is iterative, we'll use a separate directory to hold the results of each iteration, to keep things organized. In this example, we'll have 4 iterations in all, as well as the final results.

2. Reading the data

Our input here is the "full" data. This data already isn't the "raw" data. That is, we expect it to have gone through preliminary processing using basic cell-oriented tools. For example, removing "doublet" cells, cells which are actually empty wells, etc. The specifics of such pre-processing may be different for each data set, and are outside the scope of this vignette.

Reading this data takes a ridiculously long time necause the AnnData implementation, in its infinite wisdom, does not use memory-mapping. When we switch to using DAF things would be much better and faster.

3. Cleaning the data

Having this "full" data, we will demonstrate performing some additional "universal" steps to extract the "clean" data out of it. Importantly, all we will do here is select which subset of the data to use; we will not perform any corrections (such as batch corrections) here. Such corrections are better applies based on the resulting metacell model, as this will allow cell-type-specific analysis to be done.

3.1 Decisions

Cleaning up the data (whether in pre-processing or using the steps below) requires us to make decisions, based on prior biological knowledge and experience with the pipeline. Also, none of the decisions is final. As demonstrated here, we typically iterate the analysis, revisiting and refining prior decisions, until we obtain a "good" final result. This is in contrast to the one-pass vignette where we assume all decisions are "correct" from the start.

3.1.1 Excluding cells by UMIs count

The first decision we need to make is "how many UMIs does a clean cell contain". Cells with "too few" UMIs indicate very low-quality sampling and might be just empty droplets. Cells with "too many" UMIs might be doublets. The thresholds to use depend on the technology and the dataset, and the specifics of any pre-processing done, if any. The method we use here is to look at the distribution of total UMIs per cell, and pick "reasonable" thresholds for exclusions based on it.

This operation is identical to the matching one in the one-pass vignette.

We can visualize the cell UMIs threshold decision and adjust it accordingly:

3.1.2 Excluding genes by name

The next decision we need to make is which genes to exclude from the data, by their names. The poster children for this are mytochondrial genes and strong sex-chromosome genes.

Later on in the analysis we will revisit this, and exclude additional genes. This is in contrast to the one-pass vignette where we get right the list from the start. Realistically, one would start with a list that was used for similar data in the past, and possibly tweak it later as needed.

We'll instruct the package to exclude the genes we have chosen. By default, the code will also automatically exclude a few additional genes, specifically genes with very high variance that are not correlated with any other gene (bursty_lonely_gene, see the documentation for details). Such genes are useless for grouping cells together; even worse, including them is actively harmful as they cause cells to appear to be deviants (have a gene in which they are very different from the rest of the cells in the same metacell), and thereby be marked as outliers.

3.1.3 Excluding cells by high excluded gene UMIs

The next decision we need to make is which cells to exclude due to containing too many UMIs in the excluded genes. If a cell contains "too many" excluded (mainly mytochondrial) gene UMIs, this may indicate a badly sampled cell, leading to very skewed results. Again, the exact threshold depends on both the technology and the dataset. Here we resort to looking at the distribution of the fraction of excluded genes in each cell, and manually picking the threshold.

We start by computing the total UMIs of excluded genes in each cell. We only need to do this once (as long as we don't change the list of excluded genes).

Next we'll pick a maximal fraction of excluded UMIs in each cell.

We can visualize the cell excluded genes fraction threshold decision and adjust it accordingly:

We can now instruct the package to exclude the cells we have chosen.

3.2 Extract the clean data

Having decided on which cells and genes to exclude, we can now extract the clean data out of the full dataset.

3.3 Save the data

It is good practice keep the pristine input data in case we'll need to do something else with it. We'll therefore save a copy of the "full" data with the manual annotations we have collected, as well as the "clean" data we extracted from it. We do this prior to computing metacells, which will add further annotations, so we'll be able to easily go back to the unannotated versions of the data if (realistically, when) we revise our decisions.

Using AnnData, this is a full copy of the data; when we'll switch to DAF we'll be able to just store the additional annotations we made, which would take practically no space and be much faster.

4. Compute the 1st iteration metacells

We'll rename our data to "cells" to distinguish it from the metacells we are about to compute. We'll be adding annotations to this data (most importantly, the metacell each cell belongs to), and will later save it separately from the "clean" data we saved above.

4.1 Decisions

Even though we have the clean cells data, we can't compute metacells for it before making a few more decisions.

4.1.2 Lateral genes

A crucial decision when running metacells is the list of genes are lateral, that is, should not be used to group cells together. The poster child for this are cell-cycle genes. These genes are strong and any clustering algorithm will therefore prefer to group together cells in the same cell-cycle state, at the expense of mixing up (reasonably close) other cell states, which are what we are actually interested in. Note that lateral genes are still used in deviant cell detection, that is, each lateral gene should still have a consistent level in all the cells in each metacell.

Since our data is brand-new, we don't have a good comprehensive list of lateral genes for it, so we'll need to come up with an initial one, with the expectation we'll be adding more genes down the line during the iterative process.

To create this initial list, we'll start with a very short list of "base" lateral genes, using biological prior knowledge. The list below is not universal (it is specific to human cells, for one thing), but it is a start (cell cycle, stress, ribosomal). We'll then expand it with correlated genes.

Realistically, one would start with a list that was used for similar data in the past, skip the correlated genes expannsion, and possibly tweak it later as needed. However we show here the process staring with no similar data at all, so this (admittedly tedious) step is required.

Since our initial list was very partial, we would like to extend it to include any genes highly-correlated with them (as long as this makes sense, biologically speaking).

Manually looking at all 20-30,000 genes to see whether they are correlated with the base genes is obviously not practical. We therefore provide a function that tries to reduce the amount of genes we need to look at, by computing correlations between genes and clustering them to "related gene modules". We can then examine only such modules that contain base lateral genes (or that are correlated with such modules).

When we do this, we conservatively include any genes that we might want to add to the lateral genes list (that is, lean heavily in favor of "false positive" errors). We expect the vast majority of the candidates we look at to not be added as lateral genes; what is more important is that we do not miss any gene that we will want to add.

This correlates all the genes with each other, and clusters them. This is still a lot of data, so let's condense it to the correlation between the overall related gene modules (clusters).

We can visualize this to use as a coarse guide when looking into the specific modules:

We now visually inspect each of the potentially relevant gene modules, to decide on which additional genes to list as lateral. We'll generate a disk file for the figure of each module for offline inspection, but only show here in the notebook the gene modules that contain lateral genes, or are potentially correlated with other gene module(s) that contain lateral genes. These would still be "quite a few" modules to go through. For each one, we need to decide whether we want to ignore the module, add all its genes as lateral, or possibly add just a few of its genes as lateral. This depends on biological knowledge, and often requires looking at some resources (such as genecards or even published papers) to make the choice.

Having looked at all the above (which admittedly, takes some effort), we made the following decisions (which we hope you will find to be reasonable). Again, we don't expect it to be perfect (yet), we'll revise this list in future iterations. But the better we make it now, the less iterations we'll need to make.

4.1.3 Noisy genes

Another important list of genes are the genes with higher than usual variance, which will cause a lot of cells to be deviant (inconsistent with their metacells), and therefore be marked as outliers. Marking them as "noisy" allows for more variance for these genes in each metacell, reducing the number of outliers.

We currently don't provide a cell-based method for finding such genes; instead we wait until we have a metacells model and use that as a basis for decision instead.

4.1.4 Parallelization

Finally, we need to decide on how much parallelization to use. This is a purely technical decision - it does not affect the results, just the performance.

The more parallel piles we use, the faster the computation will be (up to the number of physical processors, but that is handled automatically for us).

However, having more parallel piles means using more memory. If we run out of memory, we'll need to reduce the number of parallel piles. You can track the memory usage by running top or htop during the computation.

We provide a guesstimator for the maximal number of parallel piles that will fit in memory. This is by no means perfect, but it is a starting point.

4.2 Computation

We are (finally) ready to actually group the cells into metacells. At least, for the first time. Since we don't trust the decisions we made so far too much, we don't expect these 1st iteration metacells to be worth much, either.

4.2.1 Hyper-parameters

The metacells pipeline has a lot of hyper parameters you can tweak. The defaults were chosen such that scRNA-seq data, especially 10x data, should work "well" out of the box. You should read the documentation and have a good understanding of the effect of any parameter you may want to tweak, keeping in mind the synergy between some of the parameters.

If we had to call out one hyper-parameter you might wish to tweak, it would be the target_metacell_size. This specifies the "ideal" number of cells in each metacell. The algorithm works hard to keep the actual metacell size close to this value - in particular, metacells larger than twice this size will be split, and metacells which are much smaller than this size will be merged, or dissolved (become outliers).

By default this value is set to 96. Setting it to a smaller value will create more metacells, which may allow capturing more subtle differences between cell states (e.g. along a gradient); this, however, would come at the cost of making less robust estimations of each gene's expression level in each metacell. This might be a worthwhile tradeoff if your cells are of higher quality.

In this vignette, we'll use the default across all iterations. Realistically, one may try different hyper parameters in different iterations.

One technical "hyper-paramater" you must specify is the random_seed. A non-zero value will genenrate reproducible results. A zero value will generate non-reproducible results and will be slightly faster. We "strongly urge" you to use a non-zero as reproducible results are much easier to deal with.

4.2.2 Assigning cells to metacells

This is the core of the method. It can take a while. The dataset used in this example is far from trivial - it contains ~1/3M cells. This takes ~10 minutes to compute on a hefty (48 HT cores, 0.5TB RAM) server. You can see progress is being made by running the computation with a progress bar. This progress is rather non-linear, for example there's a pause at the start when computing rare gene modules, and it isn't smooth afterwards either. You can skip it alltogether by getting rid of the with statement.

4.2.3 Collecting the metacells

The above merely computed a metacell name and index for each cell ("Outliers" and negative for outlier cells). We still need to collect all the cells of each metacell, to create a new AnnData where each profile is a metacell. Note that in this new metacells data, we no longer have UMIs per gene; instead, for each gene, we have an estimate of the fraction of its UMIs out of the total UMIs. Since AnnData can only hold a single 2D dataset, the result must be a separate object (with each "observation" being a metacell), so we copy all the per-gene annotations from the cells dataset to the result.

If the input data contains per-cell annotations of interest, then here would be a good place to convey that data to the metacells object. Since each metacell is a combination of multiple cells, this requires a decision on how exactly to do this for each per-cell annotation. We can either reduce the multiple per-cell values into a single one (such as the most frequent value or the mean value), or we can collect the fraction of cells that have each value in each metacell (for categorical data). Since AnnData is limited to simple 2D data, the latter is implemented as a series of per-metacell annotations (one for each possible value). This will be handled better in DAF.

4.3 Computing for MCView

So we have our metacells data (that we only trust as much as our decisions leading to them). This is pretty opaque as of itself. MCView is our interactive tool for exploring the data, refining our decisions, and assigning type labels to the metacells.

However, in order to use MCView, we first need to compute all sort of quality control data for MCView to display. This again may take a while (but much less then computing the metacells above).

Here's a preview of the 2D UMAP view of the data (without any color annotations, as we do not have type annotations yet):

4.4 Saving the data

We'll now save the results, so we can import them into MCView.

5. Importing into MCView

This vignette focuses on the metacells package, not MCView, which deserves a full vignette of its own. Still, here are some basics about how to use it.

5.1 Installing MCView

The MCView is written in R but is not a standard CRAN package. To install it, you should type (in R):

install.packages("remotes")
remotes::install_github("tanaylab/MCView")

5.2 Importing data set

Since MCView is written in R, it isn't easy to run it inside a Python notebook. Instead we've provided a small R script that will load the data we saved above, and import it into an MCView application. Here is the code of the script for reference:

library("MCView")

args <- commandArgs(trailingOnly=TRUE)

if (length(args) == 6) {
    prefix <- args[1]
    name <- args[2]
    title <- args[3]
    type <- args[4]
    atlas_prefix <- args[5]
    atlas_name <- args[6]
    import_dataset(
        sprintf("../mcview/%s", name),                           # The directory to create
        sprintf("%s-%s", prefix, gsub("/", "-", name)),          # The name of the dataset
        sprintf("../output/%s/%s.metacells.h5ad", name, prefix), # The metacells h5ad file
        metadata_fields = "all",                                 # Ask to import all the metadata
        title = title,                                           # A title for the GUI
        cell_type_field = type,                                  # The name of the type field
        cell_type_colors_file = "../captured/type_colors.csv",   # The type colors CSV file
        projection_weights_file = sprintf("../output/%s/%s.atlas_weights.csv", name, prefix),
        atlas_project = sprintf("../mcview/%s", atlas_name),
        atlas_dataset = sprintf("%s-%s", atlas_prefix, gsub("/", "-", atlas_name)),
    )

} else if (length(args) == 4) {
    prefix <- args[1]
    name <- args[2]
    title <- args[3]
    type <- args[4]
    import_dataset(
        sprintf("../mcview/%s", name),                           # The directory to create
        sprintf("%s-%s", prefix, gsub("/", "-", name)),          # The name of the dataset
        sprintf("../output/%s/%s.metacells.h5ad", name, prefix), # The metacells h5ad file
        metadata_fields = "all",                                 # Ask to import all the metadata
        title = title,                                           # A title for the GUI
        cell_type_field = type,                                  # The name of the type field
        cell_type_colors_file = "../captured/type_colors.csv"    # The type colors CSV file
    )

} else if (length(args) == 3) {
    prefix <- args[1]
    name <- args[2]
    title <- args[3]
    import_dataset(
        sprintf("../mcview/%s", name),                           # The directory to create
        sprintf("%s-%s", prefix, gsub("/", "-", name)),          # The name of the dataset
        sprintf("../output/%s/%s.metacells.h5ad", name, prefix), # The metacells h5ad file
        metadata_fields = "all",                                 # Ask to import all the metadata
        title = title                                            # A title for the GUI
    )

} else {
    stopifnot(FALSE)
}

We'll just run it as an external process using Rscript:

6. Compute the 2nd iteration metacells

Having looked at our data in MCView, we are ready to revise our original decisions and compute higher quality metacells. In general, we will have to iterate several times, recomputing the metacells in each iteration, until the result is "good enough". We'll rename the data to reflect this. Note that the cells contain the old assignment to metacells; this is harmless as it will be replaced by the new metacells we'll compute.

6.1 Decisions

The revised decisions below are based on the analysis we performed in MCView. When anlyzing the first iteration results in MCView, we didn't bother to even try to add type annotations to the data, which makes this simpler. In the following iterations we'll add such annotations and convey them between the iterations.

6.1.2 Adding noisy genes

Looking in MCView, we have discovered genes which we should mark as noisy, because they have very high variance for cells with the "same" state (that is, within a metacell), and we think this makes biological sense.

The main tool we use for detecting noisy genes is the "# of metacells with significant inner-fold" QC graph. Assuming the algorithm works well, genes which have a high inner fold in many metacells must be due to them being extra noisy. These are also worth considering as lateral genes, but sometimes they are important for distinguishing between different cell types.

For example, HBB is noisy, but is a clear erythrocytes marker, so it should not be lateral:

IGKC

While IGKC is also noisy, but is an immunoglobulin, so it should be lateral:

IGKC

6.1.3 Adding lateral genes

Similarly, even given a cursory analysis of the data (without doing any manual type annotations), we already discovered more genes we need to mark as lateral (in additions to the lateral genes we noticed while looking for noisy genes).

The main tool for identifying additional lateral genes is the marker heatmap. In general, we should have a pretty good idea about the function of each gene in this heatmap. When looking at the one we got, we noticed several genes which do not differentiate between the cell types we are interested in isolating, for example, JUN and TUBA1A (cell death).

Markers Heatmap

Another tool is looking at the inner fold heatmap. This shows genes that the metacells algorithm did not manage to homogenize well, so it is a good place to look for residual lateral behavior genes. For example, DONSON is related to cell cycle.

Inner Fold Heatmap

Finally, it is worth looking at the genes which are highly correlated with known lateral genes (including the ones we found above). For example, JUNB is correlated with JUN.

JUN-JUNB

6.1.4 Parallelization

Since this will be essentially the same data set as before, normally we'll just reuse the same amount of parallelization as in the first iteration for the rest of the process.

6.2 Computation

The steps here just repeat what we did in the first iteration.

7. Annotating types in MCView

This time, we expect the metacells to have sufficient quality so it is worthwhile to provide manual type annotations for them. When we'll iterate further, we'll use the previous iteration annotations as the basis for the next one, that is, the annotations are one more decision we refine during the iterations.

7.1 Type annotation decisions

The basic procedure we use for annotating metacell types is to look at the markers heatmap to identify interesting genes, and then using the gene-gene plots to identify groups of metacells that present the right combination of markers to justify some type label. Here's an example gene-gene plot showing a gradient between HSC (Hematopoietic Stem Cell) and LMPP (Lymphoid-primed Multipotent Progenitors):

HLF-AVP

If some metacells express a combination of markers for two unrelated types, we will consider labeling them as doublets. All these decisions require prior biological knowledge of the genes and the system. Here's an example of a doublet metacell containing a mixture of Ery (erythrocytes) and T-cells:

HBB-TRBC1

To kickstart the process, MCView will automatically cluster the metacells innto some groups, and give a color to each. This makes the markers heatmap at least somewhat informative even before manual type annotation has begun. Note these clusters are just as a very crude starting point. Sometimes a cluster will map well to a single cell type, but most likely, a cell type would contain metacells from several clusters, or a cluster will need to be split between two cell types.

7.2 Updating types in MCView

The MCView application is stateless, that is, any type annotations created in it are not persistent. If you close your browser's window, any annotations you made will be lost, unless you first export the types as a CSV file. You can then start a new instance of the application in your browser and import this file back to continue where you left off. We saved a copy of this exported file in iterative.iteration-2.types.csv which we'll use below.

You can also export a small convenience file which just contains the mapping between types and colors. In this vignette we'll assume this mapping does not change between the applications. We saved a copy of this file in captured/type_colors.csv - in fact, you will notice that we use this in the scripts/import_dataset.r script above when we create applications with type annotations already embedded in them.

These MCView mechanisms make it easy for a web server to deal with multiple simultaneous users which may modifying things at the same time. For the web server, the application files are read-only, which is exactly what you want when publishing data for users to see. This even allows the users to work on alternative type annotations without impacting each another.

However, this makes it more difficult for a single user which actively edits the data. The way around it is to do embed the exported file into the application's directory, that is, modify the application's disk files. From that point on, all users that will open the application will see the new type annotations.

We'll do this now, using the following simple script:

library("MCView")

args <- commandArgs(trailingOnly=TRUE)
stopifnot(length(args) == 2)
name <- args[1]
types <- args[2]

update_metacell_types(
    sprintf("../mcview/%s", name),                  # The directory of the app
    sprintf("../hca_bm-%s", gsub("/", "-", name)),  # The name of the dataset
    types                                           # The types CSV files
)

8. Compute the 3rd iteration metacells

We'll again rename the data.

8.1 Capturing previous type annotations

We are about to compute a new set of metacells, so these types we assigned to the old set of metacells can't be directly applied to them. Instead, we will apply the type annotations to the cells, that is, give each cell the type of its (old) metacell; then, after computing the new metacells, we'll give each one the most frequent type of the cells it contains. This is far from perfect, so we'll need to go to MCView again and verify and modify these types, but it is an improvement over starting the annotation from scratch.

A tricky point is that some cells are outliers (do not belong in any metacell), so we have to explicitly allow for that when applying the types to the cells.

8.2 Decisions

The revised decisions below are based on the analysis we performed in MCView.

8.2.1 Removing doublets

Ideally, all doublets should have been removed from even the "full" data we started with. However, in the real world, we often find a few metacells which we think are doublets. Our best course of action is to remove them from the data, so they will be excluded in the next iteration.

We'll just slice the data to remove the excluded doublet cells.

8.2.2 Adding lateral genes

Even though we went through two iterations, we still discovered more lateral genes, using the same techniques as described above. Such is the nature of the beast.

8.2.3 Adding noisy genes

This time, we haven't found any new genes to mark as noisy. These genes are much easier to detect so they typically stabilize early in the iterations.

8.3 Computation

The steps here just again repeat what we did in the previous iterations. However, this time we can also now automatically assign a type for each new metacell, by picking the most frequent type of the cells in it. We do not expect this to be perfect, but to serve as a more convenient starting point when reviewing the data in MCView.

9. Computing 4th iteration metacells

We'll do another iteration, similar to the previous one. Depending on the intended usage, it may take many iterations until the results are "good enough".

9.1 Decisions

The revised decisions below are based on the analysis we performed in MCView, similar to what we did in the previous iterations.

9.1.1 Removing excluded genes

This time we noticed NEAT1 is causing problems, and upon consideration, decided it is not sufficient to mark it as lateral. Instead we want to exclude it from the data, so it will not impact the denominator (total UMIs in each cell), as it is a very high expressing gene.

In general we care about the fraction of the UMIs of each gene out of the total for each cell. A very highly expressed gene may therefore cause other genes to appear to be unner-expressed. This is part of the reason why, for example, we excluded the MT-.* mytochondrial genes from the analysis. One should be careful though because this is a very blunt tool to apply to the data.

We'll just compute a mask of the additional (well, single) excluded gene and apply it below.

9.1.2 Removing doublets

We discovered a few more doublets. Blood data is notorious for containing doublets because blood cells are "sticky".

We'll again simply slice the set of cells and genes we use next:

9.1.3 Adding lateral genes

Even though we went through three iterations so far, we still discovered more lateral genes. This only appears to be an endless process, though; it does converge. Eventually.

9.2 Computation

The steps here just again repeat what we did in the previous iterations.

10. Finalizing the data

At some point, one must declare the results to be "good enough" and wrap things up. This point "depends".

If one is working on a very large data set, with the intent to publishing it as an atlas for other researches to use as reference, one would (presumably) invest much more effort ensuring the quality of metacells and type annotations for each of the cell types.

In contrast, if one is working on a small data set, and is investigating a specific biological question (e.g. comparing a specific cell type between wild type and knockout organisms), one can focus the effort on specific cell types, and use coarse grained annotations for the rest (and worry much less about their quality in general).

Either way, the decision of when "enough is enough" will depend on the analyst's judgement, just like every other decision we have discussed so far.

In this specific example, the purpose of the vignette is merely to demonstrate the iterative process. We therefore decided to stop here. To remove doubt, the results presented here are not suitable for use in any serious analysis. This is an example of the process, nothing more.

All that said, wrapping things up requires taking a few extra steps we'll discuss below. We'll start by renaming the data to indicate it is the final result.

10.1 Conveying type annotations

Because we will not be recomputing the metacells this time, we can apply the final type annotations from our work in MCView to both the cells (as in previous iterations) and the metacells.

10.2 Marking essential genes

If we are preparing a reference atlas, that is, we expect to project other data into it (a process which is the subject of a separate vignette TODO), then it is recommended to designate a few genes as "essential" for each cell type.

When we will project data onto the atlas, we'll do a lot of quality assurance to ensure the projection makes biological sense. The (optional) essential genes help with this quality assurance. Specifically, if a gene is designated as essential for some cell type, and some projected data "seems like" it is of this cell type, we'll ensure that this data agrees with the atlas for the essential genes. For example, if the projected metacell does not express CD8, then we can't say it is a CD8 T-cell. It may be very similar to a CD8 T-cell, but we just can't reasonably that it is one.

Selecting the (few) genes to use as essenntial for each cell type requires, as usual, the expertise of the biologist preparing the atlas. Note that all we require is a list of genes, not some quantitative threshold to use as a cell type classifier. The quantitative aspect is automatically provided by the projection algorithm itself (which is outside the scope of this vignette).

10.3 Removing doublet meta/cells

Some of the metacells were annotated as doublets. In a realistic iterative process, we would exclude the cells of these metacells, re-compute the metacells, and re-annotate them. Since we artificially stopped at this point in the process, we'll skip these steps, leaving the doublets in these so-called "final" results.

10.4 Spit and polish in general

This is a good place to delete, rename, compute and in general ensure all per-gene and per-metacell properties exist, are properly named, etc. Again, how much effort (if any) is put into this varies greatly depending on the specific scennario.

Just as an example, here we'll delete the type annotations from the previous iterations from the data, because they will only confuse anyone using the final result.

10.5 Computing for MCView

We've already computed all the MCView data. Yes, we have changed the assignment of types to metacells since, but none of the computed data actually depends on these types, so there's no need to compute it again.

10.6 Saving the final results

We'll save the results in some location that makes it clear that this is the final data. In a real process, this would be what we'd publish out of it as a "metacells atlas". In fact, typically such an atlas wouldn't even contain the cells data, but we store it for completeness, and in case "somethings comes up" and we have to do a few more iterations.

That's it. We now have our final results, both as h5ad files and as an MCView application, to use as we see fit.

To remove doubt, the results presented here are not suitable for use in any serious analysis. This is an example of the process, nothing more.