Read an anndata
file which is the output of python metacells
package,
and import the metacell dataset to MCView. Each project can have multiple datasets
which can be in the app using the right sidebar.
Usage
import_dataset(
project,
dataset,
anndata_file,
cell_type_field = NULL,
metacell_types_file = NULL,
cell_type_colors_file = NULL,
outliers_anndata_file = NULL,
cluster_metacells = TRUE,
cluster_k = NULL,
metadata_fields = NULL,
metadata = NULL,
metadata_colors = NULL,
cell_metadata = NULL,
cell_to_metacell = NULL,
gene_modules_file = NULL,
gene_modules_k = NULL,
calc_gg_cor = TRUE,
gene_names = NULL,
metacell_graphs = NULL,
atlas_project = NULL,
atlas_dataset = NULL,
projection_weights_file = NULL,
copy_atlas = TRUE,
minimal_max_log_fraction = -13,
minimal_relative_log_fraction = 2,
umap_anchors = NULL,
umap_config = NULL,
min_umap_log_expr = -14,
genes_per_anchor = 30,
layout = NULL,
default_graph = NULL,
overwrite = TRUE,
...
)
Arguments
- project
path to the project
- dataset
name for the dataset, e.g. "PBMC". The name of the dataset can only contain alphanumeric characters, dots, dashes and underscores.
- anndata_file
path to
h5ad
file which contains the output of metacell2 pipeline (metacells python package) or a loaded anndata object of the same format.- cell_type_field
name of a field in the anndata
object$obs
which contains a cell type (optional). If the field doesn't exist andmetacell_types_file
is missing, MCView would first look for a field named 'projected_type', 'type', 'cell_type' or 'cluster' atobject$obs
(in this order), and if it doesn't exists MCView would cluster the metacell matrix using kmeans++ algorithm (from thetglkmeans
package).- metacell_types_file
path to a tabular file (csv,tsv) with cell type assignement for each metacell. The file should have a column named "metacell" with the metacell ids and another column named "cell_type", or "cluster" with the cell type assignment. Metacell ids that do not exists in the data would be ignored.
If this parameter andcell_type_field
are missing andcluster_metacells=TRUE
, MCView would cluster the metacell matrix using kmeans++ algorithm (from thetglkmeans
package).
If the file has a field named 'color' andcell_type_colors_file=NULL
, the cell types colors would be used.- cell_type_colors_file
path to a tabular file (csv,tsv) with color assignement for each cell type. The file should have a column named "cell_type" or "cluster" with the cell types and another column named "color" with the color assignment.
In case the metacell types (given by a file or from theanndata_file
) contain types which are not present in the cell type colors file, MCView would use thechameleon
package to assign a color for them in addition to the cell type colors.
If this is missing, andmetacell_types_file
did not have a 'color' field, MCView would use thechameleon
package to assign a color for each cell type.
When an atlas is given (usingatlas_project
andatlas_dataset
), if the cell types are the same as the atlas, the atlas colors would be used.- outliers_anndata_file
path to anndata file with outliers (optional). This would enable, by default, the following tabs: ["Outliers", "Similar-fold", "Deviant-fold"]. See the metacells python package for more details.
- cluster_metacells
When TRUE and no metacell type is given (via
metacell_types_file
orcell_type_field
- implicit and explicit), MCView would cluster the metacell matrix using kmeans++ algorithm (from thetglkmeans
package).- cluster_k
number of clusters for initial metacell clustering. If NULL - the number of clusters would be determined such that a metacell would contain 16 cells on average.
- metadata_fields
names of fields in the anndata
object$obs
which contains metadata for each metacell.
The fields should can be either numeric or categorical.
You can usecell_metadata_to_metacell
to convert from categorical to a numeric score (e.g. by using fraction of the category). You can use 'all' in order to import all the fields of the anndata object.- metadata
can be either a data frame with a column named "metacell" with the metacell id and other metadata columns or a name of a delimited file which contains such data frame. See
metadata_fields
for other details.- metadata_colors
a named list with colors for each metadata column, or a name of a yaml file with such list. For numerical metadata columns, colors should be given as a list where the first element is a vector of colors and the second element is a vector of breaks.
Note that in case the breaks are out of the range of the metadata values, the colors would be used for the minimum and maximum values.
If only colors are given breaks would be implicitly determined from the minimum and maximum of the metadata field.
For categorical metadata columns, color can be given either as a named vector where names are the categories and the values are the colors, or as a named list where the first element named 'colors' holds the colors, and the second element called 'categories' holds the categories.- cell_metadata
data frame with a column named "cell_id" with the cell id and other metadata columns, or a name of a delimited file which contains such data frame. For activating the "Samples" tab, the data frame should have an additional column named "samp_id" with a sample identifier per cell (e.g., batch id, patient etc.). Optionally, a column named "metacell" can be added to the data frame, which will be used instead of the
cell_to_metacell
parameter.- cell_to_metacell
data frame with a column named "cell_id" with cell id and another column named "metacell" with the metacell the cell is part of, or a name of a delimited file which contains such data frame. If NULL, the metacell will be inferred from the 'metacell' column in
cell_metadata
.- gene_modules_file
path to a tabular file (csv,tsv) with assignment of genes to gene modules. Should have a field named "gene" with the gene name and a field named "module" with the name of the gene module.
- gene_modules_k
number of clusters for initial gene module calculation. If NULL - the number of clusters would be determined such that an gene module would contain 16 genes on average.
- calc_gg_cor
Calculate top 30 correlated and anti-correlated genes for each gene. This computation can be heavy for large datasets or weaker machines, so you can set
calc_gg_cor=FALSE
to skip it. Note that then this feature would be missing from the app.- gene_names
use alternative gene names (optional). A data frame with a column called 'gene_name' with the original gene name (as it appears at the 'h5ad' file) and another column called 'alt_name' with the gene name to use in MCView. Genes that do not appear at the table would not be changed.
- metacell_graphs
a named list of metacell graphs or files containing metacell graphs. Each graph should be a data frame columns named "from", "to" and "weight" with the ids of the metacells and the weight of the edge. If the list is not named, the names would be 'graph1', 'graph2' and so on. Note that the graph cannot be named "metacell" as this is reserved for the metacell graph.
- atlas_project
path to and
MCView
project which contains the atlas.- atlas_dataset
name of the atlas dataset
- projection_weights_file
Path to a tabular file (csv,tsv) with the following fields "query", "atlas" and "weight". The file is an output of
metacells
projection algorithm.- copy_atlas
copy atlas MCView to the current project. If FALSE - a symbolic link would be created instead.
- minimal_max_log_fraction
When choosing marker genes: take only genes with at least one value (in log fraction units - normalized egc) above this threshold
- minimal_relative_log_fraction
When choosing marker genes: take only genes with relative log fraction (mc_fp) above this this value
- umap_anchors
a vector of gene names to use for UMAP calculation. If NULL, the umap from the anndata object would be used.
- umap_config
a named list with UMAP configuration. See
umap::umap
for more details. When NULL, the default configuration would be used, except for: min_dist=0.96, n_neighbors=10, n_epoch=500.- min_umap_log_expr
minimal log2 expression for genes to use for UMAP calculation.
- genes_per_anchor
number of genes to use for each umap anchor.
- layout
a data frame with a column named "metacell" with the metacell id and other columns with the x and y coordinates of the metacell. If NULL, the layout would be taken from the anndata object.
- default_graph
a data frame with a column named "from", "to" and "weight" with the ids of the metacells and the weight of the edge. If NULL, the graph would be taken from the anndata object.
- overwrite
if a dataset with the same name already exists, overwrite it. Otherwise, an error would be thrown.
- ...
Arguments passed on to
create_project
title
The title of the app. This would be shown on the top left of the screen.
tabs
Controls which tabs to show in the left sidebar and their order. Options are: "QC", "Projection-QC", "Manifold", "Genes", "Query", "Atlas", "Markers", "Gene modules", "Projected-fold", "Diff. Expression", "Cell types", "Flow", "Annotate", "About". When NULL - default tabs would be set. For projects with atlas projections, please set
atlas
to TRUE.help
Controls wether to start the app with a help modal (from introjs). Help messages can be edited in help.yaml file (see 'Architecture' vignette).
selected_gene1,selected_gene2
The default genes that would be selected (in any screen with gene selection). If this parameter is missing, the 2 genes with highest max(expr)-min(expr) in the first dataset would be chosen.
selected_mc1,selected_mc2
The default metacells that would be selected in the Diff. Expression tab.
datasets
A named list with additional per-dataset parameters. Current parameters include default visualization properties of projection and scatter plots.
other_params
Named list of additional parameters such as projection_point_size, projection_point_stroke, scatters_point_size and scatters_stroke_size
edit_config
open file editor for config file editing
atlas
use default configuration for atlas projections (relevant only when
tabs
is NULL)
Details
The function would create a directory under project/cache/dataset
which
would contain objects used by MCView shiny app (such as the metacell matrix).
In addition, you can supply file with type assignment for each metacell
(metacell_types_file
) and a file with color assignment for each metacell type
(cell_type_colors_file
).
Examples
if (FALSE) { # \dontrun{
dir.create("raw")
download.file(
"http://www.wisdom.weizmann.ac.il/~atanay/metac_data/PBMC_processed.tar.gz",
"raw/PBMC_processed.tar.gz"
)
untar("raw/PBMC_processed.tar.gz", exdir = "raw")
create_project("PBMC")
import_dataset("PBMC", "PBMC163k", "raw/metacells.h5ad")
} # }