
Submodules module

Manage storage of per-gene data.

class, names: tgutils.numpy.ArrayStr)

Bases: object

A collection of genes with some associated data.

This must be consistent between all profiles that are processed together.

__init__(full:, names: tgutils.numpy.ArrayStr) → None

Initialize the genes.

abstract any_array(name: str) → numpy.ndarray

Load a per-gene data array.

array(cls: Type[A], name: str) → A

Load a per-gene data array.

abstract available_data() → List[str]

Return a list of the available data.

count = None

Convenient access to the number of genes.

static created(path: str, organism: str) → None

Write the genes.yaml file after creating a genes directory.

At minimum, the directory should contain the gene names file.

full = None

The full gene set this is derived from.

abstract has_array(name: str) → bool

Whether there exists some per-gene data.

static load(path: str, organism: Optional[str] = None) →

Load genes from a directory.

names = None

The sorted upper-case names of genes in the set.

uuid = None

The unique identifier of this list of gene names.



Per-batch meta-data.

__init__(**kwargs) → None

Create metadata for for some genes.

genes_count = None

The number of genes.

organism = None

The organism these are genes of.

required_keys = {'genes_count': <class 'int'>, 'organism': <class 'str'>, 'uuid': <class 'str'>}


A set of genes with some associated data.

This must be consistent between all batches that are processed together.

__init__(metadata: → None

Open a genes set directory for access.

any_array(name: str) → numpy.ndarray

Load a per-gene data array.

available_data() → List[str]

Return a list of the available data.

data_path(path: str) → str

Return the path of a data file in the genes directory.

has_array(name: str) → bool

Whether there exists some per-gene data.

metadata = None

The meta-data describing the genes.

class, included_indices: tgutils.numpy.ArrayInt32)


A subset of some genes.

__init__(superset:, included_indices: tgutils.numpy.ArrayInt32) → None

Create a subset of some genes.

any_array(name: str) → numpy.ndarray

Load a per-gene data array.

available_data() → List[str]

Return a list of the available data.

has_array(name: str) → bool

Whether there exists some per-gene data.

included_indices = None

The indices of the superset genes which are included in the subset.

superset = None

The genes this is a subset of. module

Handle HCA (h5) files. module

Helper functions. List[uuid.UUID]) → uuid.UUID

Use md5sum to combine multiple UUIDs into a single UUID. str) → uuid.UUID

Compute a checksum of a disk file. int) → tgutils.application.Loop

Create a logged loop for summing profiles. module

Handle multi-lane sparse format data.

class str, data: Dict[str, Any])

Bases: object

Describe the format of a CSV file.

__init__(path: str, data: Dict[str, Any]) → None

Construct from an entry in a format.yaml file.

field_names = None

When the file has no header line, assume this header line is used.

field_types = None

For each recognized field of the file, its expected data type.

separator = None

The separator between fields.


Bases: object

Data needed to delane the UMIs matrix.

__init__() → None

Initialize self. See help(type(self)) for accurate signature.

data_of_lanes = None

The per-lane data.

data_of_profiles = None

The per-profile data by the 0-based original index.

get_lane(lane_name: str) →

Get the data for a specific lane.

class str)

Bases: object

Describe the format of all CSV files in a standard format directory.

__init__(path: str) → None

Construct the format from a format.yaml file.

genes = None

The format of the genes CSV file.

profiles = None

The format of the profiles CSV file.

class str)

Bases: object

Per-lane data needed to delane the UMIs matrix.

__init__(name: str) → None

Initialize empty lane data.

add_entry(gene_index: int, profile_index: int, umis_count: int) → None

Add a scanned UMIs entry to the lane.

entry_lines = None

The lines of the entries of the lane.

last_profile_data_line_index = None

The last profile data line index of the lane.

name = None

The unique lane name.

profiles_count = None

The number of profiles in the lane.

write_umis_matrix(delaned_root: str, umis_header_line: str, genes_count: int) → None

Write the collected entries into the UMIs matrix file.


Bases: object

Per-profile data needed to delane the UMIs matrix.

__init__(lane_data: → None

Initialize scanned profile data.

index_in_lane = None

The 1-based index of the profile in the lane.

lane_data = None

The lane the profile belongs to. module

Memory-mapped matrices for observations storage. = b'METACELL SPARSE MEMORY MAPPED MATRIX VERSION 0\x00\x00'

The magic string for a sparse memory mapped matrix. = 48

Size of the magic string identifying the file type.

class str)

Bases: object

A read-only memory mapped file containing a series of integer measurements per profile.

This is optimized to allow for accessing the data of arbitrary profiles. It does not allow for efficient extraction of the data of arbitrary genes.

The file format is as follows:

  • Magic string: containing the

  • Profiles count: 4-bytes little-endian.

  • Genes count: 4-bytes little-endian.

  • Profiles data: See below.

  • Padding to 4 bytes alignment using 0 bytes.

  • Index of offsets of profile data: See below.

The format of the index of the offsets is as follows:

  • For each profile, the 4-byte little-endian offset of the 1st byte of its data.

  • Finally, the offset of the 1st byte following the data of the last profile.

The format of the profile data is as follows:

  • A sequence of pairs of bytes, where the 1st byte is the unsigned delta to apply to the current gene index, and the 2nd byte is the unsigned value to add to the current gene of the profile.

This format is very compact when more than 1/256 of the genes have non-zero data, and when the measurement value is typically less than 256. When genes are sparser, we write additional byte pairs with a delta gene index of 255 and measurement of 0. When the measurement value is too high, we write additional byte pairs with a delta gene index of 0.

Decompression is reasonably fast as there are no branches involved. It is difficult to vectorize but the saving in I/O (especially when accessing the data over a network) “should” justify this compression. However it is trivial to parallelize decompression of multiple profile data.

The compression rate is pretty good (around 1/50 compared to dense encoding). The optimal number of bits for the genes and measurement deltas is actually 8 and 4, respectively. However, this complicates the code and only reduces the file sizes by around 20%. Using an optimized Huffman tree or gzip it is possible to further compress the result to around 1/2th the current size, but this would come at the cost of much slower decompression speed.

The profile data offsets index appears at the end of the file, but its location is known to be file_size - (profiles_count + 1) * 4. The profiles count is known to be at the offset

All 4-byte values are stored on a 4-byte aligned offset, and are in little-endian order. This allowing a memory-mapped implementation on most processors to decode the value using an efficient aligned pointer dereference.

__init__(path: str) → None

Open a read-only memory-mapped matrix file.

add_profile_data(profile_index: int, array: tgutils.numpy.ArrayFloat32) → None

Add the profile data into a numpy array.

close() → None

Release all the resources held by the file.

entries_count() → int

Return the total number of non-zero entries in the matrix.

genes_count = None

The number of genes in the matrix.

mmap = None

The mapped memory.

profiles_count = None

The number of profiles in the matrix.

to_numpy_matrix() → tgutils.numpy.MatrixFloat32

Return the full matrix as a dense numpy array, whose rows are genes, columns are profiles, and entries are 4-byte integers containing observations.

static write(path: str, genes_count: int, data: List[List[Tuple[int, int]]]) → None

Write a sparse memory-mapped matrix.

  • path – The path of the disk file to write.

  • genes_count – The number of genes (size of gene set).

  • data – A list of profile data, where each profile data is a list of tuples, where each tuple has a gene index and an observation. The list of tuples must be sorted by the gene index. Entries with zero observations are allowed, and have no effect on the output. Multiple tuples with the same gene index are allowed, and are summed into the output. module

Manage meta-data in YAML files.


Bases: types.SimpleNamespace

Meta-data stored in a YAML file.

__init__(**kwargs) → None

Create the metadata object.

as_dict(**kwargs) → Dict[str, Any]

Return the metadata as a clean dictionary for dumping to YAML.

static detect_cls(cls: Type[YM], _dictionary: Dict[str, Any]) → Type[YM]

Decide on the concrete metadata class to load from the dictionary.

dump(yaml_path: str, **kwargs) → None

Dump batch metadata into a file.

classmethod load(yaml_path: str) → YM

Load the batch meta-data from a file.

path = None

The path of the directory containing the data.

removed_keys = ['path', 'yaml_path']

The keys that should be removed from the YAML file.

required_keys = {'uuid': <class 'str'>}

The keys that must appear in the YAML file.

uuid = None

The globally unique identifier of the data.

yaml_path = None

The path of the YAML meta-data file. module

Maintain per-gene-per-profile UMIs data on the disk.

class str, data: Any)

Bases: types.SimpleNamespace

Per-combined profiles container metadata.

__init__(yaml_path: str, data: Any) → None

Load the meta-data from a YAML file.

as_dict(**kwargs) → Dict[str, Any]

Return the metadata as a clean dictionary for dumping to YAML.

name = None

The name of the combined container in the collection.

profiles_count = None

The number of profiles in the combined container.

uuid = None

The unique identifier of the combined container.

class*, metadata:, genes:, name: str = '')

Bases: object

Common base class for all profile containers.

__init__(*, metadata:, genes:, name: str = '') → None

Initialize self. See help(type(self)) for accurate signature.

add_into_gp_array(name: str, profile_index: int, profile_array: tgutils.numpy.ArrayFloat32) → tgutils.numpy.ArrayFloat32

Add the requested data of a profile into an array.

any_array(kind: str, name: str) → numpy.ndarray

Return a per-gene or per-profile array of any type.

any_frame(kind: str, name: str, *, profile_indices: Optional[Collection], gene_indices: Optional[Collection]) → pandas.core.frame.DataFrame

Return a per-gene/gene, per-gene/profile or per-profile/profile data of any type.


The returned frame index for genes (for gg and gp) is the gene names, even though the gene_indices parameter uses the integer index. Both the profile_indices and the returned frame index for profiles (for gp and pp) use the integer profile indices.

any_matrix(kind: str, name: str, profile_indices: Optional[Collection] = None, gene_indices: Optional[Collection] = None) → numpy.ndarray

Return a per-gene/gene, per-gene/profile or per-profile/profile data of any type.

any_series(kind: str, name: str) → pandas.core.series.Series

Return a series of per-gene or per-profile data of any type.

array(cls: Type[A], kind: str, name: str) → A

Return a per-gene or per-profile array of some type.

available_data(kind: str) → List[str]

Return a list of the available data of the specified kind (‘g’, ‘p’, ‘gp’, ‘gg’ or ‘pp’).

data_path(path: str) → str

Return the path of a data file in the container directory.

abstract dir_paths() → List[str]

Return the list of directories containing actual UMIs data for the container.

This is just a single path for a batch, and a list for a merged collection of batches. For a view it contains the view’s directory in addition to the base container’s.

frame(cls: Type[F], kind: str, name: str, *, profile_indices: Optional[Collection] = None, gene_indices: Optional[Collection] = None) → F

Return a per-gene/gene, per-gene/profile or per-profile/profile data of some type.


The returned frame index for genes (for gg and gp) is the gene names, even though the gene_indices parameter uses the integer index. Both the profile_indices and the returned frame index for profiles (for gp and pp) use the integer profile indices.

genes = None

The genes set stored for each profile.

has_array(kind: str, name: str) → bool

Test whether the container contains some per-gene or per-profile data file.

has_matrix(kind: str, name: str) → bool

Test whether the container contains some per-gene/gene, per-gene/profile or per-profile/profile data file.

static load(path: str, genes:, *, name: str = '', profiles_kind: Optional[str] = None) →

Load a profiles container from disk directory.

matrix(cls: Type[A], kind: str, name: str, *, profile_indices: Optional[Collection] = None, gene_indices: Optional[Collection] = None) → A

Return a per-gene/gene, per-gene/profile or per-profile/profile data of any type.

metadata = None

The meta-data describing the container.

name = None

The human friendly name of the container.

profile_indices() → tgutils.numpy.ArrayInt32

Return an array of profile indices.

abstract profile_name(profile_index: int, *, base_index: int = 0) → str

Return a human-friendly profile name.

This uses the profile’s barcode (if a p.barcode.txt file exists), otherwise the numeric profile index. In both cases the name is prefixed with the specific batch name, if any.

profile_names(*, base_index: int = 0) → tgutils.numpy.ArrayStr

Return an array of all the profile names.

profiles_count = None

The number of profiles in the container (to be filled by the sub-class).

series(cls: Type[S], kind: str, name: str) → S

Return a series of per-gene or per-profile data of some type.

total_umis(kind: str, profile_indices: Optional[Collection] = None) → tgutils.numpy.ArrayFloat32

Return an array of the per-gene or per-profile total UMIs, for all or some of the profiles.

The first time this is called (only for all the profiles), it ts cached on disk.

write(cls: Type[A], array: A, kind: str, name: str) → None

Write a data array or matrix into the container.

class*, metadata:, genes:, name: str = '')


A batch of UMIs measurement profiles.

__init__(*, metadata:, genes:, name: str = '') → None

Open a batch of profiles for access.

static created(path: str, genes:, profiles_kind: str) → None

Write the profiles.yaml file after creating a profiles batch in some directory.

At minimum, the directory should include a raw UMIs matrix.

dir_paths() → List[str]

Return the list of directories containing actual UMIs data for the container.

This is just a single path for a batch, and a list for a merged collection of batches. For a view it contains the view’s directory in addition to the base container’s.

metadata = None

The meta-data describing the batch.

profile_name(profile_index: int, *, base_index: int = 0) → str

Return a human-friendly profile name.

This uses the profile’s barcode (if a p.barcode.txt file exists), otherwise the numeric profile index. In both cases the name is prefixed with the specific batch name, if any.



Per-batch meta-data.


alias of ProfilesBatch

class, genes:, name: str = '')


A collection of named containers.

Each combined container is a sub-directory which is, itself, a profiles container. Per-profile and per-gene-per-profile data is collected from the leaf (batch or view) containers.

__init__(metadata:, genes:, name: str = '') → None

Initialize self. See help(type(self)) for accurate signature.

add_into_gp_array(name: str, profile_index: int, profile_array: tgutils.numpy.ArrayFloat32) → tgutils.numpy.ArrayFloat32

Add the requested data of a profile into an array.

available_data(kind: str) → List[str]

Return a list of the available data of the specified kind (‘g’, ‘p’, ‘gp’, ‘gg’ or ‘pp’).

container_index_of_profile(profile_index: int) → int

Return the index of the container to which a specific profile belongs.

containers = None

The combined (leaf) containers.

static created(path: str, combined_dirs: List[str]) → None

Write the profiles.yaml file after creating a profiles collection in some directory.

The directory should include all the combined profile containers.

dir_paths() → List[str]

Return the list of directories containing actual UMIs data for the container.

This is just a single path for a batch, and a list for a merged collection of batches. For a view it contains the view’s directory in addition to the base container’s.

first_profile_indices = None

The first index of each container (for searching).

has_array(kind: str, name: str) → bool

Test whether the container contains some per-gene or per-profile data file.

has_matrix(kind: str, name: str) → bool

Test whether the container contains some per-gene/gene, per-gene/profile or per-profile/profile data file.

index_by_name = None

The index of a combined container by its name.

metadata = None

The meta-data describing the collection.

profile_name(profile_index: int, *, base_index: int = 0) → str

Return a human-friendly profile name.

This uses the profile’s barcode (if a p.barcode.txt file exists), otherwise the numeric profile index. In both cases the name is prefixed with the specific batch name, if any.



Per-collection meta-data.

__init__(**kwargs) → None

Create metadata for a collection of batches of profiles.

as_dict(**kwargs) → Dict[str, Any]

Return the metadata as a clean dictionary.

combined_profiles = None

The list of batches contained in the collection.


alias of ProfilesCollection

required_keys = {'combined_profiles': <class 'list'>, 'genes_count': <class 'int'>, 'genes_uuid': <class 'str'>, 'organism': <class 'str'>, 'profiles_count': <class 'int'>, 'profiles_kind': <class 'str'>, 'uuid': <class 'str'>}
class*, metadata:, genes:, name: str = '')


A batch of profiles where each one is the sum of a group of profiles from some other container.

__init__(*, metadata:, genes:, name: str = '') → None

Open a view of a profiles groups container for access.

In this case, the passed genes parameter is the full genes, which must match the full genes of the grouped container. The exact used genes are identical to the genes of the grouped container.

static created(path: str, grouped_profiles: → None

Write the profiles.yaml file after creating a profiles groups in some directory.

The directory should contain the p.grouped_count.npy file with the size of each group.

grouped_profiles = None

The open grouped container.

metadata = None

The meta-data describing the groups.



Per-groups meta-data.


alias of ProfilesGroups

required_keys = {'genes_count': <class 'int'>, 'genes_uuid': <class 'str'>, 'grouped_profiles': <class 'str'>, 'grouped_uuid': <class 'str'>, 'organism': <class 'str'>, 'profiles_count': <class 'int'>, 'profiles_kind': <class 'str'>, 'uuid': <class 'str'>}


Common meta-data for any profiles container.

__init__(**kwargs) → None

Create metadata for a batch of profiles.

container = None

The (concrete) container of the (derived) class.

static detect_cls(_cls: Type[YM], dictionary: Dict[str, Any]) → Type[YM]

Decide on the concrete metadata class to load from the dictionary.

genes_count = None

The number of genes.

genes_uuid = None

The set of genes that were measured for each profile.

open(genes:, *, name: str = '') →

Open the actual container for data access.

organism = None

The organism the data is for (‘human’, ‘mouse’, etc.).

profiles_count = None

The number of profiles this contains.

profiles_kind = None

The kind of profiles this contains (‘cells’, ‘metacells’, etc.)

required_keys = {'genes_count': <class 'int'>, 'genes_uuid': <class 'str'>, 'organism': <class 'str'>, 'profiles_count': <class 'int'>, 'profiles_kind': <class 'str'>, 'uuid': <class 'str'>}

The keys that must appear in the metadata YAML file.

class*, metadata:, genes:, name: str = '')


A restricted view of some profiles container.

The view is specified using the two optional files, p.base_index.npy and genes.base_index.npy. If missing, then the view is assumed to use the full profiles and/or genes of the base container.

Having a view which uses the full base container data is useful when creating a new collection. Such a “just a link” view allows the collection to refer to an arbitrary existing container, without having to physically copy it into the collection directory. This is different from a symbolic link in that it has different modification times (for make-like tools), and allows creating additional computed data inside the view without modifying the original base container.


Some data of the base container may not be meaningfully subset-able. For example, p.total_umis would be incorrect if the view uses a subset of the genes. By default, the code assumes all data is subset-able, as this allows for each access to barcodes and other meta-data associated with each profile (or gene). To avoid subset-ing computed data, add the name of the data to one of,

GENES_DEPENDENT_DATA = {'g': {'mean_fraction'}, 'gg': {}, 'gp': {}, 'p': {'total_umis'}, 'pp': {'balanced_ranks', 'correlation', 'edge_weights', 'pruned_ranks'}}

For each kind (gp, gg, pp, g, p), the names of data which depends on the set of genes.

PROFILES_DEPENDENT_DATA = {'g': {'mean_fraction', 'total_umis'}, 'gg': {}, 'gp': {'fold_in_group'}, 'p': {}, 'pp': {'balanced_ranks', 'correlation', 'edge_weights', 'pruned_ranks'}}

For each kind (gp, gg, pp, g, p), the names of data which depends on the set of profiles.

__init__(*, metadata:, genes:, name: str = '') → None

Open a view of a profiles container for access.

In this case, the passed genes parameter is the full genes, which must match the full genes of the base container. The exact used gene subset is computed based on the base container’s genes, and the content of the genes.base_index.npy file, if any.

add_into_gp_array(name: str, profile_index: int, profile_array: tgutils.numpy.ArrayFloat32) → tgutils.numpy.ArrayFloat32

Add the requested data of a profile into an array.

available_data(kind: str) → List[str]

Return a list of the available data of the specified kind (‘g’, ‘p’, ‘gp’, ‘gg’ or ‘pp’).

base_gene_index(gene_index: int) → int

Return the index of the gene in the base profiles container.

base_gene_indices = None

The indices of the included genes, or None if all genes are included.

base_profile_index(profile_index: int) → int

Return the index of the profile in the base profiles container.

base_profile_indices = None

The indices of the included profiles, or None if all profiles are included.

base_profiles = None

The open base container.

static created(path: str, base_profiles: → None

Write the profiles.yaml file after creating a profiles view in some directory.

The directory should contain the base_index files of the filtered profiles and/or genes.

dir_paths() → List[str]

Return the list of directories containing actual UMIs data for the container.

This is just a single path for a batch, and a list for a merged collection of batches. For a view it contains the view’s directory in addition to the base container’s.

has_array(kind: str, name: str) → bool

Test whether the container contains some per-gene or per-profile data file.

has_matrix(kind: str, name: str) → bool

Test whether the container contains some per-gene/gene, per-gene/profile or per-profile/profile data file.

is_slicing() → bool

Whether the view only includes some of the base container’s data.

is_slicing_genes() → bool

Whether the view only includes some of the base container’s genes.

is_slicing_profiles() → bool

Whether the view only includes some of the base container’s profiles.

metadata = None

The meta-data describing the batch.



Per-view meta-data.


alias of ProfilesView

required_keys = {'base_profiles': <class 'str'>, 'base_uuid': <class 'str'>, 'genes_count': <class 'int'>, 'genes_uuid': <class 'str'>, 'organism': <class 'str'>, 'profiles_count': <class 'int'>, 'profiles_kind': <class 'str'>, 'uuid': <class 'str'>} = typing.Union[, numpy.ndarray]

A matrix of data per profile and gene. str) → str

Return the profiles kind for the grouped profiles. str) → str

Return the profiles kind for the profiles groups. module

Create profile views.

Module contents

Manage disk storage of metacell data.