AnnData Format
DataAxesFormats.AnnDataFormat
—
Module
Import/export
Daf
data from/to
AnnData
. We use the
AnnData
Julia implementation from
Muon.jl
.
Due to the different data models, not all the content of
AnnData
can be represented as
Daf
, and vice-versa. However, "most" of the data can be automatically converted from one form to the other. In both directions, conversion is zero-copy; that is, we merely create a different view for the same vectors and matrices. We also use memory-mapping whenever possible for increased performance.
The following
Daf
data can't be naively stored in
AnnData
:
-
AnnDatais restricted to storing data for only two axes, whichAnnDataalways calls "obs" and "var". In contrast,Dafcan store data for an arbitrary set of meaningfully named axes. -
Anndataalways contains a matrix property for these two axes called "X". Mercifully, the rest of the matrices are allowed to have meaningful names. In contrast,Dafallows storing an arbitrary set of meaningfully named matrices. -
AnnDatacan only hold row-major matrices, while Julia defaults to column-major layout;Dafallows storing both layouts, sacrificing disk storage for performance.
Therefore, when viewing
Daf
data as
AnnData
, we pick two specific axes and rename them to "obs" and "var", pick a specific matrix property of these axes and rename it to "X", and
relayout!
it if needed so
AnnData
would be happy. We store the discarded names of the axes and matrix in unstructured annotations called
obs_is
,
var_is
and
X_is
. This allows us to reconstruct the original names when re-viewing the
AnnData
as
Daf
data.
The following
AnnData
can't be naively stored in
Daf
:
- Non-scalars (e.g., mappings) inside
unsunstructured annotations. TheDafequivalent is storing JSON string blobs, which is awkward to use. TODO: provide better API to deal with such data. - Data using nullable entries (e.g. a matrix with nullable integer entries). In contrast,
Dafsupports the convention that zero values are special. This only works in some cases (e.g., it isn't a good solution for Boolean data). It is possible of course to explicitly store Boolean masks and apply them to the data, but this is inconvenient. TODO: HaveDafnatively support nullable/masked arrays. - Categorical data. Categorical vectors are therefore converted to simple strings. However,
Dafdoesn't support matrices of strings, so it doesn't support or convert categorical matrices. - Matrix data that only uses one of the axes (that is,
obsmandvarmdata). The problem here is, paradoxically, thatDafsupports such data "too well", by allowing multiple axes to be defined, and storing matrices based on any pair of axes. However, this requires the other axes to be explicitly created, and their information just doesn't exist in theAnnDatadata set. TODO: Allow unstructured annotations to store the entries of the other axis.
When viewing
AnnData
as
Daf
, we either ignore, warn, or treat as an error any such unsupported data.
Square matrices accessed via
Daf
APIs will be the (column-major)
transpose
of the original
AnnData
(row-major) matrix.
Due to limitations of the
Daf
data model, square matrices are stored only in column-major layout. In contrast,
AnnData
square matrices (
obsp
,
varp
), are stored in row-major layout. We have several bad options to address this:
- We can break the
Dafinvariant that all accessed data is column-major, at least for square matrices. This is bad because the invariant greatly simplifiesDafclient code. Forcing clients to check the data layout and callingrelayout!would add a lot of error-prone boilerplate to our users. - We can
relayout!the data when copying it betweenAnnDataandDaf. This is bad because it would force us to duplicate the data. More importantly, there is typically a good reason for the layout of the data. For example, assume a directed graph between cells. A common way to store is is to have a square matrix where each row contains the weights of the edges originating in one cell, connecting it to all other cells. This allows code to efficiently "loop on all cells; loop on all outgoing edges". If werelayout!the data, then such a loop would become extremely inefficient. - We can return the transposed matrix from
Daf. This is bad because Julia code and Python code processing the "same" data would need to flip the indices (e.g.,outgoing_weight[from_cell, to_cell]in Python vs.outgoing_weight[to_cell, from_cell]in Julia).
Having to pick between these bad options, we chose the last one as the lesser evil. The assumption is that Julia code is written separately from the Python code anyway. If the same algorithm is implemented in both systems, it would work (efficiently!) - that is, as long as the developer read this warning and flipped the order of the indices.
We do
not
have this problem with non-square matrices (e.g., the per-cell-per-gene
UMIs
matrix), since
Daf
allows for storing and accessing both layouts of the same data in this case. We simply populate
Daf
with the row-major data from
AnnData
and if asked for the outher layout, will
relayout!
it (and store/cache the result).
DataAxesFormats.AnnDataFormat.anndata_as_daf
—
Function
anndata_as_daf(
[filter::Maybe{Function} = nothing,]
adata::Union{AnnData, AbstractString};
[name::Maybe{AbstractString} = nothing,
obs_is::Maybe{AbstractString} = nothing,
var_is::Maybe{AbstractString} = nothing,
X_is::Maybe{AbstractString} = nothing,
unsupported_handler::AbnormalHandler = WarnHandler]
)::MemoryDaf
View
AnnData
as a
Daf
data set, specifically using a
MemoryDaf
. This doesn't duplicate matrices or vectors, but acts as a view containing references to the same ones. Adding and/or deleting data in the view using the
Daf
API will not affect the original
adata
.
Any unsupported
AnnData
annotations will be handled using the
unsupported_handler
. By default, we'll warn about each and every such unsupported property.
If
adata
is a string, then it is the path of an
h5ad
file which is automatically loaded.
If not specified, the
name
will be the value of the "name"
uns
property, if it exists, otherwise, it will be "anndata".
If not specified,
obs_is
(the name of the "obs" axis) will be the value of the "obs_is"
uns
property, if it exists, otherwise, it will be "obs".
If not specified,
var_is
(the name of the "var" axis) will be the value of the "var_is"
uns
property, if it exists, otherwise, it will be "var".
If not specified,
X_is
(the name of the "X" matrix) will be the value of the "X_is"
uns
property, if it exists, otherwise, it will be "X".
If
filter
is specified, it is a function that is given two parameters. The first is the name of the
anndata
member (
X
,
obs
,
var
,
obsp
,
varp
,
layer
) and the second is the key (
X
for the
X
member). It should return
false
if the data is to be ignored. This allows skipping unwanted data (or data that can't be converted for any reason). This doesn't speed things up
DataAxesFormats.AnnDataFormat.daf_as_anndata
—
Function
daf_as_anndata(
daf::DafReader;
[obs_is::Maybe{AbstractString} = nothing,
var_is::Maybe{AbstractString} = nothing,
X_is::Maybe{AbstractString} = nothing,
h5ad::Maybe{AbstractString} = nothing]
)::AnnData
View the
daf
data set as
AnnData
. This doesn't duplicate matrices or vectors, but acts as a view containing references to the same ones. Adding and/or deleting data in the view using the
AnnData
API will not affect the original
daf
data set.
If specified, the result is also written to an
h5ad
file.
If not specified,
obs_is
(the name of the "obs" axis) will be the value of the "obs_is" scalar property, if it exists, otherwise, it will be "obs".
If not specified,
var_is
(the name of the "var" axis) will be the value of the "var_is" scalar property, if it exists, otherwise, it will be "var".
If not specified,
X_is
(the name of the "X" matrix) will be the value of the "X_is" scalar property, if it exists, otherwise, it will be "X".
Each of the final
obs_is
,
var_is
,
X_is
values is stored as unstructured annotations, unless the default value ("obs", "var", "X") is used.
All scalar properties, vector properties of the chosen "obs" and "var" axes, and matrix properties of these axes, are stored in the returned new
AnnData
object.