Readers
DataAxesFormats.Readers
—
Module
The
DafReader
interface specifies a high-level API for reading
Daf
data. This API is implemented here, on top of the low-level
FormatReader
API. The high-level API provides thread safety so the low-level API can (mostly) ignore this issue.
Each data set is given a name to use in error messages etc. You can explicitly set this name when creating a
Daf
object. Otherwise, when opening an existing data set, if it contains a scalar "name" property, it is used. Otherwise some reasonable default is used. In all cases, object names are passed through
unique_name
to avoid ambiguity.
Data properties are identified by a unique name given the axes they are based on. That is, there is a separate namespace for scalar properties, vector properties for each specific axis, and matrix properties for each unordered pair of axes.
For matrices, we keep careful track of their layout. Returned matrices are always in column-major layout, using
relayout!
if necessary. As this is an expensive operation, we'll cache the result in memory. Similarly, we cache the results of applying a query to the data. We allow clearing the cache to reduce memory usage, if necessary.
The data API is the high-level API intended to be used from outside the package, and is therefore re-exported from the top-level
Daf
namespace. It provides additional functionality on top of the low-level
FormatReader
implementation, accepting more general data types, automatically dealing with
relayout!
when needed. In particular, it enforces single-writer multiple-readers for each data set, so the format code can ignore multi-threading and still be thread-safe.
In the APIs below, when getting a value, specifying a
default
of
undef
means that it is an
error
for the value not to exist. In contrast, specifying a
default
of
nothing
means it is OK for the value not to exist, returning
nothing
. Specifying an actual value for
default
means it is OK for the value not to exist, returning the
default
instead. This is in spirit with, but not identical to,
undef
being used as a flag for array construction saying "there is no initializer". If you feel this is an abuse of the
undef
value, take some comfort in that it is the default value for the
default
, so you almost never have to write it explicitly in your code.
DataAxesFormats.Readers.description
—
Function
description(
daf::DafReader[;
deep::Bool = false,
cache::Bool = false,
tensors::Bool = true,
)::AbstractString
Return a (multi-line) description of the contents of
daf
. This tries to hit a sweet spot between usefulness and terseness. If
cache
, also describes the content of the cache. If
deep
, also describes any data set nested inside this one (if any).
If
tensors
is set, this will include a
tensors
section which will condense the long list of tensor matrices.
print(description(example_chain_daf(); deep = true))
# output
name: chain!
type: Write Chain
scalars:
organism: "human"
reference: "test"
axes:
cell: 856 entries
donor: 95 entries
experiment: 23 entries
gene: 683 entries
metacell: 7 entries
type: 4 entries
vectors:
cell:
donor: 856 x Str (Dense)
experiment: 856 x Str (Dense)
metacell: 856 x Str (Dense)
donor:
age: 95 x UInt32 (Dense)
sex: 95 x Str (Dense)
gene:
is_lateral: 683 x Bool (Dense; 438 (64%) true)
is_marker: 683 x Bool (Dense; 650 (95%) true)
metacell:
type: 7 x Str (Dense)
type:
color: 4 x Str (Dense)
matrices:
cell,gene:
UMIs: 856 x 683 x UInt8 in Columns (Dense)
gene,cell:
UMIs: 683 x 856 x UInt8 in Columns (Dense)
gene,metacell:
fraction: 683 x 7 x Float32 in Columns (Dense)
metacell,metacell:
edge_weight: 7 x 7 x Float32 in Columns (Dense)
chain:
- name: cells!
type: MemoryDaf
scalars:
organism: "human"
reference: "test"
axes:
cell: 856 entries
donor: 95 entries
experiment: 23 entries
gene: 683 entries
vectors:
cell:
donor: 856 x Str (Dense)
experiment: 856 x Str (Dense)
donor:
age: 95 x UInt32 (Dense)
sex: 95 x Str (Dense)
gene:
is_lateral: 683 x Bool (Dense; 438 (64%) true)
matrices:
cell,gene:
UMIs: 856 x 683 x UInt8 in Columns (Dense)
gene,cell:
UMIs: 683 x 856 x UInt8 in Columns (Dense)
- name: metacells!
type: MemoryDaf
axes:
cell: 856 entries
gene: 683 entries
metacell: 7 entries
type: 4 entries
vectors:
cell:
metacell: 856 x Str (Dense)
gene:
is_marker: 683 x Bool (Dense; 650 (95%) true)
metacell:
type: 7 x Str (Dense)
type:
color: 4 x Str (Dense)
matrices:
gene,metacell:
fraction: 683 x 7 x Float32 in Columns (Dense)
metacell,metacell:
edge_weight: 7 x 7 x Float32 in Columns (Dense)
Scalar properties
DataAxesFormats.Readers.has_scalar
—
Function
has_scalar(daf::DafReader, name::AbstractString)::Bool
Check whether a scalar property with some
name
exists in
daf
.
has_scalar(example_cells_daf(), "organism")
# output
true
has_scalar(example_metacells_daf(), "organism")
# output
false
DataAxesFormats.Readers.scalars_set
—
Function
scalars_set(daf::DafReader)::AbstractSet{<:AbstractString}
The names of the scalar properties in
daf
.
There's no immutable set type in Julia for us to return. If you do modify the result set, bad things will happen.
String.(scalars_set(example_cells_daf()))
# output
2-element Vector{String}:
"organism"
"reference"
DataAxesFormats.Readers.get_scalar
—
Function
get_scalar(
daf::DafReader,
name::AbstractString;
[default::Union{StorageScalar, Nothing, UndefInitializer} = undef]
)::Maybe{StorageScalar}
Get the value of a scalar property with some
name
in
daf
.
If
default
is
undef
(the default), this first verifies the
name
scalar property exists in
daf
. Otherwise
default
will be returned if the property does not exist.
get_scalar(example_cells_daf(), "organism")
# output
"human"
println(get_scalar(example_metacells_daf(), "organism"; default = nothing))
# output
nothing
Readers axes
DataAxesFormats.Readers.has_axis
—
Function
has_axis(daf::DafReader, axis::AbstractString)::Bool
Check whether some
axis
exists in
daf
.
has_axis(example_cells_daf(), "metacell")
# output
false
has_axis(example_metacells_daf(), "metacell")
# output
true
DataAxesFormats.Readers.axes_set
—
Function
axes_set(daf::DafReader)::AbstractSet{<:AbstractString}
The names of the axes of
daf
.
There's no immutable set type in Julia for us to return. If you do modify the result set, bad things will happen.
sort!(String.(axes_set(example_cells_daf())))
# output
4-element Vector{String}:
"cell"
"donor"
"experiment"
"gene"
DataAxesFormats.Readers.axis_length
—
Function
axis_length(daf::DafReader, axis::AbstractString)::Int64
The number of entries along the
axis
in
daf
.
This first verifies the
axis
exists in
daf
.
axis_length(example_metacells_daf(), "type")
# output
4
DataAxesFormats.Readers.axis_entries
—
Function
axis_entries(
daf::DafReader,
axis::AbstractString,
indices::Maybe{AbstractVector{<:Integer}} = nothing;
allow_empty::Bool = false,
)::AbstractVector{<:AbstractString}
Return a vector of the names of the entries with
indices
in the
axis
. If
allow_empty
, the zero (or negative) index is converted to the empty string. Otherwise, all
indices
must be valid. If
indices
are no specified, returns the vector of entry names. That is,
axis_entries(daf, axis)
is the same as
axis_vector(daf, axis)
.
axis_entries(example_metacells_daf(), "type", [3, 0]; allow_empty = true)
# output
2-element Vector{AbstractString}:
"MEBEMP-L"
""
DataAxesFormats.Readers.axis_vector
—
Function
axis_vector(
daf::DafReader,
axis::AbstractString;
[default::Union{Nothing, UndefInitializer} = undef]
)::Maybe{AbstractVector{<:AbstractString}}
The array of unique names of the entries of some
axis
of
daf
. This is similar to doing
get_vector
for the special
name
property, except that it returns a simple vector (array) of strings instead of a
NamedVector
.
If
default
is
undef
(the default), this verifies the
axis
exists in
daf
. Otherwise, the
default
is
nothing
, which is returned if the
axis
does not exist.
String.(axis_vector(example_metacells_daf(), "type"))
# output
4-element Vector{String}:
"memory-B"
"MEBEMP-E"
"MEBEMP-L"
"MPP"
DataAxesFormats.Readers.axis_indices
—
Function
axis_indices(
daf::DafReader,
axis::AbstractString,
entries::AbstractVector{<:AbstractString};
allow_empty::Bool = false,
allow_missing::Bool = false,
)::AbstractVector{<:Integer}
Return a vector of the indices of the
entries
in the
axis
. If
allow_empty
, the empty string is converted to a zero index. If
allow_missing
, any non-empty strings that do not exist are likewise converted to a zero index. Otherwise, all
entries
must exist in the
axis
.
axis_indices(example_metacells_daf(), "type", ["MPP", ""]; allow_empty = true)
# output
2-element Vector{Int64}:
4
0
DataAxesFormats.Readers.axis_dict
—
Function
axis_dict(daf::DafReader, axis::AbstractString)::AbstractDict{<:AbstractString, <:Integer}
Return a dictionary converting axis entry names to their integer index.
axis_dict(example_metacells_daf(), "type")
# output
OrderedCollections.OrderedDict{AbstractString, Int64} with 4 entries:
"memory-B" => 1
"MEBEMP-E" => 2
"MEBEMP-L" => 3
"MPP" => 4
Vector properties
DataAxesFormats.Readers.has_vector
—
Function
has_vector(daf::DafReader, axis::AbstractString, name::AbstractString)::Bool
Check whether a vector property with some
name
exists for the
axis
in
daf
. This is always true for the special
name
property.
This first verifies the
axis
exists in
daf
.
has_vector(example_cells_daf(), "cell", "type")
# output
false
has_vector(example_metacells_daf(), "metacell", "type")
# output
true
DataAxesFormats.Readers.vectors_set
—
Function
vectors_set(daf::DafReader, axis::AbstractString)::AbstractSet{<:AbstractString}
The names of the vector properties for the
axis
in
daf
,
not
including the special
name
property.
This first verifies the
axis
exists in
daf
.
There's no immutable set type in Julia for us to return. If you do modify the result set, bad things will happen.
sort!(String.(vectors_set(example_cells_daf(), "cell")))
# output
2-element Vector{String}:
"donor"
"experiment"
DataAxesFormats.Readers.get_vector
—
Function
get_vector(
daf::DafReader,
axis::AbstractString,
name::AbstractString;
[default::Union{StorageScalar, StorageVector, Nothing, UndefInitializer} = undef]
)::Maybe{NamedVector}
Get the vector property with some
name
for some
axis
in
daf
. The names of the result are the names of the vector entries (same as returned by
axis_vector
). The special property
name
returns an array whose values are also the (read-only) names of the entries of the axis.
This first verifies the
axis
exists in
daf
. If
default
is
undef
(the default), this first verifies the
name
vector exists in
daf
. Otherwise, if
default
is
nothing
, it will be returned. If it is a
StorageVector
, it has to be of the same size as the
axis
, and is returned. If it is a
StorageScalar
. Otherwise, a new
Vector
is created of the correct size containing the
default
, and is returned.
get_vector(example_metacells_daf(), "type", "color")
# output
4-element Named SparseArrays.ReadOnly{String, 1, Vector{String}}
type │
─────────┼────────────
memory-B │ "steelblue"
MEBEMP-E │ "#eebb6e"
MEBEMP-L │ "plum"
MPP │ "gold"
Matrix properties
DataAxesFormats.Readers.has_matrix
—
Function
has_matrix(
daf::DafReader,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString;
[relayout::Bool = true]
)::Bool
Check whether a matrix property with some
name
exists for the
rows_axis
and the
columns_axis
in
daf
. Since this is Julia, this means a column-major matrix. A daf may contain two copies of the same data, in which case it would report the matrix under both axis orders.
If
relayout
(the default), this will also check whether the data exists in the other layout (that is, with flipped axes).
This first verifies the
rows_axis
and
columns_axis
exists in
daf
.
has_matrix(example_cells_daf(), "gene", "cell", "UMIs")
# output
true
DataAxesFormats.Readers.matrices_set
—
Function
matrices_set(
daf::DafReader,
rows_axis::AbstractString,
columns_axis::AbstractString;
[relayout::Bool = true]
)::AbstractSet{<:AbstractString}
The names of the matrix properties for the
rows_axis
and
columns_axis
in
daf
.
If
tensors
, this will condense the list of tensor matrices (
<
tensor
axis
entry
>_name
) into a single entry called
/<
tensor
axis
name
>/_matrix
. This makes the reasonable assumption that names do not contain the
/
character (which wouldn't work in files and H5DF storage formats, anyway).
If
relayout
(default), then this will include the names of matrices that exist in the other layout (that is, with flipped axes).
This first verifies the
rows_axis
and
columns_axis
exist in
daf
.
There's no immutable set type in Julia for us to return. If you do modify the result set, bad things will happen.
String.(matrices_set(example_cells_daf(), "gene", "cell"))
# output
1-element Vector{String}:
"UMIs"
DataAxesFormats.Readers.get_matrix
—
Function
get_matrix(
daf::DafReader,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString;
[default::Union{StorageScalarBase, StorageMatrix, Nothing, UndefInitializer} = undef,
relayout::Bool = true]
)::Maybe{NamedMatrix}
Get the column-major matrix property with some
name
for some
rows_axis
and
columns_axis
in
daf
. The names of the result axes are the names of the relevant axes entries (same as returned by
axis_vector
).
If
relayout
(the default), then if the matrix is only stored in the other memory layout (that is, with flipped axes), then automatically call
relayout!
to compute the result. If
daf
isa
DafWriter
, then store the result for future use; otherwise, just cache it as
MemoryData
. This may lock up very large amounts of memory; you can call
empty_cache!
to release it.
This first verifies the
rows_axis
and
columns_axis
exist in
daf
. If
default
is
undef
(the default), this first verifies the
name
matrix exists in
daf
. Otherwise, if
default
is
nothing
, it is returned. If
default
is a
StorageMatrix
, it has to be of the same size as the
rows_axis
and
columns_axis
, and is returned. Otherwise, a new
Matrix
is created of the correct size containing the
default
, and is returned.
get_matrix(example_metacells_daf(), "gene", "metacell", "fraction")
# output
683×7 Named Matrix{Float32}
gene ╲ metacell │ M1671.28 M2357.20 … M756.63 M412.08
────────────────┼──────────────────────────────────────────────────────
RPL22 │ 0.00447666 0.0041286 … 0.00434327 0.00373581
PARK7 │ 8.52301f-5 0.000154199 0.000108019 6.50531f-5
ENO1 │ 0.000464448 0.000482609 0.000248241 4.22228f-5
PRDM2 │ 2.053f-5 2.85439f-5 2.46575f-5 0.000151486
HP1BP3 │ 0.000107137 0.000110915 8.9043f-5 0.00012099
CDC42 │ 0.000153017 0.000207847 0.000152447 0.000176377
HNRNPR │ 0.000122974 6.7171f-5 7.09771f-5 6.7083f-5
RPL11 │ 0.010306 0.0110606 0.0109086 0.0124251
⋮ ⋮ ⋮ ⋱ ⋮ ⋮
NRIP1 │ 0.000155974 0.000361428 0.000197766 2.79487f-5
ATP5PF │ 8.62855f-5 0.000125912 0.000121949 8.22312f-5
CCT8 │ 0.000104152 7.55233f-5 0.00011572 4.13243f-5
SOD1 │ 0.000177344 0.000147838 0.000104723 0.000103708
SON │ 0.000280491 0.000262015 0.000170829 0.00032361
ATP5PO │ 0.000134007 0.000123143 0.00018833 9.73498f-5
TTC3 │ 0.000111978 0.00011131 0.000100166 0.000122469
HMGN1 │ 0.000345676 0.000287754 … 0.000264526 0.000160654
Utilities
DataAxesFormats.Readers.complete_path
—
Function
complete_path(daf::DafReader)::Maybe{AbstractString}
If the
daf
repository is persistent (resides on disk), the absolute path leading to it. This path can be given to
complete_daf
to access the repository after the current process is terminated. If
nothing
then the repository is (at least partially) in-memory and will disappear when the current process is terminated.
The
H5df
format may report a path that ends with
#...
to identify a specific group inside an
HDF5
file. That is, the reported path isn't necessarily the path of a disk file or directory.
DataAxesFormats.Readers.axis_version_counter
—
Function
axis_version_counter(daf::DafReader, axis::AbstractString)::UInt32
Return the version number of the axis. This is incremented every time
delete_axis!
is called. It is used by interfaces to other programming languages to safely cache per-axis data.
This is purely in-memory per-instance, and
not
a global persistent version counter. That is, the version counter starts at zero even if opening a persistent disk
daf
data set.
metacells = example_metacells_daf()
println(axis_version_counter(metacells, "type"))
delete_axis!(metacells, "type")
add_axis!(metacells, "type", ["Foo", "Bar", "Baz"])
println(axis_version_counter(metacells, "type"))
# output
0
1
DataAxesFormats.Readers.vector_version_counter
—
Function
vector_version_counter(daf::DafReader, axis::AbstractString, name::AbstractString)::UInt32
Return the version number of the vector. This is incremented every time
set_vector!
,
empty_dense_vector!
or
empty_sparse_vector!
are called. It is used by interfaces to safely cache per-vector data.
This is purely in-memory per-instance, and
not
a global persistent version counter. That is, the version counter starts at zero even if opening a persistent disk
daf
data set.
metacells = example_metacells_daf()
println(vector_version_counter(metacells, "type", "color"))
set_vector!(metacells, "type", "color", string.(collect(1:4)); overwrite = true)
println(vector_version_counter(metacells, "type", "color"))
# output
1
2
DataAxesFormats.Readers.matrix_version_counter
—
Function
matrix_version_counter(
daf::DafReader,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString
)::UInt32
Return the version number of the matrix. The order of the axes does not matter. This is incremented every time
set_matrix!
,
empty_dense_matrix!
or
empty_sparse_matrix!
are called. It is used by interfaces to other programming languages to safely cache per-matrix data.
This is purely in-memory per-instance, and
not
a global persistent version counter. That is, the version counter starts at zero even if opening a persistent disk
daf
data set.
metacells = example_metacells_daf()
println(matrix_version_counter(metacells, "gene", "metacell", "fraction"))
set_matrix!(metacells, "gene", "metacell", "fraction", rand(Float32, 683, 7); overwrite = true)
println(matrix_version_counter(metacells, "gene", "metacell", "fraction"))
# output
1
2
Index
-
DataAxesFormats.Readers -
DataAxesFormats.Readers.axes_set -
DataAxesFormats.Readers.axis_dict -
DataAxesFormats.Readers.axis_entries -
DataAxesFormats.Readers.axis_indices -
DataAxesFormats.Readers.axis_length -
DataAxesFormats.Readers.axis_vector -
DataAxesFormats.Readers.axis_version_counter -
DataAxesFormats.Readers.complete_path -
DataAxesFormats.Readers.description -
DataAxesFormats.Readers.get_matrix -
DataAxesFormats.Readers.get_scalar -
DataAxesFormats.Readers.get_vector -
DataAxesFormats.Readers.has_axis -
DataAxesFormats.Readers.has_matrix -
DataAxesFormats.Readers.has_scalar -
DataAxesFormats.Readers.has_vector -
DataAxesFormats.Readers.matrices_set -
DataAxesFormats.Readers.matrix_version_counter -
DataAxesFormats.Readers.scalars_set -
DataAxesFormats.Readers.vector_version_counter -
DataAxesFormats.Readers.vectors_set