Readers

DataAxesFormats.Readers Module

The DafReader interface specifies a high-level API for reading Daf data. This API is implemented here, on top of the low-level FormatReader API. The high-level API provides thread safety so the low-level API can (mostly) ignore this issue.

Each data set is given a name to use in error messages etc. You can explicitly set this name when creating a Daf object. Otherwise, when opening an existing data set, if it contains a scalar "name" property, it is used. Otherwise some reasonable default is used. In all cases, object names are passed through unique_name to avoid ambiguity.

Data properties are identified by a unique name given the axes they are based on. That is, there is a separate namespace for scalar properties, vector properties for each specific axis, and matrix properties for each unordered pair of axes.

For matrices, we keep careful track of their layout. Returned matrices are always in column-major layout, using relayout! if necessary. As this is an expensive operation, we'll cache the result in memory. Similarly, we cache the results of applying a query to the data. We allow clearing the cache to reduce memory usage, if necessary.

The data API is the high-level API intended to be used from outside the package, and is therefore re-exported from the top-level Daf namespace. It provides additional functionality on top of the low-level FormatReader implementation, accepting more general data types, automatically dealing with relayout! when needed. In particular, it enforces single-writer multiple-readers for each data set, so the format code can ignore multi-threading and still be thread-safe.

Note

In the APIs below, when getting a value, specifying a default of undef means that it is an error for the value not to exist. In contrast, specifying a default of nothing means it is OK for the value not to exist, returning nothing . Specifying an actual value for default means it is OK for the value not to exist, returning the default instead. This is in spirit with, but not identical to, undef being used as a flag for array construction saying "there is no initializer". If you feel this is an abuse of the undef value, take some comfort in that it is the default value for the default , so you almost never have to write it explicitly in your code.

DataAxesFormats.Readers.description Function
description(
    daf::DafReader[;
    deep::Bool = false,
    cache::Bool = false,
    tensors::Bool = true,
)::AbstractString

Return a (multi-line) description of the contents of daf . This tries to hit a sweet spot between usefulness and terseness. If cache , also describes the content of the cache. If deep , also describes any data set nested inside this one (if any).

If tensors is set, this will include a tensors section which will condense the long list of tensor matrices.

print(description(example_chain_daf(); deep = true))

# output

name: chain!
type: Write Chain
scalars:
  organism: "human"
  reference: "test"
axes:
  cell: 856 entries
  donor: 95 entries
  experiment: 23 entries
  gene: 683 entries
  metacell: 7 entries
  type: 4 entries
vectors:
  cell:
    donor: 856 x Str (Dense)
    experiment: 856 x Str (Dense)
    metacell: 856 x Str (Dense)
  donor:
    age: 95 x UInt32 (Dense)
    sex: 95 x Str (Dense)
  gene:
    is_lateral: 683 x Bool (Dense; 438 (64%) true)
    is_marker: 683 x Bool (Dense; 650 (95%) true)
  metacell:
    type: 7 x Str (Dense)
  type:
    color: 4 x Str (Dense)
matrices:
  cell,gene:
    UMIs: 856 x 683 x UInt8 in Columns (Dense)
  gene,cell:
    UMIs: 683 x 856 x UInt8 in Columns (Dense)
  gene,metacell:
    fraction: 683 x 7 x Float32 in Columns (Dense)
  metacell,metacell:
    edge_weight: 7 x 7 x Float32 in Columns (Dense)
chain:
- name: cells!
  type: MemoryDaf
  scalars:
    organism: "human"
    reference: "test"
  axes:
    cell: 856 entries
    donor: 95 entries
    experiment: 23 entries
    gene: 683 entries
  vectors:
    cell:
      donor: 856 x Str (Dense)
      experiment: 856 x Str (Dense)
    donor:
      age: 95 x UInt32 (Dense)
      sex: 95 x Str (Dense)
    gene:
      is_lateral: 683 x Bool (Dense; 438 (64%) true)
  matrices:
    cell,gene:
      UMIs: 856 x 683 x UInt8 in Columns (Dense)
    gene,cell:
      UMIs: 683 x 856 x UInt8 in Columns (Dense)
- name: metacells!
  type: MemoryDaf
  axes:
    cell: 856 entries
    gene: 683 entries
    metacell: 7 entries
    type: 4 entries
  vectors:
    cell:
      metacell: 856 x Str (Dense)
    gene:
      is_marker: 683 x Bool (Dense; 650 (95%) true)
    metacell:
      type: 7 x Str (Dense)
    type:
      color: 4 x Str (Dense)
  matrices:
    gene,metacell:
      fraction: 683 x 7 x Float32 in Columns (Dense)
    metacell,metacell:
      edge_weight: 7 x 7 x Float32 in Columns (Dense)

Scalar properties

DataAxesFormats.Readers.has_scalar Function
has_scalar(daf::DafReader, name::AbstractString)::Bool

Check whether a scalar property with some name exists in daf .

has_scalar(example_cells_daf(), "organism")

# output

true

has_scalar(example_metacells_daf(), "organism")

# output

false

DataAxesFormats.Readers.scalars_set Function
scalars_set(daf::DafReader)::AbstractSet{<:AbstractString}

The names of the scalar properties in daf .

Note

There's no immutable set type in Julia for us to return. If you do modify the result set, bad things will happen.

String.(scalars_set(example_cells_daf()))

# output

2-element Vector{String}:
 "organism"
 "reference"

DataAxesFormats.Readers.get_scalar Function
get_scalar(
    daf::DafReader,
    name::AbstractString;
    [default::Union{StorageScalar, Nothing, UndefInitializer} = undef]
)::Maybe{StorageScalar}

Get the value of a scalar property with some name in daf .

If default is undef (the default), this first verifies the name scalar property exists in daf . Otherwise default will be returned if the property does not exist.

get_scalar(example_cells_daf(), "organism")

# output

"human"

println(get_scalar(example_metacells_daf(), "organism"; default = nothing))

# output

nothing

Readers axes

DataAxesFormats.Readers.has_axis Function
has_axis(daf::DafReader, axis::AbstractString)::Bool

Check whether some axis exists in daf .

has_axis(example_cells_daf(), "metacell")

# output

false

has_axis(example_metacells_daf(), "metacell")

# output

true

DataAxesFormats.Readers.axes_set Function
axes_set(daf::DafReader)::AbstractSet{<:AbstractString}

The names of the axes of daf .

Note

There's no immutable set type in Julia for us to return. If you do modify the result set, bad things will happen.

sort!(String.(axes_set(example_cells_daf())))

# output

4-element Vector{String}:
 "cell"
 "donor"
 "experiment"
 "gene"

DataAxesFormats.Readers.axis_length Function
axis_length(daf::DafReader, axis::AbstractString)::Int64

The number of entries along the axis in daf .

This first verifies the axis exists in daf .

axis_length(example_metacells_daf(), "type")

# output

4

DataAxesFormats.Readers.axis_entries Function
axis_entries(
    daf::DafReader,
    axis::AbstractString,
    indices::Maybe{AbstractVector{<:Integer}} = nothing;
    allow_empty::Bool = false,
)::AbstractVector{<:AbstractString}

Return a vector of the names of the entries with indices in the axis . If allow_empty , the zero (or negative) index is converted to the empty string. Otherwise, all indices must be valid. If indices are no specified, returns the vector of entry names. That is, axis_entries(daf, axis) is the same as axis_vector(daf, axis) .

axis_entries(example_metacells_daf(), "type", [3, 0]; allow_empty = true)

# output

2-element Vector{AbstractString}:
 "MEBEMP-L"
 ""

DataAxesFormats.Readers.axis_vector Function
axis_vector(
    daf::DafReader,
    axis::AbstractString;
    [default::Union{Nothing, UndefInitializer} = undef]
)::Maybe{AbstractVector{<:AbstractString}}

The array of unique names of the entries of some axis of daf . This is similar to doing get_vector for the special name property, except that it returns a simple vector (array) of strings instead of a NamedVector .

If default is undef (the default), this verifies the axis exists in daf . Otherwise, the default is nothing , which is returned if the axis does not exist.

String.(axis_vector(example_metacells_daf(), "type"))

# output

4-element Vector{String}:
 "memory-B"
 "MEBEMP-E"
 "MEBEMP-L"
 "MPP"

DataAxesFormats.Readers.axis_indices Function
axis_indices(
    daf::DafReader,
    axis::AbstractString,
    entries::AbstractVector{<:AbstractString};
    allow_empty::Bool = false,
    allow_missing::Bool = false,
)::AbstractVector{<:Integer}

Return a vector of the indices of the entries in the axis . If allow_empty , the empty string is converted to a zero index. If allow_missing , any non-empty strings that do not exist are likewise converted to a zero index. Otherwise, all entries must exist in the axis .

axis_indices(example_metacells_daf(), "type", ["MPP", ""]; allow_empty = true)

# output

2-element Vector{Int64}:
 4
 0

DataAxesFormats.Readers.axis_dict Function
axis_dict(daf::DafReader, axis::AbstractString)::AbstractDict{<:AbstractString, <:Integer}

Return a dictionary converting axis entry names to their integer index.

axis_dict(example_metacells_daf(), "type")

# output

OrderedCollections.OrderedDict{AbstractString, Int64} with 4 entries:
  "memory-B" => 1
  "MEBEMP-E" => 2
  "MEBEMP-L" => 3
  "MPP"      => 4

Vector properties

DataAxesFormats.Readers.has_vector Function
has_vector(daf::DafReader, axis::AbstractString, name::AbstractString)::Bool

Check whether a vector property with some name exists for the axis in daf . This is always true for the special name property.

This first verifies the axis exists in daf .

has_vector(example_cells_daf(), "cell", "type")

# output

false

has_vector(example_metacells_daf(), "metacell", "type")

# output

true

DataAxesFormats.Readers.vectors_set Function
vectors_set(daf::DafReader, axis::AbstractString)::AbstractSet{<:AbstractString}

The names of the vector properties for the axis in daf , not including the special name property.

This first verifies the axis exists in daf .

Note

There's no immutable set type in Julia for us to return. If you do modify the result set, bad things will happen.

sort!(String.(vectors_set(example_cells_daf(), "cell")))

# output

2-element Vector{String}:
 "donor"
 "experiment"

DataAxesFormats.Readers.get_vector Function
get_vector(
    daf::DafReader,
    axis::AbstractString,
    name::AbstractString;
    [default::Union{StorageScalar, StorageVector, Nothing, UndefInitializer} = undef]
)::Maybe{NamedVector}

Get the vector property with some name for some axis in daf . The names of the result are the names of the vector entries (same as returned by axis_vector ). The special property name returns an array whose values are also the (read-only) names of the entries of the axis.

This first verifies the axis exists in daf . If default is undef (the default), this first verifies the name vector exists in daf . Otherwise, if default is nothing , it will be returned. If it is a StorageVector , it has to be of the same size as the axis , and is returned. If it is a StorageScalar . Otherwise, a new Vector is created of the correct size containing the default , and is returned.

get_vector(example_metacells_daf(), "type", "color")

# output

4-element Named SparseArrays.ReadOnly{String, 1, Vector{String}}
type     │
─────────┼────────────
memory-B │ "steelblue"
MEBEMP-E │   "#eebb6e"
MEBEMP-L │      "plum"
MPP      │      "gold"

Matrix properties

DataAxesFormats.Readers.has_matrix Function
has_matrix(
    daf::DafReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString;
    [relayout::Bool = true]
)::Bool

Check whether a matrix property with some name exists for the rows_axis and the columns_axis in daf . Since this is Julia, this means a column-major matrix. A daf may contain two copies of the same data, in which case it would report the matrix under both axis orders.

If relayout (the default), this will also check whether the data exists in the other layout (that is, with flipped axes).

This first verifies the rows_axis and columns_axis exists in daf .

has_matrix(example_cells_daf(), "gene", "cell", "UMIs")

# output

true

DataAxesFormats.Readers.matrices_set Function
matrices_set(
    daf::DafReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString;
    [relayout::Bool = true]
)::AbstractSet{<:AbstractString}

The names of the matrix properties for the rows_axis and columns_axis in daf .

If tensors , this will condense the list of tensor matrices ( < tensor axis entry >_name ) into a single entry called /< tensor axis name >/_matrix . This makes the reasonable assumption that names do not contain the / character (which wouldn't work in files and H5DF storage formats, anyway).

If relayout (default), then this will include the names of matrices that exist in the other layout (that is, with flipped axes).

This first verifies the rows_axis and columns_axis exist in daf .

Note

There's no immutable set type in Julia for us to return. If you do modify the result set, bad things will happen.

String.(matrices_set(example_cells_daf(), "gene", "cell"))

# output

1-element Vector{String}:
 "UMIs"

DataAxesFormats.Readers.get_matrix Function
get_matrix(
    daf::DafReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString;
    [default::Union{StorageScalarBase, StorageMatrix, Nothing, UndefInitializer} = undef,
    relayout::Bool = true]
)::Maybe{NamedMatrix}

Get the column-major matrix property with some name for some rows_axis and columns_axis in daf . The names of the result axes are the names of the relevant axes entries (same as returned by axis_vector ).

If relayout (the default), then if the matrix is only stored in the other memory layout (that is, with flipped axes), then automatically call relayout! to compute the result. If daf isa DafWriter , then store the result for future use; otherwise, just cache it as MemoryData . This may lock up very large amounts of memory; you can call empty_cache! to release it.

This first verifies the rows_axis and columns_axis exist in daf . If default is undef (the default), this first verifies the name matrix exists in daf . Otherwise, if default is nothing , it is returned. If default is a StorageMatrix , it has to be of the same size as the rows_axis and columns_axis , and is returned. Otherwise, a new Matrix is created of the correct size containing the default , and is returned.

get_matrix(example_metacells_daf(), "gene", "metacell", "fraction")

# output

683×7 Named Matrix{Float32}
gene ╲ metacell │    M1671.28     M2357.20  …      M756.63      M412.08
────────────────┼──────────────────────────────────────────────────────
RPL22           │  0.00447666    0.0041286  …   0.00434327   0.00373581
PARK7           │  8.52301f-5  0.000154199     0.000108019   6.50531f-5
ENO1            │ 0.000464448  0.000482609     0.000248241   4.22228f-5
PRDM2           │    2.053f-5   2.85439f-5      2.46575f-5  0.000151486
HP1BP3          │ 0.000107137  0.000110915       8.9043f-5   0.00012099
CDC42           │ 0.000153017  0.000207847     0.000152447  0.000176377
HNRNPR          │ 0.000122974    6.7171f-5      7.09771f-5    6.7083f-5
RPL11           │    0.010306    0.0110606       0.0109086    0.0124251
⋮                           ⋮            ⋮  ⋱            ⋮            ⋮
NRIP1           │ 0.000155974  0.000361428     0.000197766   2.79487f-5
ATP5PF          │  8.62855f-5  0.000125912     0.000121949   8.22312f-5
CCT8            │ 0.000104152   7.55233f-5      0.00011572   4.13243f-5
SOD1            │ 0.000177344  0.000147838     0.000104723  0.000103708
SON             │ 0.000280491  0.000262015     0.000170829   0.00032361
ATP5PO          │ 0.000134007  0.000123143      0.00018833   9.73498f-5
TTC3            │ 0.000111978   0.00011131     0.000100166  0.000122469
HMGN1           │ 0.000345676  0.000287754  …  0.000264526  0.000160654

Utilities

DataAxesFormats.Readers.complete_path Function
complete_path(daf::DafReader)::Maybe{AbstractString}

If the daf repository is persistent (resides on disk), the absolute path leading to it. This path can be given to complete_daf to access the repository after the current process is terminated. If nothing then the repository is (at least partially) in-memory and will disappear when the current process is terminated.

Note

The H5df format may report a path that ends with #... to identify a specific group inside an HDF5 file. That is, the reported path isn't necessarily the path of a disk file or directory.

DataAxesFormats.Readers.axis_version_counter Function
axis_version_counter(daf::DafReader, axis::AbstractString)::UInt32

Return the version number of the axis. This is incremented every time delete_axis! is called. It is used by interfaces to other programming languages to safely cache per-axis data.

Note

This is purely in-memory per-instance, and not a global persistent version counter. That is, the version counter starts at zero even if opening a persistent disk daf data set.

metacells = example_metacells_daf()
println(axis_version_counter(metacells, "type"))
delete_axis!(metacells, "type")
add_axis!(metacells, "type", ["Foo", "Bar", "Baz"])
println(axis_version_counter(metacells, "type"))

# output

0
1

DataAxesFormats.Readers.vector_version_counter Function
vector_version_counter(daf::DafReader, axis::AbstractString, name::AbstractString)::UInt32

Return the version number of the vector. This is incremented every time set_vector! , empty_dense_vector! or empty_sparse_vector! are called. It is used by interfaces to safely cache per-vector data.

Note

This is purely in-memory per-instance, and not a global persistent version counter. That is, the version counter starts at zero even if opening a persistent disk daf data set.

metacells = example_metacells_daf()
println(vector_version_counter(metacells, "type", "color"))
set_vector!(metacells, "type", "color", string.(collect(1:4)); overwrite = true)
println(vector_version_counter(metacells, "type", "color"))

# output

1
2

DataAxesFormats.Readers.matrix_version_counter Function
matrix_version_counter(
    daf::DafReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString
)::UInt32

Return the version number of the matrix. The order of the axes does not matter. This is incremented every time set_matrix! , empty_dense_matrix! or empty_sparse_matrix! are called. It is used by interfaces to other programming languages to safely cache per-matrix data.

Note

This is purely in-memory per-instance, and not a global persistent version counter. That is, the version counter starts at zero even if opening a persistent disk daf data set.

metacells = example_metacells_daf()
println(matrix_version_counter(metacells, "gene", "metacell", "fraction"))
set_matrix!(metacells, "gene", "metacell", "fraction", rand(Float32, 683, 7); overwrite = true)
println(matrix_version_counter(metacells, "gene", "metacell", "fraction"))

# output

1
2

Index