H5DF Format
DataAxesFormats.H5dfFormat
—
Module
A
Daf
storage format in an HDF5 disk file. This is the "native" way to store
Daf
data in HDF5 files, which can be used to contain "anything", as HDF5 is essentially "a filesystem inside a file", with "groups" instead of directories and "datasets" instead of files. Therefore HDF5 is very generic, and there are various specific formats which use specific internal structure to hold some data in it - for example,
h5ad
files have a specific internal structure for representing
AnnData
objects. To represent
Daf
data in HDF5 storage, we use the following internal structure (which
is
not
compatible with
h5ad
):
-
The HDF5 file may contain
Dafdata directly in the root group, in which case, it is restricted to holding just a singleDafdata set. When using such a file, you automatically access the singleDafdata set contained in it. By convention such files are given a.h5dfsuffix. -
Alternatively, the HDF5 file may contain
Dafdata inside some arbitrary group, in which case, there's no restriction on the content of other groups in the file. Such groups may contain otherDafdata (allowing for multipleDafdata sets in a single file), and/or non-Dafdata. When using such a file, you need to specify the name of the group that contains theDafdata set you are interested it. By convention, at least if such files contain "mostly" (or only)Dafdata sets, they are given a.h5dfssuffix, and are accompanied by some documentation describing the top-level groups in the file. -
Under the
Dafdata group, there are 4 sub-groups:scalars,axes,vectorsandmatricesand adafdataset. -
To future-proof the format, the
dafdataset will contain a vector of two integers, the first acting as the major version number and the second as the minor version number, using semantic versioning . This makes it easy to test whether some group in an HDF5 file does/n't containDafdata, and which version of the internal structure it is using. Currently the only defined version is[1,0]. -
The
scalarsgroup contains scalar properties, each as its own "dataset". The only supported scalar data types are these included inStorageScalar. If you really need something else, serialize it to JSON and store the result as a string scalar. This should be extremely rare. -
The
axesgroup contains a "dataset" per axis, which contains a vector of strings (the names of the axis entries). -
The
vectorsgroup contains a sub-group for each axis. Each such sub-group contains vector properties. If the vector is dense, it is stored directly as a "dataset". Otherwise, it is stored as a group containing two vector "datasets":nzindis containing the indices of the non-zero values, andnzvalcontaining the actual values. See Julia'sSparseVectorimplementation for details. The only supported vector element types are these included inStorageScalar, same asStorageVector.If the data type is
Boolthen the data vector is typically all-truevalues; in this case we simply skip storing it.We switch to using this sparse format for sufficiently sparse string data (where the zero value is the empty string). This isn't supported by
SparseVectorbecause "reasons" so we load it into a dense vector. In this case we name the values vectornztxt. -
The
matricesgroup contains a sub-group for each rows axis, which contains a sub-group for each columns axis. Each such sub-sub group contains matrix properties. If the matrix is dense, it is stored directly as a "dataset" (in column-major layout). Otherwise, it is stored as a group containing three vector "datasets":colptrcontaining the indices of the rows of each column inrowval,rowvalcontaining the indices of the non-zero rows of the columns, andnzvalcontaining the non-zero matrix entry values. See Julia'sSparseMatrixCSCimplementation for details. The only supported matrix element types are these included inStorageReal- this explicitly excludes matrices of strings, same asStorageMatrix.If the data type is
Boolthen the data vector is typically all-truevalues; in this case we simply skip storing it.We switch to using this sparse format for sufficiently sparse string data (where the zero value is the empty string). This isn't supported by
SparseMatrixCSCbecause "reasons" so we load it into a dense matrix. In this case we name the values vectornztxt. -
All vectors and matrices are stored in a contiguous way in the file, which allows us to efficiently memory-map them.
That's all there is to it. Due to the above restrictions on types and layout, the metadata provided by HDF5 for each "dataset" is sufficient to fully describe the data, and one should be able to directly access it using any HDF5 API in any programming language, if needed. Typically, however, it is easiest to simply use the Julia
Daf
package to access the data.
Example HDF5 structure:
example-daf-dataset-root-group/
├─ daf
├─ scalars/
│ └─ version
├─ axes/
│ ├─ cell
│ └─ gene
├─ vectors/
│ ├─ cell/
│ │ └─ batch
│ └─ gene/
│ └─ is_marker
└─ matrices/
├─ cell/
│ ├─ cell/
│ └─ gene/
│ └─ UMIs/
│ ├─ colptr
│ ├─ rowval
│ └─ nzval
└─ gene/
├─ cell/
└─ gene/
When creating an HDF5 file to contain
Daf
data, you should specify
;fapl=HDF5.FileAccessProperties(;alignment=(1,8))
. This ensures all the memory buffers are properly aligned for efficient access. Otherwise, memory mapping will be
much
less efficient. A warning is therefore generated whenever you try to access
Daf
data stored in an HDF5 file which does not enforce proper alignment.
Deleting data from an HDF5 file does not reuse the abandoned storage. In general if you want to reclaim that storage, you will need to repack the file, which will invalidate any memory-mapped buffers created for it. Therefore, if you delete data (e.g. using
delete_vector!
), you should eventually abandon the
H5df
object, repack the HDF5 file, then create a new
H5df
object to access the repacked data.
The code here assumes the HDF5 data obeys all the above conventions and restrictions (that said, code will be able to access vectors and matrices stored in unaligned, chunked and/or compressed formats, but this will be much less efficient). As long as you only create and access
Daf
data in HDF5 files using
H5df
, then the code will work as expected (assuming no bugs). However, if you do this in some other way (e.g., directly using some HDF5 API in some arbitrary programming language), and the result is invalid, then the code here may fails with "less than friendly" error messages.
DataAxesFormats.H5dfFormat.MAJOR_VERSION
—
Constant
The specific major version of the
H5df
format that is supported by this code (
1
). The code will refuse to access data that is stored in a different major format.
DataAxesFormats.H5dfFormat.MINOR_VERSION
—
Constant
The maximal minor version of the
H5df
format that is supported by this code (
0
). The code will refuse to access data that is stored with the expected major version (
1
), but that uses a higher minor version.
DataAxesFormats.H5dfFormat.H5df
—
Type
H5df(
root::Union{AbstractString, HDF5.File, HDF5.Group},
mode::AbstractString = "r";
[name::Maybe{AbstractString} = nothing]
)
Storage in a HDF5 file.
The
root
can be the path of an HDF5 file, which will be opened with the specified
mode
, or an opened HDF5 file, in which cases the
Daf
data set will be stored directly in the root of the file (by convention, using a
.h5df
file name suffix). Alternatively, the
root
can be a group inside an HDF5 file, which allows to store multiple
Daf
data sets inside the same HDF5 file (by convention, using a
.h5dfs
file name suffix).
As a shorthand, you can also specify a
root
which is the path of a HDF5 file with a
.h5dfs
suffix, followed by
#
and the path of the group in the file.
If you create a directory whose name is
something.h5dfs#
and place
Daf
HDF5 files in it, this scheme will fail. So don't.
When opening an existing data set, if
name
is not specified, and there exists a "name" scalar property, it is used as the name. Otherwise, the path of the HDF5 file will be used as the name, followed by
#
and the internal path of the group (if any).
The valid
mode
values are as follows (the default mode is
r
):
| Mode | Allow modifications? | Create if does not exist? | Truncate if exists? | Returned type |
|---|---|---|---|---|
r
|
No | No | No |
DafReadOnly
|
r+
|
Yes | No | No |
H5df
|
w+
|
Yes | Yes | No |
H5df
|
w
|
Yes | Yes | Yes |
H5df
|
If the
root
is a path followed by
#
and a group, then
w
mode will
not
truncate the whole file if it exists; instead, it will only truncate the group.