Zarr Format
DataAxesFormats.ZarrFormat
—
Module
A
Daf
storage format in a
Zarr
directory tree or ZIP archive. Like
FilesDaf
, the data can live in a directory of files on the filesystem (so standard filesystem tools work, and deleting a property immediately frees its storage), and offers a different trade-off compared to
FilesDaf
and
H5df
.
FilesDaf
uses its own
Daf
-specific layout, but the individual files are in deliberately simple formats (
JSON
for metadata, one-line-per-entry text for axis entries, raw little-endian binary for numeric data), so they are easy to inspect or produce with standard command-line tools even without any
Daf
-aware library.
ZarrDaf
instead lays the files out according to the Zarr specification: the per-array
.zarray
metadata and the chunk files are more opaque than
FilesDaf
's plain text/JSON, but in exchange the directory can be read directly by any Zarr library (e.g. the Python
zarr
package) without that library having to know anything about
Daf
.
A Zarr directory is still a directory rather than a single file, so for convenient publication or transport we also support storing a
Daf
data set inside a single ZIP archive via
MmapZipStore
. Archives written by this package hold every chunk uncompressed (method
0
) so it can be memory-mapped for direct access just like the directory backend. On the ZIP backend the archive is append-only: properties cannot be deleted and axes cannot be reordered. For read access, any Zarr v2 ZIP archive that matches the internal structure described below is accepted (including ones produced by foreign tools such as Python's
zarr
package, even if the chunks are chunked and/or compressed, subject to
Zarr.jl
's support for data types, filters, and compressors). Remote object stores (S3, GCS, …) are not supported.
We use the following internal structure under some root Zarr group (which is
not
compatible with any specific existing Zarr-based convention such as
OME-NGFF
):
-
The directory will contain 4 sub-groups:
scalars,axes,vectors, andmatrices, and adafarray. -
The
dafarray signifies that the group containsDafdata. It contains twoUInt8integers, the first being the major version number and the second the minor version number, using semantic versioning . This makes it easy to test whether some Zarr group does/n't containDafdata, and which version of the internal structure it is using. Currently the only defined version is[1,0]. -
The
scalarsgroup contains scalar properties, each as a single-element Zarr array. The only supported scalar data types are these included inStorageScalar. If you really need something else, serialize it to JSON and store the result as a string scalar. This should be extremely rare. -
The
axesgroup contains a Zarr array per axis, which contains a vector of strings (the names of the axis entries). -
The
vectorsgroup contains a sub-group for each axis. Each such sub-group contains vector properties. If the vector is dense, it is stored directly as a Zarr array. Otherwise, it is stored as a sub-group containing two child Zarr arrays:nzindcontaining the indices of the non-zero values, andnzvalcontaining the actual values. See Julia'sSparseVectorimplementation for details. The only supported vector element types are these included inStorageScalar, same asStorageVector.If the data type is
Boolthen the data vector is typically all-truevalues; in this case we simply skip storing thenzvalchild array. -
The
matricesgroup contains a sub-group for each rows axis, which contains a sub-group for each columns axis. Each such sub-sub-group contains matrix properties. If the matrix is dense, it is stored directly as a Zarr array (in column-major layout). Otherwise, it is stored as a sub-group containing three child Zarr arrays:colptrcontaining the indices of the rows of each column inrowval,rowvalcontaining the indices of the non-zero rows of the columns, andnzvalcontaining the non-zero matrix entry values. See Julia'sSparseMatrixCSCimplementation for details. The only supported matrix element types are these included inStorageReal- this explicitly excludes matrices of strings, same asStorageMatrix.If the data type is
Boolthen the data matrix is typically all-truevalues; in this case we simply skip storing thenzvalchild array. -
Every Zarr array is created without compression, using a single chunk covering the full array, so the chunk file on disk is a raw binary image that we can memory-map for direct access.
Example Zarr directory structure:
example-daf-dataset-root-directory.zarr/
├─ .zgroup
├─ daf/
│ ├─ .zarray
│ └─ 0
├─ scalars/
│ ├─ .zgroup
│ └─ version/
│ ├─ .zarray
│ └─ 0
├─ axes/
│ ├─ .zgroup
│ ├─ cell/
│ └─ gene/
├─ vectors/
│ ├─ .zgroup
│ ├─ cell/
│ │ ├─ .zgroup
│ │ └─ batch/
│ └─ gene/
│ ├─ .zgroup
│ └─ is_marker/
└─ matrices/
├─ .zgroup
├─ cell/
│ ├─ .zgroup
│ └─ gene/
│ ├─ .zgroup
│ └─ UMIs/
│ ├─ .zgroup
│ ├─ colptr/
│ ├─ rowval/
│ └─ nzval/
└─ gene/
├─ .zgroup
├─ cell/
└─ gene/
Zarr.jl
writes matrices in C storage order (the only order it supports) with the
.zarray
shape
listed in the reverse of the Julia matrix shape, so the raw chunk bytes match Julia's native column-major layout. A
Daf
matrix whose
(rows_axis, columns_axis)
are
(cell, gene)
(a Julia
(n_cells, n_genes)
matrix) is therefore written with
.zarray
containing
"shape": [n_genes, n_cells]
and
"order": "C"
. A client using a different Zarr implementation — most notably Python's
zarr
package — reads this as a C-contiguous NumPy array of shape
(n_genes, n_cells)
, which is the
transpose
of the Julia view. The bytes on disk are identical; only the shape labels are swapped. To obtain the
Daf
-canonical
(cell, gene)
orientation in Python, apply
.T
(a zero-copy view) to the loaded array. This affects only dense matrices (the
colptr
/
rowval
/
nzval
child arrays of sparse matrices are 1D vectors, unaffected); 1D axis-entry arrays and vector properties have the same shape in both languages.
The code here assumes the Zarr data obeys all the above conventions and restrictions. As long as you only create and access
Daf
data in Zarr directories using
ZarrDaf
, then the code will work as expected (assuming no bugs). However, if you do this in some other way (e.g., a Zarr library in another language producing compressed or multi-chunk arrays), and the result is invalid, then the code here may fail with "less than friendly" error messages.
DataAxesFormats.ZarrFormat.MAJOR_VERSION
—
Constant
The major version of the
ZarrDaf
on-disk format supported by this code.
DataAxesFormats.ZarrFormat.MINOR_VERSION
—
Constant
The highest minor version of the
ZarrDaf
on-disk format supported by this code.
DataAxesFormats.ZarrFormat.ZarrDaf
—
Type
ZarrDaf(
path::AbstractString,
mode::AbstractString = "r";
[name::Maybe{AbstractString} = nothing]
)
Storage in a Zarr directory tree, Zarr ZIP archive, or remote HTTP(S) Zarr group.
The
path
is a filesystem path that follows one of these conventions:
-
something.daf.zarr— a Zarr directory containing a singleDafdata set at its root. -
something.daf.zarr.zip— a Zarr ZIP archive containing a singleDafdata set at its root. -
something.dafs.zarr.zip#/group— a Zarr ZIP archive containingDafdata sets in sub-groups, addressed bygroup. -
http://…orhttps://…— a URL pointing at a remote Zarr directory that contains aDafdata set, served over HTTP (e.g. via a static file server,HTTP.serve(store, path, …), orxpublish). Onlymode = "r"is supported; the HTTP backend is strictly read-only and returns aDafReadOnly. The remote directory must contain a consolidated.zmetadatafile, and the served content must be stable for the lifetime of the open handle: per-chunk GETs happen lazily, so if the underlying data set is rewritten or relocated while the handle is open, subsequent reads may see inconsistent bytes.
The backend (directory, ZIP, or HTTP) is selected from the path prefix / file-name suffix. The ZIP backend is append-only: properties cannot be deleted and axes cannot be reordered (attempts to do so raise an error).
If you create a directory whose name is
something.dafs.zarr.zip#
and place
Daf
ZIP archives in it, this scheme will fail. So don't.
When opening an existing data set, if
name
is not specified, and there exists a "name" scalar property, it is used as the name. Otherwise, the
path
(including any
#/group
suffix) will be used as the name.
The valid
mode
values are as follows (the default mode is
r
):
| Mode | Allow modifications? | Create if does not exist? | Truncate if exists? | Returned type |
|---|---|---|---|---|
r
|
No | No | No |
DafReadOnly
|
r+
|
Yes | No | No |
ZarrDaf
|
w+
|
Yes | Yes | No |
ZarrDaf
|
w
|
Yes | Yes | Yes |
ZarrDaf
|
Truncating a sub-daf inside a ZIP archive is not supported (because the ZIP backend is append-only) and raises an error; use
r+
or
w+
to open a sub-daf for writing without truncation.
When several
ZarrDaf
instances in the same process share a ZIP archive path (typically different
#/group
sub-dafs of the same
.dafs.zarr.zip
file, or repeated opens of the same single-daf
.daf.zarr.zip
), they share a single underlying
MmapZipStore
and a single
data_lock
, so that concurrent calls serialize correctly and the archive is never mmap-ed twice. The first such open determines the store's writability: a later open of the same archive that requests write access will raise an error if the first open was read-only. Release the read-only handle first, or open the writable instance first. The directory backend does not share a store — each open creates its own independent
DirectoryStore
over the same filesystem tree.
DataAxesFormats.ZarrFormat.DAF_ZARR_ZIP_MAX_FILE_SIZE
—
Constant
The virtual address reservation size used for writable
MmapZipStore
opens of a
ZarrDaf
(modes
r+
,
w+
,
w
). Each such open reserves this much virtual address space via a single anonymous
PROT_NONE
mapping and overlays the real file onto its first
filesize
bytes; subsequent
ftruncate
+ re-overlay calls extend the accessible portion as the archive grows. The physical file stays at its real size — only VA is reserved. Defaults to 128 GiB, leaving plenty of room for concurrent live stores on platforms with ~128 TiB of user VA (Apple Silicon). Set to a larger value before opening a
ZarrDaf
whose ZIP archive might grow past this bound. An append that would cross the bound fails with an explicit error pointing back here.