Zarr Format

DataAxesFormats.ZarrFormat Module

A Daf storage format in a Zarr directory tree or ZIP archive. Like FilesDaf , the data can live in a directory of files on the filesystem (so standard filesystem tools work, and deleting a property immediately frees its storage), and offers a different trade-off compared to FilesDaf and H5df .

FilesDaf uses its own Daf -specific layout, but the individual files are in deliberately simple formats ( JSON for metadata, one-line-per-entry text for axis entries, raw little-endian binary for numeric data), so they are easy to inspect or produce with standard command-line tools even without any Daf -aware library. ZarrDaf instead lays the files out according to the Zarr specification: the per-array .zarray metadata and the chunk files are more opaque than FilesDaf 's plain text/JSON, but in exchange the directory can be read directly by any Zarr library (e.g. the Python zarr package) without that library having to know anything about Daf .

A Zarr directory is still a directory rather than a single file, so for convenient publication or transport we also support storing a Daf data set inside a single ZIP archive via MmapZipStore . Archives written by this package hold every chunk uncompressed (method 0 ) so it can be memory-mapped for direct access just like the directory backend. On the ZIP backend the archive is append-only: properties cannot be deleted and axes cannot be reordered. For read access, any Zarr v2 ZIP archive that matches the internal structure described below is accepted (including ones produced by foreign tools such as Python's zarr package, even if the chunks are chunked and/or compressed, subject to Zarr.jl 's support for data types, filters, and compressors). Remote object stores (S3, GCS, …) are not supported.

We use the following internal structure under some root Zarr group (which is not compatible with any specific existing Zarr-based convention such as OME-NGFF ):

  • The directory will contain 4 sub-groups: scalars , axes , vectors , and matrices , and a daf array.

  • The daf array signifies that the group contains Daf data. It contains two UInt8 integers, the first being the major version number and the second the minor version number, using semantic versioning . This makes it easy to test whether some Zarr group does/n't contain Daf data, and which version of the internal structure it is using. Currently the only defined version is [1,0] .

  • The scalars group contains scalar properties, each as a single-element Zarr array. The only supported scalar data types are these included in StorageScalar . If you really need something else, serialize it to JSON and store the result as a string scalar. This should be extremely rare.

  • The axes group contains a Zarr array per axis, which contains a vector of strings (the names of the axis entries).

  • The vectors group contains a sub-group for each axis. Each such sub-group contains vector properties. If the vector is dense, it is stored directly as a Zarr array. Otherwise, it is stored as a sub-group containing two child Zarr arrays: nzind containing the indices of the non-zero values, and nzval containing the actual values. See Julia's SparseVector implementation for details. The only supported vector element types are these included in StorageScalar , same as StorageVector .

    If the data type is Bool then the data vector is typically all- true values; in this case we simply skip storing the nzval child array.

  • The matrices group contains a sub-group for each rows axis, which contains a sub-group for each columns axis. Each such sub-sub-group contains matrix properties. If the matrix is dense, it is stored directly as a Zarr array (in column-major layout). Otherwise, it is stored as a sub-group containing three child Zarr arrays: colptr containing the indices of the rows of each column in rowval , rowval containing the indices of the non-zero rows of the columns, and nzval containing the non-zero matrix entry values. See Julia's SparseMatrixCSC implementation for details. The only supported matrix element types are these included in StorageReal - this explicitly excludes matrices of strings, same as StorageMatrix .

    If the data type is Bool then the data matrix is typically all- true values; in this case we simply skip storing the nzval child array.

  • Every Zarr array is created without compression, using a single chunk covering the full array, so the chunk file on disk is a raw binary image that we can memory-map for direct access.

Example Zarr directory structure:

example-daf-dataset-root-directory.zarr/
├─ .zgroup
├─ daf/
│  ├─ .zarray
│  └─ 0
├─ scalars/
│  ├─ .zgroup
│  └─ version/
│     ├─ .zarray
│     └─ 0
├─ axes/
│  ├─ .zgroup
│  ├─ cell/
│  └─ gene/
├─ vectors/
│  ├─ .zgroup
│  ├─ cell/
│  │  ├─ .zgroup
│  │  └─ batch/
│  └─ gene/
│     ├─ .zgroup
│     └─ is_marker/
└─ matrices/
   ├─ .zgroup
   ├─ cell/
   │  ├─ .zgroup
   │  └─ gene/
   │     ├─ .zgroup
   │     └─ UMIs/
   │        ├─ .zgroup
   │        ├─ colptr/
   │        ├─ rowval/
   │        └─ nzval/
   └─ gene/
      ├─ .zgroup
      ├─ cell/
      └─ gene/

Note

Zarr.jl writes matrices in C storage order (the only order it supports) with the .zarray shape listed in the reverse of the Julia matrix shape, so the raw chunk bytes match Julia's native column-major layout. A Daf matrix whose (rows_axis, columns_axis) are (cell, gene) (a Julia (n_cells, n_genes) matrix) is therefore written with .zarray containing "shape": [n_genes, n_cells] and "order": "C" . A client using a different Zarr implementation — most notably Python's zarr package — reads this as a C-contiguous NumPy array of shape (n_genes, n_cells) , which is the transpose of the Julia view. The bytes on disk are identical; only the shape labels are swapped. To obtain the Daf -canonical (cell, gene) orientation in Python, apply .T (a zero-copy view) to the loaded array. This affects only dense matrices (the colptr / rowval / nzval child arrays of sparse matrices are 1D vectors, unaffected); 1D axis-entry arrays and vector properties have the same shape in both languages.

Note

The code here assumes the Zarr data obeys all the above conventions and restrictions. As long as you only create and access Daf data in Zarr directories using ZarrDaf , then the code will work as expected (assuming no bugs). However, if you do this in some other way (e.g., a Zarr library in another language producing compressed or multi-chunk arrays), and the result is invalid, then the code here may fail with "less than friendly" error messages.

DataAxesFormats.ZarrFormat.ZarrDaf Type
ZarrDaf(
    path::AbstractString,
    mode::AbstractString = "r";
    [name::Maybe{AbstractString} = nothing]
)

Storage in a Zarr directory tree, Zarr ZIP archive, or remote HTTP(S) Zarr group.

The path is a filesystem path that follows one of these conventions:

  • something.daf.zarr — a Zarr directory containing a single Daf data set at its root.
  • something.daf.zarr.zip — a Zarr ZIP archive containing a single Daf data set at its root.
  • something.dafs.zarr.zip#/group — a Zarr ZIP archive containing Daf data sets in sub-groups, addressed by group .
  • http://… or https://… — a URL pointing at a remote Zarr directory that contains a Daf data set, served over HTTP (e.g. via a static file server, HTTP.serve(store, path, …) , or xpublish ). Only mode = "r" is supported; the HTTP backend is strictly read-only and returns a DafReadOnly . The remote directory must contain a consolidated .zmetadata file, and the served content must be stable for the lifetime of the open handle: per-chunk GETs happen lazily, so if the underlying data set is rewritten or relocated while the handle is open, subsequent reads may see inconsistent bytes.

The backend (directory, ZIP, or HTTP) is selected from the path prefix / file-name suffix. The ZIP backend is append-only: properties cannot be deleted and axes cannot be reordered (attempts to do so raise an error).

Note

If you create a directory whose name is something.dafs.zarr.zip# and place Daf ZIP archives in it, this scheme will fail. So don't.

When opening an existing data set, if name is not specified, and there exists a "name" scalar property, it is used as the name. Otherwise, the path (including any #/group suffix) will be used as the name.

The valid mode values are as follows (the default mode is r ):

Mode Allow modifications? Create if does not exist? Truncate if exists? Returned type
r No No No DafReadOnly
r+ Yes No No ZarrDaf
w+ Yes Yes No ZarrDaf
w Yes Yes Yes ZarrDaf

Truncating a sub-daf inside a ZIP archive is not supported (because the ZIP backend is append-only) and raises an error; use r+ or w+ to open a sub-daf for writing without truncation.

Note

When several ZarrDaf instances in the same process share a ZIP archive path (typically different #/group sub-dafs of the same .dafs.zarr.zip file, or repeated opens of the same single-daf .daf.zarr.zip ), they share a single underlying MmapZipStore and a single data_lock , so that concurrent calls serialize correctly and the archive is never mmap-ed twice. The first such open determines the store's writability: a later open of the same archive that requests write access will raise an error if the first open was read-only. Release the read-only handle first, or open the writable instance first. The directory backend does not share a store — each open creates its own independent DirectoryStore over the same filesystem tree.

DataAxesFormats.ZarrFormat.DAF_ZARR_ZIP_MAX_FILE_SIZE Constant

The virtual address reservation size used for writable MmapZipStore opens of a ZarrDaf (modes r+ , w+ , w ). Each such open reserves this much virtual address space via a single anonymous PROT_NONE mapping and overlays the real file onto its first filesize bytes; subsequent ftruncate + re-overlay calls extend the accessible portion as the archive grows. The physical file stays at its real size — only VA is reserved. Defaults to 128 GiB, leaving plenty of room for concurrent live stores on platforms with ~128 TiB of user VA (Apple Silicon). Set to a larger value before opening a ZarrDaf whose ZIP archive might grow past this bound. An append that would cross the bound fails with an explicit error pointing back here.

Index