Files Format
DataAxesFormats.FilesFormat
—
Module
A
Daf
storage format in disk files. This is an efficient way to persist
Daf
data in a filesystem, and offers a different trade-off compared to storing the data in an HDF5 file.
On the downside, this being a directory, you need to create a
zip
or
tar
or some other form of archive file if you want to publish it. Also, accessing
FilesDaf
will consume multiple file descriptors as opposed to just one for HDF5, and, of course, HDF5 has libraries to support it in most systems.
On the upside, the format of the files is so simple that it is trivial to access them from any programming environment, without requiring a complex library like HDF5. In addition, since each scalar, vector or matrix property is stored in a separate file, deleting data automatically frees the storage (unlike in an HDF5 file, where you must manually repack the file to actually release the storage). Also, you can use standard tools to look at the data (e.g. use
ls
or the Windows file explorer to view the list of properties, how much space each one uses, when it was created, etc.). Most importantly, this allows using standard tools like
make
to create automatic repeatable processing workflows.
We use multiple files to store
Daf
data, under some root directory, as follows:
-
The directory will contain 4 sub-directories:
scalars,axes,vectors, andmatrices, and a file calleddaf.json. -
The
daf.jsonsignifies that the directory containsDafdata. In this file, there should be a mapping with aversionkey whose value is an array of two integers. The first is the major version number and the second is the minor version number, using semantic versioning . This makes it easy to test whether a directory does/n't containDafdata, and which version of the internal structure it is using. Currently the only defined version is[1,0]. -
The
scalarsdirectory contains scalar properties, each as in its ownname.jsonfile, containing a mapping with atypekey whose value is the data type of the scalar (one of theStorageScalartypes, withStringfor a string scalar) and avaluekey whose value is the actual scalar value. -
The
axesdirectory contains aname.txtfile per axis, where each line contains a name of an axis entry. -
The
vectorsdirectory contains a directory per axis, containing the vectors. For every vector, aname.jsonfile will contain a mapping with aneltypekey specifying the type of the vector element, and aformatkey specifying how the data is stored on disk, one ofdenseandsparse.If the
formatisdense, then there will be a file containing the vector entries, eithername.txtfor strings (with a value per line), orname.datafor binary data (which we can memory-map for direct access).If the format is
sparse, then there will also be anindtypekey specifying the data type of the indices of the non-zero values, and two binary data files,name.nzindcontaining the indices of the non-zero entries, andname.nzvalcontaining the values of the non-zero entries (which we can memory-map for direct access). See Julia'sSparseVectorimplementation for details.If the data type is
Boolthen the data vector is typically all-truevalues; in this case we simply skip storing it.We switch to using this sparse format for sufficiently sparse string data (where the zero value is the empty string). This isn't supported by
SparseVectorbecause "reasons" so we load it into a dense vector. In this case we name the values filename.nztxt. -
The
matricesdirectly contains a directory per rows axis, which contains a directory per columns axis, which contains the matrices. For each matrix, aname.jsonfile will contain a mapping with aneltypekey specifying the type of the matrix element, and aformatkey specifying how the data is stored on disk, one ofdenseandsparse.If the
formatisdense, then there will be aname.databinary file in column-major layout (which we can memory-map for direct access).If the format is
sparse, then there will also be anindtypekey specifying the data type of the indices of the non-zero values, and three binary data files,name.colptr,name.rowvalcontaining the indices of the non-zero values, andname.nzvalcontaining the values of the non-zero entries (which we can memory-map for direct access). See Julia'sSparseMatrixCSCimplementation for details.If the data type is
Boolthen the data vector is typically all-truevalues; in this case we simply skip storing it.We switch to using this sparse format for sufficiently sparse string data (where the zero value is the empty string). This isn't supported by
SparseMatrixCSCbecause "reasons" so we load it into a dense matrix. In this case we name the values filename.nztxt.
Since data is stored in files using the property names, we are sadly susceptible to the operating system vagaries when it comes to "what is a valid property name" (e.g., no
/
characters allowed) and whether property names are/not case sensitive. In theory, we could just encode the property names somehow but that would make the file names opaque, which would lose out on a lot of the benefit of using files. It
always
pays to have "sane", simple, unique property names, using only alphanumeric characters, that would be a valid variable name in most programming languages.
Example directory structure:
example-daf-dataset-root-directory/
├─ daf.json
├─ scalars/
│ └─ version.json
├─ axes/
│ ├─ cell.txt
│ └─ gene.txt
├─ vectors/
│ ├─ cell/
│ │ ├─ batch.json
│ │ └─ batch.txt
│ └─ gene/
│ ├─ is_marker.json
│ └─ is_marker.data
└─ matrices/
├─ cell/
│ ├─ cell/
│ └─ gene/
│ ├─ UMIs.json
│ ├─ UMIs.colptr
│ ├─ UMIs.rowval
│ └─ UMIs.nzval
└─ gene/
├─ cell/
└─ gene/
All binary data is stored as a sequence of elements, in little endian byte order (which is the native order for modern CPUs), without any headers or padding. (Dense) matrices are stored in column-major layout (which matches Julia's native matrix layout).
All string data is stored in lines, one entry per line, separated by a `
character (regardless of the OS used). Therefore, you can't have a line break inside an axis entry name or in a vector property value, at least not when storing it in
FilesDaf`.
That's all there is to it. The format is intentionally simple and transparent to maximize its accessibility by other (standard) tools. Still, it is easiest to create the data using the Julia
Daf
package.
The code here assumes the files data obeys all the above conventions and restrictions. As long as you only create and access
Daf
data in files using
FilesDaf
, then the code will work as expected (assuming no bugs). However, if you do this in some other way (e.g., directly using the filesystem and custom tools), and the result is invalid, then the code here may fail with "less than friendly" error messages.
DataAxesFormats.FilesFormat.MAJOR_VERSION
—
Constant
The specific major version of the
FilesDaf
format that is supported by this code (
1
). The code will refuse to access data that is stored in a different major format.
DataAxesFormats.FilesFormat.MINOR_VERSION
—
Constant
The maximal minor version of the
FilesDaf
format that is supported by this code (
0
). The code will refuse to access data that is stored with the expected major version (
1
), but that uses a higher minor version.
DataAxesFormats.FilesFormat.FilesDaf
—
Type
FilesDaf(
path::AbstractString,
mode::AbstractString = "r";
[name::Maybe{AbstractString} = nothing]
)
Storage in disk files in some directory.
When opening an existing data set, if
name
is not specified, and there exists a "name" scalar property, it is used as the name. Otherwise, the
path
will be used as the name.
The valid
mode
values are as follows (the default mode is
r
):
| Mode | Allow modifications? | Create if does not exist? | Truncate if exists? | Returned type |
|---|---|---|---|---|
r
|
No | No | No |
DafReadOnly
|
r+
|
Yes | No | No |
FilesDaf
|
w+
|
Yes | Yes | No |
FilesDaf
|
w
|
Yes | Yes | Yes |
FilesDaf
|