Storage types

DataAxesFormats.StorageTypes Module

Only a restricted set of scalar, matrix and vector types is stored by Daf .

The set of scalar types is restricted because we need to be able to store them in disk files. This rules out compound types such as Dict . This isn't an issue for vector and matrix elements but is sometimes bothersome for "scalar" data (not associated with any axis). If you find yourself needed to store such data, you'll have to serialize it to a string. By convention, we use JSON blobs for such data to maximize portability between different systems.

Julia supports a potentially infinite variety of ways to represent matrices and vectors. Daf is intentionally restricted to specific representations. This has several advantages:

  • Daf storage formats need only implement storing these restricted representations, which lend themselves to simple storage in consecutive bytes (in memory and/or on disk). These representations also allow for memory-mapping the data from disk files, which allows Daf to deal with data sets larger than the available memory. However, we also allow storing vectors and matrices of strings. We try to make as efficient as possible (which isn't saying much).

  • Client code need only worry about dealing with these restricted representations, which limits the amount of code paths required for efficient algorithm implementations. However, you (mostly) need not worry about this when invoking library functions, which have code paths covering all common matrix types. You do need to consider the layout of the data, though (see below).

This has the downside that Daf doesn't support efficient storage of specialized matrices (to pick a random example, upper triangular matrices). This isn't a great loss, since Daf targets storing arbitrary scientific data (especially biological data), which in general is not of any such special shape. The upside is that all matrices stored and returned by Daf have a clear layout (regardless of whether they are dense or sparse). This allows user code to ensure it is working "with the grain" of the data, which is much more efficient.

Note

Currently all boolean vectors are matrices are stored internally using one byte per entry (that is, as Vector{Bool} and Matrix{Bool} rather than BitVector and BitMatrix . This is somewhat less efficient, but is simpler and Boolean data is rarely a significant part of either storage or processing.

DataAxesFormats.StorageTypes.StorageScalar Type
StorageScalar = Union{StorageReal, <:AbstractString}

Types that can be used as scalars, or elements in stored matrices or vectors.

This is restricted to StorageReal (including Booleans) and strings. It is arguably too restrictive, as in principle we could support any arbitrary isbitstype . However, in practice this would cause much trouble when accessing the data from other systems (specifically Python and R). Since Daf targets storing scientific data (especially biological data), as opposed to "anything at all", this restriction seems reasonable.

DataAxesFormats.StorageTypes.StorageScalarBase Type
StorageScalarBase = Union{StorageReal, AbstractString}

For using in where clauses when a type needs to be a StorageScalar . That is, write where {T <: StorageScalarBase} instead of where {T <: StorageScalar} , because of the limitations of Julia's type system.

DataAxesFormats.StorageTypes.StorageVector Type
StorageVector{T} = AbstractVector{T} where {T <: StorageScalar}

Vectors that can be directly stored (and fetched) from Daf storage.

The element type must be a StorageScalar , to allow storing the data in disk files. Vectors of strings are supported but will be less efficient.

DataAxesFormats.StorageTypes.StorageMatrix Type
StorageMatrix{T} = AbstractMatrix{T} where {T <: StorageScalar}

Matrices that can be directly stored (and fetched) from Daf storage.

Note

All matrices we store must have a clear layout, that is, must be in either row-major or column-major format. There's no support for sparse matrices of strings, because "reasons".

Index