Formats

DataAxesFormats.Formats Module

The FormatReader and FormatWriter interfaces specify a low-level API for storing Daf data. To extend Daf to support an additional format, create a new implementation of this API.

A storage format object contains some named scalar data, a set of axes (each with a unique name for each entry), and named vector and matrix data based on these axes.

Data properties are identified by a unique name given the axes they are based on. That is, there is a separate namespace for scalar properties, vector properties for each specific axis, and matrix properties for each (ordered) pair of axes.

For matrices, we keep careful track of their layout(@ref). Specifically, a storage format only deals with column-major matrices, listed under the rows axis first and the columns axis second. A storage format object may hold two copies of the same matrix, in both possible memory layouts, in which case it will be listed twice, under both axes orders.

In general, storage format objects are as "dumb" as possible, to make it easier to support new storage formats. The required functions implement a glorified key-value repository, with the absolutely minimal necessary logic to deal with the separate property namespaces listed above.

For clarity of documentation, we split the type hierarchy to DafWriter <: FormatWriter <: DafReader <: FormatReader .

The functions listed here use the FormatReader for read-only operations and FormatWriter for write operations into a Daf storage. This is a low-level API, not meant to be used from outside the package, and therefore is not re-exported from the top-level DataAxesFormats namespace.

In contrast, the functions using DafReader and DafWriter describe the high-level API meant to be used from outside the package, and are re-exported. These functions are listed in the DataAxesFormats.Readers and DataAxesFormats.Writers modules. They provide all the logic common to any storage format, allowing us to keep the format-specific functions as simple as possible.

That is, when implementing a new Daf storage format, you should write struct MyFormat <: DafWriter , and implement the functions listed here for both FormatReader and FormatWriter .

Read API

DataAxesFormats.Formats.FormatReader Type

An low-level abstract interface for reading from Daf storage formats.

We require each storage format to have a .name and an .internal:: Internal property. This enables all the high-level DafReader functions.

Each storage format must implement the functions listed below for reading from the storage.

DataAxesFormats.Formats.Internal Type
struct Internal ... end

Internal data we need to keep in any concrete FormatReader . This has to be available as a .internal data member of the concrete format. This enables all the high-level DafReader and DafWriter functions.

The constructor will automatically call unique_name to try and make the names unique for improved error messages.

Caching

DataAxesFormats.Formats.CacheGroup Type

Types of cached data inside Daf .

  • MappedData - memory-mapped disk data. This is the cheapest data, as it doesn't put pressure on the garbage collector. It requires some OS resources to maintain the mapping, and physical memory for the subset of the data that is actually being accessed. That is, one can memory map larger data than the physical memory, and performance will be good, as long as the subset of the data that is actually accessed is small enough to fit in memory. If it isn't, the performance will drop (a lot!) because the OS will be continuously reading data pages from disk - but it will not crash due to an out of memory error. It is very important not to re-map the same data twice because that causes all sort of inefficiencies and edge cases in the hardware and low-level software.
  • MemoryData - a copy of data (from disk, or computed). This does pressure the garbage collector and can cause out of memory errors. However, recomputing or re-fetching the data from disk is slow, so caching this data is crucial for performance.
  • QueryData - data that is computed by queries based on stored data (e.g., masked data, or results of a reduction or an element-wise operation). This again takes up application memory and may cause out of memory errors, but it is very useful to cache the results when the same query is executed multiple times (e.g., when using views). Manually executing queries therefore allows to explicitly disable the caching of the query results, since some queries will not be repeated.

If too much data has been cached, call empty_cache! to release it.

DataAxesFormats.Formats.empty_cache! Function
empty_cache!(
    daf::DafReader;
    [clear::Maybe{CacheGroup} = nothing,
    keep::Maybe{CacheGroup} = nothing]
)::Nothing

Clear some cached data. By default, completely empties the caches. You can specify either clear , to only forget a specific CacheGroup (e.g., for clearing only QueryData ), or keep , to forget everything except a specific CacheGroup (e.g., for keeping only MappedData ). You can't specify both clear and keep .

Note

If there are any slow cache update operations in flight (matrix relayout, queries) then this will wait until they are done to ensure that the cache is in a consistent state.

Warning

Some backends return vectors and matrices that alias buffers held alive by the cache entry (for zero-copy access). Emptying the cache releases those buffers, so any array reference previously returned by such a backend becomes dangling. Currently this affects HttpDaf ; drop all such references before calling empty_cache! on one of these backends.

Description

DataAxesFormats.Formats.format_description_header Function
format_description_header(format::FormatReader, lines::Vector{String}, deep::Bool)::Nothing

Allow a format to amit additional description header lines.

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_description_footer Function
format_description_footer(format::FormatReader, lines::Vector{String}; cache::Bool, deep::Bool, tensors::Bool)::Nothing

Allow a format to amit additional description footer lines. If deep , this also emit the description of any data sets nested in this one, if any.

This trusts that we have a read lock on the data set.

Scalar properties

DataAxesFormats.Formats.format_has_scalar Function
format_has_scalar(format::FormatReader, name::AbstractString)::Bool

Check whether a scalar property with some name exists in format .

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_scalars_set Function
format_scalars_set(format::FormatReader)::AbstractSet{<:AbstractString}

The names of the scalar properties in format .

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_get_scalar Function
format_get_scalar(
    format::FormatReader,
    name::AbstractString,
)::Tuple{StorageScalar, CacheGroup}

Implement fetching the value of a scalar property with some name in format , together with a per-item CacheGroup describing how the returned value should be cached. Formats that can classify a returned value's mmappability (e.g. H5df ) should return MappedData when handing back an mmapped view and MemoryData when handing back an eagerly-read copy.

This trusts that we have a read lock on the data set, and that the name scalar property exists in format .

Data axes

DataAxesFormats.Formats.format_has_axis Function
format_has_axis(format::FormatReader, axis::AbstractString; for_change::Bool)::Bool

Check whether some axis exists in format . If for_change , this is done just prior to adding or deleting the axis.

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_axes_set Function
format_axes_set(format::FormatReader)::AbstractSet{<:AbstractString}

The names of the axes of format .

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_axis_vector Function
format_axis_vector(
    format::FormatReader,
    axis::AbstractString,
)::Tuple{AbstractVector{<:AbstractString}, CacheGroup}

Implement fetching the unique names of the entries of some axis of format , together with a per-item CacheGroup describing how the returned vector should be cached.

This trusts that we have a read lock on the data set, and that the axis exists in format .

DataAxesFormats.Formats.format_axis_length Function
format_axis_length(format::FormatReader, axis::AbstractString)::Int64

Implement fetching the number of entries along the axis .

This trusts that we have a read lock on the data set, and that the axis exists in format .

Vector properties

DataAxesFormats.Formats.format_has_vector Function
format_has_vector(format::FormatReader, axis::AbstractString, name::AbstractString)::Bool

Implement checking whether a vector property with some name exists for the axis in format .

This trusts that we have a read lock on the data set, that the axis exists in format and that the property name isn't name .

DataAxesFormats.Formats.format_vectors_set Function
format_vectors_set(format::FormatReader, axis::AbstractString)::AbstractSet{<:AbstractString}

Implement fetching the names of the vectors for the axis in format , not including the special name property.

This trusts that we have a read lock on the data set, and that the axis exists in format .

DataAxesFormats.Formats.format_get_vector Function
format_get_vector(
    format::FormatReader,
    axis::AbstractString,
    name::AbstractString,
)::Tuple{StorageVector, Any, CacheGroup}

Implement fetching the vector property with some name for some axis in format , together with an opaque backing object that must be kept alive for as long as the returned vector is used (pass nothing if the vector is self-contained), and a per-item CacheGroup describing how the returned vector should be cached.

The backing slot exists so formats like HttpDaf can surface zero-copy unsafe_wrap 'd vectors alongside the Vector{UInt8} buffer they alias; the cache stores the pair as a Tuple{NamedArray, Any} so the backing stays reachable as long as the vector is cached.

This trusts that we have a read lock on the data set, that the axis exists in format , and the name vector property exists for the axis .

Matrix properties

DataAxesFormats.Formats.format_has_matrix Function
format_has_matrix(
    format::FormatReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString;
)::Bool

Implement checking whether a matrix property with some name exists for the rows_axis and the columns_axis in format . If cache also checks whether the matrix exists in the cache.

This trusts that we have a read lock on the data set, and that the rows_axis and the columns_axis exist in format .

DataAxesFormats.Formats.format_matrices_set Function
format_matrices_set(
    format::FormatReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
)::AbstractSet{<:AbstractString}

Implement fetching the names of the matrix properties for the rows_axis and columns_axis in format .

This trusts that we have a read lock on the data set, and that the rows_axis and columns_axis exist in format .

DataAxesFormats.Formats.format_get_matrix Function
format_get_matrix(
    format::FormatReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString
)::Tuple{StorageMatrix, Any, CacheGroup}

Implement fetching the matrix property with some name for some rows_axis and columns_axis in format , together with an opaque backing object kept alive for the returned matrix's lifetime (pass nothing if the matrix is self-contained), and a per-item CacheGroup describing how the returned matrix should be cached. See format_get_vector for the backing rationale.

This trusts that we have a read lock on the data set, and that the rows_axis and columns_axis exist in format , and the name matrix property exists for them.

Write API

DataAxesFormats.Formats.FormatWriter Type

An abstract interface for writing into Daf storage formats.

Each storage format must implement the functions listed below for writing into the storage.

Scalar properties

DataAxesFormats.Formats.format_set_scalar! Function
format_set_scalar!(
    format::FormatWriter,
    name::AbstractString,
    value::StorageScalar,
)::Maybe{CacheGroup}

Implement setting the value of a scalar property with some name in format . Return the CacheGroup to use when caching the just-written value (as would be returned by format_get_scalar ), or nothing to skip caching.

This trusts that we have a write lock on the data set, and that the name scalar property does not exist in format .

DataAxesFormats.Formats.format_delete_scalar! Function
format_delete_scalar!(
    format::FormatWriter,
    name::AbstractString;
    for_set::Bool
)::Nothing

Implement deleting a scalar property with some name from format . If for_set , this is done just prior to setting the scalar with a different value.

This trusts that we have a write lock on the data set, and that the name scalar property exists in format .

Data axes

DataAxesFormats.Formats.format_add_axis! Function
format_add_axis!(
    format::FormatWriter,
    axis::AbstractString,
    entries::AbstractVector{<:AbstractString}
)::Nothing

Implement adding a new axis to format .

This trusts we have a write lock on the data set, that the axis does not already exist in format , and that the names of the entries are unique.

DataAxesFormats.Formats.format_delete_axis! Function
format_delete_axis!(format::FormatWriter, axis::AbstractString)::Nothing

Implement deleting some axis from format .

This trusts This trusts we have a write lock on the data set, that the axis exists in format , and that all properties that are based on this axis have already been deleted.

Vector properties

DataAxesFormats.Formats.format_set_vector! Function
format_set_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString,
    vector::Union{StorageScalar, StorageVector},
)::Nothing

Implement setting a vector property with some name for some axis in format .

If the vector specified is actually a StorageScalar , the stored vector is filled with this value.

This trusts we have a write lock on the data set, that the axis exists in format , that the vector property name isn't "name" , that it does not exist for the axis , and that the vector has the appropriate length for it.

DataAxesFormats.Formats.format_delete_vector! Function
format_delete_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString;
    for_set::Bool
)::Nothing

Implement deleting a vector property with some name for some axis from format . If for_set , this is done just prior to setting the vector with a different value.

This trusts we have a write lock on the data set, that the axis exists in format , that the vector property name isn't name , and that the name vector exists for the axis .

Matrix properties

DataAxesFormats.Formats.format_set_matrix! Function
format_set_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    matrix::StorageMatrix,
)::Nothing

Implement setting the matrix property with some name for some rows_axis and columns_axis in format .

If the matrix specified is actually a StorageScalar , the stored matrix is filled with this value.

This trusts we have a write lock on the data set, that the rows_axis and columns_axis exist in format , that the name matrix property does not exist for them, and that the matrix is column-major of the appropriate size for it.

DataAxesFormats.Formats.format_relayout_matrix! Function
format_relayout_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    matrix::StorageMatrix,
)::StorageMatrix

relayout! the existing name column-major matrix property for the rows_axis and the columns_axis and store the results as a row-major matrix property (that is, with flipped axes).

This trusts we have a write lock on the data set, that the rows_axis and columns_axis are different from each other, exist in format , that the name matrix property exists for them, and that it does not exist for the flipped axes.

DataAxesFormats.Formats.format_delete_matrix! Function
format_delete_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString;
    for_set::Bool
)::StorageMatrix

Implement deleting a matrix property with some name for some rows_axis and columns_axis from format . If for_set , this is done just prior to setting the matrix with a different value.

This trusts we have a write lock on the data set, that the rows_axis and columns_axis exist in format , and that the name matrix property exists for them.

Creating properties

DataAxesFormats.Formats.format_get_empty_dense_vector! Function
format_get_empty_dense_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString,
    eltype::Type{T},
)::Tuple{AbstractVector{T}, Maybe{CacheGroup}} where {T <: StorageReal}

Implement creating an empty dense vector property with some name for some axis in format . Return the buffer to be filled by the caller, together with the CacheGroup to use when caching the filled buffer (as would be returned by format_get_vector ), or nothing to skip caching.

This trusts we have a write lock on the data set, that the axis exists in format and that the vector property name isn't "name" , and that it does not exist for the axis .

Note

The return type of this function is always a functionally dense vector, that is, it will have strides of (1,) , so that elements are consecutive in memory. However it need not be an actual DenseVector because of Julia's type system's limitations.

DataAxesFormats.Formats.format_filled_empty_dense_vector! Function
format_filled_empty_dense_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString,
    filled::AbstractVector{<:StorageReal},
)::Nothing

Allow the format to perform finalization once the empty dense vector has been filled (e.g. flushing metadata, patching chunk checksums). The default does nothing.

This trusts we have a write lock on the data set, that the axis exists in format , that the name vector property exists for the axis , and that filled is the same buffer that was returned by format_get_empty_dense_vector! .

DataAxesFormats.Formats.format_get_empty_sparse_vector! Function
format_get_empty_sparse_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString,
    eltype::Type{T},
    nnz::StorageInteger,
    indtype::Type{I},
)::Tuple{AbstractVector{I}, AbstractVector{T}, Maybe{CacheGroup}}
where {T <: StorageReal, I <: StorageInteger}

Implement creating an empty sparse vector property with some name for some axis in format . Return the nzind and nzval buffers to be filled by the caller, together with the CacheGroup to use when caching the assembled sparse vector (as would be returned by format_get_vector ), or nothing to skip caching.

This trusts we have a write lock on the data set, that the axis exists in format and that the vector property name isn't "name" , and that it does not exist for the axis .

DataAxesFormats.Formats.format_filled_empty_sparse_vector! Function
format_filled_empty_sparse_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString,
    filled::SparseVector{<:StorageReal, <:StorageInteger},
)::Nothing

Allow the format to perform finalization once the empty sparse vector has been filled (e.g. flushing metadata, patching chunk checksums). The default does nothing.

DataAxesFormats.Formats.format_get_empty_dense_matrix! Function
format_get_empty_dense_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    eltype::Type{T},
)::Tuple{AbstractMatrix{T}, Maybe{CacheGroup}} where {T <: StorageReal}

Implement creating an empty dense matrix property with some name for some rows_axis and columns_axis in format . Return the buffer to be filled by the caller, together with the CacheGroup to use when caching the filled buffer (as would be returned by format_get_matrix ), or nothing to skip caching.

This trusts we have a write lock on the data set, that the rows_axis and columns_axis exist in format and that the name matrix property does not exist for them.

Note

The return type of this function is always a functionally dense vector, that is, it will have strides of (1,nrows) , so that elements are consecutive in memory. However it need not be an actual DenseMatrix because of Julia's type system's limitations.

DataAxesFormats.Formats.format_filled_empty_dense_matrix! Function
format_filled_empty_dense_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    filled::AbstractMatrix{<:StorageReal},
)::Nothing

Allow the format to perform finalization once the empty dense matrix has been filled (e.g. flushing metadata, patching chunk checksums). The default does nothing.

This trusts we have a write lock on the data set, that the rows_axis and columns_axis exist in format , that the name matrix property exists for them, and that filled is the same buffer that was returned by format_get_empty_dense_matrix! .

DataAxesFormats.Formats.format_get_empty_sparse_matrix! Function
format_get_empty_sparse_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    eltype::Type{T},
    intdype::Type{I},
    nnz::StorageInteger,
)::Tuple{AbstractVector{I}, AbstractVector{I}, AbstractVector{T}, Maybe{CacheGroup}}
where {T <: StorageReal, I <: StorageInteger}

Implement creating an empty sparse matrix property with some name for some rows_axis and columns_axis in format . Return the colptr , rowval and nzval buffers to be filled by the caller, together with the CacheGroup to use when caching the assembled sparse matrix (as would be returned by format_get_matrix ), or nothing to skip caching.

This trusts we have a write lock on the data set, that the rows_axis and columns_axis exist in format and that the name matrix property does not exist for them.

DataAxesFormats.Formats.format_filled_empty_sparse_matrix! Function
format_filled_empty_sparse_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    filled::SparseMatrixCSC{<:StorageReal, <:StorageInteger},
)::Nothing

Allow the format to perform finalization once the empty sparse matrix has been filled (e.g. flushing metadata, patching chunk checksums). The default does nothing.

Reordering axes

DataAxesFormats.Formats.invalidate_axis_data! Function
invalidate_axis_data!(writer::FormatWriter, axis::AbstractString)::Nothing

Invalidate all cached data that depends on the content of axis : the axis entries, the axis-to-index dict, every vector on the axis, and every matrix where axis is one of the two axes. Increments the version counter for each property. Must be called inside a write lock.

Index