Formats

DataAxesFormats.Formats Module

The FormatReader and FormatWriter interfaces specify a low-level API for storing Daf data. To extend Daf to support an additional format, create a new implementation of this API.

A storage format object contains some named scalar data, a set of axes (each with a unique name for each entry), and named vector and matrix data based on these axes.

Data properties are identified by a unique name given the axes they are based on. That is, there is a separate namespace for scalar properties, vector properties for each specific axis, and matrix properties for each (ordered) pair of axes.

For matrices, we keep careful track of their layout(@ref). Specifically, a storage format only deals with column-major matrices, listed under the rows axis first and the columns axis second. A storage format object may hold two copies of the same matrix, in both possible memory layouts, in which case it will be listed twice, under both axes orders.

In general, storage format objects are as "dumb" as possible, to make it easier to support new storage formats. The required functions implement a glorified key-value repository, with the absolutely minimal necessary logic to deal with the separate property namespaces listed above.

For clarity of documentation, we split the type hierarchy to DafWriter <: FormatWriter <: DafReader <: FormatReader .

The functions listed here use the FormatReader for read-only operations and FormatWriter for write operations into a Daf storage. This is a low-level API, not meant to be used from outside the package, and therefore is not re-exported from the top-level DataAxesFormats namespace.

In contrast, the functions using DafReader and DafWriter describe the high-level API meant to be used from outside the package, and are re-exported. These functions are listed in the DataAxesFormats.Readers and DataAxesFormats.Writers modules. They provide all the logic common to any storage format, allowing us to keep the format-specific functions as simple as possible.

That is, when implementing a new Daf storage format, you should write struct MyFormat <: DafWriter , and implement the functions listed here for both FormatReader and FormatWriter .

Read API

DataAxesFormats.Formats.FormatReader Type

An low-level abstract interface for reading from Daf storage formats.

We require each storage format to have a .name and an .internal:: Internal property. This enables all the high-level DafReader functions.

Each storage format must implement the functions listed below for reading from the storage.

DataAxesFormats.Formats.Internal Type
struct Internal ... end

Internal data we need to keep in any concrete FormatReader . This has to be available as a .internal data member of the concrete format. This enables all the high-level DafReader and DafWriter functions.

The constructor will automatically call unique_name to try and make the names unique for improved error messages.

Caching

DataAxesFormats.Formats.CacheGroup Type

Types of cached data inside Daf .

  • MappedData - memory-mapped disk data. This is the cheapest data, as it doesn't put pressure on the garbage collector. It requires some OS resources to maintain the mapping, and physical memory for the subset of the data that is actually being accessed. That is, one can memory map larger data than the physical memory, and performance will be good, as long as the subset of the data that is actually accessed is small enough to fit in memory. If it isn't, the performance will drop (a lot!) because the OS will be continuously reading data pages from disk - but it will not crash due to an out of memory error. It is very important not to re-map the same data twice because that causes all sort of inefficiencies and edge cases in the hardware and low-level software.
  • MemoryData - a copy of data (from disk, or computed). This does pressure the garbage collector and can cause out of memory errors. However, recomputing or re-fetching the data from disk is slow, so caching this data is crucial for performance.
  • QueryData - data that is computed by queries based on stored data (e.g., masked data, or results of a reduction or an element-wise operation). This again takes up application memory and may cause out of memory errors, but it is very useful to cache the results when the same query is executed multiple times (e.g., when using views). Manually executing queries therefore allows to explicitly disable the caching of the query results, since some queries will not be repeated.

If too much data has been cached, call empty_cache! to release it.

DataAxesFormats.Formats.empty_cache! Function
empty_cache!(
    daf::DafReader;
    [clear::Maybe{CacheGroup} = nothing,
    keep::Maybe{CacheGroup} = nothing]
)::Nothing

Clear some cached data. By default, completely empties the caches. You can specify either clear , to only forget a specific CacheGroup (e.g., for clearing only QueryData ), or keep , to forget everything except a specific CacheGroup (e.g., for keeping only MappedData ). You can't specify both clear and keep .

Note

If there are any slow cache update operations in flight (matrix relayout, queries) then this will wait until they are done to ensure that the cache is in a consistent state.

Description

DataAxesFormats.Formats.format_description_header Function
format_description_header(format::FormatReader, lines::Vector{String}, deep::Bool)::Nothing

Allow a format to amit additional description header lines.

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_description_footer Function
format_description_footer(format::FormatReader, lines::Vector{String}; cache::Bool, deep::Bool, tensors::Bool)::Nothing

Allow a format to amit additional description footer lines. If deep , this also emit the description of any data sets nested in this one, if any.

This trusts that we have a read lock on the data set.

Scalar properties

DataAxesFormats.Formats.format_has_scalar Function
format_has_scalar(format::FormatReader, name::AbstractString)::Bool

Check whether a scalar property with some name exists in format .

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_scalars_set Function
format_scalars_set(format::FormatReader)::AbstractSet{<:AbstractString}

The names of the scalar properties in format .

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_get_scalar Function
format_get_scalar(format::FormatReader, name::AbstractString)::StorageScalar

Implement fetching the value of a scalar property with some name in format .

This trusts that we have a read lock on the data set, and that the name scalar property exists in format .

Data axes

DataAxesFormats.Formats.format_has_axis Function
format_has_axis(format::FormatReader, axis::AbstractString; for_change::Bool)::Bool

Check whether some axis exists in format . If for_change , this is done just prior to adding or deleting the axis.

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_axes_set Function
format_axes_set(format::FormatReader)::AbstractSet{<:AbstractString}

The names of the axes of format .

This trusts that we have a read lock on the data set.

DataAxesFormats.Formats.format_axis_vector Function
format_axis_vector(format::FormatReader, axis::AbstractString)::AbstractVector{<:AbstractString}

Implement fetching the unique names of the entries of some axis of format .

This trusts that we have a read lock on the data set, and that the axis exists in format .

DataAxesFormats.Formats.format_axis_length Function
format_axis_length(format::FormatReader, axis::AbstractString)::Int64

Implement fetching the number of entries along the axis .

This trusts that we have a read lock on the data set, and that the axis exists in format .

Vector properties

DataAxesFormats.Formats.format_has_vector Function
format_has_vector(format::FormatReader, axis::AbstractString, name::AbstractString)::Bool

Implement checking whether a vector property with some name exists for the axis in format .

This trusts that we have a read lock on the data set, that the axis exists in format and that the property name isn't name .

DataAxesFormats.Formats.format_vectors_set Function
format_vectors_set(format::FormatReader, axis::AbstractString)::AbstractSet{<:AbstractString}

Implement fetching the names of the vectors for the axis in format , not including the special name property.

This trusts that we have a read lock on the data set, and that the axis exists in format .

DataAxesFormats.Formats.format_get_vector Function
format_get_vector(format::FormatReader, axis::AbstractString, name::AbstractString)::StorageVector

Implement fetching the vector property with some name for some axis in format .

This trusts that we have a read lock on the data set, that the axis exists in format , and the name vector property exists for the axis .

Matrix properties

DataAxesFormats.Formats.format_has_matrix Function
format_has_matrix(
    format::FormatReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString;
)::Bool

Implement checking whether a matrix property with some name exists for the rows_axis and the columns_axis in format . If cache also checks whether the matrix exists in the cache.

This trusts that we have a read lock on the data set, and that the rows_axis and the columns_axis exist in format .

DataAxesFormats.Formats.format_matrices_set Function
format_matrices_set(
    format::FormatReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
)::AbstractSet{<:AbstractString}

Implement fetching the names of the matrix properties for the rows_axis and columns_axis in format .

This trusts that we have a read lock on the data set, and that the rows_axis and columns_axis exist in format .

DataAxesFormats.Formats.format_get_matrix Function
format_get_matrix(
    format::FormatReader,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString
)::StorageMatrix

Implement fetching the matrix property with some name for some rows_axis and columns_axis in format .

This trusts that we have a read lock on the data set, and that the rows_axis and columns_axis exist in format , and the name matrix property exists for them.

Write API

DataAxesFormats.Formats.FormatWriter Type

An abstract interface for writing into Daf storage formats.

Each storage format must implement the functions listed below for writing into the storage.

Scalar properties

DataAxesFormats.Formats.format_set_scalar! Function
format_set_scalar!(
    format::FormatWriter,
    name::AbstractString,
    value::StorageScalar,
)::Nothing

Implement setting the value of a scalar property with some name in format .

This trusts that we have a write lock on the data set, and that the name scalar property does not exist in format .

DataAxesFormats.Formats.format_delete_scalar! Function
format_delete_scalar!(
    format::FormatWriter,
    name::AbstractString;
    for_set::Bool
)::Nothing

Implement deleting a scalar property with some name from format . If for_set , this is done just prior to setting the scalar with a different value.

This trusts that we have a write lock on the data set, and that the name scalar property exists in format .

Data axes

DataAxesFormats.Formats.format_add_axis! Function
format_add_axis!(
    format::FormatWriter,
    axis::AbstractString,
    entries::AbstractVector{<:AbstractString}
)::Nothing

Implement adding a new axis to format .

This trusts we have a write lock on the data set, that the axis does not already exist in format , and that the names of the entries are unique.

DataAxesFormats.Formats.format_delete_axis! Function
format_delete_axis!(format::FormatWriter, axis::AbstractString)::Nothing

Implement deleting some axis from format .

This trusts This trusts we have a write lock on the data set, that the axis exists in format , and that all properties that are based on this axis have already been deleted.

Vector properties

DataAxesFormats.Formats.format_set_vector! Function
format_set_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString,
    vector::Union{StorageScalar, StorageVector},
)::Nothing

Implement setting a vector property with some name for some axis in format .

If the vector specified is actually a StorageScalar , the stored vector is filled with this value.

This trusts we have a write lock on the data set, that the axis exists in format , that the vector property name isn't "name" , that it does not exist for the axis , and that the vector has the appropriate length for it.

DataAxesFormats.Formats.format_delete_vector! Function
format_delete_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString;
    for_set::Bool
)::Nothing

Implement deleting a vector property with some name for some axis from format . If for_set , this is done just prior to setting the vector with a different value.

This trusts we have a write lock on the data set, that the axis exists in format , that the vector property name isn't name , and that the name vector exists for the axis .

Matrix properties

DataAxesFormats.Formats.format_set_matrix! Function
format_set_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    matrix::StorageMatrix,
)::Nothing

Implement setting the matrix property with some name for some rows_axis and columns_axis in format .

If the matrix specified is actually a StorageScalar , the stored matrix is filled with this value.

This trusts we have a write lock on the data set, that the rows_axis and columns_axis exist in format , that the name matrix property does not exist for them, and that the matrix is column-major of the appropriate size for it.

DataAxesFormats.Formats.format_relayout_matrix! Function
format_relayout_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    matrix::StorageMatrix,
)::StorageMatrix

relayout! the existing name column-major matrix property for the rows_axis and the columns_axis and store the results as a row-major matrix property (that is, with flipped axes).

This trusts we have a write lock on the data set, that the rows_axis and columns_axis are different from each other, exist in format , that the name matrix property exists for them, and that it does not exist for the flipped axes.

DataAxesFormats.Formats.format_delete_matrix! Function
format_delete_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString;
    for_set::Bool
)::StorageMatrix

Implement deleting a matrix property with some name for some rows_axis and columns_axis from format . If for_set , this is done just prior to setting the matrix with a different value.

This trusts we have a write lock on the data set, that the rows_axis and columns_axis exist in format , and that the name matrix property exists for them.

Creating properties

DataAxesFormats.Formats.format_get_empty_dense_vector! Function
format_get_empty_dense_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString,
    eltype::Type{T},
)::Vector{T} where {T <: StorageReal}

Implement setting a vector property with some name for some axis in format .

Implement creating an empty dense matrix with some name for some rows_axis and columns_axis in format .

This trusts we have a write lock on the data set, that the axis exists in format and that the vector property name isn't "name" , and that it does not exist for the axis .

Note

The return type of this function is always a functionally dense vector, that is, it will have strides of (1,) , so that elements are consecutive in memory. However it need not be an actual DenseVector because of Julia's type system's limitations.

DataAxesFormats.Formats.format_get_empty_sparse_vector! Function
format_get_empty_sparse_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString,
    eltype::Type{T},
    nnz::StorageInteger,
    indtype::Type{I},
)::Tuple{AbstractVector{I}, AbstractVector{T}, Any}
where {T <: StorageReal, I <: StorageInteger}

Implement creating an empty dense vector property with some name for some rows_axis and columns_axis in format .

This trusts we have a write lock on the data set, that the axis exists in format and that the vector property name isn't "name" , and that it does not exist for the axis .

DataAxesFormats.Formats.format_filled_empty_sparse_vector! Function
format_filled_empty_sparse_vector!(
    format::FormatWriter,
    axis::AbstractString,
    name::AbstractString,
    filled::SparseVector{<:StorageReal, <:StorageInteger},
)::Nothing

Allow the format to perform caching once the empty sparse vector has been filled . By default this does nothing.

DataAxesFormats.Formats.format_get_empty_dense_matrix! Function
format_get_empty_dense_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    eltype::Type{T},
)::AbstractMatrix{T} where {T <: StorageReal}

Implement creating an empty dense matrix property with some name for some rows_axis and columns_axis in format .

This trusts we have a write lock on the data set, that the rows_axis and columns_axis exist in format and that the name matrix property does not exist for them.

Note

The return type of this function is always a functionally dense vector, that is, it will have strides of (1,nrows) , so that elements are consecutive in memory. However it need not be an actual DenseMatrix because of Julia's type system's limitations.

DataAxesFormats.Formats.format_get_empty_sparse_matrix! Function
format_get_empty_sparse_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    eltype::Type{T},
    intdype::Type{I},
    nnz::StorageInteger,
)::Tuple{AbstractVector{I}, AbstractVector{I}, AbstractVector{T}, Any}
where {T <: StorageReal, I <: StorageInteger}

Implement creating an empty sparse matrix property with some name for some rows_axis and columns_axis in format .

This trusts we have a write lock on the data set, that the rows_axis and columns_axis exist in format and that the name matrix property does not exist for them.

DataAxesFormats.Formats.format_filled_empty_sparse_matrix! Function
format_filled_empty_sparse_matrix!(
    format::FormatWriter,
    rows_axis::AbstractString,
    columns_axis::AbstractString,
    name::AbstractString,
    filled::SparseMatrixCSC{<:StorageReal, <:StorageInteger},
)::Nothing

Allow the format to perform caching once the empty sparse matrix has been filled . By default this does nothing.

Index