Formats
DataAxesFormats.Formats
—
Module
The
FormatReader
and
FormatWriter
interfaces specify a low-level API for storing
Daf
data. To extend
Daf
to support an additional format, create a new implementation of this API.
A storage format object contains some named scalar data, a set of axes (each with a unique name for each entry), and named vector and matrix data based on these axes.
Data properties are identified by a unique name given the axes they are based on. That is, there is a separate namespace for scalar properties, vector properties for each specific axis, and matrix properties for each (ordered) pair of axes.
For matrices, we keep careful track of their layout(@ref). Specifically, a storage format only deals with column-major matrices, listed under the rows axis first and the columns axis second. A storage format object may hold two copies of the same matrix, in both possible memory layouts, in which case it will be listed twice, under both axes orders.
In general, storage format objects are as "dumb" as possible, to make it easier to support new storage formats. The required functions implement a glorified key-value repository, with the absolutely minimal necessary logic to deal with the separate property namespaces listed above.
For clarity of documentation, we split the type hierarchy to
DafWriter
<:
FormatWriter
<:
DafReader
<:
FormatReader
.
The functions listed here use the
FormatReader
for read-only operations and
FormatWriter
for write operations into a
Daf
storage. This is a low-level API, not meant to be used from outside the package, and therefore is not re-exported from the top-level
DataAxesFormats
namespace.
In contrast, the functions using
DafReader
and
DafWriter
describe the high-level API meant to be used from outside the package, and are re-exported. These functions are listed in the
DataAxesFormats.Readers
and
DataAxesFormats.Writers
modules. They provide all the logic common to any storage format, allowing us to keep the format-specific functions as simple as possible.
That is, when implementing a new
Daf
storage format, you should write
struct MyFormat <: DafWriter
, and implement the functions listed here for both
FormatReader
and
FormatWriter
.
Read API
DataAxesFormats.Formats.DafReader
—
Type
A high-level abstract interface for read-only access to
Daf
data.
All the functions for this type are provided based on the functions required for
FormatReader
. See the
Readers
module for their description.
DataAxesFormats.Formats.Internal
—
Type
struct Internal ... end
Internal data we need to keep in any concrete
FormatReader
. This has to be available as a
.internal
data member of the concrete format. This enables all the high-level
DafReader
and
DafWriter
functions.
The constructor will automatically call
unique_name
to try and make the names unique for improved error messages.
Caching
DataAxesFormats.Formats.CacheGroup
—
Type
Types of cached data inside
Daf
.
-
MappedData- memory-mapped disk data. This is the cheapest data, as it doesn't put pressure on the garbage collector. It requires some OS resources to maintain the mapping, and physical memory for the subset of the data that is actually being accessed. That is, one can memory map larger data than the physical memory, and performance will be good, as long as the subset of the data that is actually accessed is small enough to fit in memory. If it isn't, the performance will drop (a lot!) because the OS will be continuously reading data pages from disk - but it will not crash due to an out of memory error. It is very important not to re-map the same data twice because that causes all sort of inefficiencies and edge cases in the hardware and low-level software. -
MemoryData- a copy of data (from disk, or computed). This does pressure the garbage collector and can cause out of memory errors. However, recomputing or re-fetching the data from disk is slow, so caching this data is crucial for performance. -
QueryData- data that is computed by queries based on stored data (e.g., masked data, or results of a reduction or an element-wise operation). This again takes up application memory and may cause out of memory errors, but it is very useful to cache the results when the same query is executed multiple times (e.g., when using views). Manually executing queries therefore allows to explicitly disable the caching of the query results, since some queries will not be repeated.
If too much data has been cached, call
empty_cache!
to release it.
DataAxesFormats.Formats.empty_cache!
—
Function
empty_cache!(
daf::DafReader;
[clear::Maybe{CacheGroup} = nothing,
keep::Maybe{CacheGroup} = nothing]
)::Nothing
Clear some cached data. By default, completely empties the caches. You can specify either
clear
, to only forget a specific
CacheGroup
(e.g., for clearing only
QueryData
), or
keep
, to forget everything except a specific
CacheGroup
(e.g., for keeping only
MappedData
). You can't specify both
clear
and
keep
.
If there are any slow cache update operations in flight (matrix relayout, queries) then this will wait until they are done to ensure that the cache is in a consistent state.
Some backends return vectors and matrices that alias buffers held alive by the cache entry (for zero-copy access). Emptying the cache releases those buffers, so any array reference previously returned by such a backend becomes dangling. Currently this affects
HttpDaf
; drop all such references before calling
empty_cache!
on one of these backends.
Description
DataAxesFormats.Formats.format_description_header
—
Function
format_description_header(format::FormatReader, lines::Vector{String}, deep::Bool)::Nothing
Allow a
format
to amit additional description header lines.
This trusts that we have a read lock on the data set.
DataAxesFormats.Formats.format_description_footer
—
Function
format_description_footer(format::FormatReader, lines::Vector{String}; cache::Bool, deep::Bool, tensors::Bool)::Nothing
Allow a
format
to amit additional description footer lines. If
deep
, this also emit the description of any data sets nested in this one, if any.
This trusts that we have a read lock on the data set.
Scalar properties
DataAxesFormats.Formats.format_has_scalar
—
Function
format_has_scalar(format::FormatReader, name::AbstractString)::Bool
Check whether a scalar property with some
name
exists in
format
.
This trusts that we have a read lock on the data set.
DataAxesFormats.Formats.format_scalars_set
—
Function
format_scalars_set(format::FormatReader)::AbstractSet{<:AbstractString}
The names of the scalar properties in
format
.
This trusts that we have a read lock on the data set.
DataAxesFormats.Formats.format_get_scalar
—
Function
format_get_scalar(
format::FormatReader,
name::AbstractString,
)::Tuple{StorageScalar, CacheGroup}
Implement fetching the value of a scalar property with some
name
in
format
, together with a per-item
CacheGroup
describing how the returned value should be cached. Formats that can classify a returned value's mmappability (e.g.
H5df
) should return
MappedData
when handing back an mmapped view and
MemoryData
when handing back an eagerly-read copy.
This trusts that we have a read lock on the data set, and that the
name
scalar property exists in
format
.
Data axes
DataAxesFormats.Formats.format_has_axis
—
Function
format_has_axis(format::FormatReader, axis::AbstractString; for_change::Bool)::Bool
Check whether some
axis
exists in
format
. If
for_change
, this is done just prior to adding or deleting the axis.
This trusts that we have a read lock on the data set.
DataAxesFormats.Formats.format_axes_set
—
Function
format_axes_set(format::FormatReader)::AbstractSet{<:AbstractString}
The names of the axes of
format
.
This trusts that we have a read lock on the data set.
DataAxesFormats.Formats.format_axis_vector
—
Function
format_axis_vector(
format::FormatReader,
axis::AbstractString,
)::Tuple{AbstractVector{<:AbstractString}, CacheGroup}
Implement fetching the unique names of the entries of some
axis
of
format
, together with a per-item
CacheGroup
describing how the returned vector should be cached.
This trusts that we have a read lock on the data set, and that the
axis
exists in
format
.
DataAxesFormats.Formats.format_axis_length
—
Function
format_axis_length(format::FormatReader, axis::AbstractString)::Int64
Implement fetching the number of entries along the
axis
.
This trusts that we have a read lock on the data set, and that the
axis
exists in
format
.
Vector properties
DataAxesFormats.Formats.format_has_vector
—
Function
format_has_vector(format::FormatReader, axis::AbstractString, name::AbstractString)::Bool
Implement checking whether a vector property with some
name
exists for the
axis
in
format
.
This trusts that we have a read lock on the data set, that the
axis
exists in
format
and that the property name isn't
name
.
DataAxesFormats.Formats.format_vectors_set
—
Function
format_vectors_set(format::FormatReader, axis::AbstractString)::AbstractSet{<:AbstractString}
Implement fetching the names of the vectors for the
axis
in
format
,
not
including the special
name
property.
This trusts that we have a read lock on the data set, and that the
axis
exists in
format
.
DataAxesFormats.Formats.format_get_vector
—
Function
format_get_vector(
format::FormatReader,
axis::AbstractString,
name::AbstractString,
)::Tuple{StorageVector, Any, CacheGroup}
Implement fetching the vector property with some
name
for some
axis
in
format
, together with an opaque backing object that must be kept alive for as long as the returned vector is used (pass
nothing
if the vector is self-contained), and a per-item
CacheGroup
describing how the returned vector should be cached.
The backing slot exists so formats like
HttpDaf
can surface zero-copy
unsafe_wrap
'd vectors alongside the
Vector{UInt8}
buffer they alias; the cache stores the pair as a
Tuple{NamedArray, Any}
so the backing stays reachable as long as the vector is cached.
This trusts that we have a read lock on the data set, that the
axis
exists in
format
, and the
name
vector property exists for the
axis
.
Matrix properties
DataAxesFormats.Formats.format_has_matrix
—
Function
format_has_matrix(
format::FormatReader,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString;
)::Bool
Implement checking whether a matrix property with some
name
exists for the
rows_axis
and the
columns_axis
in
format
. If
cache
also checks whether the matrix exists in the cache.
This trusts that we have a read lock on the data set, and that the
rows_axis
and the
columns_axis
exist in
format
.
DataAxesFormats.Formats.format_matrices_set
—
Function
format_matrices_set(
format::FormatReader,
rows_axis::AbstractString,
columns_axis::AbstractString,
)::AbstractSet{<:AbstractString}
Implement fetching the names of the matrix properties for the
rows_axis
and
columns_axis
in
format
.
This trusts that we have a read lock on the data set, and that the
rows_axis
and
columns_axis
exist in
format
.
DataAxesFormats.Formats.format_get_matrix
—
Function
format_get_matrix(
format::FormatReader,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString
)::Tuple{StorageMatrix, Any, CacheGroup}
Implement fetching the matrix property with some
name
for some
rows_axis
and
columns_axis
in
format
, together with an opaque backing object kept alive for the returned matrix's lifetime (pass
nothing
if the matrix is self-contained), and a per-item
CacheGroup
describing how the returned matrix should be cached. See
format_get_vector
for the backing rationale.
This trusts that we have a read lock on the data set, and that the
rows_axis
and
columns_axis
exist in
format
, and the
name
matrix property exists for them.
Write API
DataAxesFormats.Formats.DafWriter
—
Type
A high-level abstract interface for write access to
Daf
data.
All the functions for this type are provided based on the functions required for
FormatWriter
. See the
Writers
module for their description.
DataAxesFormats.Formats.FormatWriter
—
Type
An abstract interface for writing into
Daf
storage formats.
Each storage format must implement the functions listed below for writing into the storage.
Scalar properties
DataAxesFormats.Formats.format_set_scalar!
—
Function
format_set_scalar!(
format::FormatWriter,
name::AbstractString,
value::StorageScalar,
)::Maybe{CacheGroup}
Implement setting the
value
of a scalar property with some
name
in
format
. Return the
CacheGroup
to use when caching the just-written value (as would be returned by
format_get_scalar
), or
nothing
to skip caching.
This trusts that we have a write lock on the data set, and that the
name
scalar property does not exist in
format
.
DataAxesFormats.Formats.format_delete_scalar!
—
Function
format_delete_scalar!(
format::FormatWriter,
name::AbstractString;
for_set::Bool
)::Nothing
Implement deleting a scalar property with some
name
from
format
. If
for_set
, this is done just prior to setting the scalar with a different value.
This trusts that we have a write lock on the data set, and that the
name
scalar property exists in
format
.
Data axes
DataAxesFormats.Formats.format_add_axis!
—
Function
format_add_axis!(
format::FormatWriter,
axis::AbstractString,
entries::AbstractVector{<:AbstractString}
)::Nothing
Implement adding a new
axis
to
format
.
This trusts we have a write lock on the data set, that the
axis
does not already exist in
format
, and that the names of the
entries
are unique.
DataAxesFormats.Formats.format_delete_axis!
—
Function
format_delete_axis!(format::FormatWriter, axis::AbstractString)::Nothing
Implement deleting some
axis
from
format
.
This trusts This trusts we have a write lock on the data set, that the
axis
exists in
format
, and that all properties that are based on this axis have already been deleted.
Vector properties
DataAxesFormats.Formats.format_set_vector!
—
Function
format_set_vector!(
format::FormatWriter,
axis::AbstractString,
name::AbstractString,
vector::Union{StorageScalar, StorageVector},
)::Nothing
Implement setting a vector property with some
name
for some
axis
in
format
.
If the
vector
specified is actually a
StorageScalar
, the stored vector is filled with this value.
This trusts we have a write lock on the data set, that the
axis
exists in
format
, that the vector property
name
isn't
"name"
, that it does not exist for the
axis
, and that the
vector
has the appropriate length for it.
DataAxesFormats.Formats.format_delete_vector!
—
Function
format_delete_vector!(
format::FormatWriter,
axis::AbstractString,
name::AbstractString;
for_set::Bool
)::Nothing
Implement deleting a vector property with some
name
for some
axis
from
format
. If
for_set
, this is done just prior to setting the vector with a different value.
This trusts we have a write lock on the data set, that the
axis
exists in
format
, that the vector property name isn't
name
, and that the
name
vector exists for the
axis
.
Matrix properties
DataAxesFormats.Formats.format_set_matrix!
—
Function
format_set_matrix!(
format::FormatWriter,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString,
matrix::StorageMatrix,
)::Nothing
Implement setting the matrix property with some
name
for some
rows_axis
and
columns_axis
in
format
.
If the
matrix
specified is actually a
StorageScalar
, the stored matrix is filled with this value.
This trusts we have a write lock on the data set, that the
rows_axis
and
columns_axis
exist in
format
, that the
name
matrix property does not exist for them, and that the
matrix
is column-major of the appropriate size for it.
DataAxesFormats.Formats.format_relayout_matrix!
—
Function
format_relayout_matrix!(
format::FormatWriter,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString,
matrix::StorageMatrix,
)::StorageMatrix
relayout!
the existing
name
column-major
matrix
property for the
rows_axis
and the
columns_axis
and store the results as a row-major matrix property (that is, with flipped axes).
This trusts we have a write lock on the data set, that the
rows_axis
and
columns_axis
are different from each other, exist in
format
, that the
name
matrix property exists for them, and that it does not exist for the flipped axes.
DataAxesFormats.Formats.format_delete_matrix!
—
Function
format_delete_matrix!(
format::FormatWriter,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString;
for_set::Bool
)::StorageMatrix
Implement deleting a matrix property with some
name
for some
rows_axis
and
columns_axis
from
format
. If
for_set
, this is done just prior to setting the matrix with a different value.
This trusts we have a write lock on the data set, that the
rows_axis
and
columns_axis
exist in
format
, and that the
name
matrix property exists for them.
Creating properties
DataAxesFormats.Formats.format_get_empty_dense_vector!
—
Function
format_get_empty_dense_vector!(
format::FormatWriter,
axis::AbstractString,
name::AbstractString,
eltype::Type{T},
)::Tuple{AbstractVector{T}, Maybe{CacheGroup}} where {T <: StorageReal}
Implement creating an empty dense vector property with some
name
for some
axis
in
format
. Return the buffer to be filled by the caller, together with the
CacheGroup
to use when caching the filled buffer (as would be returned by
format_get_vector
), or
nothing
to skip caching.
This trusts we have a write lock on the data set, that the
axis
exists in
format
and that the vector property
name
isn't
"name"
, and that it does not exist for the
axis
.
DataAxesFormats.Formats.format_filled_empty_dense_vector!
—
Function
format_filled_empty_dense_vector!(
format::FormatWriter,
axis::AbstractString,
name::AbstractString,
filled::AbstractVector{<:StorageReal},
)::Nothing
Allow the
format
to perform finalization once the empty dense vector has been
filled
(e.g. flushing metadata, patching chunk checksums). The default does nothing.
This trusts we have a write lock on the data set, that the
axis
exists in
format
, that the
name
vector property exists for the
axis
, and that
filled
is the same buffer that was returned by
format_get_empty_dense_vector!
.
DataAxesFormats.Formats.format_get_empty_sparse_vector!
—
Function
format_get_empty_sparse_vector!(
format::FormatWriter,
axis::AbstractString,
name::AbstractString,
eltype::Type{T},
nnz::StorageInteger,
indtype::Type{I},
)::Tuple{AbstractVector{I}, AbstractVector{T}, Maybe{CacheGroup}}
where {T <: StorageReal, I <: StorageInteger}
Implement creating an empty sparse vector property with some
name
for some
axis
in
format
. Return the
nzind
and
nzval
buffers to be filled by the caller, together with the
CacheGroup
to use when caching the assembled sparse vector (as would be returned by
format_get_vector
), or
nothing
to skip caching.
This trusts we have a write lock on the data set, that the
axis
exists in
format
and that the vector property
name
isn't
"name"
, and that it does not exist for the
axis
.
DataAxesFormats.Formats.format_filled_empty_sparse_vector!
—
Function
format_filled_empty_sparse_vector!(
format::FormatWriter,
axis::AbstractString,
name::AbstractString,
filled::SparseVector{<:StorageReal, <:StorageInteger},
)::Nothing
Allow the
format
to perform finalization once the empty sparse vector has been
filled
(e.g. flushing metadata, patching chunk checksums). The default does nothing.
DataAxesFormats.Formats.format_get_empty_dense_matrix!
—
Function
format_get_empty_dense_matrix!(
format::FormatWriter,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString,
eltype::Type{T},
)::Tuple{AbstractMatrix{T}, Maybe{CacheGroup}} where {T <: StorageReal}
Implement creating an empty dense matrix property with some
name
for some
rows_axis
and
columns_axis
in
format
. Return the buffer to be filled by the caller, together with the
CacheGroup
to use when caching the filled buffer (as would be returned by
format_get_matrix
), or
nothing
to skip caching.
This trusts we have a write lock on the data set, that the
rows_axis
and
columns_axis
exist in
format
and that the
name
matrix property does not exist for them.
DataAxesFormats.Formats.format_filled_empty_dense_matrix!
—
Function
format_filled_empty_dense_matrix!(
format::FormatWriter,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString,
filled::AbstractMatrix{<:StorageReal},
)::Nothing
Allow the
format
to perform finalization once the empty dense matrix has been
filled
(e.g. flushing metadata, patching chunk checksums). The default does nothing.
This trusts we have a write lock on the data set, that the
rows_axis
and
columns_axis
exist in
format
, that the
name
matrix property exists for them, and that
filled
is the same buffer that was returned by
format_get_empty_dense_matrix!
.
DataAxesFormats.Formats.format_get_empty_sparse_matrix!
—
Function
format_get_empty_sparse_matrix!(
format::FormatWriter,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString,
eltype::Type{T},
intdype::Type{I},
nnz::StorageInteger,
)::Tuple{AbstractVector{I}, AbstractVector{I}, AbstractVector{T}, Maybe{CacheGroup}}
where {T <: StorageReal, I <: StorageInteger}
Implement creating an empty sparse matrix property with some
name
for some
rows_axis
and
columns_axis
in
format
. Return the
colptr
,
rowval
and
nzval
buffers to be filled by the caller, together with the
CacheGroup
to use when caching the assembled sparse matrix (as would be returned by
format_get_matrix
), or
nothing
to skip caching.
This trusts we have a write lock on the data set, that the
rows_axis
and
columns_axis
exist in
format
and that the
name
matrix property does not exist for them.
DataAxesFormats.Formats.format_filled_empty_sparse_matrix!
—
Function
format_filled_empty_sparse_matrix!(
format::FormatWriter,
rows_axis::AbstractString,
columns_axis::AbstractString,
name::AbstractString,
filled::SparseMatrixCSC{<:StorageReal, <:StorageInteger},
)::Nothing
Allow the
format
to perform finalization once the empty sparse matrix has been
filled
(e.g. flushing metadata, patching chunk checksums). The default does nothing.
Reordering axes
DataAxesFormats.Formats.invalidate_axis_data!
—
Function
invalidate_axis_data!(writer::FormatWriter, axis::AbstractString)::Nothing
Invalidate all cached data that depends on the content of
axis
: the axis entries, the axis-to-index dict, every vector on the axis, and every matrix where
axis
is one of the two axes. Increments the version counter for each property. Must be called inside a write lock.
Index
-
DataAxesFormats.Formats -
DataAxesFormats.Formats.CacheGroup -
DataAxesFormats.Formats.DafReader -
DataAxesFormats.Formats.DafWriter -
DataAxesFormats.Formats.FormatReader -
DataAxesFormats.Formats.FormatWriter -
DataAxesFormats.Formats.Internal -
DataAxesFormats.Formats.empty_cache! -
DataAxesFormats.Formats.format_add_axis! -
DataAxesFormats.Formats.format_axes_set -
DataAxesFormats.Formats.format_axis_length -
DataAxesFormats.Formats.format_axis_vector -
DataAxesFormats.Formats.format_delete_axis! -
DataAxesFormats.Formats.format_delete_matrix! -
DataAxesFormats.Formats.format_delete_scalar! -
DataAxesFormats.Formats.format_delete_vector! -
DataAxesFormats.Formats.format_description_footer -
DataAxesFormats.Formats.format_description_header -
DataAxesFormats.Formats.format_filled_empty_dense_matrix! -
DataAxesFormats.Formats.format_filled_empty_dense_vector! -
DataAxesFormats.Formats.format_filled_empty_sparse_matrix! -
DataAxesFormats.Formats.format_filled_empty_sparse_vector! -
DataAxesFormats.Formats.format_get_empty_dense_matrix! -
DataAxesFormats.Formats.format_get_empty_dense_vector! -
DataAxesFormats.Formats.format_get_empty_sparse_matrix! -
DataAxesFormats.Formats.format_get_empty_sparse_vector! -
DataAxesFormats.Formats.format_get_matrix -
DataAxesFormats.Formats.format_get_scalar -
DataAxesFormats.Formats.format_get_vector -
DataAxesFormats.Formats.format_has_axis -
DataAxesFormats.Formats.format_has_matrix -
DataAxesFormats.Formats.format_has_scalar -
DataAxesFormats.Formats.format_has_vector -
DataAxesFormats.Formats.format_matrices_set -
DataAxesFormats.Formats.format_relayout_matrix! -
DataAxesFormats.Formats.format_scalars_set -
DataAxesFormats.Formats.format_set_matrix! -
DataAxesFormats.Formats.format_set_scalar! -
DataAxesFormats.Formats.format_set_vector! -
DataAxesFormats.Formats.format_vectors_set -
DataAxesFormats.Formats.invalidate_axis_data!