Concat
DataAxesFormats.Concat
—
Module
Concatenate multiple
Daf
data sets along some axis. This copies the data from the concatenated data sets into some target data set.
The exact behavior of concatenation is surprisingly complex when accounting for sparse vs. dense matrices, different matrix layouts, and properties which are not along the concatenation axis. The implementation is further complicated by minimizing the allocation of intermediate memory buffers for the data; that is, in principle, concatenating from and into memory-mapped data sets should not allocate "any" memory buffers - the data should be copied directly from one memory-mapped region to another.
DataAxesFormats.Concat.concatenate!
—
Function
concatenate!(
destination::DafWriter,
axis::Union{AbstractString, AbstractVector{<:AbstractString}},
sources::AbstractVector{<:DafReader};
[names::Maybe{AbstractVector{<:AbstractString}} = nothing,
dataset_axis::Maybe{AbstractString} = "dataset",
dataset_property::Bool = true,
prefix::Union{Bool, AbstractVector{Bool}} = false,
prefixed::Maybe{Union{AbstractSet{<:AbstractString}, AbstractVector{<:AbstractSet{<:AbstractString}}}} = nothing,
empty::Maybe{EmptyData} = nothing,
sparse_if_saves_storage_fraction::AbstractFloat = 0.25,
merge::Maybe{MergeData} = nothing,
overwrite::Bool = false]
)::Nothing
Concatenate data from a
sources
sequence of
Daf
data sets into a single
destination
data set along one or more concatenation
axis
. You can also concatenate along multiple axes by specifying an array of
axis
names.
We need a unique name for each of the concatenated data sets. By default, we use the
DafReader.name
. You can override this by specifying an explicit
names
vector with one name per data set.
By default, a new axis named by
dataset_axis
is created with one entry per concatenated data set, using these unique names. You can disable this by setting
dataset_axis
to
nothing
.
If an axis is created, and
dataset_property
is set (the default), a property with the same name is created for the concatenated
axis
, containing the name of the data set each entry was collected from.
The entries of each concatenated axis must be unique. By default, we require that no entry name is used in more than one data set. If this isn't the case, then set
prefix
to specify adding the unique data set name (and a
.
separator) to its entries (either once for all the axes, or using a vector with a setting per axis).
If a prefix is added to the axis entry names, then it must also be added to all the vector properties whose values are entries of the axis. By default, we assume that any property name that is identical to the axis name is such a property (e.g., given a
cluster
axis, a
cluster
property of each
cell
is assumed to contain the names of clusters from that axis). We also allow for property names to just start with the axis name, followed by
.
and some suffix (e.g.,
cluster.manual
will also be assumed to contain the names of clusters). We'll automatically add the unique prefix to all such properties.
If, however, this heuristic fails, you can specify a vector of properties to be
prefixed
(or a vector of such vectors, one per concatenated axis). In this case only the listed properties will be prefixed with the unique data set names.
Vector and matrix properties for the
axis
will be concatenated. If some of the concatenated data sets do not contain some property, then an
empty
value must be specified for it, and will be used for the missing data.
Concatenated matrices are always stored in column-major layout where the concatenation axis is the column axis. There should not exist any matrices whose both axes are concatenated (e.g., square matrices of the concatenated axis).
The concatenated properties will be sparse if the storage for the sparse data is smaller than naive dense storage by at
sparse_if_saves_storage_fraction
(by default, if using sparse storage saves at least 25% of the space, that is, takes at most 75% of the dense storage space). When estimating this fraction, we assume dense data is 100% non-zero; we only take into account data already stored as sparse, as well as any missing data whose
empty
value is zero.
By default, properties that do not apply to any of the concatenation
axis
will be ignored. If
merge
is specified, then such properties will be processed according to it. Using
CollectAxis
for a property requires that the
dataset_axis
will not be
nothing
.
By default, concatenation will fail rather than
overwrite
existing properties in the target.
DataAxesFormats.Concat.MergeDatum
—
Type
A pair where the key is a
PropertyKey
and the value is
MergeAction
. We also allow specifying tuples instead of pairs to make it easy to invoke the API from other languages such as Python which do not have the concept of a
Pair
.
Similarly to
ViewData
, the order of the entries matters (last one wins), and a key containing
"*"
is expanded to all the relevant properties. For matrices, merge is done separately for each layout. That is, the order of the key
(rows_axis, columns_axis, matrix_name)
key
does
matter in the
MergeData
, which is different from how
ViewData
works.
DataAxesFormats.Concat.MergeData
—
Type
Specify all the data to merge. We would have liked to specify this as
AbstractVector{<:MergeDatum}
but Julia in its infinite wisdom considers
["a" => "b", ("c", "d") => "e"]
to be a
Vector{Any}
, which would require literals to be annotated with the type.
DataAxesFormats.Concat.MergeAction
—
Type
The action for merging the values of a property from the concatenated data sets into the result data set. This is used to properties that do not apply to the concatenation axis (that is, scalar properties, and vector and matrix properties of other axes). Valid values are:
-
SkipProperty- do not create the property in the result. This is the default. -
LastValue- use the value from the last concatenated data set (that has a value for the property). This is useful for properties that have the same value for all concatenated data sets. -
CollectAxis- collect the values from all the data sets, adding a dimension to the data (that is, convert a scalar property to a vector, and a vector property to a matrix). This can't be applied to matrix properties, because we can't directly store 3D data insideDaf. In addition, this requires that a dataset axis is created in the target, and that an empty value is specified for the property if it is missing from any of the concatenated data sets.