Reorder
DataAxesFormats.Reorder
—
Module
Reordering axis entries across one or more
Daf
writers with on-disk backup for crash recovery.
Why reorder axes? In a word, performance . Take "for example" the genes and cell axes of single-cell RNA sequencing. Most genes are not interesting. A small subset will typically be accessed repeatedly and the rest, possibly not at all. If you order the gene axis such the "interesting" ones are first, then accessing column-major memory-mapped data where genes are rows, you will bring into memory less pages than if the interesting genes were scattered randomly in their axis. As for the cells axis - if the cells in the same metacell are adjacent in this axis, when using cells as rows, the same logic applies.
Another use case is in training when random subsets of the data is accessed. Randomizing the axes order in advance on disk, and then accessing contiguous ranges of entries, greatly reduces the amount of pages one brings from this. This works even better if using a 2D chunked format. TODO: Support such formats well in
Daf
.
For large data sets, this can make a dramatic difference in performance of. This dovetails with choosing the correct layout of the data used in analysis code. These two considerations can save (or cost) 2-3 orders of magnitude(!) in performance. Of course, other considerations also apply - parallelism, avoiding memory allocation and garbage collection in inner loops, vectorization (SIMD), minimizing the number of passes on the same data... But these don't hold a candle to getting the layout of the data right - on disk and in memory.
DataAxesFormats.Reorder.reorder_axes!
—
Function
reorder_axes!(
daf::DafWriter,
axes_permutations::AbstractDict{<:AbstractString, <:AbstractVector{<:Integer}},
)::Nothing
reorder_axes!(
dafs::AbstractVector{<:DafWriter},
axes_permutations::AbstractDict{<:AbstractString, <:AbstractVector{<:Integer}},
)::Nothing
Reorder the entries of one or more axes in one or more leaf
DafWriter
s. Each value in
axes_permutations
is a permutation vector:
new_entries[i] = old_entries[permutation[i]]
. All writers that contain a permuted axis must have identical entry lists for that axis.
The operation is crash-safe: a backup is created before any data is modified, and a lock marker is written so that a subsequent call can detect and roll back a partially-applied reorder.
When multiple writers share the same backing store (e.g. several H5df groups in one HDF5 file), pass them all in a single call so that the backup and lock are coordinated correctly.
DataAxesFormats.Reorder.reset_reorder_axes!
—
Function
reset_reorder_axes!(
daf::DafWriter,
)::Bool
reset_reorder_axes!(
dafs::AbstractVector{<:DafWriter},
)::Bool
Roll back a partially-applied reorder on one or more leaf
DafWriter
s. This is the
only
way to recover from a crash during
reorder_axes!
. Returns
true
if any writer had a pending reorder that was rolled back.
DataAxesFormats.Reorder.FormatReorderPlan
—
Type
struct FormatReorderPlan
planned_axes::AbstractDict{<:AbstractString, PlannedAxis}
planned_vectors::Vector{PlannedVector}
planned_matrices::Vector{PlannedMatrix}
end
Enumerates every property that will be rewritten when the given axes are permuted in a single
FormatWriter
. Produced by
build_reorder_plan
and consumed by
format_replace_reorder!
,
format_cleanup_reorder!
, and
format_reset_reorder!
. The orchestrator derives replacement progress totals by summing
n_replacement_elements
across
planned_vectors
and
planned_matrices
.
DataAxesFormats.Reorder.PlannedAxis
—
Type
struct PlannedAxis
permutation::AbstractVector{<:Integer}
inverse_permutation::AbstractVector{<:Integer}
new_entries::AbstractVector{<:AbstractString}
end
Per-axis data carried by a
FormatReorderPlan
: the forward and inverse index permutations, and the already-permuted entry list. The orchestrator computes
new_entries
once per axis and shares it across all writers, so no format has to re-permute the axis entry names.
DataAxesFormats.Reorder.PlannedVector
—
Type
struct PlannedVector
axis::AbstractString
name::AbstractString
n_replacement_elements::Int
end
One entry in a
FormatReorderPlan
identifying a vector that will be rewritten when its axis is permuted.
n_replacement_elements
is the number of progress ticks the replacement phase will produce for this vector (the dense length for dense and sparse vectors alike).
DataAxesFormats.Reorder.PlannedMatrix
—
Type
struct PlannedMatrix
rows_axis::AbstractString
columns_axis::AbstractString
name::AbstractString
n_replacement_elements::Int
end
One entry in a
FormatReorderPlan
identifying a matrix that will be rewritten when one or both of its axes are permuted.
n_replacement_elements
is the number of progress ticks the replacement phase will produce for this matrix (
n_rows * n_columns
for dense matrices,
nnz
for sparse matrices).
DataAxesFormats.Reorder.build_reorder_plan
—
Function
build_reorder_plan(
writer::FormatWriter,
planned_axes::AbstractDict{<:AbstractString, PlannedAxis},
)::FormatReorderPlan
Walk
writer
's axes, vectors, and matrices and return a
FormatReorderPlan
enumerating every property that will be rewritten when the given axes are permuted. Must be called inside a write lock. Permuted axes that don't exist in
writer
are ignored; matrices whose rows and columns axes are both unpermuted are skipped.
DataAxesFormats.Reorder.format_lock_reorder!
—
Function
format_lock_reorder!(writer::FormatWriter, operation_id::AbstractString)::Nothing
Fast, atomic: claim the reorder lock on
writer
so subsequent
build_reorder_plan
,
format_replace_reorder!
, and
format_cleanup_reorder!
calls have exclusive rights to the reorder backup state. Must be called inside a write lock, and only when
format_has_reorder_lock
returns
false
.
operation_id
is an opaque token (typically a UUID) generated once per reorder batch by the orchestrator. For formats where multiple writers can share the same backing store (e.g. H5df), the implementation verifies that any pre-existing lock entries carry the
same
operation_id
; if a foreign operation is detected, an error is raised. This prevents two independent reorders from silently interfering with each other's backup state.
DataAxesFormats.Reorder.format_backup_reorder!
—
Function
format_backup_reorder!(writer::FormatWriter, plan::FormatReorderPlan)::Nothing
Fast: save a backup of every property listed in
plan
so that
format_reset_reorder!
can restore the pre-reorder state. Must be called after
format_lock_reorder!
and
build_reorder_plan
but before
format_replace_reorder!
.
DataAxesFormats.Reorder.format_replace_reorder!
—
Function
format_replace_reorder!(
writer::FormatWriter,
plan::FormatReorderPlan,
replacement_progress::Maybe{Progress},
crash_counter::Maybe{Ref{Int}},
)::Nothing
Replace the live data in
writer
with the reordered versions described by
plan
, ticking
replacement_progress
as work completes. Must be called after
format_backup_reorder!
. If
crash_counter
is not
nothing
, it is decremented after each property replacement and a
SimulatedCrash
is thrown when it reaches zero (used for testing crash recovery).
DataAxesFormats.Reorder.format_cleanup_reorder!
—
Function
format_cleanup_reorder!(writer::FormatWriter)::Nothing
Remove the backup state created by previous phases. Must be called after
format_replace_reorder!
.
DataAxesFormats.Reorder.format_has_reorder_lock
—
Function
format_has_reorder_lock(writer::FormatWriter)::Bool
Return
true
if and only if a recovery marker from a previously-crashed reorder exists on
writer
.
DataAxesFormats.Reorder.format_reset_reorder!
—
Function
format_reset_reorder!(writer::FormatWriter)::Bool
Roll any pending reorder back to the pre-reorder state. Returns
true
if and only if work was done.
Index
-
DataAxesFormats.Reorder -
DataAxesFormats.Reorder.FormatReorderPlan -
DataAxesFormats.Reorder.PlannedAxis -
DataAxesFormats.Reorder.PlannedMatrix -
DataAxesFormats.Reorder.PlannedVector -
DataAxesFormats.Reorder.build_reorder_plan -
DataAxesFormats.Reorder.format_backup_reorder! -
DataAxesFormats.Reorder.format_cleanup_reorder! -
DataAxesFormats.Reorder.format_has_reorder_lock -
DataAxesFormats.Reorder.format_lock_reorder! -
DataAxesFormats.Reorder.format_replace_reorder! -
DataAxesFormats.Reorder.format_reset_reorder! -
DataAxesFormats.Reorder.reorder_axes! -
DataAxesFormats.Reorder.reset_reorder_axes!