Reorder

DataAxesFormats.Reorder Module

Reordering axis entries across one or more Daf writers with on-disk backup for crash recovery.

Why reorder axes? In a word, performance . Take "for example" the genes and cell axes of single-cell RNA sequencing. Most genes are not interesting. A small subset will typically be accessed repeatedly and the rest, possibly not at all. If you order the gene axis such the "interesting" ones are first, then accessing column-major memory-mapped data where genes are rows, you will bring into memory less pages than if the interesting genes were scattered randomly in their axis. As for the cells axis - if the cells in the same metacell are adjacent in this axis, when using cells as rows, the same logic applies.

Another use case is in training when random subsets of the data is accessed. Randomizing the axes order in advance on disk, and then accessing contiguous ranges of entries, greatly reduces the amount of pages one brings from this. This works even better if using a 2D chunked format. TODO: Support such formats well in Daf .

For large data sets, this can make a dramatic difference in performance of. This dovetails with choosing the correct layout of the data used in analysis code. These two considerations can save (or cost) 2-3 orders of magnitude(!) in performance. Of course, other considerations also apply - parallelism, avoiding memory allocation and garbage collection in inner loops, vectorization (SIMD), minimizing the number of passes on the same data... But these don't hold a candle to getting the layout of the data right - on disk and in memory.

DataAxesFormats.Reorder.reorder_axes! Function
reorder_axes!(
    daf::DafWriter,
    axes_permutations::AbstractDict{<:AbstractString, <:AbstractVector{<:Integer}},
)::Nothing

reorder_axes!(
    dafs::AbstractVector{<:DafWriter},
    axes_permutations::AbstractDict{<:AbstractString, <:AbstractVector{<:Integer}},
)::Nothing

Reorder the entries of one or more axes in one or more leaf DafWriter s. Each value in axes_permutations is a permutation vector: new_entries[i] = old_entries[permutation[i]] . All writers that contain a permuted axis must have identical entry lists for that axis.

The operation is crash-safe: a backup is created before any data is modified, and a lock marker is written so that a subsequent call can detect and roll back a partially-applied reorder.

When multiple writers share the same backing store (e.g. several H5df groups in one HDF5 file), pass them all in a single call so that the backup and lock are coordinated correctly.

DataAxesFormats.Reorder.reset_reorder_axes! Function
reset_reorder_axes!(
    daf::DafWriter,
)::Bool

reset_reorder_axes!(
    dafs::AbstractVector{<:DafWriter},
)::Bool

Roll back a partially-applied reorder on one or more leaf DafWriter s. This is the only way to recover from a crash during reorder_axes! . Returns true if any writer had a pending reorder that was rolled back.

DataAxesFormats.Reorder.FormatReorderPlan Type
struct FormatReorderPlan
    planned_axes::AbstractDict{<:AbstractString, PlannedAxis}
    planned_vectors::Vector{PlannedVector}
    planned_matrices::Vector{PlannedMatrix}
end

Enumerates every property that will be rewritten when the given axes are permuted in a single FormatWriter . Produced by build_reorder_plan and consumed by format_replace_reorder! , format_cleanup_reorder! , and format_reset_reorder! . The orchestrator derives replacement progress totals by summing n_replacement_elements across planned_vectors and planned_matrices .

DataAxesFormats.Reorder.PlannedAxis Type
struct PlannedAxis
    permutation::AbstractVector{<:Integer}
    inverse_permutation::AbstractVector{<:Integer}
    new_entries::AbstractVector{<:AbstractString}
end

Per-axis data carried by a FormatReorderPlan : the forward and inverse index permutations, and the already-permuted entry list. The orchestrator computes new_entries once per axis and shares it across all writers, so no format has to re-permute the axis entry names.

DataAxesFormats.Reorder.PlannedVector Type
struct PlannedVector
    axis::AbstractString
    name::AbstractString
    n_replacement_elements::Int
end

One entry in a FormatReorderPlan identifying a vector that will be rewritten when its axis is permuted. n_replacement_elements is the number of progress ticks the replacement phase will produce for this vector (the dense length for dense and sparse vectors alike).

DataAxesFormats.Reorder.PlannedMatrix Type
struct PlannedMatrix
    rows_axis::AbstractString
    columns_axis::AbstractString
    name::AbstractString
    n_replacement_elements::Int
end

One entry in a FormatReorderPlan identifying a matrix that will be rewritten when one or both of its axes are permuted. n_replacement_elements is the number of progress ticks the replacement phase will produce for this matrix ( n_rows * n_columns for dense matrices, nnz for sparse matrices).

DataAxesFormats.Reorder.build_reorder_plan Function
build_reorder_plan(
    writer::FormatWriter,
    planned_axes::AbstractDict{<:AbstractString, PlannedAxis},
)::FormatReorderPlan

Walk writer 's axes, vectors, and matrices and return a FormatReorderPlan enumerating every property that will be rewritten when the given axes are permuted. Must be called inside a write lock. Permuted axes that don't exist in writer are ignored; matrices whose rows and columns axes are both unpermuted are skipped.

DataAxesFormats.Reorder.format_lock_reorder! Function
format_lock_reorder!(writer::FormatWriter, operation_id::AbstractString)::Nothing

Fast, atomic: claim the reorder lock on writer so subsequent build_reorder_plan , format_replace_reorder! , and format_cleanup_reorder! calls have exclusive rights to the reorder backup state. Must be called inside a write lock, and only when format_has_reorder_lock returns false .

operation_id is an opaque token (typically a UUID) generated once per reorder batch by the orchestrator. For formats where multiple writers can share the same backing store (e.g. H5df), the implementation verifies that any pre-existing lock entries carry the same operation_id ; if a foreign operation is detected, an error is raised. This prevents two independent reorders from silently interfering with each other's backup state.

DataAxesFormats.Reorder.format_replace_reorder! Function
format_replace_reorder!(
    writer::FormatWriter,
    plan::FormatReorderPlan,
    replacement_progress::Maybe{Progress},
    crash_counter::Maybe{Ref{Int}},
)::Nothing

Replace the live data in writer with the reordered versions described by plan , ticking replacement_progress as work completes. Must be called after format_backup_reorder! . If crash_counter is not nothing , it is decremented after each property replacement and a SimulatedCrash is thrown when it reaches zero (used for testing crash recovery).

Index