Skip to contents

Why native?

dafr is a pure R + C++ port of the Julia DataAxesFormats.jl package. Compared to a Julia-facade wrapper, native has:

  • No JuliaCall copy tax on cross-language boundaries.
  • mmap-backed reads for vectors and sparse matrices (no double-buffer).
  • OpenMP-parallel query kernels (Sum, Mean, Var, Mode, Quantile, …).
  • User-extensible op registry (register_eltwise, register_reduction).

Mmap readers

When a FilesDaf is opened read-only, dense vectors and the three payload arrays of a sparse matrix (colptr, rowval, nzval) are mmap’d — the OS maps the on-disk files directly into process memory with no copy.

fd <- files_daf("/path/to/daf", mode = "r")
x  <- get_vector(fd, "cell", "donor")     # mmap'd — no allocation
m  <- get_matrix(fd, "cell", "gene", "UMIs")  # mmap'd dgCMatrix

Low-level mmap constructors are also exported for advanced uses: mmap_dgCMatrix, mmap_int, mmap_lgl, mmap_real.

Parallel kernels

Reductions above a size threshold dispatch to OpenMP. The threshold is controlled via options(dafr.kernel_threshold = N) where N is the minimum element count for parallel execution.

options(dafr.kernel_threshold = 1e5)   # default 1e6

Thread count honours set_num_threads() (or the dafr.num_threads option). CRAN-like environments are auto-capped to 2 cores on package load.

Performance vs DataAxesFormats.jl

The package is benchmarked against the reference Julia implementation (DataAxesFormats.jl) — native Julia in-process, with no R or wrapper layer — using a shared fixture set. Fixtures hash-match across runners, queries are driven from the same catalog, and each iteration invalidates the per-query cache so timings reflect real compute rather than a cache-lookup ceiling.

Setup (2026-04-22): dafr commit d6d9a14, R 4.4.1 with MKL 2024.1 (ILP64), OMP_NUM_THREADS=1; Julia 1.12.5, OpenBLAS-64, threads = default. Linux x86_64. 79 queries across four fixtures: big_sparse (dense kernel sweeps on a large CSC matrix), cells_daf (51 queries covering every phrase in the string DSL), chain_triple (read paths through a three-layer chain), mmap_reopen (filesystem open + single read).

Reading the tables. In every table Ratio = dafr / julia (lower is better for dafr) and Speedup = julia / dafr (higher is better for dafr). A speedup of 10× means dafr is ten times faster than Julia.

Headline. Of the 79 benchmarked queries, dafr is faster than Julia on 56, at parity on 12, and slower on 11 (of which 4 exceed a 1.5× threshold — all on the filesystem-open-then-immediately-read micro-test). On the heavy workloads that dominate real pipelines (sparse-matrix reductions, grouped reductions), dafr wins by 2×–37×.

Heavy kernels on a large sparse matrix

DSL-driven reductions over a CSC matrix (big_sparse, 10 000 × 10 000, 5 % density, ~5 M nonzeros). Both runners measure the full query path — parse + axis lookup + kernel + result wrap — with the per-query cache invalidated each iteration. OMP_NUM_THREADS=1, so threading is not in play.

Query dafr median Julia median Ratio (R/J) Speedup (J/R)
kernel_var_row 158 ms 5.86 s 0.03× 37×
kernel_std_row 159 ms 5.85 s 0.03× 37×
kernel_mean_row 153 ms 2.83 s 0.05× 18×
kernel_max_row 158 ms 2.90 s 0.05× 18×
kernel_median_row 190 ms 3.50 s 0.05× 18×
kernel_quantile_row 194 ms 3.16 s 0.06× 16×
kernel_geomean_row 226 ms 3.45 s 0.07× 15×
kernel_sum_row 184 ms 2.73 s 0.07× 15×
kernel_mode_row 872 ms 6.31 s 0.14×
kernel_max_col 154 ms 683 ms 0.23×
kernel_sum_col 162 ms 713 ms 0.23×
kernel_mean_col 161 ms 643 ms 0.25×

What’s actually being compared. These are wall-time numbers for the full DSL query, not for the underlying linear algebra. Both packages wrap a query parser, an axis-resolution layer, and a result-wrapping layer around a numeric kernel. Profiling shows that most of the visible time on this fixture is in those wrappers, not in the kernel itself — e.g. get_query(daf, "@ row @ col :: value >- Sum") in DataAxesFormats.jl spends ~99.9 % of its 652 ms in framework code; the underlying sum(::SparseMatrixCSC; dims=1) returns in ~0.5 ms. dafr’s analogous path is ~4× leaner per query, but dafr is also far above what a tight C++ kernel needs in absolute terms — both runners pay a substantial per-query tax.

The wins above therefore have two distinct sources, depending on the row of the table:

  • Column reductions (*_col, ~4×). Pure framework win. Raw Julia sum(A; dims=1) on the underlying SparseMatrixCSC is actually much faster than dafr’s column-sum kernel; DataAxesFormats.jl forfeits that fast path by routing every reduction through a generic per-column mapfoldl for uniformity across operations. The 4× gap is the cost of that generic path, not a kernel-quality difference.

  • “Cheap” row reductions on CSC (sum_row, mean_row, max_row, var_row, std_row; 15–37×). Same story, amplified. SparseArrays.jl has a specialized single-pass dims=2 path that outperforms dafr’s C++ kernel head-to-head; again, DataAxesFormats.jl’s generic reduction wrapper does not use it, and the wrapper happens to be especially expensive on this direction. Most of the apparent dafr advantage here is DataAxesFormats.jl framework cost, not a dafr kernel advantage.

  • “Hard” row reductions on CSC (median_row, mode_row, quantile_row, geomean_row; 7–18×). Genuine kernel win. SparseArrays.jl has no specialized dims=2 path for these, and the obvious idiomatic Julia approach (mapslices(median, A; dims=2)) densifies every row — that alone takes ~3.5 s and allocates ~1.5 GiB on this fixture. dafr ships hand-rolled CSC-row kernels (kernel_quantile_csc, kernel_mode_csc, kernel_geomean_csc) that scan nonzeros once without densifying. This is the only part of the table where dafr’s kernel itself, rather than its framework, is the source of the speedup.

The takeaway: dafr is a faster DSL runtime than DataAxesFormats.jl on every row of this table, but on most of them (*_col plus the cheap row reductions) the underlying linear algebra in either package is already very fast. dafr’s specialized C++ reduction kernels matter mostly for the harder row-wise statistics on sparse data (median/mode/quantile/geomean).

Grouped reductions

Same fixture, grouping a vector into bins before reducing:

Query dafr median Julia median Ratio (R/J) Speedup (J/R)
grouped_g2_max_100 113 ms 2.94 s 0.04× 26×
grouped_g2_mean_100 215 ms 3.16 s 0.07× 15×
grouped_g2_mean_1000 287 ms 3.87 s 0.07× 13×
grouped_g2_sum_100 215 ms 2.88 s 0.07× 13×
grouped_g3_mean_100 169 ms 2.11 s 0.08× 12×
grouped_g3_sum_100 172 ms 1.99 s 0.09× 12×
grouped_g3_max_100 192 ms 2.07 s 0.09× 11×
grouped_g3_mean_1000 1.07 s 2.04 s 0.52×

String-DSL queries on a metacell-scale store

51 queries covering every phrase in the query language (cells_daf, a subset of example_cells_daf()). Times are sub-millisecond to low-millisecond for both runners; dafr is at parity or faster on most, modestly slower on a handful of axis-mask / lookup-composition phrases:

Summary Count
dafr faster (ratio < 0.9×) 23
parity (0.9× – 1.1×) 10
dafr slower (ratio > 1.1×) 18
worst ratio 1.73×

No cells_daf query exceeded its 2× threshold.

Chain queries

Reading through a three-daf chain (chain_triple):

Query dafr median Julia median Ratio (R/J) Speedup (J/R)
chain_read_vector 694 µs 1.15 ms 0.60× 1.7×
chain_read_matrix 931 µs 1.09 ms 0.85× 1.2×
chain_read_scalar 656 µs 398 µs 1.65× 0.6×
chain_reduce 2.80 ms 1.67 ms 1.68× 0.6×

Open + first-read latency (only place dafr loses)

On the mmap_reopen fixture — open a FilesDaf and immediately read one property — dafr is consistently slower than Julia. All four are within a few milliseconds absolute, but the per-call overhead is higher:

Query dafr median Julia median Ratio (R/J) Speedup (J/R)
mmap_open_read_axis 2.11 ms 794 µs 2.66× 0.38×
mmap_open_read_matrix 4.09 ms 1.66 ms 2.46× 0.41×
mmap_open_read_vector 2.96 ms 1.54 ms 1.93× 0.52×
mmap_open_read_scalar 1.00 ms 642 µs 1.56× 0.64×

The gap is per-call: JSON descriptor parsing, stats, and the one-shot mmap setup cost more per file in R than in Julia. For workloads that open a store once and then stream many reads (the common case), this cost amortises to zero — see the kernel tables above. For open-heavy workflows, keep the FilesDaf handle alive across queries.

Reproducing the numbers

The bake-off harness lives under benchmarks/ (R runner) and benchmarks/julia/ (Julia runner). Both runners consume the same query catalog and fixture set and emit CSVs with identical schemas so the comparison step is a simple join. See benchmarks/README.md for the full workflow.

reorder_axes() — in-place axis permutation

reorder_axes(daf, axis = perm, ...) permutes the entries of one or more axes in place, rewriting every vector and matrix that depends on the axis. It accepts named integer permutations (new_entries[i] = old_entries[perm[i]]):

d <- memory_daf("demo")
add_axis(d, "cell", c("c1", "c2", "c3"))
set_vector(d, "cell", "score", c(10, 20, 30))
reorder_axes(d, cell = c(3L, 1L, 2L))
axis_vector(d, "cell")     # c("c3", "c1", "c2")
#> [1] "c3" "c1" "c2"
get_vector(d, "cell", "score")  # 30, 10, 20
#> c3 c1 c2 
#> 30 10 20

The cost is bounded by the data: each dependent vector / matrix is rewritten exactly once via vectorised R indexing (memory-bound). For a 1k-cell, 1k-gene store with ~10 dense vectors and 2 dense matrices, a full cell-axis reorder runs in tens of milliseconds on a single core.

On files_daf, the operation is crash-recoverable. Before any in-place writes, the original tree is hardlink-snapshotted into .reorder.backup/. If the process is killed mid-reorder, the next writable open (mode = "r+" or "w+") automatically rolls back from the backup before returning the daf to the caller. Manual recovery is also exposed via reset_reorder_axes(daf) but is rarely needed.

reorder_axes() only operates on leaf storage formats (memory_daf, files_daf, zarr_daf, http_daf — though HTTP is read-only). Wrappers (chain_reader, chain_writer, view, contractor) are rejected up front; reorder them via the underlying writer instead.