Why native?
dafr is a pure R + C++ port of the Julia
DataAxesFormats.jl package. Compared to a Julia-facade
wrapper, native has:
- No
JuliaCallcopy tax on cross-language boundaries. - mmap-backed reads for vectors and sparse matrices (no double-buffer).
- OpenMP-parallel query kernels (Sum, Mean, Var, Mode, Quantile, …).
- User-extensible op registry (
register_eltwise,register_reduction).
Mmap readers
When a FilesDaf is opened read-only, dense vectors and the three
payload arrays of a sparse matrix (colptr,
rowval, nzval) are mmap’d — the
OS maps the on-disk files directly into process memory with no copy.
fd <- files_daf("/path/to/daf", mode = "r")
x <- get_vector(fd, "cell", "donor") # mmap'd — no allocation
m <- get_matrix(fd, "cell", "gene", "UMIs") # mmap'd dgCMatrixLow-level mmap constructors are also exported for advanced uses:
mmap_dgCMatrix, mmap_int,
mmap_lgl, mmap_real.
Parallel kernels
Reductions above a size threshold dispatch to OpenMP. The threshold
is controlled via options(dafr.kernel_threshold = N) where
N is the minimum element count for parallel execution.
options(dafr.kernel_threshold = 1e5) # default 1e6Thread count honours set_num_threads() (or the
dafr.num_threads option). CRAN-like environments are
auto-capped to 2 cores on package load.
Performance vs DataAxesFormats.jl
The package is benchmarked against the reference Julia implementation
(DataAxesFormats.jl) — native Julia in-process, with no R
or wrapper layer — using a shared fixture set. Fixtures hash-match
across runners, queries are driven from the same catalog, and each
iteration invalidates the per-query cache so timings reflect real
compute rather than a cache-lookup ceiling.
Setup (2026-04-22): dafr commit
d6d9a14, R 4.4.1 with MKL 2024.1 (ILP64),
OMP_NUM_THREADS=1; Julia 1.12.5, OpenBLAS-64, threads =
default. Linux x86_64. 79 queries across four fixtures:
big_sparse (dense kernel sweeps on a large CSC matrix),
cells_daf (51 queries covering every phrase in the string
DSL), chain_triple (read paths through a three-layer
chain), mmap_reopen (filesystem open + single read).
Reading the tables. In every table
Ratio = dafr / julia (lower is better for dafr) and
Speedup = julia / dafr (higher is better for dafr). A
speedup of 10× means dafr is ten times faster than
Julia.
Headline. Of the 79 benchmarked queries, dafr is faster than Julia on 56, at parity on 12, and slower on 11 (of which 4 exceed a 1.5× threshold — all on the filesystem-open-then-immediately-read micro-test). On the heavy workloads that dominate real pipelines (sparse-matrix reductions, grouped reductions), dafr wins by 2×–37×.
Heavy kernels on a large sparse matrix
DSL-driven reductions over a CSC matrix (big_sparse, 10
000 × 10 000, 5 % density, ~5 M nonzeros). Both runners measure the full
query path — parse + axis lookup + kernel + result wrap — with the
per-query cache invalidated each iteration.
OMP_NUM_THREADS=1, so threading is not in play.
| Query | dafr median | Julia median | Ratio (R/J) | Speedup (J/R) |
|---|---|---|---|---|
kernel_var_row |
158 ms | 5.86 s | 0.03× | 37× |
kernel_std_row |
159 ms | 5.85 s | 0.03× | 37× |
kernel_mean_row |
153 ms | 2.83 s | 0.05× | 18× |
kernel_max_row |
158 ms | 2.90 s | 0.05× | 18× |
kernel_median_row |
190 ms | 3.50 s | 0.05× | 18× |
kernel_quantile_row |
194 ms | 3.16 s | 0.06× | 16× |
kernel_geomean_row |
226 ms | 3.45 s | 0.07× | 15× |
kernel_sum_row |
184 ms | 2.73 s | 0.07× | 15× |
kernel_mode_row |
872 ms | 6.31 s | 0.14× | 7× |
kernel_max_col |
154 ms | 683 ms | 0.23× | 4× |
kernel_sum_col |
162 ms | 713 ms | 0.23× | 4× |
kernel_mean_col |
161 ms | 643 ms | 0.25× | 4× |
What’s actually being compared. These are wall-time
numbers for the full DSL query, not for the underlying linear algebra.
Both packages wrap a query parser, an axis-resolution layer, and a
result-wrapping layer around a numeric kernel. Profiling shows that most
of the visible time on this fixture is in those wrappers, not in the
kernel itself —
e.g. get_query(daf, "@ row @ col :: value >- Sum") in
DataAxesFormats.jl spends ~99.9 % of its 652 ms in
framework code; the underlying
sum(::SparseMatrixCSC; dims=1) returns in ~0.5 ms. dafr’s
analogous path is ~4× leaner per query, but dafr is also far above what
a tight C++ kernel needs in absolute terms — both runners pay a
substantial per-query tax.
The wins above therefore have two distinct sources, depending on the row of the table:
Column reductions (
*_col, ~4×). Pure framework win. Raw Juliasum(A; dims=1)on the underlyingSparseMatrixCSCis actually much faster than dafr’s column-sum kernel;DataAxesFormats.jlforfeits that fast path by routing every reduction through a generic per-columnmapfoldlfor uniformity across operations. The 4× gap is the cost of that generic path, not a kernel-quality difference.“Cheap” row reductions on CSC (
sum_row,mean_row,max_row,var_row,std_row; 15–37×). Same story, amplified.SparseArrays.jlhas a specialized single-passdims=2path that outperforms dafr’s C++ kernel head-to-head; again,DataAxesFormats.jl’s generic reduction wrapper does not use it, and the wrapper happens to be especially expensive on this direction. Most of the apparent dafr advantage here isDataAxesFormats.jlframework cost, not a dafr kernel advantage.“Hard” row reductions on CSC (
median_row,mode_row,quantile_row,geomean_row; 7–18×). Genuine kernel win.SparseArrays.jlhas no specializeddims=2path for these, and the obvious idiomatic Julia approach (mapslices(median, A; dims=2)) densifies every row — that alone takes ~3.5 s and allocates ~1.5 GiB on this fixture. dafr ships hand-rolled CSC-row kernels (kernel_quantile_csc,kernel_mode_csc,kernel_geomean_csc) that scan nonzeros once without densifying. This is the only part of the table where dafr’s kernel itself, rather than its framework, is the source of the speedup.
The takeaway: dafr is a faster DSL runtime than
DataAxesFormats.jl on every row of this table, but on most
of them (*_col plus the cheap row reductions) the
underlying linear algebra in either package is already very fast. dafr’s
specialized C++ reduction kernels matter mostly for the harder row-wise
statistics on sparse data
(median/mode/quantile/geomean).
Grouped reductions
Same fixture, grouping a vector into bins before reducing:
| Query | dafr median | Julia median | Ratio (R/J) | Speedup (J/R) |
|---|---|---|---|---|
grouped_g2_max_100 |
113 ms | 2.94 s | 0.04× | 26× |
grouped_g2_mean_100 |
215 ms | 3.16 s | 0.07× | 15× |
grouped_g2_mean_1000 |
287 ms | 3.87 s | 0.07× | 13× |
grouped_g2_sum_100 |
215 ms | 2.88 s | 0.07× | 13× |
grouped_g3_mean_100 |
169 ms | 2.11 s | 0.08× | 12× |
grouped_g3_sum_100 |
172 ms | 1.99 s | 0.09× | 12× |
grouped_g3_max_100 |
192 ms | 2.07 s | 0.09× | 11× |
grouped_g3_mean_1000 |
1.07 s | 2.04 s | 0.52× | 2× |
String-DSL queries on a metacell-scale store
51 queries covering every phrase in the query language
(cells_daf, a subset of example_cells_daf()).
Times are sub-millisecond to low-millisecond for both runners; dafr is
at parity or faster on most, modestly slower on a handful of axis-mask /
lookup-composition phrases:
| Summary | Count |
|---|---|
| dafr faster (ratio < 0.9×) | 23 |
| parity (0.9× – 1.1×) | 10 |
| dafr slower (ratio > 1.1×) | 18 |
| worst ratio | 1.73× |
No cells_daf query exceeded its 2× threshold.
Chain queries
Reading through a three-daf chain (chain_triple):
| Query | dafr median | Julia median | Ratio (R/J) | Speedup (J/R) |
|---|---|---|---|---|
chain_read_vector |
694 µs | 1.15 ms | 0.60× | 1.7× |
chain_read_matrix |
931 µs | 1.09 ms | 0.85× | 1.2× |
chain_read_scalar |
656 µs | 398 µs | 1.65× | 0.6× |
chain_reduce |
2.80 ms | 1.67 ms | 1.68× | 0.6× |
Open + first-read latency (only place dafr loses)
On the mmap_reopen fixture — open a FilesDaf and
immediately read one property — dafr is consistently slower than Julia.
All four are within a few milliseconds absolute, but the per-call
overhead is higher:
| Query | dafr median | Julia median | Ratio (R/J) | Speedup (J/R) |
|---|---|---|---|---|
mmap_open_read_axis |
2.11 ms | 794 µs | 2.66× | 0.38× |
mmap_open_read_matrix |
4.09 ms | 1.66 ms | 2.46× | 0.41× |
mmap_open_read_vector |
2.96 ms | 1.54 ms | 1.93× | 0.52× |
mmap_open_read_scalar |
1.00 ms | 642 µs | 1.56× | 0.64× |
The gap is per-call: JSON descriptor parsing, stats, and
the one-shot mmap setup cost more per file in R than in Julia. For
workloads that open a store once and then stream many reads (the common
case), this cost amortises to zero — see the kernel tables above. For
open-heavy workflows, keep the FilesDaf handle alive across
queries.
Reproducing the numbers
The bake-off harness lives under benchmarks/ (R runner)
and benchmarks/julia/ (Julia runner). Both runners consume
the same query catalog and fixture set and emit CSVs with identical
schemas so the comparison step is a simple join. See
benchmarks/README.md for the full workflow.
reorder_axes() — in-place axis permutation
reorder_axes(daf, axis = perm, ...) permutes the entries
of one or more axes in place, rewriting every vector and matrix that
depends on the axis. It accepts named integer permutations
(new_entries[i] = old_entries[perm[i]]):
d <- memory_daf("demo")
add_axis(d, "cell", c("c1", "c2", "c3"))
set_vector(d, "cell", "score", c(10, 20, 30))
reorder_axes(d, cell = c(3L, 1L, 2L))
axis_vector(d, "cell") # c("c3", "c1", "c2")
#> [1] "c3" "c1" "c2"
get_vector(d, "cell", "score") # 30, 10, 20
#> c3 c1 c2
#> 30 10 20The cost is bounded by the data: each dependent vector / matrix is
rewritten exactly once via vectorised R indexing (memory-bound). For a
1k-cell, 1k-gene store with ~10 dense vectors and 2 dense matrices, a
full cell-axis reorder runs in tens of milliseconds on a
single core.
On files_daf, the operation is
crash-recoverable. Before any in-place writes, the
original tree is hardlink-snapshotted into
.reorder.backup/. If the process is killed mid-reorder, the
next writable open (mode = "r+" or "w+")
automatically rolls back from the backup before returning the daf to the
caller. Manual recovery is also exposed via
reset_reorder_axes(daf) but is rarely needed.
reorder_axes() only operates on leaf storage formats
(memory_daf, files_daf, zarr_daf,
http_daf — though HTTP is read-only). Wrappers
(chain_reader, chain_writer,
view, contractor) are rejected up front;
reorder them via the underlying writer instead.