dafr is a native R + C++ implementation of the DataAxesFormats (DAF) data model: a uniform, self-describing container for 1D and 2D data arranged along a set of named axes. Think of it as a generalisation of AnnData with scalars, per-axis vectors, per-axis-pair matrices, disk-backed persistence, and a composable query DSL.
This package is a pure R + C++ port of the Julia reference implementation - no Julia install is required.
Installation
# Install from GitHub:
remotes::install_github("tanaylab/dafr")
# Or from a local checkout:
# R CMD INSTALL /path/to/dafrThis package has no Julia dependency. It is pure R + C++.
Native advantages
- No Julia install required - pure R + C++ (no JuliaCall copy tax).
- Mmap-backed reads via
mmap_dgCMatrix/mmap_int/mmap_lgl/mmap_real. - OpenMP-parallel kernels for Sum / Mean / Var / Mode / Quantile / GeoMean.
-
register_eltwise/register_reductionfor user-defined query ops. - Pipe-chain builders:
daf[Axis("cell") |> LookupVector("age") |> IsGreater(2)]. - AnnData interop via
DafAnnData+h5ad_as_daf/daf_as_h5ad.
Usage
library(dafr)
#>
#> Attaching package: 'dafr'
#> The following object is masked from 'package:graphics':
#>
#> Axis
# Create a DAF object in memory
daf <- memory_daf("example")
# Add axes
add_axis(daf, "obs", c("cell1", "cell2", "cell3"))
add_axis(daf, "var", c("gene1", "gene2"))
# Add vector data
set_vector(daf, "obs", "score", c(0.1, 0.5, 0.9))
# Add matrix data
mat <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 2)
set_matrix(daf, "obs", "var", "counts", mat)
# Access data
get_vector(daf, "obs", "score")
#> cell1 cell2 cell3
#> 0.1 0.5 0.9
get_matrix(daf, "obs", "var", "counts")
#> gene1 gene2
#> cell1 1 4
#> cell2 2 5
#> cell3 3 6
# Get a dataframe view
get_dataframe(daf, "obs")
#> score
#> cell1 0.1
#> cell2 0.5
#> cell3 0.9Querying Data
dafr provides two equivalent query interfaces: a compact string DSL and pipe-composable builders. The [ operator on any DafReader accepts either form.
# Create a sample dataset
daf <- memory_daf("query_examples")
add_axis(daf, "cell", c("A", "B", "C"))
add_axis(daf, "gene", c("X", "Y", "Z"))
set_vector(daf, "cell", "age", c(10, 50, 70))
set_vector(daf, "cell", "type", c("T", "B", "NK"))
set_scalar(daf, "version", "1.0")
set_matrix(daf, "cell", "gene", "counts", matrix(1:9, nrow = 3, ncol = 3))
# Scalar
daf[". version"]
#> [1] "1.0"
# Axes
daf["@ ?"]
#> [1] "cell" "gene"
daf["@ cell : ?"]
#> [1] "age" "type"
# Vector data
daf["@ cell : age"]
#> A B C
#> 10 50 70
# Matrix data
daf["@ cell @ gene :: counts"]
#> X Y Z
#> A 1 4 7
#> B 2 5 8
#> C 3 6 9
# Filter axis by property
daf["@ cell [ age > 20 ]"]
#> B C
#> "B" "C"
# Transform with eltwise op
daf["@ cell : age % Abs"]
#> A B C
#> 10 50 70
# Pipe-chain builder form
daf[Axis("cell") |> LookupVector("age")]
#> A B C
#> 10 50 70
# Composed builder: mask + vector lookup
daf[
Axis("cell") |>
BeginMask("age") |>
IsGreater(20) |>
EndMask()
]
#> B C
#> "B" "C"The query syntax follows a simple pattern:
-
@ axisselects an axis -
. propertyretrieves a scalar -
: propertyretrieves a vector property -
:: propertyretrieves a matrix property -
% operationapplies an element-wise operation -
>> reductionapplies a reduction operation -
[ property comparison value ]filters by a property
See vignette("queries", package = "dafr") for the practical tour and vignette("query-dsl-reference", package = "dafr") for the full operator grammar.
AnnData interop
A Daf object can be exposed as an AnnData-shaped facade with the familiar X / obs / var / layers / uns / obs_names / var_names / shape properties:
d <- example_cells_daf()
ann <- as_anndata(d)
ann$n_obs
#> [1] 856
ann$n_vars
#> [1] 683
dim(ann$X)
#> [1] 856 683
head(ann$obs_names)
#> [1] "demux_07_12_20_1_AACAAGATCCATTTCA-1" "demux_07_12_20_1_AACGAAAGTCCAATCA-1"
#> [3] "demux_07_12_20_1_AAGACAAAGTTCCGTA-1" "demux_07_12_20_1_AGACTCATCTATTGTC-1"
#> [5] "demux_07_12_20_1_AGATAGACATTCCTCG-1" "demux_07_12_20_1_ATCGTAGTCCAGTGCG-1"
head(ann$obs)
#> donor experiment
#> demux_07_12_20_1_AACAAGATCCATTTCA-1 N89 demux_07_12_20_1
#> demux_07_12_20_1_AACGAAAGTCCAATCA-1 N84 demux_07_12_20_1
#> demux_07_12_20_1_AAGACAAAGTTCCGTA-1 N86 demux_07_12_20_1
#> demux_07_12_20_1_AGACTCATCTATTGTC-1 N84 demux_07_12_20_1
#> demux_07_12_20_1_AGATAGACATTCCTCG-1 N89 demux_07_12_20_1
#> demux_07_12_20_1_ATCGTAGTCCAGTGCG-1 N89 demux_07_12_20_1The facade is read-only; writes (ann$X <- ...) raise an error. Modify the underlying Daf directly if you need to change data.
Round-trip h5ad I/O is supported (requires the hdf5r Suggests dep). Sparse X, categorical obs / var columns, nested uns, and obsm / varm all round-trip:
d <- h5ad_as_daf("path/to/file.h5ad")
daf_as_h5ad(d, "out.h5ad", overwrite = TRUE)dplyr backend
tbl(daf, axis) returns a lazy daf_axis_tbl where rows are axis entries and columns are per-axis vectors. Most dplyr verbs work natively:
library(dplyr)
d <- example_cells_daf()
tbl(d, "cell") |>
filter(donor == "N89") |>
select(name, donor, experiment) |>
collect() |>
head()
#> # A tibble: 6 x 3
#> name donor experiment
#> 1 demux_07_12_20_1_AACAAGATCCATTTCA-1 N89 demux_07_12_20_1
#> 2 demux_07_12_20_1_AGATAGACATTCCTCG-1 N89 demux_07_12_20_1
#> 3 demux_07_12_20_1_ATCGTAGTCCAGTGCG-1 N89 demux_07_12_20_1
#> 4 demux_07_12_20_1_CACAGGCGTCCTACAA-1 N89 demux_07_12_20_1
#> 5 demux_07_12_20_1_CCTACGTAGCCAACCC-1 N89 demux_07_12_20_1
#> 6 demux_07_12_20_1_GAAGGGTGTCCCTGAG-1 N89 demux_07_12_20_1
tbl(d, "cell") |>
count(experiment, sort = TRUE) |>
collect() |>
head()
#> # A tibble: 6 x 2
#> name n
#> 1 demux_28_12_20_2 72
#> 2 demux_28_12_20_1 58
#> 3 demux_21_02_21_2 51
#> 4 demux_22_02_21_2 47
#> 5 demux_01_03_21_1 45
#> 6 demux_07_12_20_2 42Supported verbs: filter, select, mutate, arrange, summarise, group_by / ungroup, distinct, pull, collect, as_tibble, rename / relocate, transmute, reframe, slice and the slice_head / slice_tail / slice_min / slice_max / slice_sample family, count / tally / add_count / add_tally. Inside mutate() / summarise(): window functions (lag, lead, cumsum, row_number, rank family), scalar helpers (if_else, case_when, coalesce, n_distinct, first / last / nth), across(where(...), ...), and tidyselect helpers in select(). .by = (dplyr 1.1+) works on filter / mutate / summarise. See vignette("dplyr", package = "dafr") for details.
When a grouping variable names another axis already in the daf, summarise() and count() auto-tie-back to a lazy daf_axis_tbl on that axis - so the pipeline can continue. Axis entries then appear in the name column (the convention for the axis-identity column). Example:
# `donor` is an axis of example_cells_daf(); count() re-anchors onto it.
tbl(d, "cell") |>
count(donor, sort = TRUE) |>
class() # daf_axis_tbl + tbl_df
#> [1] "daf_axis_tbl"Mutations are accumulated lazily and can be written back with dplyr::compute(tbl, vectors = c(...)):
# Lazy pipeline...
updated <- tbl(d, "cell") |>
mutate(donor_upper = toupper(donor))
# ...materialise the new vector on the underlying daf:
dplyr::compute(updated, vectors = "donor_upper")Matrices are out of scope - use the native query DSL for those.
Key Features
- Full Daf data model: scalars, per-axis vectors, per-axis-pair matrices.
- Storage backends:
memory_daf(in-memory),files_daf(one-file-per-property + mmap reads),zarr_daf(directory or single-file.daf.zarr.zip, Zarr v2 layout readable byzarr-python),http_daf(read-only over HTTP). - Zero-copy mmap-backed reads for vectors and sparse matrices on read-only
FilesDafandMmapZipStore-backedZarrDaf. - OpenMP-parallel reduction kernels (Sum, Mean, Var, Mode, Quantile, GeoMean).
- Views and adapters for zero-copy subsetting / renaming.
- Chaining repositories for efficient overlay / reuse of data.
- Contracts for explicit computation input / output validation.
- Explicit support for grouped axes and group helpers.
- Query DSL (string form + pipe-chain builders) with user-extensible ops (
register_eltwise,register_reduction). - AnnData (
h5ad) import / export, plus a read-onlyAnnData-shaped facade (as_anndata). - dplyr backend for per-axis tibbles.
- Control over data layout and support for both dense and sparse matrices.
Status
Pre-CRAN, lifecycle: experimental. The data model and on-disk formats are stable and cross-checked against DataAxesFormats.jl and zarr-python; the R API may still see breaking refinements before the first CRAN release. See NEWS.md for release notes. Comments, bug reports, and PRs are welcome.
License (MIT)
Copyright (c) 2025-2026 Weizmann Institute of Science
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.