R/intervals-annotation.R
gintervals.annotate.RdAnnotates one-dimensional intervals by finding nearest neighbors in another set of intervals and adding selected columns from the neighbors to the original intervals.
gintervals.annotate(
intervals,
annotation_intervals,
annotation_columns = NULL,
column_names = NULL,
dist_column = "dist",
max_dist = Inf,
na_value = NA,
maxneighbors = 1,
tie_method = c("first", "min.start", "min.end"),
overwrite = FALSE,
keep_order = TRUE,
intervals.set.out = NULL,
...
)Intervals to annotate (1D).
Source intervals containing annotation data (1D).
Character vector of column names to copy from
annotation_intervals. If NULL (default), all non-basic
columns are used, i.e. everything beyond the coordinate/strand columns
among: chrom, start, end, chrom1, start1, end1, chrom2, start2, end2, strand.
Optional custom names for the annotation columns. If
provided, must have the same length as annotation_columns. Defaults
to using the original names.
Name of the distance column to include. Use NULL to
omit the distance column. Defaults to "dist".
Maximum absolute distance. When finite, neighbors with
|dist| > max_dist result in annotation columns being set to
na_value for those rows, while the row itself is retained.
Value(s) to use for annotations when beyond max_dist
or when no neighbor is found. Can be a single scalar recycled for all
columns, or a named list/vector supplying per-column values matching
column_names.
Maximum number of neighbors per interval (duplicates intervals as needed). Defaults to 1.
Tie-breaking when distances are equal: one of
"first" (arbitrary but stable), "min.start" (smaller neighbor start first),
or "min.end" (smaller neighbor end first). Applies when
maxneighbors > 1.
When FALSE (default), errors if selected annotation
columns would overwrite existing columns in intervals. When
TRUE, conflicting base columns are replaced by the annotation
columns.
If TRUE (default), preserves the original order of
intervals rows in the output.
intervals set name where the function result is optionally outputted
Additional arguments forwarded to gintervals.neighbors
(e.g., mindist, maxdist).
A data frame containing the original intervals plus the requested
annotation columns (and optional distance column). If
maxneighbors > 1, rows may be duplicated per input interval to
accommodate multiple neighbors.
The function wraps and extends gintervals.neighbors to provide
convenient column selection/renaming, optional distance inclusion, distance
thresholding with custom NA values, multiple neighbors per interval, and
deterministic tie-breaking. Currently supports 1D intervals only.
- When annotation_columns = NULL, all non-basic columns present in
annotation_intervals are included.
- Setting dist_column = NULL omits the distance column.
- If no neighbor is found for an interval, annotation columns are filled with
na_value and the distance (when present) is NA_real_.
- Column name collisions are handled as follows: when overwrite=FALSE
a clear error is emitted; when overwrite=TRUE, base columns with the
same names are replaced by annotation columns.
# Prepare toy data
intervs <- gintervals(1, c(1000, 5000), c(1100, 5050))
ann <- gintervals(1, c(900, 5400), c(950, 5500))
ann$remark <- c("a", "b")
ann$score <- c(10, 20)
# Basic usage with default columns (all non-basic columns)
gintervals.annotate(intervs, ann)
#> chrom start end remark score dist
#> 1 chr1 1000 1100 a 10 50
#> 2 chr1 5000 5050 b 20 350
# Select specific columns, with custom names and distance column name
gintervals.annotate(
intervs, ann,
annotation_columns = c("remark"),
column_names = c("ann_remark"),
dist_column = "ann_dist"
)
#> chrom start end ann_remark ann_dist
#> 1 chr1 1000 1100 a 50
#> 2 chr1 5000 5050 b 350
# Distance threshold with scalar NA replacement
gintervals.annotate(
intervs, ann,
annotation_columns = c("remark"),
max_dist = 200,
na_value = "no_ann"
)
#> chrom start end remark dist
#> 1 chr1 1000 1100 a 50
#> 2 chr1 5000 5050 no_ann 350
# Multiple neighbors with deterministic tie-breaking
nbrs <- gintervals.annotate(
gintervals(1, 1000, 1100),
{
x <- gintervals(1, c(800, 1200), c(900, 1300))
x$label <- c("left", "right")
x
},
annotation_columns = "label",
maxneighbors = 2,
tie_method = "min.start"
)
nbrs
#> chrom start end label dist
#> 1 chr1 1000 1100 left 100
#> 2 chr1 1000 1100 right 100
# Overwrite existing columns in the base intervals
intervs2 <- intervs
intervs2$remark <- c("orig1", "orig2")
gintervals.annotate(intervs2, ann, annotation_columns = "remark", overwrite = TRUE)
#> chrom start end remark dist
#> 1 chr1 1000 1100 a 50
#> 2 chr1 5000 5050 b 350