Skip to contents

This function processes ATAC-seq data by importing signal tracks, normalizing peaks, and performing multiple normalization steps including regional normalization, constitutive peak normalization, and probability-based normalization.

Usage

preprocess_data(
  project_name,
  files = NULL,
  cell_types = NULL,
  peak_intervals = NULL,
  peaks = NULL,
  anchor_cell_type = NULL,
  figures_dir = NULL,
  peaks_size = 500,
  binsize = 20,
  overwrite_tracks = FALSE,
  overwrite_marginal = FALSE,
  window_size = 20000,
  minimal_quantile = 0.1,
  const_threshold = -16,
  const_norm_quant = 1,
  const_scaling_quant = 1,
  const_quantile = 0.9,
  prob1_thresh = NULL,
  add_tss_dist = TRUE,
  tss_intervals = "intervs.global.tss",
  proximal_atac_window_size = 20000
)

Arguments

project_name

Character string. The prefix used for track names and project identification.

files

Optional character vector. Paths to input ATAC-seq signal files, can be in bigWig or tsv format, see misha::gtrack.import for more details. Required if tracks don't exist.

cell_types

Optional character vector. Names of cell types to process. If NULL, derived from track names.

peak_intervals

Data frame or file path. Peak intervals with required columns 'chrom', 'start', and 'end'.

peaks

An alias for peak_intervals.

anchor_cell_type

Optional character. Cell type to use as reference for normalization. If NULL, the mean of all cell types is used.

figures_dir

Optional character. Directory path to save normalization plots.

peaks_size

Numeric. Size to normalize peaks to in base pairs. Default: 500

binsize

Numeric. Bin size for signal track import in base pairs. Default: 20

overwrite_tracks

Logical. Whether to overwrite existing individual cell type tracks. Default: FALSE

overwrite_marginal

Logical. Whether to overwrite existing marginal track. Default: FALSE

window_size

Numeric. Window size for regional normalization in base pairs. Default: 2e4

minimal_quantile

Numeric. Minimum quantile for regional normalization. Default: 0.1

const_threshold

Numeric. Log2 threshold for identifying constitutive peaks. Default: -16

const_norm_quant

Numeric. Quantile for constitutive peak normalization. Default: 1

const_scaling_quant

Numeric. Scaling quantile for constitutive normalization. Default: 1

const_quantile

Numeric. Quantile for probability normalization threshold. Default: 0.9

prob1_thresh

Optional numeric. Threshold for probability=1 in normalization. If NULL, calculated from const_quantile.

add_tss_dist

Logical. Whether to add TSS distance to peaks. Default: TRUE

tss_intervals

Character. Name of TSS intervals track. Default: "intervs.global.tss"

proximal_atac_window_size

Numeric. Window size for proximal ATAC signal computation. For each peak, a feature of the (punctrured) window signal is computed. Default: 2e4

Value

A list containing:

  • atac: Raw ATAC signal matrix

  • atac_norm: Region-normalized signal matrix

  • atac_norm_const: Constitutive peak-normalized signal matrix

  • atac_norm_prob: Probability-normalized signal matrix. This is the recommended input for the atac_scores parameter in iq_regression and regress_trajectory_motifs.

  • peaks: Data frame of peak information

  • additional_features: Data frame of additional features (dinucleotide distribution and punctured regional ATAC signal)

  • params: List of parameters used for normalization

Details

The function performs several normalization steps:

  1. Regional normalization using punctured windows around peaks

  2. Identification and normalization of constitutive peaks

  3. Conversion to probability scores

If visualization is enabled (figures_dir is provided), the function generates scatter plots showing the effects of each normalization step.

Examples

if (FALSE) { # \dontrun{
# Basic usage with existing tracks
result <- preprocess_data(
    project_name = "my_project",
    peaks = "peaks.bed",
    figures_dir = "figures"
)

# Full preprocessing with new data
result <- preprocess_data(
    project_name = "my_project",
    files = c("celltype1.bw", "celltype2.bw"),
    peaks = "peaks.bed",
    anchor_cell_type = "celltype1",
    figures_dir = "figures",
    overwrite_tracks = TRUE
)
} # }