Skip to contents

This function processes ATAC-seq data by importing signal tracks, normalizing peaks, and performing multiple normalization steps including regional normalization, constitutive peak normalization, and probability-based normalization.

Usage

preprocess_data(
  project_name,
  files = NULL,
  cell_types = NULL,
  peaks = NULL,
  anchor_cell_type = NULL,
  figures_dir = NULL,
  peaks_size = 500,
  binsize = 20,
  overwrite_tracks = FALSE,
  overwrite_marginal = FALSE,
  window_size = 20000,
  minimal_quantile = 0.1,
  const_threshold = -16,
  const_norm_quant = 1,
  const_scaling_quant = 1,
  const_quantile = 0.9,
  prob1_thresh = NULL,
  add_tss_dist = TRUE,
  tss_intervals = "intervs.global.tss",
  proximal_atac_window_size = 20000
)

Arguments

project_name

Character string. The prefix used for track names and project identification.

files

Optional character vector. Paths to input ATAC-seq signal files, can be in bigWig or tsv format, see misha::gtrack.import for more details. Required if tracks don't exist.

cell_types

Optional character vector. Names of cell types to process. If NULL, derived from track names.

peaks

Data frame or file path. Peak intervals with required columns 'chrom', 'start', and 'end'.

anchor_cell_type

Optional character. Cell type to use as reference for normalization. If NULL, the mean of all cell types is used.

figures_dir

Optional character. Directory path to save normalization plots.

peaks_size

Numeric. Size to normalize peaks to in base pairs. Default: 500

binsize

Numeric. Bin size for signal track import in base pairs. Default: 20

overwrite_tracks

Logical. Whether to overwrite existing individual cell type tracks. Default: FALSE

overwrite_marginal

Logical. Whether to overwrite existing marginal track. Default: FALSE

window_size

Numeric. Window size for regional normalization in base pairs. Default: 2e4

minimal_quantile

Numeric. Minimum quantile for regional normalization. Default: 0.1

const_threshold

Numeric. Log2 threshold for identifying constitutive peaks. Default: -16

const_norm_quant

Numeric. Quantile for constitutive peak normalization. Default: 1

const_scaling_quant

Numeric. Scaling quantile for constitutive normalization. Default: 1

const_quantile

Numeric. Quantile for probability normalization threshold. Default: 0.9

prob1_thresh

Optional numeric. Threshold for probability=1 in normalization. If NULL, calculated from const_quantile.

add_tss_dist

Logical. Whether to add TSS distance to peaks. Default: TRUE

tss_intervals

Character. Name of TSS intervals track. Default: "intervs.global.tss"

proximal_atac_window_size

Numeric. Window size for proximal ATAC signal computation. For each peak, a feature of the (punctrured) window signal is computed. Default: 2e4

Value

A list containing:

  • atac: Raw ATAC signal matrix

  • atac_norm: Region-normalized signal matrix

  • atac_norm_const: Constitutive peak-normalized signal matrix

  • atac_norm_prob: Probability-normalized signal matrix

  • peaks: Data frame of peak information

  • additional_features: Data frame of additional features (dinucleotide distribution and punctured regional ATAC signal)

  • params: List of parameters used for normalization

Details

The function performs several normalization steps:

  1. Regional normalization using punctured windows around peaks

  2. Identification and normalization of constitutive peaks

  3. Conversion to probability scores

If visualization is enabled (figures_dir is provided), the function generates scatter plots showing the effects of each normalization step.

Examples

if (FALSE) { # \dontrun{
# Basic usage with existing tracks
result <- preprocess_data(
    project_name = "my_project",
    peaks = "peaks.bed",
    figures_dir = "figures"
)

# Full preprocessing with new data
result <- preprocess_data(
    project_name = "my_project",
    files = c("celltype1.bw", "celltype2.bw"),
    peaks = "peaks.bed",
    anchor_cell_type = "celltype1",
    figures_dir = "figures",
    overwrite_tracks = TRUE
)
} # }