
Preprocess and normalize ATAC-seq data
preprocess_data.Rd
This function processes ATAC-seq data by importing signal tracks, normalizing peaks, and performing multiple normalization steps including regional normalization, constitutive peak normalization, and probability-based normalization.
Usage
preprocess_data(
project_name,
files = NULL,
cell_types = NULL,
peaks = NULL,
anchor_cell_type = NULL,
figures_dir = NULL,
peaks_size = 500,
binsize = 20,
overwrite_tracks = FALSE,
overwrite_marginal = FALSE,
window_size = 20000,
minimal_quantile = 0.1,
const_threshold = -16,
const_norm_quant = 1,
const_scaling_quant = 1,
const_quantile = 0.9,
prob1_thresh = NULL,
add_tss_dist = TRUE,
tss_intervals = "intervs.global.tss",
proximal_atac_window_size = 20000
)
Arguments
- project_name
Character string. The prefix used for track names and project identification.
- files
Optional character vector. Paths to input ATAC-seq signal files, can be in bigWig or tsv format, see
misha::gtrack.import
for more details. Required if tracks don't exist.- cell_types
Optional character vector. Names of cell types to process. If NULL, derived from track names.
- peaks
Data frame or file path. Peak intervals with required columns 'chrom', 'start', and 'end'.
- anchor_cell_type
Optional character. Cell type to use as reference for normalization. If NULL, the mean of all cell types is used.
- figures_dir
Optional character. Directory path to save normalization plots.
- peaks_size
Numeric. Size to normalize peaks to in base pairs. Default: 500
- binsize
Numeric. Bin size for signal track import in base pairs. Default: 20
- overwrite_tracks
Logical. Whether to overwrite existing individual cell type tracks. Default: FALSE
- overwrite_marginal
Logical. Whether to overwrite existing marginal track. Default: FALSE
- window_size
Numeric. Window size for regional normalization in base pairs. Default: 2e4
- minimal_quantile
Numeric. Minimum quantile for regional normalization. Default: 0.1
- const_threshold
Numeric. Log2 threshold for identifying constitutive peaks. Default: -16
- const_norm_quant
Numeric. Quantile for constitutive peak normalization. Default: 1
- const_scaling_quant
Numeric. Scaling quantile for constitutive normalization. Default: 1
- const_quantile
Numeric. Quantile for probability normalization threshold. Default: 0.9
- prob1_thresh
Optional numeric. Threshold for probability=1 in normalization. If NULL, calculated from const_quantile.
- add_tss_dist
Logical. Whether to add TSS distance to peaks. Default: TRUE
- tss_intervals
Character. Name of TSS intervals track. Default: "intervs.global.tss"
- proximal_atac_window_size
Numeric. Window size for proximal ATAC signal computation. For each peak, a feature of the (punctrured) window signal is computed. Default: 2e4
Value
A list containing:
atac: Raw ATAC signal matrix
atac_norm: Region-normalized signal matrix
atac_norm_const: Constitutive peak-normalized signal matrix
atac_norm_prob: Probability-normalized signal matrix
peaks: Data frame of peak information
additional_features: Data frame of additional features (dinucleotide distribution and punctured regional ATAC signal)
params: List of parameters used for normalization
Details
The function performs several normalization steps:
Regional normalization using punctured windows around peaks
Identification and normalization of constitutive peaks
Conversion to probability scores
If visualization is enabled (figures_dir is provided), the function generates scatter plots showing the effects of each normalization step.
Examples
if (FALSE) { # \dontrun{
# Basic usage with existing tracks
result <- preprocess_data(
project_name = "my_project",
peaks = "peaks.bed",
figures_dir = "figures"
)
# Full preprocessing with new data
result <- preprocess_data(
project_name = "my_project",
files = c("celltype1.bw", "celltype2.bw"),
peaks = "peaks.bed",
anchor_cell_type = "celltype1",
figures_dir = "figures",
overwrite_tracks = TRUE
)
} # }