Skip to contents

Perform IQ regression on peak intervals using the provided ATAC-seq scores, ATAC-seq score differences, normalized intervals, motif energies, and additional features, after dividing the intervals into training and testing sets.

Usage

iq_regression(
  peak_intervals,
  atac_scores = NULL,
  atac_diff = NULL,
  normalize_bins = TRUE,
  norm_intervals = NULL,
  motif_energies = NULL,
  additional_features = NULL,
  max_motif_num = 30,
  traj_prego = NULL,
  peaks_size = 300,
  bin_start = 1,
  bin_end = NULL,
  seed = 60427,
  frac_train = 0.8,
  filter_model = TRUE,
  ...
)

Arguments

peak_intervals

A data frame, indicating the genomic positions ('chrom', 'start', 'end') of each peak, with an additional column named "const" indicating whether the peak is constitutive. Optionally, a column named "cluster" can be added with indication of the cluster of each peak.

atac_scores

Optional. A numeric matrix, representing mean ATAC score per bin per peak. Rows: peaks, columns: bins. By default iceqream would regress the last column minus the first column. If you want to regress something else, please either set bin_start or bin_end, or provide atac_diff instead. If normalize_bins is TRUE, the scores will be normalized to 0, 1.

atac_diff

Optional. A numeric vector representing the differential accessibility between the start and end of the trajectory. Either this or atac_scores must be provided.

normalize_bins

whether to normalize the ATAC scores to 0, 1. Default: TRUE

norm_intervals

A data frame, indicating the genomic positions ('chrom', 'start', 'end') of peaks used for energy normalization. If NULL, the function will use peak_intervals for normalization.

motif_energies

A numeric matrix, representing the energy of each motif in each peak. If NULL, the function will use pssm_db to calculate the motif energies. Note that this might take a while.

additional_features

A data frame, representing additional genomic features (e.g. CpG content, distance to TSS, etc.) for each peak. Note that NA values would be replaced with 0.

max_motif_num

maximum number of motifs to consider. Default: 50

traj_prego

output of learn_traj_prego. If provided, no additional prego models would be inferred.

peaks_size

size of the peaks to extract sequences from. Default: 300bp

bin_start

the start of the trajectory. Default: 1

bin_end

the end of the trajectory. Default: the last bin (only used when atac_scores is provided)

seed

random seed for reproducibility.

frac_train

A numeric value indicating the fraction of intervals to use for training (default is 0.8).

filter_model

A logical value indicating whether to filter the model (default is TRUE).

...

Arguments passed on to regress_trajectory_motifs

n_clust_factor

factor to divide the number of to keep after clustering. e.g. if n_clust_factor > 1 the number of motifs to keep will be reduced by a factor of n_clust_factor. Default: 1

norm_motif_energies

A numeric matrix, representing the normalized energy of each motif in each interval of norm_intervals. If NULL, the function will use pssm_db to calculate the motif energies. Note that this might take a while.

pssm_db

a data frame with PSSMs ('A', 'C', 'G' and 'T' columns), with an additional column 'motif' containing the motif name. All the motifs in motif_energies (column names) should be present in the 'motif' column. Default: all motifs in the prego package.

min_tss_distance

distance from Transcription Start Site (TSS) to classify a peak as an enhancer. Default: 5000. If NULL, no filtering will be performed - use this option if your peaks are already filtered.
Note that in order to filter peaks that are too close to TSS, the current misha genome must have an intervals set called intervs.global.tss.

normalize_energies

whether to normalize the motif energies. Set this to FALSE if the motif energies are already normalized.

min_initial_energy_cor

minimal correlation between the motif normalized energy and the ATAC difference.

energy_norm_quantile

quantile of the energy used for normalization. Default: 1

norm_energy_max

maximum value of the normalized energy. Default: 10

n_prego_motifs

number of prego motifs (de-novo motifs) to consider.

min_diff

minimal ATAC difference for a peak to participate in the initial prego motif inference and in the distillation step (if distill_on_diff is TRUE).

distill_on_diff

whether to distill motifs based on differential accessibility. If FALSE, all peaks will be used for distillation, if TRUE - only peaks with differential accessibility >= min_diff will be used.

prego_sample_fraction

Fraction of peaks to sample for prego motif inference. A smaller number would be faster but might lead to over-fitting. Default: 0.1

feature_selection_beta

beta parameter used for feature selection.

filter_using_r2

whether to filter features using R^2.

r2_threshold

minimal R^2 for a feature to be included in the model.

parallel

whether to use parallel processing on glmnet.

spat_num_bins

number of spatial bins to use.

spat_bin_size

size of each spatial bin.

kmer_sequence_length

length of the kmer sequence to use for kmer screening. By default the full sequence is used.

alpha

The elasticnet mixing parameter, with \(0\le\alpha\le 1\). The penalty is defined as $$(1-\alpha)/2||\beta||_2^2+\alpha||\beta||_1.$$ alpha=1 is the lasso penalty, and alpha=0 the ridge penalty.

lambda

A user supplied lambda sequence. Typical usage is to have the program compute its own lambda sequence based on nlambda and lambda.min.ratio. Supplying a value of lambda overrides this. WARNING: use with care. Avoid supplying a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit.

Value

An instance of TrajectoryModel containing model information and results:

  • modelThe final General Linear Model (GLM) object.

  • motif_modelsNamed List, PSSM and spatial models for each motif cluster.

  • normalized_energiesNumeric vector, normalized energies of each motif in each peak.

  • additional_featuresdata frame of the additional features.

  • diff_scoreNumeric, normalized score of differential accessibility between 'bin_start' and 'bin_end'.

  • predicted_diff_scoreNumeric, predicted differential accessibility score between 'bin_start' and 'bin_end'.

  • initial_prego_modelsList, inferred prego models at the initial step of the algorithm.

  • peak_intervalsdata frame, indicating the genomic positions ('chrom', 'start', 'end') of each peak used for training.

Examples

if (FALSE) {

}