
Perform IQ regression on peak intervals
iq_regression.RdPerform IQ regression on peak intervals using the provided ATAC-seq scores, ATAC-seq score differences, normalized intervals, motif energies, and additional features, after dividing the intervals into training and testing sets.
Usage
iq_regression(
peak_intervals,
atac_scores = NULL,
atac_diff = NULL,
normalize_bins = TRUE,
norm_intervals = NULL,
motif_energies = NULL,
additional_features = NULL,
add_sequences_features = TRUE,
max_motif_num = 30,
traj_prego = NULL,
peaks_size = 500,
bin_start = 1,
bin_end = NULL,
seed = 60427,
frac_train = 0.8,
train_idxs = NULL,
test_idxs = NULL,
filter_model = TRUE,
min_diff = 0.1,
prego_min_diff = min_diff,
prego_sample_for_kmers = TRUE,
prego_sample_fraction = 0.1,
prego_energy_norm_quantile = 1,
prego_spat_bin_size = NULL,
prego_spat_num_bins = NULL,
r2_threshold = 0.0005,
bits_threshold = 1.75,
filter_sample_frac = 0.1,
include_interactions = FALSE,
interaction_threshold = 0.001,
max_motif_interaction_n = NULL,
max_add_interaction_n = NULL,
n_prego_motifs = 0,
n_cores = NULL,
output_dir = NULL,
plot_report = TRUE,
...
)Arguments
- peak_intervals
A data frame, indicating the genomic positions ('chrom', 'start', 'end') of each peak.
- atac_scores
Optional. A numeric matrix, representing mean ATAC score per bin per peak. Rows: peaks, columns: bins. By default iceqream would regress the last column minus the first column. If you want to regress something else, please either change bin_start or bin_end, or provide
atac_diffinstead. Ifnormalize_binsis TRUE, the scores will be normalized to be between 0 and 1.- atac_diff
Optional. A numeric vector representing the differential accessibility between the start and end of the trajectory. Either this or
atac_scoresmust be provided.- normalize_bins
whether to normalize the ATAC scores to be between 0 and 1. Default: TRUE
- norm_intervals
A data frame, indicating the genomic positions ('chrom', 'start', 'end') of peaks used for energy normalization. If NULL, the function will use
peak_intervalsfor normalization.- motif_energies
A numeric matrix, representing the energy of each motif in each peak. If NULL, the function will use
pssm_dbto calculate the motif energies. Note that this might take a while.- additional_features
A data frame, representing additional genomic features (e.g. CpG content, distance to TSS, etc.) for each peak. Note that NA values would be replaced with 0.
- add_sequences_features
Add CG content and dinuceotide content to the additional features.
- max_motif_num
maximum number of motifs to consider. Default: 50
- traj_prego
output of
learn_traj_prego. If provided, no additional prego models would be inferred.- peaks_size
size of the peaks to extract sequences from. Default: 500bp
- bin_start
the start of the trajectory. Default: 1
- bin_end
the end of the trajectory. Default: the last bin (only used when atac_scores is provided)
- seed
random seed for reproducibility.
- frac_train
A numeric value indicating the fraction of intervals to use for training (default is 0.8).
- train_idxs
A vector of indices to use for training. If NULL, the training set is randomly selected.
- test_idxs
A vector of indices to use for testing. If NULL, the testing set is the complement of the training set.
- filter_model
A logical value indicating whether to filter the model (default is TRUE).
- min_diff
minimal ATAC difference for a peak to participate in the initial prego motif inference and in the distillation step (if
distill_on_diffis TRUE).- prego_sample_for_kmers
whether to use a sample of the peaks for kmer screening. Default: TRUE
- prego_sample_fraction
Fraction of peaks to sample for prego motif inference. A smaller number would be faster but might lead to over-fitting. Default: 0.1
- r2_threshold
minimal R^2 for a feature to be included in the model.
- bits_threshold
minimal sum of bits for a feature to be included in the model.
- filter_sample_frac
The fraction of samples to use for computing the r2 without each model at the filtering step. When NULL, all samples are used.
- include_interactions
whether to include interactions between motifs / additional fetures as model features. IQ will create interactions between significant additional features and all motifs, and between significant motifs. Default: FALSE
- interaction_threshold
threshold for the selecting features to create interactions. IQ learns a linear model on the features and selects the features with coefficients above this threshold. Default: 0.001
- max_motif_interaction_n
maximum number of motifs to consider for interactions. If NULL, all motifs above the interaction_threshold will be considered. Default: NULL
- max_add_interaction_n
maximum number of additional features to consider for interactions. If NULL, all additional features above the interaction_threshold will be considered. Default: NULL
- n_prego_motifs
number of prego motifs (de-novo motifs) to consider.
- n_cores
The number of cores to use for parallel processing. When NULL, the number of threads is automatically determined as 80% of the available cores. See
prego::set_parallelfor more details.- output_dir
A directory to save intermediate results. If not NULL, the train and test indices are saved to a CSV file, together with the models at each step (before filtering, after filtering, and after adding interactions). The models are saved as RDS files. If the directory exists, the files are overwritten.
- plot_report
A logical value indicating whether to plot the model report. Default is TRUE.
- ...
Arguments passed on to
regress_trajectory_motifsn_clust_factorfactor to divide the number of to keep after clustering. e.g. if n_clust_factor > 1 the number of motifs to keep will be reduced by a factor of n_clust_factor. Default: 1
norm_motif_energiesA numeric matrix, representing the normalized energy of each motif in each interval of
norm_intervals. If NULL, the function will usepssm_dbto calculate the motif energies. Note that this might take a while.pssm_dba data frame with PSSMs ('A', 'C', 'G' and 'T' columns), with an additional column 'motif' containing the motif name. All the motifs in
motif_energies(column names) should be present in the 'motif' column. Default: all motifs.min_tss_distancedistance from Transcription Start Site (TSS) to classify a peak as an enhancer. Default: 5000. If NULL, no filtering will be performed - use this option if your peaks are already filtered.
Note that in order to filter peaks that are too close to TSS, the currentmishagenome must have an intervals set calledintervs.global.tss.normalize_energieswhether to normalize the motif energies. Set this to FALSE if the motif energies are already normalized.
min_initial_energy_corminimal correlation between the motif normalized energy and the ATAC difference.
energy_norm_quantilequantile of the energy used for normalization. Default: 1
norm_energy_maxmaximum value of the normalized energy. Default: 10
distill_on_diffwhether to distill motifs based on differential accessibility. If FALSE, all peaks will be used for distillation, if TRUE - only peaks with differential accessibility >= min_diff will be used.
feature_selection_betabeta parameter used for feature selection.
filter_using_r2whether to filter features using R^2.
parallelwhether to use parallel processing on glmnet.
spat_num_binsnumber of spatial bins to use.
spat_bin_sizesize of each spatial bin.
kmer_sequence_lengthlength of the kmer sequence to use for kmer screening. By default the full sequence is used.
max_interaction_nmaximum number of interactions to consider.
alphaThe elasticnet mixing parameter, with \(0\le\alpha\le 1\). The penalty is defined as $$(1-\alpha)/2||\beta||_2^2+\alpha||\beta||_1.$$
alpha=1is the lasso penalty, andalpha=0the ridge penalty.lambdaA user supplied
lambdasequence. Typical usage is to have the program compute its ownlambdasequence based onnlambdaandlambda.min.ratio. Supplying a value oflambdaoverrides this. WARNING: use with care. Avoid supplying a single value forlambda(for predictions after CV usepredict()instead). Supply instead a decreasing sequence oflambdavalues.glmnetrelies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit.symmetrize_spatif TRUE, the spatial model would be symmetrized around the center bin. Default: TRUE.