
Regress trajectory motifs of a full manifold (multiple trajectories)
regress_trajectory_motifs_manifold.Rd
Regress trajectory motifs of a full manifold (multiple trajectories)
Usage
regress_trajectory_motifs_manifold(
peak_intervals,
atac_diff_add_mat,
norm_intervals = NULL,
max_motif_num = 120,
target_traj_motif_num = 30,
n_clust_factor = 1,
motif_energies = NULL,
norm_motif_energies = NULL,
pssm_db = iceqream::motif_db,
additional_features = NULL,
min_tss_distance = 5000,
bin_start = 1,
bin_end = NULL,
min_initial_energy_cor = 0.05,
normalize_energies = TRUE,
energy_norm_quantile = 1,
norm_energy_max = 10,
n_prego_motifs = 0,
traj_prego = NULL,
initial_min_diff = 0.1,
min_diff = initial_min_diff,
prego_sample_for_kmers = TRUE,
prego_sample_fraction = 0.1,
seed = 60427,
feature_selection_beta = 0.003,
lambda = 0.00001,
alpha = 1,
filter_using_r2 = FALSE,
r2_threshold = 0.0005,
parallel = TRUE,
peaks_size = 500,
spat_num_bins = NULL,
spat_bin_size = 2,
kmer_sequence_length = 300,
include_interactions = FALSE,
interaction_threshold = 0.001,
max_motif_interaction_n = NULL,
max_add_interaction_n = NULL,
...
)
Arguments
- peak_intervals
A data frame, indicating the genomic positions ('chrom', 'start', 'end') of each peak.
- norm_intervals
A data frame, indicating the genomic positions ('chrom', 'start', 'end') of peaks used for energy normalization. If NULL, the function will use
peak_intervals
for normalization.- max_motif_num
maximum number of motifs to consider. Default: 50
- target_traj_motif_num
Number of motifs to select for each trajectory. Note that the actual number of motifs can be less or more than this number, depending on the data from other trajectories.
- n_clust_factor
factor to divide the number of to keep after clustering. e.g. if n_clust_factor > 1 the number of motifs to keep will be reduced by a factor of n_clust_factor. Default: 1
- motif_energies
A numeric matrix, representing the energy of each motif in each peak. If NULL, the function will use
pssm_db
to calculate the motif energies. Note that this might take a while.- norm_motif_energies
A numeric matrix, representing the normalized energy of each motif in each interval of
norm_intervals
. If NULL, the function will usepssm_db
to calculate the motif energies. Note that this might take a while.- pssm_db
a data frame with PSSMs ('A', 'C', 'G' and 'T' columns), with an additional column 'motif' containing the motif name. All the motifs in
motif_energies
(column names) should be present in the 'motif' column. Default: all motifs.- additional_features
A data frame, representing additional genomic features (e.g. CpG content, distance to TSS, etc.) for each peak. Note that NA values would be replaced with 0.
- min_tss_distance
distance from Transcription Start Site (TSS) to classify a peak as an enhancer. Default: 5000. If NULL, no filtering will be performed - use this option if your peaks are already filtered.
Note that in order to filter peaks that are too close to TSS, the currentmisha
genome must have an intervals set calledintervs.global.tss
.- bin_start
the start of the trajectory. Default: 1
- bin_end
the end of the trajectory. Default: the last bin (only used when atac_scores is provided)
- min_initial_energy_cor
minimal correlation between the motif normalized energy and the ATAC difference.
- normalize_energies
whether to normalize the motif energies. Set this to FALSE if the motif energies are already normalized.
- energy_norm_quantile
quantile of the energy used for normalization. Default: 1
- norm_energy_max
maximum value of the normalized energy. Default: 10
- n_prego_motifs
number of prego motifs (de-novo motifs) to consider.
- traj_prego
output of
learn_traj_prego
. If provided, no additional prego models would be inferred.- initial_min_diff
minimal ATAC difference for a peak to participate in the initial prego motif inference.
- min_diff
minimal ATAC difference for a peak to participate in the initial prego motif inference and in the distillation step (if
distill_on_diff
is TRUE).- prego_sample_for_kmers
whether to use a sample of the peaks for kmer screening. Default: TRUE
- prego_sample_fraction
Fraction of peaks to sample for prego motif inference. A smaller number would be faster but might lead to over-fitting. Default: 0.1
- seed
random seed for reproducibility.
- feature_selection_beta
beta parameter used for feature selection.
- lambda
A user supplied
lambda
sequence. Typical usage is to have the program compute its ownlambda
sequence based onnlambda
andlambda.min.ratio
. Supplying a value oflambda
overrides this. WARNING: use with care. Avoid supplying a single value forlambda
(for predictions after CV usepredict()
instead). Supply instead a decreasing sequence oflambda
values.glmnet
relies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit.- alpha
The elasticnet mixing parameter, with \(0\le\alpha\le 1\). The penalty is defined as $$(1-\alpha)/2||\beta||_2^2+\alpha||\beta||_1.$$
alpha=1
is the lasso penalty, andalpha=0
the ridge penalty.- filter_using_r2
whether to filter features using R^2.
- r2_threshold
minimal R^2 for a feature to be included in the model.
- parallel
whether to use parallel processing on glmnet.
- peaks_size
size of the peaks to extract sequences from. Default: 500bp
- spat_num_bins
number of spatial bins to use.
- spat_bin_size
size of each spatial bin.
- kmer_sequence_length
length of the kmer sequence to use for kmer screening. By default the full sequence is used.
- include_interactions
whether to include interactions between motifs / additional fetures as model features. IQ will create interactions between significant additional features and all motifs, and between significant motifs. Default: FALSE
- interaction_threshold
threshold for the selecting features to create interactions. IQ learns a linear model on the features and selects the features with coefficients above this threshold. Default: 0.001
- max_motif_interaction_n
maximum number of motifs to consider for interactions. If NULL, all motifs above the interaction_threshold will be considered. Default: NULL
- max_add_interaction_n
maximum number of additional features to consider for interactions. If NULL, all additional features above the interaction_threshold will be considered. Default: NULL
- ...
Arguments passed on to
distill_traj_model_multi
traj_models
A list of trajectory models.
distill_single
Logical indicating whether to distill clusters with a single motif.
learn_single_spatial
Logical indicating whether to learn only spatial features for clusters with a single motif, or skip them.
use_all_motifs
Logical indicating whether to use all motifs in the resulting models. If FALSE, only motifs from clusters which had a motif from the original model are used.
cluster_report_dir
The directory to store cluster reports. If not NULL, a png would be created for each cluster.
filter_models
Logical indicating whether to filter the models before distillation. Defaults to TRUE.
unique_motifs
Logical indicating whether to keep only unique motifs. Defaults to FALSE.
intra_cor_thresh
The threshold for the average intra-cluster correlation to split clusters. If NULL, no splitting is done.
bits_threshold
minimal sum of bits for a feature to be included in the model.