Cross-validate a PWM regression model
regress_pwm.cv.RdPerform cross-validation on a PWM regression model. You can either provide explicit folds, or use the nfolds argument to set the number of folds. If the response is binary (0 and 1) or a categories vector is given, the folds would be stratified by the response/categories.
Arguments
- sequences
A vector of DNA sequences ('A', 'T', 'C' or 'G'. Will go through
toupper). Please make sure that the sequences are long enough to coverspat_num_bins*spat_bin_sizebp, and that they are centered around the motif/signal.- response
A matrix of response variables - number of rows should equal the number of sequences
- nfolds
number of folds for cross-validation. Can be NULL if
foldsare provided.- metric
metric to use for cross-validation. One of 'ks' or 'r2'. If NULL - 'ks' would be set for binary response and 'r2' for continuous response.
- folds
vector of fold numbers for each sequence (optional)
- categories
vector of categories for each sequence (optional)
- use_sample
whether to use sampled optimization or not.
- seed
random seed
- parallel
whether to run the cross-validation in parallel.
- fold_parallel
whether to run the optimization in each fold in parallel. It is recommended to set this to FALSE if
parallelis TRUE.- add_full_model
whether to add the full model (without cross-validation) to the results.
- alternative
alternative hypothesis for the p-value calculation when using
ks.test. One of "two.sided", "less" or "greater".- ...
Arguments passed on to
regress_pwm,regress_pwm.samplemotifInitial motif to start the regression from. Can be either a string with a kmer where the character "*" indicates a wildcard or a data frame with a pre-computed PSSM (see the slot
pssmin the return value of this function). If NULL - a K-mer screen would be performed in order to find the best kmer for initialization. Ifinit_from_datasetis TRUE, the regression would be initialized from the PSSM of the best motif in the dataset.init_from_datasetinitialize the regression from the PSSM of the best motif in
motif_dataset, usingfinal_metricas the metric. If TRUE, themotifparameter would be ignored. Seescreen_pwmfor more details.motif_lengthLength of the seed motif. If the motif is shorter than this, it will be extended by wildcards (stars). Note that If the motif is longer than this, it will not be truncated.
score_metricmetric to use for optimizing the PWM. One of "r2" or "ks". When using "ks" the response variable should be a single vector of 0 and 1.
bidirectis the motif bi-directional. If TRUE, the reverse-complement of the motif will be used as well.
spat_bin_sizesize of the spatial bin (in bp).
spat_num_binsnumber of spatial bins. Please make sure that the sequences are long enough to cover this number of bins. bp outside of spat_bin_size * spat_num_bins would be ignored. If
bidirectis TRUE, the number of bins should be odd as 'prego' symmetrizes the motif around the center bin.spat_modela previously computed spatial model (see
spat) in the return value of this function.improve_epsilonminimum improve in the objective function to continue the optimization
min_nuc_probminimum nucleotide probability in every iteration
unif_prioruniform prior for nucleotide probabilities
include_responseinclude the response in the resulting list (default: TRUE)
verboseshow verbose messages.
consensus_single_thresh,consensus_double_threshthresholds for the consensus sequence calculation (single and double nucleotides)
match_with_dbmatch the resulting PWMs with motif databases using
pssm_match. Note that the closest match is returned, even if it is not similar enough in absolute terms.screen_dbScreen
motif_datasetusingscreen_pwmand use the best motif as the initial motif. If TRUE, the following fields would be added to the return value: "db_motif", "db_motif_pred", "db_motif_pssm" and "db_motif_score".motif_dataseta data frame with PSSMs ('A', 'C', 'G' and 'T' columns), with an additional column 'motif' containing the motif name, for example
HOMER_motifs,JASPAR_motifsor all_motif_datasets(). By default all_motif_datasets() would be used.multi_kmersif TRUE, different candidates of kmers would be regressed in order to find the best seed according to
final_metric.final_metricmetric to use in order to choose the best motif. One of 'ks' or 'r2'. Note that unlike
score_metricwhich is used in the regression itself, this metric is used only for choosing the best motif out of all the runs on the sampled dataset. If NULL - 'ks' would be used for binary response and 'r2' for continuous response.kmer_lengtha vector of kmer lengths to screen in order to find the best seed motif.
max_candsmaximum number of kmer candidates to try.
motif_numNumber of motifs to infer. When
motif_num> 1, the function would runmotif_numtimes, each time on the residuals of a linear model of all the previous runs (seesmooth_kparameter). The best motif is then returned, while all the others are stored at 'models' in the return value.smooth_kk for smoothing the predictions of each model in order to compute the residuals when
motif_num> 1. The residuals are computed asresponse- running mean of size 'k' of the current model.min_kmer_corminimal correlation between the kmer and the response in order to use it as a seed.
internal_num_foldsnumber of folds to use in the internal cross-validation.
sample_for_kmersUse a random sample of the dataset in order to find the best kmer. This is useful when the dataset is very large and the kmer screen would take a long time. Note that the final regression would be performed on the entire dataset. Only relevant when
multi_kmersis TRUE.sample_fracfraction of the dataset to use for the kmer screen. Default: 0.1.
sample_idxsindices of the sequences to use for the kmer screen. If NULL, a random sample would be used.
sample_ratioratio between the '1' category and the '0' category in the sampled dataset (for binary response). Relevant only when
sample_fracis NULL.log_energytransform the energy to log scale on each iteration.
energy_funca function to transform the energy at each iteration. Should accept a numeric vector and return a numeric vector. e.g.
logorfunction(x) x^2. Note that the range of the input energies is between 0 and 1 (the probability of the motif in the sequence), so if you inferred the function using the the returned energies (which are in log scale) you should make sure that the function first log transforms usinglog_energy=TRUE.xmin,xmax,nptsrange for the energy function and the number of points to use for its interpolation.
energy_func_generatora function to generate the energy function when regressing multiple motifs. Should accept the result of the previous iteration + the original response and return a function similar to
energy_func. e.g.function(prev_reg, resp) { df <- data.frame(x = prev_reg$pred, y = resp) fn_gam <- as.formula("y ~ s(x, k=3, bs='cr')") model <- mgcv::gam(fn_gam, family = binomial(link = "logit"), data = df, method="REML") function(z){ mgcv::predict.gam(object = model, newdata = data.frame(x = z)) }}. When this parameter is not NULL, energy_func_generator would create an energy function and then run another step of regression initialized with the previous motif withenergy_funcas the energy function. This is useful when the energy function is not monotonic, for example - one might want to use a gam model to fit the energy function like in the example above.optimize_pwmoptimize the PWM model (Default: TRUE). If FALSE, the PWM model would be used as the initial model for the spatial model.
optimize_spatoptimize the spatial model (Default: TRUE). If FALSE, the spatial model would be used as the initial model for the PWM model.
kmer_sequence_lengththe length of the sequence to use for the kmer screen. If NULL, the entire sequence would be used.
symmetrize_spatif TRUE, the spatial model would be symmetrized around the center bin. Default: TRUE.
min_gap,max_gapthe length of a gap to be considered in the pattern. Only one gap, of length min_gap:max_gap, is being used, and is located anywhere in the motif. Note that this greatly expand the search space (and increase multiple testing severely).
Value
a list with the following elements:
- cv_models:
a list of models, one for each fold.
- cv_pred:
a vector of predictions for each sequence.
- score:
score of the model on the cross-validated predictions.
- cv_scores:
a vector of scores for each fold.
- folds:
a vector with the fold number for each sequence.
- full_model:
The full model (without cross-validation), if
add_full_modelis TRUE.
Examples
if (FALSE) { # \dontrun{
res <- regress_pwm.cv(
cluster_sequences_example, cluster_mat_example[, 1],
nfolds = 5, use_sample = TRUE, sample_frac = c(0.1, 1)
)
res$score
res$cv_scores
plot(
res$cv_pred,
res$full_model$pred,
xlab = "CV predictions", ylab = "Full model predictions", cex = 0.1
)
plot_regression_prediction_binary(res$cv_pred, cluster_mat_example[, 1])
plot_regression_prediction_binary(res$full_model$pred, cluster_mat_example[, 1])
# without sampling
res <- regress_pwm.cv(
cluster_sequences_example, cluster_mat_example[, 1],
nfolds = 5, use_sample = FALSE
)
res$score
res$cv_scores
plot(res$cv_pred,
res$full_model$pred,
xlab = "CV predictions", ylab = "Full model predictions", cex = 0.1
)
} # }