Cross-validate a PWM regression model
regress_pwm.cv.RdPerform cross-validation on a PWM regression model. You can either provide explicit folds, or use the nfolds argument to set the number of folds. If the response is binary (0 and 1) or a categories vector is given, the folds would be stratified by the response/categories.
Arguments
- sequences
A vector of DNA sequences ('A', 'T', 'C' or 'G'. Will go through
toupper). Sequences must be long enough to coverspat_num_bins*spat_bin_sizebp, and should be centered around the motif/signal.- response
A matrix (or vector) of response variables. The number of rows must equal the number of sequences. A single binary vector (0/1) is required when
score_metric = "ks".- nfolds
number of folds for cross-validation. Can be NULL if
foldsare provided.- metric
metric to use for cross-validation. One of 'ks' or 'r2'. If NULL - 'ks' would be set for binary response and 'r2' for continuous response.
- folds
vector of fold numbers for each sequence (optional)
- categories
vector of categories for each sequence (optional)
- use_sample
whether to use sampled optimization or not.
- seed
Random seed.
- parallel
whether to run the cross-validation in parallel.
- fold_parallel
whether to run the optimization in each fold in parallel. It is recommended to set this to FALSE if
parallelis TRUE.- add_full_model
whether to add the full model (without cross-validation) to the results.
- alternative
Alternative hypothesis for
ks.test. One of"two.sided","less","greater".- ...
Arguments passed on to
regress_pwm,regress_pwm.samplemotifInitial seed for the regression. Either a kmer string ("*" denotes a wildcard) or a data frame with a pre-computed PSSM (same shape as the
pssmslot of the return value). IfNULL(default), a kmer screen finds a seed automatically. Ignored wheninit_from_dataset = TRUE.init_from_datasetIf TRUE, initialize from the best-matching PSSM in
motif_dataset(chosen byfinal_metric; seescreen_pwm). Overridesmotif.motif_lengthLength of the seed motif. Shorter seeds are padded with wildcards; longer seeds are not truncated.
score_metricMetric optimized during PWM fitting. One of
"r2"or"ks"."ks"requires a single binary response.bidirectWhether the motif is bi-directional. If TRUE, the reverse-complement is also scored.
spat_bin_sizeSize of each spatial bin (bp).
spat_num_binsNumber of spatial bins. Sequence length must be at least
spat_bin_size * spat_num_bins; positions outside this window are ignored. Ifbidirect = TRUE, this should be odd (prego symmetrizes the spatial model around the center bin).spat_modelA previously computed spatial model (the
spatslot of a prior result) to reuse.improve_epsilonMinimum improvement in the objective for the optimizer to continue.
min_nuc_probFloor on per-position nucleotide probabilities at every iteration.
unif_priorUniform prior added to nucleotide probabilities.
include_responseIf TRUE (default), the response matrix is stored in the result.
verboseShow verbose messages from the optimizer.
consensus_single_thresh,consensus_double_threshProbability thresholds for calling a single-nucleotide or two-nucleotide IUPAC consensus, respectively.
match_with_dbIf TRUE, match the resulting PSSM to the closest motif in
motif_datasetusingpssm_match. The closest match is always returned, even if similarity is low.screen_dbIf TRUE, screen
motif_datasetviascreen_pwmand add the best database motif to the result (db_motif,db_motif_pred,db_motif_pssm,db_motif_score).motif_datasetA data frame with PSSMs (columns
A,C,G,T,motif), e.g.HOMER_motifs,JASPAR_motifs, orall_motif_datasets()(default).multi_kmersIf TRUE (default) and
motif = NULL, generate multiple kmer candidates and pick the best byfinal_metric. Ifmotifis provided, this is ignored. See alsoreturn_all.return_allIf TRUE, return all candidate-kmer regressions instead of only the best one. Requires
multi_kmers = TRUE,motif = NULL, andmotif_num = 1. Use this when you want N independent kmer-seeded motifs without the residual-rounds approach ofmotif_num > 1. Default: FALSE. See Value for the returned structure.final_metricMetric used to pick the best motif across candidates (and to score the final fit). One of
"ks"or"r2". Distinct fromscore_metric, which is used inside the optimizer. If NULL:"ks"for binary response,"r2"otherwise.kmer_lengthVector of kmer lengths to try in the seed screen.
max_candsMaximum number of candidate kmers to regress.
motif_numNumber of motifs to infer. If > 1, dispatches to
regress_multiple_motifs(residual rounds; see Details).smooth_kWindow size (in observations) for the running-mean residual computed between rounds when
motif_num > 1.min_kmer_corMinimum kmer-response correlation for a kmer to qualify as a seed candidate.
internal_num_foldsNumber of internal cross-validation folds during optimization.
sample_for_kmersIf TRUE, run the kmer screen on a random sample of sequences (final regression is still on the full data). Useful for large datasets. Only relevant when
multi_kmers = TRUE.sample_fracFraction of sequences sampled for the kmer screen. Default: 0.1.
sample_idxsExplicit sequence indices to use for the kmer screen. If NULL, a random sample is drawn.
sample_ratioFor binary response, ratio of '1' to '0' in the sampled subset. Relevant only when
sample_fracis NULL.val_fracFraction of sequences held out for internal validation in the multi-kmer mode. Default: 0.1.
log_energyIf TRUE, the energy is log-transformed at every iteration.
energy_funcOptional function applied to the energy at every iteration; takes and returns a numeric vector (e.g.
logorfunction(x) x^2). Input energies are in the range[0, 1]; if your function was fit on log energies, uselog_energy = TRUE.xmin,xmax,nptsRange and resolution for interpolating
energy_func.energy_func_generatorA function that, given the previous-iteration result and the original response, returns an
energy_func. Used inmotif_num > 1: each round, the generator builds an energy function (e.g. a GAM-based one for non-monotonic relationships) and a second pass is run with it. Example:function(prev_reg, resp) { df <- data.frame(x = prev_reg$pred, y = resp) model <- mgcv::gam(y ~ s(x, k = 3, bs = "cr"), family = binomial("logit"), data = df, method = "REML") function(z) mgcv::predict.gam(model, newdata = data.frame(x = z)) }optimize_pwmIf FALSE, the PWM is held fixed and only the spatial model is optimized.
optimize_spatIf FALSE, the spatial model is held fixed and only the PWM is optimized.
kmer_sequence_lengthLength of the central window used for the kmer screen. If NULL, the full sequence is used.
symmetrize_spatIf TRUE (default), the spatial model is symmetrized around the center bin.
min_gap,max_gapthe length of a gap to be considered in the pattern. Only one gap, of length min_gap:max_gap, is being used, and is located anywhere in the motif. Note that this greatly expand the search space (and increase multiple testing severely).
Value
a list with the following elements:
- cv_models:
a list of models, one for each fold.
- cv_pred:
a vector of predictions for each sequence.
- score:
score of the model on the cross-validated predictions.
- cv_scores:
a vector of scores for each fold.
- folds:
a vector with the fold number for each sequence.
- full_model:
The full model (without cross-validation), if
add_full_modelis TRUE.
Examples
if (FALSE) { # \dontrun{
res <- regress_pwm.cv(
cluster_sequences_example, cluster_mat_example[, 1],
nfolds = 5, use_sample = TRUE, sample_frac = c(0.1, 1)
)
res$score
res$cv_scores
plot(
res$cv_pred,
res$full_model$pred,
xlab = "CV predictions", ylab = "Full model predictions", cex = 0.1
)
plot_regression_prediction_binary(res$cv_pred, cluster_mat_example[, 1])
plot_regression_prediction_binary(res$full_model$pred, cluster_mat_example[, 1])
# without sampling
res <- regress_pwm.cv(
cluster_sequences_example, cluster_mat_example[, 1],
nfolds = 5, use_sample = FALSE
)
res$score
res$cv_scores
plot(res$cv_pred,
res$full_model$pred,
xlab = "CV predictions", ylab = "Full model predictions", cex = 0.1
)
} # }