Skip to contents

Perform cross-validation on a PWM regression model. You can either provide explicit folds, or use the nfolds argument to set the number of folds. If the response is binary (0 and 1) or a categories vector is given, the folds would be stratified by the response/categories.

Usage

regress_pwm.cv(
  sequences,
  response,
  nfolds = NULL,
  metric = NULL,
  folds = NULL,
  categories = NULL,
  use_sample = FALSE,
  seed = 60427,
  parallel = getOption("prego.parallel", FALSE),
  fold_parallel = !parallel && getOption("prego.parallel", FALSE),
  add_full_model = TRUE,
  alternative = "less",
  ...
)

Arguments

sequences

A vector of DNA sequences ('A', 'T', 'C' or 'G'. Will go through toupper). Sequences must be long enough to cover spat_num_bins * spat_bin_size bp, and should be centered around the motif/signal.

response

A matrix (or vector) of response variables. The number of rows must equal the number of sequences. A single binary vector (0/1) is required when score_metric = "ks".

nfolds

number of folds for cross-validation. Can be NULL if folds are provided.

metric

metric to use for cross-validation. One of 'ks' or 'r2'. If NULL - 'ks' would be set for binary response and 'r2' for continuous response.

folds

vector of fold numbers for each sequence (optional)

categories

vector of categories for each sequence (optional)

use_sample

whether to use sampled optimization or not.

seed

Random seed.

parallel

whether to run the cross-validation in parallel.

fold_parallel

whether to run the optimization in each fold in parallel. It is recommended to set this to FALSE if parallel is TRUE.

add_full_model

whether to add the full model (without cross-validation) to the results.

alternative

Alternative hypothesis for ks.test. One of "two.sided", "less", "greater".

...

Arguments passed on to regress_pwm, regress_pwm.sample

motif

Initial seed for the regression. Either a kmer string ("*" denotes a wildcard) or a data frame with a pre-computed PSSM (same shape as the pssm slot of the return value). If NULL (default), a kmer screen finds a seed automatically. Ignored when init_from_dataset = TRUE.

init_from_dataset

If TRUE, initialize from the best-matching PSSM in motif_dataset (chosen by final_metric; see screen_pwm). Overrides motif.

motif_length

Length of the seed motif. Shorter seeds are padded with wildcards; longer seeds are not truncated.

score_metric

Metric optimized during PWM fitting. One of "r2" or "ks". "ks" requires a single binary response.

bidirect

Whether the motif is bi-directional. If TRUE, the reverse-complement is also scored.

spat_bin_size

Size of each spatial bin (bp).

spat_num_bins

Number of spatial bins. Sequence length must be at least spat_bin_size * spat_num_bins; positions outside this window are ignored. If bidirect = TRUE, this should be odd (prego symmetrizes the spatial model around the center bin).

spat_model

A previously computed spatial model (the spat slot of a prior result) to reuse.

improve_epsilon

Minimum improvement in the objective for the optimizer to continue.

min_nuc_prob

Floor on per-position nucleotide probabilities at every iteration.

unif_prior

Uniform prior added to nucleotide probabilities.

include_response

If TRUE (default), the response matrix is stored in the result.

verbose

Show verbose messages from the optimizer.

consensus_single_thresh,consensus_double_thresh

Probability thresholds for calling a single-nucleotide or two-nucleotide IUPAC consensus, respectively.

match_with_db

If TRUE, match the resulting PSSM to the closest motif in motif_dataset using pssm_match. The closest match is always returned, even if similarity is low.

screen_db

If TRUE, screen motif_dataset via screen_pwm and add the best database motif to the result (db_motif, db_motif_pred, db_motif_pssm, db_motif_score).

motif_dataset

A data frame with PSSMs (columns A, C, G, T, motif), e.g. HOMER_motifs, JASPAR_motifs, or all_motif_datasets() (default).

multi_kmers

If TRUE (default) and motif = NULL, generate multiple kmer candidates and pick the best by final_metric. If motif is provided, this is ignored. See also return_all.

return_all

If TRUE, return all candidate-kmer regressions instead of only the best one. Requires multi_kmers = TRUE, motif = NULL, and motif_num = 1. Use this when you want N independent kmer-seeded motifs without the residual-rounds approach of motif_num > 1. Default: FALSE. See Value for the returned structure.

final_metric

Metric used to pick the best motif across candidates (and to score the final fit). One of "ks" or "r2". Distinct from score_metric, which is used inside the optimizer. If NULL: "ks" for binary response, "r2" otherwise.

kmer_length

Vector of kmer lengths to try in the seed screen.

max_cands

Maximum number of candidate kmers to regress.

motif_num

Number of motifs to infer. If > 1, dispatches to regress_multiple_motifs (residual rounds; see Details).

smooth_k

Window size (in observations) for the running-mean residual computed between rounds when motif_num > 1.

min_kmer_cor

Minimum kmer-response correlation for a kmer to qualify as a seed candidate.

internal_num_folds

Number of internal cross-validation folds during optimization.

sample_for_kmers

If TRUE, run the kmer screen on a random sample of sequences (final regression is still on the full data). Useful for large datasets. Only relevant when multi_kmers = TRUE.

sample_frac

Fraction of sequences sampled for the kmer screen. Default: 0.1.

sample_idxs

Explicit sequence indices to use for the kmer screen. If NULL, a random sample is drawn.

sample_ratio

For binary response, ratio of '1' to '0' in the sampled subset. Relevant only when sample_frac is NULL.

val_frac

Fraction of sequences held out for internal validation in the multi-kmer mode. Default: 0.1.

log_energy

If TRUE, the energy is log-transformed at every iteration.

energy_func

Optional function applied to the energy at every iteration; takes and returns a numeric vector (e.g. log or function(x) x^2). Input energies are in the range [0, 1]; if your function was fit on log energies, use log_energy = TRUE.

xmin,xmax,npts

Range and resolution for interpolating energy_func.

energy_func_generator

A function that, given the previous-iteration result and the original response, returns an energy_func. Used in motif_num > 1: each round, the generator builds an energy function (e.g. a GAM-based one for non-monotonic relationships) and a second pass is run with it. Example: function(prev_reg, resp) { df <- data.frame(x = prev_reg$pred, y = resp) model <- mgcv::gam(y ~ s(x, k = 3, bs = "cr"), family = binomial("logit"), data = df, method = "REML") function(z) mgcv::predict.gam(model, newdata = data.frame(x = z)) }

optimize_pwm

If FALSE, the PWM is held fixed and only the spatial model is optimized.

optimize_spat

If FALSE, the spatial model is held fixed and only the PWM is optimized.

kmer_sequence_length

Length of the central window used for the kmer screen. If NULL, the full sequence is used.

symmetrize_spat

If TRUE (default), the spatial model is symmetrized around the center bin.

min_gap,max_gap

the length of a gap to be considered in the pattern. Only one gap, of length min_gap:max_gap, is being used, and is located anywhere in the motif. Note that this greatly expand the search space (and increase multiple testing severely).

Value

a list with the following elements:

cv_models:

a list of models, one for each fold.

cv_pred:

a vector of predictions for each sequence.

score:

score of the model on the cross-validated predictions.

cv_scores:

a vector of scores for each fold.

folds:

a vector with the fold number for each sequence.

full_model:

The full model (without cross-validation), if add_full_model is TRUE.

Examples

if (FALSE) { # \dontrun{
res <- regress_pwm.cv(
    cluster_sequences_example, cluster_mat_example[, 1],
    nfolds = 5, use_sample = TRUE, sample_frac = c(0.1, 1)
)
res$score
res$cv_scores

plot(
    res$cv_pred,
    res$full_model$pred,
    xlab = "CV predictions", ylab = "Full model predictions", cex = 0.1
)
plot_regression_prediction_binary(res$cv_pred, cluster_mat_example[, 1])
plot_regression_prediction_binary(res$full_model$pred, cluster_mat_example[, 1])

# without sampling
res <- regress_pwm.cv(
    cluster_sequences_example, cluster_mat_example[, 1],
    nfolds = 5, use_sample = FALSE
)
res$score
res$cv_scores
plot(res$cv_pred,
    res$full_model$pred,
    xlab = "CV predictions", ylab = "Full model predictions", cex = 0.1
)
} # }