Perform a PWM regression

Usage

regress_multiple_motifs(
  sequences,
  response,
  motif = NULL,
  motif_length = 15,
  score_metric = "r2",
  bidirect = TRUE,
  spat_bin_size = NULL,
  spat_num_bins = NULL,
  spat_model = NULL,
  improve_epsilon = 0.0001,
  min_nuc_prob = 0.001,
  unif_prior = 0.05,
  include_response = TRUE,
  seed = 60427,
  verbose = FALSE,
  kmer_length = 8,
  multi_kmers = FALSE,
  final_metric = NULL,
  max_cands = 10,
  min_gap = 0,
  max_gap = 1,
  min_kmer_cor = 0.08,
  motif_num = 2,
  smooth_k = 100,
  consensus_single_thresh = 0.6,
  consensus_double_thresh = 0.85,
  internal_num_folds = 1,
  match_with_db = TRUE,
  motif_dataset = all_motif_datasets(),
  parallel = getOption("prego.parallel", FALSE),
  alternative = "less",
  sample_for_kmers = FALSE,
  sample_frac = NULL,
  sample_idxs = NULL,
  sample_ratio = 1,
  log_energy = FALSE,
  energy_func_generator = NULL,
  energy_func = NULL,
  optimize_pwm = TRUE,
  optimize_spat = TRUE,
  ...
)

regress_pwm(
  sequences,
  response,
  motif = NULL,
  motif_length = 15,
  init_from_dataset = FALSE,
  score_metric = "r2",
  bidirect = TRUE,
  spat_bin_size = NULL,
  spat_num_bins = NULL,
  spat_model = NULL,
  improve_epsilon = 0.0001,
  min_nuc_prob = 0.001,
  unif_prior = 0.05,
  include_response = TRUE,
  seed = 60427,
  verbose = FALSE,
  kmer_length = 8,
  multi_kmers = TRUE,
  final_metric = NULL,
  max_cands = 10,
  min_gap = 0,
  max_gap = 1,
  min_kmer_cor = 0.08,
  motif_num = 1,
  smooth_k = 100,
  consensus_single_thresh = 0.6,
  consensus_double_thresh = 0.85,
  internal_num_folds = 1,
  match_with_db = TRUE,
  screen_db = FALSE,
  motif_dataset = all_motif_datasets(),
  parallel = getOption("prego.parallel", FALSE),
  alternative = "less",
  sample_for_kmers = FALSE,
  sample_frac = NULL,
  sample_idxs = NULL,
  sample_ratio = 1,
  log_energy = FALSE,
  energy_func = NULL,
  xmin = -100,
  xmax = 100,
  npts = 10000,
  energy_func_generator = NULL,
  optimize_pwm = TRUE,
  optimize_spat = TRUE,
  kmer_sequence_length = NULL,
  symmetrize_spat = TRUE,
  ...
)

Arguments

sequences

A vector of DNA sequences ('A', 'T', 'C' or 'G'. Will go through toupper). Please make sure that the sequences are long enough to cover spat_num_bins * spat_bin_size bp, and that they are centered around the motif/signal.

response

A matrix of response variables - number of rows should equal the number of sequences

motif

Initial motif to start the regression from. Can be either a string with a kmer where the character "*" indicates a wildcard or a data frame with a pre-computed PSSM (see the slot pssm in the return value of this function). If NULL - a K-mer screen would be performed in order to find the best kmer for initialization. If init_from_dataset is TRUE, the regression would be initialized from the PSSM of the best motif in the dataset.

motif_length

Length of the seed motif. If the motif is shorter than this, it will be extended by wildcards (stars). Note that If the motif is longer than this, it will not be truncated.

score_metric

metric to use for optimizing the PWM. One of "r2" or "ks". When using "ks" the response variable should be a single vector of 0 and 1.

bidirect

is the motif bi-directional. If TRUE, the reverse-complement of the motif will be used as well.

spat_bin_size

size of the spatial bin (in bp).

spat_num_bins

number of spatial bins. Please make sure that the sequences are long enough to cover this number of bins. bp outside of spat_bin_size * spat_num_bins would be ignored. If bidirect is TRUE, the number of bins should be odd as 'prego' symmetrizes the motif around the center bin.

spat_model

a previously computed spatial model (see spat) in the return value of this function.

improve_epsilon

minimum improve in the objective function to continue the optimization

min_nuc_prob

minimum nucleotide probability in every iteration

unif_prior

uniform prior for nucleotide probabilities

include_response

include the response in the resulting list (default: TRUE)

seed

random seed

verbose

show verbose messages.

kmer_length

a vector of kmer lengths to screen in order to find the best seed motif.

multi_kmers

if TRUE, different candidates of kmers would be regressed in order to find the best seed according to final_metric.

final_metric

metric to use in order to choose the best motif. One of 'ks' or 'r2'. Note that unlike score_metric which is used in the regression itself, this metric is used only for choosing the best motif out of all the runs on the sampled dataset. If NULL - 'ks' would be used for binary response and 'r2' for continuous response.

max_cands

maximum number of kmer candidates to try.

min_gap, max_gap

the length of a gap to be considered in the pattern. Only one gap, of length min_gap:max_gap, is being used, and is located anywhere in the motif. Note that this greatly expand the search space (and increase multiple testing severely).

min_kmer_cor

minimal correlation between the kmer and the response in order to use it as a seed.

motif_num

Number of motifs to infer. When motif_num > 1, the function would run motif_num times, each time on the residuals of a linear model of all the previous runs (see smooth_k parameter). The best motif is then returned, while all the others are stored at 'models' in the return value.

smooth_k

k for smoothing the predictions of each model in order to compute the residuals when motif_num > 1. The residuals are computed as response - running mean of size 'k' of the current model.

consensus_single_thresh, consensus_double_thresh

thresholds for the consensus sequence calculation (single and double nucleotides)

internal_num_folds

number of folds to use in the internal cross-validation.

match_with_db

match the resulting PWMs with motif databases using pssm_match. Note that the closest match is returned, even if it is not similar enough in absolute terms.

motif_dataset

a data frame with PSSMs ('A', 'C', 'G' and 'T' columns), with an additional column 'motif' containing the motif name, for example HOMER_motifs, JASPAR_motifs or all_motif_datasets(). By default all_motif_datasets() would be used.

parallel

whether to run optimization in parallel. use set_parallel to set the number of cores to use.

alternative

alternative hypothesis for the p-value calculation when using ks.test. One of "two.sided", "less" or "greater".

sample_for_kmers

Use a random sample of the dataset in order to find the best kmer. This is useful when the dataset is very large and the kmer screen would take a long time. Note that the final regression would be performed on the entire dataset. Only relevant when multi_kmers is TRUE.

sample_frac

fraction of the dataset to use for the kmer screen. Default: 0.1.

sample_idxs

indices of the sequences to use for the kmer screen. If NULL, a random sample would be used.

sample_ratio

ratio between the '1' category and the '0' category in the sampled dataset (for binary response). Relevant only when sample_frac is NULL.

log_energy

transform the energy to log scale on each iteration.

energy_func_generator

a function to generate the energy function when regressing multiple motifs. Should accept the result of the previous iteration + the original response and return a function similar to energy_func. e.g. function(prev_reg, resp) { df <- data.frame(x = prev_reg$pred, y = resp) fn_gam <- as.formula("y ~ s(x, k=3, bs='cr')") model <- mgcv::gam(fn_gam, family = binomial(link = "logit"), data = df, method="REML") function(z){ mgcv::predict.gam(object = model, newdata = data.frame(x = z)) }}. When this parameter is not NULL, energy_func_generator would create an energy function and then run another step of regression initialized with the previous motif with energy_func as the energy function. This is useful when the energy function is not monotonic, for example - one might want to use a gam model to fit the energy function like in the example above.

energy_func

a function to transform the energy at each iteration. Should accept a numeric vector and return a numeric vector. e.g. log or function(x) x^2. Note that the range of the input energies is between 0 and 1 (the probability of the motif in the sequence), so if you inferred the function using the the returned energies (which are in log scale) you should make sure that the function first log transforms using log_energy=TRUE.

optimize_pwm

optimize the PWM model (Default: TRUE). If FALSE, the PWM model would be used as the initial model for the spatial model.

optimize_spat

optimize the spatial model (Default: TRUE). If FALSE, the spatial model would be used as the initial model for the PWM model.

...

Arguments passed on to screen_kmers

min_cor: Only patterns for which the maximum correlation to one of the response variable is larger than min_cor will be reported
is_train: a boolean vector that determine which subset of sequences to use when screening
min_gap,max_gap: the length of a gap to be considered in the pattern. Only one gap, of length min_gap:max_gap, is being used, and is located anywhere in the motif. Note that this greatly expand the search space (and increase multiple testing severely).
from_range: Sequences will be considered only from position from_range (default 0)
to_range: Sequences will be considered only up to position to_range (default NULL - using the length of the sequences)
return_mat: Return a matrix of patterns and their correlation to the response variables instead of a data frame. (default: FALSE)

init_from_dataset

initialize the regression from the PSSM of the best motif in motif_dataset, using final_metric as the metric. If TRUE, the motif parameter would be ignored. See screen_pwm for more details.

screen_db

Screen motif_dataset using screen_pwm and use the best motif as the initial motif. If TRUE, the following fields would be added to the return value: "db_motif", "db_motif_pred", "db_motif_pssm" and "db_motif_score".

xmin, xmax, npts

range for the energy function and the number of points to use for its interpolation.

kmer_sequence_length

the length of the sequence to use for the kmer screen. If NULL, the entire sequence would be used.

symmetrize_spat

if TRUE, the spatial model would be symmetrized around the center bin. Default: TRUE.

Value

a list with the following elements:

pssm:: data frame with the pssm matrix with the inferred motif, where rows are positions and columns are nucleotides.
spat:: a data frame with the inferred spatial model, with the spatial factor for each bin.
pred:: a vector with the predicted pwm for each sequence.
consensus:: Consensus sequence based on the PSSM.
response:: The response matrix. If include_response is FALSE, the response matrix is not included in the list.
r2:: $r^2$ of the prediction with respect to the each response variable.
ks:: If response is binary, Kolmogorov-Smirnov test results of the predictions where the response was 1 vs the predictions where the response was 0.
seed_motif:: The seed motif that started the regression.
kmers:: The k-mers that were screened in order to find the best seed motif (if motif was NULL).
sample_idxs:: The indices of the sequences that were used for the regression (only for regress_pwm.sample).
predict:: a function that can be used to predict the PWM for a new sequence.

When match_with_db is TRUE, the following additional elements are returned:

motif_db:: The motif database that the most similar to the resulting PSSM.
db_match_cor:: The correlation between the resulting PSSM and the closest match in the motif database.
db_match_pssm:: The PSSM of the closest match in the motif database.
db_match_pred:: The predicted PWM of the closest match in the motif database.
db_match_r2:: The $r^2$ of the predicted PWM of the closest match in the motif database and the response
db_match_ks:: If response is binary, the Kolmogorov-Smirnov test results of the predicted PWM of the closest match in the motif database where the response was 1 vs the predictions where the response was 0.

When screen_db is TRUE, the following additional elements are returned:

db_motif:: The best motif from the motif database.
db_motif_pred:: The predicted PWM of the best motif from the motif database.
db_motif_pssm:: The PSSM of the best motif from the motif database.
db_motif_score:: The score of the best motif from the motif database.

When n_motifs is greater than 1, a list with the following elements is returned:

models:: A list (as above) of each inferred model
multi_stats:: A data frame with the following columns: model, score (KS for binary, r^2 otherwise), comb_score (score for the combined linear model for models 1:i) and additional statistics per model
pred:: a vector with the predicted pwm for using a linear model of the combined scores.
comb_modle:: a linear model of the combined scores.
predict:: a function that can be used to predict the PWM for a new sequence.
predict_multi:: a function that can be used to predict the PWM for the different models for a new sequence

Examples

if (FALSE) { # \dontrun{
res <- regress_pwm(sequences_example, response_mat_example)
res$pssm
res$spat
head(res$pred)

plot_regression_qc(res)

# intialize with a pre-computed PSSM
res1 <- regress_pwm(sequences_example, response_mat_example, motif = res$pssm)

# intialize with a pre-computed PSSM and spatial model
res2 <- regress_pwm(
    sequences_example,
    response_mat_example,
    motif = res$pssm,
    spat_model = res$spat
)

# binary response
res_binary <- regress_pwm(cluster_sequences_example, cluster_mat_example[, 1])
plot_regression_qc(res_binary)

# match with db
res_binary <- regress_pwm(cluster_sequences_example, cluster_mat_example[, 1], match_with_db = TRUE)
plot_regression_qc(res_binary)

# Screen for best db motif
res_binary <- regress_pwm(cluster_sequences_example, cluster_mat_example[, 1], screen_db = TRUE)
plot_regression_qc(res_binary)

# initialize with a motif from the database
res_binary <- regress_pwm(
    cluster_sequences_example,
    cluster_mat_example[, 1],
    init_from_dataset = TRUE
)

# use multiple kmer seeds
res_multi <- regress_pwm(
    cluster_sequences_example,
    cluster_mat_example[, 1],
    multi_kmers = TRUE,
    kmer_length = 6:8,
    final_metric = "ks"
)
plot_regression_qc(res_multi)

# Screen for multiple motifs
res_multi <- regress_pwm(
    cluster_sequences_example,
    cluster_mat_example[, 1],
    motif_num = 3,
    match_with_db = TRUE
)
res_multi$multi_stats
plot_regression_qc_multi(res_multi)
} # }

# regress_multiple_motifs is an alias for regress_pwm with motif_num > 1
res_multi2 <- regress_multiple_motifs(
    cluster_sequences_example,
    cluster_mat_example[, 1],
    motif_num = 5,
    match_with_db = TRUE
)
#> ℹ Using 7 bins of size 40 bp
#> ℹ Running regression for 5 motifs
#> 
#> ── Running first regression ──
#> 
#> ℹ Using 7 bins of size 40 bp
#> ℹ Using "ks" as the final metric
#> ℹ Number of response variables: 1
#> ℹ Screening for kmers in order to initialize regression
#> ℹ Number of response variables: 1
#> ℹ Screening kmers of length 8, from position 0 to position 300
#> ℹ Gaps of length 0:1 are allowed
#> ℹ minimal correlation: 0.08
#> ✔ Found 1032 kmers in 2359 sequences.
#> ℹ Motif is shorter than 15, extending with wildcards
#> ℹ Initializing regression with the following motif: "***TTAAT*ATT***"
#> ℹ Running regression
#> • Motif length: 15
#> • Bidirectional: TRUE
#> • Spat min: 10
#> • Spat max: 290
#> • Spat bin size: 40
#> • Number of bins: 7
#> • Improve epsilon: 0.0001
#> • Min nuc prob: 0.001
#> • Uniform prior: 0.05
#> • Score metric: "r2"
#> • Seed: 60427
#> ✔ Finished running regression. Consensus: "T*AM**W*Y"
#> ℹ Best match in the database: "JOLMA.HNF1B_di_full_1", cor: 0.784
#> ✔ "JOLMA.HNF1B_di_full_1" KS test D: 0.8487, p-value: 0
#> ✔ KS test D: 0.8472, p-value: 0
#> 
#> ── Running regression #2 ──
#> 
#> ℹ Using 7 bins of size 40 bp
#> ℹ Using "r2" as the final metric
#> ℹ Number of response variables: 1
#> ℹ Screening for kmers in order to initialize regression
#> ℹ Number of response variables: 1
#> ℹ Screening kmers of length 8, from position 0 to position 300
#> ℹ Gaps of length 0:1 are allowed
#> ℹ minimal correlation: 0.08
#> ✔ Found 2525 kmers in 2359 sequences.
#> ℹ Motif is shorter than 15, extending with wildcards
#> ℹ Initializing regression with the following motif: "***TAATGC*TA***"
#> ℹ Running regression
#> • Motif length: 15
#> • Bidirectional: TRUE
#> • Spat min: 10
#> • Spat max: 290
#> • Spat bin size: 40
#> • Number of bins: 7
#> • Improve epsilon: 0.0001
#> • Min nuc prob: 0.001
#> • Uniform prior: 0.05
#> • Score metric: "r2"
#> • Seed: 60427
#> ✔ Finished running regression. Consensus: "TTAATGS*TA*T"
#> ℹ Best match in the database: "JOLMA.SOX9_mono_DBD", cor: 0.685
#> ✔ R^2: 0.0055
#> ℹ KS statistic: 0.395846125038647
#> ℹ KS test statistic for models 1 and 2: 0.85079972652967
#> ℹ Improvement in KS test statistic: 0.00363937624378918
#> 
#> ── Running regression #3 ──
#> 
#> ℹ Using 7 bins of size 40 bp
#> ℹ Using "r2" as the final metric
#> ℹ Number of response variables: 1
#> ℹ Screening for kmers in order to initialize regression
#> ℹ Number of response variables: 1
#> ℹ Screening kmers of length 8, from position 0 to position 300
#> ℹ Gaps of length 0:1 are allowed
#> ℹ minimal correlation: 0.08
#> ✔ Found 2720 kmers in 2359 sequences.
#> ℹ Motif is shorter than 15, extending with wildcards
#> ℹ Initializing regression with the following motif: "***AAG*GGTTG***"
#> ℹ Running regression
#> • Motif length: 15
#> • Bidirectional: TRUE
#> • Spat min: 10
#> • Spat max: 290
#> • Spat bin size: 40
#> • Number of bins: 7
#> • Improve epsilon: 0.0001
#> • Min nuc prob: 0.001
#> • Uniform prior: 0.05
#> • Score metric: "r2"
#> • Seed: 60427
#> ✔ Finished running regression. Consensus: "T**AAA*GG*TG*A"
#> ℹ Best match in the database: "JASPAR.MSN4", cor: 0.676
#> ✔ R^2: 0.0354
#> ℹ KS statistic: 0.0200617485553538
#> ℹ KS test statistic for models 1, 2, and 3: 0.865892850144356
#> ℹ Improvement in KS test statistic: 0.0150931236146855
#> 
#> ── Running regression #4 ──
#> 
#> ℹ Using 7 bins of size 40 bp
#> ℹ Using "r2" as the final metric
#> ℹ Number of response variables: 1
#> ℹ Screening for kmers in order to initialize regression
#> ℹ Number of response variables: 1
#> ℹ Screening kmers of length 8, from position 0 to position 300
#> ℹ Gaps of length 0:1 are allowed
#> ℹ minimal correlation: 0.08
#> ✔ Found 1679 kmers in 2359 sequences.
#> ℹ Motif is shorter than 15, extending with wildcards
#> ℹ Initializing regression with the following motif: "***AGCG*AACC***"
#> ℹ Running regression
#> • Motif length: 15
#> • Bidirectional: TRUE
#> • Spat min: 10
#> • Spat max: 290
#> • Spat bin size: 40
#> • Number of bins: 7
#> • Improve epsilon: 0.0001
#> • Min nuc prob: 0.001
#> • Uniform prior: 0.05
#> • Score metric: "r2"
#> • Seed: 60427
#> ✔ Finished running regression. Consensus: "GCG*AACCTAC"
#> ℹ Best match in the database: "JASPAR.NAC025", cor: 0.721
#> ✔ R^2: 0.0204
#> ℹ KS statistic: 0.112921908544206
#> ℹ KS test statistic for models 1, 2, 3, and 4: 0.874352794143903
#> ℹ Improvement in KS test statistic: 0.00845994399954719
#> 
#> ── Running regression #5 ──
#> 
#> ℹ Using 7 bins of size 40 bp
#> ℹ Using "r2" as the final metric
#> ℹ Number of response variables: 1
#> ℹ Screening for kmers in order to initialize regression
#> ℹ Number of response variables: 1
#> ℹ Screening kmers of length 8, from position 0 to position 300
#> ℹ Gaps of length 0:1 are allowed
#> ℹ minimal correlation: 0.08
#> ✔ Found 1864 kmers in 2359 sequences.
#> ℹ Motif is shorter than 15, extending with wildcards
#> ℹ Initializing regression with the following motif: "***AGCG*AACC***"
#> ℹ Running regression
#> • Motif length: 15
#> • Bidirectional: TRUE
#> • Spat min: 10
#> • Spat max: 290
#> • Spat bin size: 40
#> • Number of bins: 7
#> • Improve epsilon: 0.0001
#> • Min nuc prob: 0.001
#> • Uniform prior: 0.05
#> • Score metric: "r2"
#> • Seed: 60427
#> ✔ Finished running regression. Consensus: "AGCG*AACCTA"
#> ℹ Best match in the database: "JASPAR.ACE2", cor: 0.642
#> ✔ R^2: 0.0101
#> ℹ KS statistic: 0.106887489603337
#> ℹ KS test statistic for models 1, 2, 3, 4, and 5: 0.880010538187867
#> ℹ Improvement in KS test statistic: 0.00565774404396424
#> ✔ Best model: model #1 (score of 0.847160350285881)