Screen for motifs in a database given a response variable

Usage

screen_pwm(
  sequences,
  response,
  metric = NULL,
  dataset = all_motif_datasets(),
  motifs = NULL,
  parallel = getOption("prego.parallel", TRUE),
  only_best = FALSE,
  prior = 0.01,
  alternative = "two.sided",
  ...
)

Arguments

sequences

a vector with the sequences

response

a vector of response variable for each sequence. If the response is a matrix, the average will be used.

metric

metric to use in order to choose the best motif. One of 'ks' or 'r2'. If NULL - the default would be 'ks' for binary variables, and 'r2' for continuous variables.

dataset

a data frame with PSSMs ('A', 'C', 'G' and 'T' columns), with an additional column 'motif' containing the motif name, for example HOMER_motifs or JASPAR_motifs, or all_motif_datasets(), or a MotifDB object.

motifs

names of specific motifs to extract from the dataset

parallel

logical, whether to use parallel processing

only_best

return only the best motif (the one with the highest score). If FALSE, all the motifs will be returned.

prior

a prior probability for each nucleotide.

alternative

alternative hypothesis for the KS test. One of 'two.sided', 'less' or 'greater'.

...

Arguments passed on to compute_pwm

pssm: a PSSM matrix or data frame. The columns of the matrix or data frame should be named with the nucleotides ('A', 'C', 'G' and 'T').
spat: a data frame with the spatial model (as returned from the $spat slot from the regression). Should contain a column called 'bin' and a column called 'spat_factor'.
spat_min: the minimum position to use from the sequences. The default is 1.
spat_max: the maximum position to use from the sequences. The default is the length of the sequences.
bidirect: is the motif bi-directional. If TRUE, the reverse-complement of the motif will be used as well.
func: the function to use to combine the PWMs for each sequence. Either 'logSumExp' or 'max'. The default is 'logSumExp'.

Value

a data frame with the following columns:

motif:: the motif name.
score:: the score of the motif (depending on metric).

if only_best is TRUE, only the best motif would be returned (a data framw with a single row).

Examples

res_screen <- screen_pwm(cluster_sequences_example, cluster_mat_example[, 1])
#> ℹ Performing PWM screening
head(res_screen)
#> # A tibble: 6 x 2
#>                            motif     score
#> 1 HOCOMOCO.HNF1B_HUMAN.H11MO.0.A 0.8606183
#> 2 HOCOMOCO.HNF1B_MOUSE.H11MO.0.A 0.8510730
#> 3                   JASPAR.HNF1A 0.8510730
#> 4            JOLMA.HNF1A_di_full 0.8505374
#> 5          JOLMA.HNF1B_di_full_1 0.8484232
#> 6          JOLMA.HNF1B_di_full_2 0.8484090

# only best match
screen_pwm(cluster_sequences_example, cluster_mat_example[, 1])
#> ℹ Performing PWM screening
#> # A tibble: 3,867 x 2
#>                            motif     score
#> 1 HOCOMOCO.HNF1B_HUMAN.H11MO.0.A 0.8606183
#> 2 HOCOMOCO.HNF1B_MOUSE.H11MO.0.A 0.8510730
#> 3                   JASPAR.HNF1A 0.8510730
#> 4            JOLMA.HNF1A_di_full 0.8505374
#> 5          JOLMA.HNF1B_di_full_1 0.8484232
#> 6          JOLMA.HNF1B_di_full_2 0.8484090
#> # ... with 3,861 more rows

# with r^2 metric
res_screen <- screen_pwm(sequences_example, response_mat_example[, 1], metric = "r2")
#> ℹ Performing PWM screening
head(res_screen)
#> # A tibble: 6 x 2
#>               motif      score
#> 1       JASPAR.SOX2 0.04355104
#> 2       JASPAR.SUT1 0.04011911
#> 3      JASPAR.SOX13 0.03979196
#> 4 JOLMA.IRX3_di_DBD 0.03947399
#> 5        JASPAR.dsx 0.03903627
#> 6       JASPAR.Sox3 0.03876807