Screen for motifs in a database given a response variable
screen_pwm.Rd
Screen for motifs in a database given a response variable
Usage
screen_pwm(
sequences,
response,
metric = NULL,
dataset = all_motif_datasets(),
motifs = NULL,
parallel = getOption("prego.parallel", TRUE),
only_best = FALSE,
prior = 0.01,
alternative = "two.sided",
...
)
Arguments
- sequences
a vector with the sequences
- response
a vector of response variable for each sequence. If the response is a matrix, the average will be used.
- metric
metric to use in order to choose the best motif. One of 'ks' or 'r2'. If NULL - the default would be 'ks' for binary variables, and 'r2' for continuous variables.
- dataset
a data frame with PSSMs ('A', 'C', 'G' and 'T' columns), with an additional column 'motif' containing the motif name, for example
HOMER_motifs
orJASPAR_motifs
, orall_motif_datasets()
, or a MotifDB object.- motifs
names of specific motifs to extract from the dataset
- parallel
logical, whether to use parallel processing
- only_best
return only the best motif (the one with the highest score). If FALSE, all the motifs will be returned.
- prior
a prior probability for each nucleotide.
- alternative
alternative hypothesis for the KS test. One of 'two.sided', 'less' or 'greater'.
- ...
Arguments passed on to
compute_pwm
pssm
a PSSM matrix or data frame. The columns of the matrix or data frame should be named with the nucleotides ('A', 'C', 'G' and 'T').
spat
a data frame with the spatial model (as returned from the
$spat
slot from the regression). Should contain a column called 'bin' and a column called 'spat_factor'.spat_min
the minimum position to use from the sequences. The default is 1.
spat_max
the maximum position to use from the sequences. The default is the length of the sequences.
bidirect
is the motif bi-directional. If TRUE, the reverse-complement of the motif will be used as well.
func
the function to use to combine the PWMs for each sequence. Either 'logSumExp' or 'max'. The default is 'logSumExp'.
Value
a data frame with the following columns:
- motif:
the motif name.
- score:
the score of the motif (depending on
metric
).
if only_best
is TRUE, only the best motif would be returned (a data framw with a single row).
Examples
res_screen <- screen_pwm(cluster_sequences_example, cluster_mat_example[, 1])
#> ℹ Performing PWM screening
head(res_screen)
#> # A tibble: 6 x 2
#> motif score
#> 1 HOCOMOCO.HNF1B_HUMAN.H11MO.0.A 0.8606183
#> 2 HOCOMOCO.HNF1B_MOUSE.H11MO.0.A 0.8510730
#> 3 JASPAR.HNF1A 0.8510730
#> 4 JOLMA.HNF1A_di_full 0.8505374
#> 5 JOLMA.HNF1B_di_full_1 0.8484232
#> 6 JOLMA.HNF1B_di_full_2 0.8484090
# only best match
screen_pwm(cluster_sequences_example, cluster_mat_example[, 1])
#> ℹ Performing PWM screening
#> # A tibble: 3,867 x 2
#> motif score
#> 1 HOCOMOCO.HNF1B_HUMAN.H11MO.0.A 0.8606183
#> 2 HOCOMOCO.HNF1B_MOUSE.H11MO.0.A 0.8510730
#> 3 JASPAR.HNF1A 0.8510730
#> 4 JOLMA.HNF1A_di_full 0.8505374
#> 5 JOLMA.HNF1B_di_full_1 0.8484232
#> 6 JOLMA.HNF1B_di_full_2 0.8484090
#> # ... with 3,861 more rows
# with r^2 metric
res_screen <- screen_pwm(sequences_example, response_mat_example[, 1], metric = "r2")
#> ℹ Performing PWM screening
head(res_screen)
#> # A tibble: 6 x 2
#> motif score
#> 1 JASPAR.SOX2 0.04355104
#> 2 JASPAR.SUT1 0.04011911
#> 3 JASPAR.SOX13 0.03979196
#> 4 JOLMA.IRX3_di_DBD 0.03947399
#> 5 JASPAR.dsx 0.03903627
#> 6 JASPAR.Sox3 0.03876807