Motif Database¶
Motif database management, querying, and enrichment analysis. Supports bundled JASPAR and HOMER databases.
pyprego.motif_db ¶
Motif database management.
Mirrors MotifDB.R and motif-dbs.R from the R prego package. Provides functionality for loading, storing, and querying collections of motif PSSMs (e.g. JASPAR, HOMER databases).
The :class:MotifDB class stores stacked log-scale PWM matrices for
efficient batch scoring across many motifs.
MotifDB ¶
A collection of motif PSSMs stored as stacked log-scale matrices.
This is the Python port of the R S4 MotifDB class. All PWM data
is stored in two dense matrices (forward and reverse-complement), each
of shape (max_len * 4, n_motifs) where positions beyond a motif's
actual length are zeroed out.
| PARAMETER | DESCRIPTION |
|---|---|
mat
|
Stacked log-scale PWM matrix, shape
TYPE:
|
rc_mat
|
Reverse-complement log-scale PWM matrix, same shape as mat.
TYPE:
|
motif_lengths
|
Mapping from motif name to its length in positions.
TYPE:
|
prior
|
The PSSM prior probability used when creating the matrices.
TYPE:
|
spat_factors
|
Spatial factor matrix, shape
TYPE:
|
spat_bin_size
|
Size of spatial bins.
TYPE:
|
spat_min
|
Starting position of the sequence, or
TYPE:
|
spat_max
|
Ending position of the sequence, or
TYPE:
|
__getitem__ ¶
Subset the MotifDB by motif name(s) or integer index/indices.
Supports exact matching for strings and integer indexing.
| PARAMETER | DESCRIPTION |
|---|---|
key
|
Motif name(s) or integer index/indices.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
MotifDB
|
A new MotifDB containing only the selected motifs. |
create_motif_db ¶
create_motif_db(pssm_df: DataFrame, prior: float = 0.01, spat_factors: ndarray | None = None, spat_bin_size: float = 1.0, spat_min: float | None = None, spat_max: float | None = None) -> MotifDB
Create a :class:MotifDB from a tidy DataFrame of PSSMs.
Mirrors the R create_motif_db() function.
| PARAMETER | DESCRIPTION |
|---|---|
pssm_df
|
DataFrame with columns
TYPE:
|
prior
|
Pseudocount prior to add to probabilities (must be in (0, 1)).
TYPE:
|
spat_factors
|
Spatial factor matrix of shape
TYPE:
|
spat_bin_size
|
Size of spatial bins.
TYPE:
|
spat_min
|
Starting position of the sequence, or
TYPE:
|
spat_max
|
Ending position of the sequence, or
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
MotifDB
|
A validated MotifDB object. |
motif_db_to_dataframe ¶
Convert a :class:MotifDB back to a tidy DataFrame.
Mirrors the R motif_db_to_dataframe() function.
| PARAMETER | DESCRIPTION |
|---|---|
db
|
A MotifDB object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns |
set_prior ¶
extract_pwm ¶
extract_pwm(sequences: list[str], motif_db: MotifDB | DataFrame, *, motifs: list[str] | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp', spat_min: int | None = None, spat_max: int | None = None) -> pd.DataFrame
Compute PWM scores for all motifs in a database.
For each motif in motif_db, extracts its PSSM and calls
:func:~pyprego.compute.compute_pwm.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences (should all be the same length).
TYPE:
|
motif_db
|
A MotifDB object or a tidy DataFrame with
TYPE:
|
motifs
|
Subset of motif names to extract. If
TYPE:
|
bidirect
|
Score both orientations.
TYPE:
|
prior
|
PSSM prior.
TYPE:
|
func
|
Aggregation function (
TYPE:
|
spat_min
|
Minimum spatial position (1-based).
TYPE:
|
spat_max
|
Maximum spatial position.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with one column per motif and one row per sequence. |
screen_pwm ¶
screen_pwm(sequences: list[str], response: ndarray, motif_db: MotifDB | DataFrame, *, motifs: list[str] | None = None, metric: str | None = None, bidirect: bool = True, prior: float = 0.01, only_best: bool = False) -> pd.DataFrame
Screen all motifs in a database against a response variable.
For each motif, computes the PWM score for all sequences and then correlates (or KS-tests) with the response.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences.
TYPE:
|
response
|
Response variable (1-D array, same length as sequences). For binary responses the KS metric is used by default; for continuous responses R-squared is used.
TYPE:
|
motif_db
|
Motif database.
TYPE:
|
motifs
|
Subset of motifs to screen.
TYPE:
|
metric
|
TYPE:
|
bidirect
|
Score both orientations.
TYPE:
|
prior
|
PSSM prior.
TYPE:
|
only_best
|
If
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns |
all_motif_datasets ¶
all_motif_datasets(datasets: list[str] | None = None, data_dir: str | Path | None = None) -> pd.DataFrame
Load built-in motif datasets.
Looks for CSV files in the bundled data/ directory (or the R
package's exported CSVs).
| PARAMETER | DESCRIPTION |
|---|---|
datasets
|
Which datasets to load (e.g.
TYPE:
|
data_dir
|
Directory containing
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Combined motif DataFrame with columns |
get_motif_pssm ¶
get_motif_pssm(motif_name: str, dataset: DataFrame | None = None, data_dir: str | Path | None = None) -> pd.DataFrame
Get the PSSM for a specific motif by name.
| PARAMETER | DESCRIPTION |
|---|---|
motif_name
|
Name of the motif (e.g.
TYPE:
|
dataset
|
Motif dataset DataFrame. If
TYPE:
|
data_dir
|
Passed to :func:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
PSSM DataFrame with columns |
| RAISES | DESCRIPTION |
|---|---|
KeyError
|
If motif_name is not found in the dataset. |
motif_enrichment ¶
motif_enrichment(pwm_q: ndarray, groups: ndarray | list[str], threshold: float = 0.99, type: str = 'relative') -> pd.DataFrame
Calculate motif enrichment for groups of loci.
Mirrors the R motif_enrichment() function. Given a matrix of PWM
quantile values and group assignments, computes enrichment of motifs
in each group.
| PARAMETER | DESCRIPTION |
|---|---|
pwm_q
|
Matrix of shape
TYPE:
|
groups
|
Group labels (length
TYPE:
|
threshold
|
Quantile threshold for considering a motif as present (default 0.99).
TYPE:
|
type
|
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Enrichment matrix with groups as rows and motifs as columns. |