Skip to content

Motif Database

Motif database management, querying, and enrichment analysis. Supports bundled JASPAR and HOMER databases.

pyprego.motif_db

Motif database management.

Mirrors MotifDB.R and motif-dbs.R from the R prego package. Provides functionality for loading, storing, and querying collections of motif PSSMs (e.g. JASPAR, HOMER databases).

The :class:MotifDB class stores stacked log-scale PWM matrices for efficient batch scoring across many motifs.

MotifDB

A collection of motif PSSMs stored as stacked log-scale matrices.

This is the Python port of the R S4 MotifDB class. All PWM data is stored in two dense matrices (forward and reverse-complement), each of shape (max_len * 4, n_motifs) where positions beyond a motif's actual length are zeroed out.

PARAMETER DESCRIPTION
mat

Stacked log-scale PWM matrix, shape (D*4, n_motifs).

TYPE: ndarray

rc_mat

Reverse-complement log-scale PWM matrix, same shape as mat.

TYPE: ndarray

motif_lengths

Mapping from motif name to its length in positions.

TYPE: dict[str, int]

prior

The PSSM prior probability used when creating the matrices.

TYPE: float

spat_factors

Spatial factor matrix, shape (n_motifs, n_bins).

TYPE: ndarray

spat_bin_size

Size of spatial bins.

TYPE: float DEFAULT: 1.0

spat_min

Starting position of the sequence, or None.

TYPE: float | None DEFAULT: None

spat_max

Ending position of the sequence, or None.

TYPE: float | None DEFAULT: None

names

names() -> list[str]

Return the names of all motifs in the database.

__getitem__

__getitem__(key: str | list[str] | int | list[int]) -> MotifDB

Subset the MotifDB by motif name(s) or integer index/indices.

Supports exact matching for strings and integer indexing.

PARAMETER DESCRIPTION
key

Motif name(s) or integer index/indices.

TYPE: str | list[str] | int | list[int]

RETURNS DESCRIPTION
MotifDB

A new MotifDB containing only the selected motifs.

grep

grep(pattern: str | list[str]) -> MotifDB

Subset the MotifDB by regex pattern matching on motif names.

PARAMETER DESCRIPTION
pattern

Regex pattern(s) to match against motif names (case-insensitive).

TYPE: str | list[str]

RETURNS DESCRIPTION
MotifDB

A new MotifDB containing matched motifs.

create_motif_db

create_motif_db(pssm_df: DataFrame, prior: float = 0.01, spat_factors: ndarray | None = None, spat_bin_size: float = 1.0, spat_min: float | None = None, spat_max: float | None = None) -> MotifDB

Create a :class:MotifDB from a tidy DataFrame of PSSMs.

Mirrors the R create_motif_db() function.

PARAMETER DESCRIPTION
pssm_df

DataFrame with columns motif, A, C, G, T. Each group of rows with the same motif value defines one PSSM.

TYPE: DataFrame

prior

Pseudocount prior to add to probabilities (must be in (0, 1)).

TYPE: float DEFAULT: 0.01

spat_factors

Spatial factor matrix of shape (n_motifs, n_bins), with rows ordered the same as the unique motifs in pssm_df. If None, a default all-ones vector is used.

TYPE: ndarray | None DEFAULT: None

spat_bin_size

Size of spatial bins.

TYPE: float DEFAULT: 1.0

spat_min

Starting position of the sequence, or None.

TYPE: float | None DEFAULT: None

spat_max

Ending position of the sequence, or None.

TYPE: float | None DEFAULT: None

RETURNS DESCRIPTION
MotifDB

A validated MotifDB object.

motif_db_to_dataframe

motif_db_to_dataframe(db: MotifDB) -> pd.DataFrame

Convert a :class:MotifDB back to a tidy DataFrame.

Mirrors the R motif_db_to_dataframe() function.

PARAMETER DESCRIPTION
db

A MotifDB object.

TYPE: MotifDB

RETURNS DESCRIPTION
DataFrame

DataFrame with columns motif, pos, A, C, G, T.

set_prior

set_prior(db: MotifDB, new_prior: float) -> MotifDB

Create a new MotifDB with a different prior.

Mirrors the R prior<- replacement method.

PARAMETER DESCRIPTION
db

Original MotifDB.

TYPE: MotifDB

new_prior

New prior value (must be in (0, 1)).

TYPE: float

RETURNS DESCRIPTION
MotifDB

New MotifDB with the updated prior.

extract_pwm

extract_pwm(sequences: list[str], motif_db: MotifDB | DataFrame, *, motifs: list[str] | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp', spat_min: int | None = None, spat_max: int | None = None) -> pd.DataFrame

Compute PWM scores for all motifs in a database.

For each motif in motif_db, extracts its PSSM and calls :func:~pyprego.compute.compute_pwm.

PARAMETER DESCRIPTION
sequences

DNA sequences (should all be the same length).

TYPE: list[str]

motif_db

A MotifDB object or a tidy DataFrame with motif column.

TYPE: MotifDB | DataFrame

motifs

Subset of motif names to extract. If None, all motifs are used.

TYPE: list[str] | None DEFAULT: None

bidirect

Score both orientations.

TYPE: bool DEFAULT: True

prior

PSSM prior.

TYPE: float DEFAULT: 0.01

func

Aggregation function ("logSumExp" or "max").

TYPE: str DEFAULT: 'logSumExp'

spat_min

Minimum spatial position (1-based).

TYPE: int | None DEFAULT: None

spat_max

Maximum spatial position.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame

DataFrame with one column per motif and one row per sequence.

screen_pwm

screen_pwm(sequences: list[str], response: ndarray, motif_db: MotifDB | DataFrame, *, motifs: list[str] | None = None, metric: str | None = None, bidirect: bool = True, prior: float = 0.01, only_best: bool = False) -> pd.DataFrame

Screen all motifs in a database against a response variable.

For each motif, computes the PWM score for all sequences and then correlates (or KS-tests) with the response.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str]

response

Response variable (1-D array, same length as sequences). For binary responses the KS metric is used by default; for continuous responses R-squared is used.

TYPE: ndarray

motif_db

Motif database.

TYPE: MotifDB | DataFrame

motifs

Subset of motifs to screen.

TYPE: list[str] | None DEFAULT: None

metric

"r2" or "ks". If None, auto-detected from response.

TYPE: str | None DEFAULT: None

bidirect

Score both orientations.

TYPE: bool DEFAULT: True

prior

PSSM prior.

TYPE: float DEFAULT: 0.01

only_best

If True, return only the top-scoring motif.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
DataFrame

DataFrame with columns motif and score, sorted descending.

all_motif_datasets

all_motif_datasets(datasets: list[str] | None = None, data_dir: str | Path | None = None) -> pd.DataFrame

Load built-in motif datasets.

Looks for CSV files in the bundled data/ directory (or the R package's exported CSVs).

PARAMETER DESCRIPTION
datasets

Which datasets to load (e.g. ["HOMER", "JASPAR"]). If None, loads all available datasets.

TYPE: list[str] | None DEFAULT: None

data_dir

Directory containing <NAME>_motifs.csv files. If None, uses the bundled data directory and falls back to /tmp if files were exported from R.

TYPE: str | Path | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame

Combined motif DataFrame with columns motif, pos, A, C, G, T, dataset, motif_orig.

get_motif_pssm

get_motif_pssm(motif_name: str, dataset: DataFrame | None = None, data_dir: str | Path | None = None) -> pd.DataFrame

Get the PSSM for a specific motif by name.

PARAMETER DESCRIPTION
motif_name

Name of the motif (e.g. "JASPAR.HNF1A").

TYPE: str

dataset

Motif dataset DataFrame. If None, loads via :func:all_motif_datasets.

TYPE: DataFrame | None DEFAULT: None

data_dir

Passed to :func:all_motif_datasets if dataset is None.

TYPE: str | Path | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame

PSSM DataFrame with columns pos, A, C, G, T.

RAISES DESCRIPTION
KeyError

If motif_name is not found in the dataset.

motif_enrichment

motif_enrichment(pwm_q: ndarray, groups: ndarray | list[str], threshold: float = 0.99, type: str = 'relative') -> pd.DataFrame

Calculate motif enrichment for groups of loci.

Mirrors the R motif_enrichment() function. Given a matrix of PWM quantile values and group assignments, computes enrichment of motifs in each group.

PARAMETER DESCRIPTION
pwm_q

Matrix of shape (n_loci, n_motifs) with quantile values.

TYPE: ndarray

groups

Group labels (length n_loci).

TYPE: ndarray | list[str]

threshold

Quantile threshold for considering a motif as present (default 0.99).

TYPE: float DEFAULT: 0.99

type

"relative" (enrichment vs other groups) or "absolute" (enrichment vs random).

TYPE: str DEFAULT: 'relative'

RETURNS DESCRIPTION
DataFrame

Enrichment matrix with groups as rows and motifs as columns.