Motif Database¶

Motif database management, querying, and enrichment analysis. Supports bundled JASPAR and HOMER databases.

pyprego.motif_db ¶

Motif database management.

Mirrors MotifDB.R and motif-dbs.R from the R prego package. Provides functionality for loading, storing, and querying collections of motif PSSMs (e.g. JASPAR, HOMER databases).

The :class:MotifDB class stores stacked log-scale PWM matrices for efficient batch scoring across many motifs.

MotifDB ¶

A collection of motif PSSMs stored as stacked log-scale matrices.

This is the Python port of the R S4 MotifDB class. All PWM data is stored in two dense matrices (forward and reverse-complement), each of shape (max_len * 4, n_motifs) where positions beyond a motif's actual length are zeroed out.

PARAMETER	DESCRIPTION
`mat`	Stacked log-scale PWM matrix, shape `(D4, n_motifs)`. TYPE:* `ndarray`
`rc_mat`	Reverse-complement log-scale PWM matrix, same shape as mat. TYPE: `ndarray`
`motif_lengths`	Mapping from motif name to its length in positions. TYPE: `dict[str, int]`
`prior`	The PSSM prior probability used when creating the matrices. TYPE: `float`
`spat_factors`	Spatial factor matrix, shape `(n_motifs, n_bins)`. TYPE: `ndarray`
`spat_bin_size`	Size of spatial bins. TYPE: `float` DEFAULT: `1.0`
`spat_min`	Starting position of the sequence, or `None`. TYPE: `float \| None` DEFAULT: `None`
`spat_max`	Ending position of the sequence, or `None`. TYPE: `float \| None` DEFAULT: `None`

names ¶

names() -> list[str]

Return the names of all motifs in the database.

getitem ¶

__getitem__(key: str | list[str] | int | list[int]) -> MotifDB

Subset the MotifDB by motif name(s) or integer index/indices.

Supports exact matching for strings and integer indexing.

PARAMETER	DESCRIPTION
`key`	Motif name(s) or integer index/indices. TYPE: `str \| list[str] \| int \| list[int]`

RETURNS	DESCRIPTION
`MotifDB`	A new MotifDB containing only the selected motifs.

grep ¶

grep(pattern: str | list[str]) -> MotifDB

Subset the MotifDB by regex pattern matching on motif names.

PARAMETER	DESCRIPTION
`pattern`	Regex pattern(s) to match against motif names (case-insensitive). TYPE: `str \| list[str]`

RETURNS	DESCRIPTION
`MotifDB`	A new MotifDB containing matched motifs.

create_motif_db ¶

create_motif_db(pssm_df: DataFrame, prior: float = 0.01, spat_factors: ndarray | None = None, spat_bin_size: float = 1.0, spat_min: float | None = None, spat_max: float | None = None) -> MotifDB

Create a :class:MotifDB from a tidy DataFrame of PSSMs.

Mirrors the R create_motif_db() function.

PARAMETER	DESCRIPTION
`pssm_df`	DataFrame with columns `motif`, `A`, `C`, `G`, `T`. Each group of rows with the same `motif` value defines one PSSM. TYPE: `DataFrame`
`prior`	Pseudocount prior to add to probabilities (must be in (0, 1)). TYPE: `float` DEFAULT: `0.01`
`spat_factors`	Spatial factor matrix of shape `(n_motifs, n_bins)`, with rows ordered the same as the unique motifs in pssm_df. If `None`, a default all-ones vector is used. TYPE: `ndarray \| None` DEFAULT: `None`
`spat_bin_size`	Size of spatial bins. TYPE: `float` DEFAULT: `1.0`
`spat_min`	Starting position of the sequence, or `None`. TYPE: `float \| None` DEFAULT: `None`
`spat_max`	Ending position of the sequence, or `None`. TYPE: `float \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`MotifDB`	A validated MotifDB object.

motif_db_to_dataframe ¶

motif_db_to_dataframe(db: MotifDB) -> pd.DataFrame

Convert a :class:MotifDB back to a tidy DataFrame.

Mirrors the R motif_db_to_dataframe() function.

PARAMETER	DESCRIPTION
`db`	A MotifDB object. TYPE: `MotifDB`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame with columns `motif`, `pos`, `A`, `C`, `G`, `T`.

set_prior ¶

set_prior(db: MotifDB, new_prior: float) -> MotifDB

Create a new MotifDB with a different prior.

Mirrors the R prior<- replacement method.

PARAMETER	DESCRIPTION
`db`	Original MotifDB. TYPE: `MotifDB`
`new_prior`	New prior value (must be in (0, 1)). TYPE: `float`

RETURNS	DESCRIPTION
`MotifDB`	New MotifDB with the updated prior.

extract_pwm ¶

extract_pwm(sequences: list[str], motif_db: MotifDB | DataFrame, *, motifs: list[str] | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp', spat_min: int | None = None, spat_max: int | None = None) -> pd.DataFrame

Compute PWM scores for all motifs in a database.

For each motif in motif_db, extracts its PSSM and calls :func:~pyprego.compute.compute_pwm.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences (should all be the same length). TYPE: `list[str]`
`motif_db`	A MotifDB object or a tidy DataFrame with `motif` column. TYPE: `MotifDB \| DataFrame`
`motifs`	Subset of motif names to extract. If `None`, all motifs are used. TYPE: `list[str] \| None` DEFAULT: `None`
`bidirect`	Score both orientations. TYPE: `bool` DEFAULT: `True`
`prior`	PSSM prior. TYPE: `float` DEFAULT: `0.01`
`func`	Aggregation function (`"logSumExp"` or `"max"`). TYPE: `str` DEFAULT: `'logSumExp'`
`spat_min`	Minimum spatial position (1-based). TYPE: `int \| None` DEFAULT: `None`
`spat_max`	Maximum spatial position. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame with one column per motif and one row per sequence.

screen_pwm ¶

screen_pwm(sequences: list[str], response: ndarray, motif_db: MotifDB | DataFrame, *, motifs: list[str] | None = None, metric: str | None = None, bidirect: bool = True, prior: float = 0.01, only_best: bool = False) -> pd.DataFrame

Screen all motifs in a database against a response variable.

For each motif, computes the PWM score for all sequences and then correlates (or KS-tests) with the response.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences. TYPE: `list[str]`
`response`	Response variable (1-D array, same length as sequences). For binary responses the KS metric is used by default; for continuous responses R-squared is used. TYPE: `ndarray`
`motif_db`	Motif database. TYPE: `MotifDB \| DataFrame`
`motifs`	Subset of motifs to screen. TYPE: `list[str] \| None` DEFAULT: `None`
`metric`	`"r2"` or `"ks"`. If `None`, auto-detected from response. TYPE: `str \| None` DEFAULT: `None`
`bidirect`	Score both orientations. TYPE: `bool` DEFAULT: `True`
`prior`	PSSM prior. TYPE: `float` DEFAULT: `0.01`
`only_best`	If `True`, return only the top-scoring motif. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame with columns `motif` and `score`, sorted descending.

all_motif_datasets ¶

all_motif_datasets(datasets: list[str] | None = None, data_dir: str | Path | None = None) -> pd.DataFrame

Load built-in motif datasets.

Looks for CSV files in the bundled data/ directory (or the R package's exported CSVs).

PARAMETER	DESCRIPTION
`datasets`	Which datasets to load (e.g. `["HOMER", "JASPAR"]`). If `None`, loads all available datasets. TYPE: `list[str] \| None` DEFAULT: `None`
`data_dir`	Directory containing `<NAME>_motifs.csv` files. If `None`, uses the bundled data directory and falls back to `/tmp` if files were exported from R. TYPE: `str \| Path \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	Combined motif DataFrame with columns `motif`, `pos`, `A`, `C`, `G`, `T`, `dataset`, `motif_orig`.

get_motif_pssm ¶

get_motif_pssm(motif_name: str, dataset: DataFrame | None = None, data_dir: str | Path | None = None) -> pd.DataFrame

Get the PSSM for a specific motif by name.

PARAMETER	DESCRIPTION
`motif_name`	Name of the motif (e.g. `"JASPAR.HNF1A"`). TYPE: `str`
`dataset`	Motif dataset DataFrame. If `None`, loads via :func:`all_motif_datasets`. TYPE: `DataFrame \| None` DEFAULT: `None`
`data_dir`	Passed to :func:`all_motif_datasets` if dataset is `None`. TYPE: `str \| Path \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	PSSM DataFrame with columns `pos`, `A`, `C`, `G`, `T`.

RAISES	DESCRIPTION
`KeyError`	If motif_name is not found in the dataset.

motif_enrichment ¶

motif_enrichment(pwm_q: ndarray, groups: ndarray | list[str], threshold: float = 0.99, type: str = 'relative') -> pd.DataFrame

Calculate motif enrichment for groups of loci.

Mirrors the R motif_enrichment() function. Given a matrix of PWM quantile values and group assignments, computes enrichment of motifs in each group.

PARAMETER	DESCRIPTION
`pwm_q`	Matrix of shape `(n_loci, n_motifs)` with quantile values. TYPE: `ndarray`
`groups`	Group labels (length `n_loci`). TYPE: `ndarray \| list[str]`
`threshold`	Quantile threshold for considering a motif as present (default 0.99). TYPE: `float` DEFAULT: `0.99`
`type`	`"relative"` (enrichment vs other groups) or `"absolute"` (enrichment vs random). TYPE: `str` DEFAULT: `'relative'`

RETURNS	DESCRIPTION
`DataFrame`	Enrichment matrix with groups as rows and motifs as columns.