Skip to content

Regression

Motif discovery via iterative PWM regression. This module contains the core regress_pwm function and its variants for multiple motifs, cluster-specific regression, and cross-validation.

pyprego.regression

PWM regression optimiser.

Mirrors regression.R / PWMLRegression.cpp from the R prego package. This module contains the core regress_pwm function that iteratively optimises a PSSM and spatial model to best explain a response variable given a set of DNA sequences.

The implementation is intentionally NumPy-based (no GPU / PyTorch) so that it closely mirrors the R behaviour and can run on any machine.

MultiRegressionResult dataclass

Container for the output of :func:regress_multiple_motifs.

ATTRIBUTE DESCRIPTION
models

Individual regression results for each motif.

TYPE: list[RegressionResult]

multi_stats

Statistics DataFrame with columns: model, score, comb_score, diff, consensus, seed_motif.

TYPE: DataFrame

pred

Combined prediction using linear model.

TYPE: ndarray

coef

Linear model coefficients (one per motif + intercept).

TYPE: ndarray

predict

predict(sequences: list[str] | ndarray) -> np.ndarray

Predict combined scores for new sequences.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str] | ndarray

RETURNS DESCRIPTION
ndarray

Combined predicted scores.

predict_multi

predict_multi(sequences: list[str] | ndarray) -> pd.DataFrame

Predict per-motif scores for new sequences.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str] | ndarray

RETURNS DESCRIPTION
DataFrame

DataFrame with columns e1, e2, etc.

ClusterRegressionResult dataclass

Container for the output of :func:regress_pwm_clusters.

ATTRIBUTE DESCRIPTION
models

Per-cluster regression models.

TYPE: dict[str, RegressionResult]

cluster_mat

Binary indicator matrix (n_sequences, n_clusters).

TYPE: ndarray

pred_mat

Prediction matrix (n_sequences, n_clusters).

TYPE: ndarray

stats

Per-cluster statistics.

TYPE: DataFrame

cluster_names

Cluster names.

TYPE: list[str]

CVRegressionResult dataclass

Container for the output of :func:regress_pwm_cv.

ATTRIBUTE DESCRIPTION
cv_models

Per-fold regression models.

TYPE: list[RegressionResult]

cv_pred

Cross-validated predictions for each sequence.

TYPE: ndarray

score

Overall score on cross-validated predictions.

TYPE: float

cv_scores

Per-fold test scores.

TYPE: list[float]

folds

Fold assignment per sequence.

TYPE: ndarray

full_model

Full model (trained on all data), if requested.

TYPE: RegressionResult | None

regress_pwm

regress_pwm(sequences: list[str] | ndarray, response: ndarray, *, motif: str | DataFrame | None = None, motif_length: int = 15, score_metric: str = 'r2', bidirect: bool = True, spat_bin_size: int | None = None, spat_num_bins: int | None = None, spat_model: DataFrame | None = None, improve_epsilon: float = 0.0001, min_nuc_prob: float = 0.001, unif_prior: float = 0.05, num_folds: int = 1, resolutions: list[float] | None = None, spat_resolutions: list[float] | None = None, log_energy: bool = False, energy_epsilon: float = 1e-05, optimize_pwm: bool = True, optimize_spat: bool = True, symmetrize_spat: bool = True, seed: int | None = 60427, consensus_single_thresh: float = 0.5, consensus_double_thresh: float = 0.75, verbose: bool = False, multi_kmers: bool = False, kmer_length: int | list[int] = 8, max_cands: int = 10, min_gap: int = 0, max_gap: int = 1, min_kmer_cor: float = 0.08, final_metric: str | None = None, sample_for_kmers: bool = False, sample_frac: float | None = None, sample_idxs: ndarray | None = None, sample_ratio: float = 1.0, val_frac: float = 0.1, match_with_db: bool = False, motif_db: dict[str, DataFrame] | DataFrame | None = None, alternative: str = 'less') -> RegressionResult

Perform PWM regression to discover a motif in DNA sequences.

This is the main entry point for motif regression. It wraps the core optimizer (:func:regress_pwm_core) with higher-level logic:

  • K-mer screening: When motif=None, screen k-mers to find the best seed (using :func:screen_kmers).
  • Multi-kmer mode: When multi_kmers=True, try multiple k-mer seeds and pick the best one based on final_metric.
  • Sampling: sample_for_kmers=True uses a subset for screening.
  • Database matching: match_with_db=True matches the result against a motif database (using :func:pssm_match).
  • Automatic metric selection: If final_metric is None, it auto-picks "ks" for binary responses and "r2" for continuous.
PARAMETER DESCRIPTION
sequences

DNA sequences (equal length, characters A/C/G/T/N).

TYPE: list[str] | ndarray

response

Response variable(s).

TYPE: ndarray

motif

Initial motif. If None and multi_kmers=False, a single k-mer screen is used to find the best seed. If None and multi_kmers=True, multiple candidate k-mers are tried.

TYPE: str | DataFrame | None DEFAULT: None

motif_length

Length of the seed motif.

TYPE: int DEFAULT: 15

score_metric

"r2" or "ks" (metric used during the optimization).

TYPE: str DEFAULT: 'r2'

bidirect

Use both orientations.

TYPE: bool DEFAULT: True

spat_bin_size

Spatial bin parameters.

TYPE: int | None DEFAULT: None

spat_num_bins

Spatial bin parameters.

TYPE: int | None DEFAULT: None

spat_model

Pre-computed spatial model.

TYPE: DataFrame | None DEFAULT: None

improve_epsilon

Optimizer parameters.

TYPE: float DEFAULT: 0.0001

min_nuc_prob

Optimizer parameters.

TYPE: float DEFAULT: 0.0001

unif_prior

Optimizer parameters.

TYPE: float DEFAULT: 0.0001

num_folds

Internal cross-validation folds.

TYPE: int DEFAULT: 1

resolutions

Phase step sizes.

TYPE: list[float] | None DEFAULT: None

spat_resolutions

Phase step sizes.

TYPE: list[float] | None DEFAULT: None

log_energy

Apply log transform to energies.

TYPE: bool DEFAULT: False

energy_epsilon

Epsilon for log transform.

TYPE: float DEFAULT: 1e-05

optimize_pwm

What to optimize.

TYPE: bool DEFAULT: True

optimize_spat

What to optimize.

TYPE: bool DEFAULT: True

symmetrize_spat

Symmetrize spatial factors.

TYPE: bool DEFAULT: True

seed

Random seed.

TYPE: int | None DEFAULT: 60427

consensus_single_thresh

Consensus thresholds.

TYPE: float DEFAULT: 0.5

consensus_double_thresh

Consensus thresholds.

TYPE: float DEFAULT: 0.5

verbose

Print progress.

TYPE: bool DEFAULT: False

multi_kmers

Try multiple k-mer seeds and pick the best.

TYPE: bool DEFAULT: False

kmer_length

K-mer length(s) to screen.

TYPE: int | list[int] DEFAULT: 8

max_cands

Maximum number of k-mer candidates.

TYPE: int DEFAULT: 10

min_gap

Gap parameters for k-mer generation.

TYPE: int DEFAULT: 0

max_gap

Gap parameters for k-mer generation.

TYPE: int DEFAULT: 0

min_kmer_cor

Minimum correlation to include a k-mer.

TYPE: float DEFAULT: 0.08

final_metric

Metric for picking the best model. None auto-selects.

TYPE: str | None DEFAULT: None

sample_for_kmers

Sample the dataset for k-mer screening.

TYPE: bool DEFAULT: False

sample_frac

Fraction to sample.

TYPE: float | None DEFAULT: None

sample_idxs

Explicit sample indices.

TYPE: ndarray | None DEFAULT: None

sample_ratio

Ratio of classes in sampling.

TYPE: float DEFAULT: 1.0

val_frac

Fraction for internal validation when using multi-kmer mode.

TYPE: float DEFAULT: 0.1

match_with_db

Match result against motif database.

TYPE: bool DEFAULT: False

motif_db

Motif database for matching.

TYPE: dict | DataFrame | None DEFAULT: None

alternative

Alternative for KS test.

TYPE: str DEFAULT: 'less'

RETURNS DESCRIPTION
RegressionResult

Fitted regression result.

regress_pwm_core

regress_pwm_core(sequences: list[str] | ndarray, response: ndarray, *, motif: str | DataFrame | None = None, motif_length: int = 15, score_metric: str = 'r2', bidirect: bool = True, spat_bin_size: int | None = None, spat_num_bins: int | None = None, spat_model: DataFrame | None = None, improve_epsilon: float = 0.0001, min_nuc_prob: float = 0.001, unif_prior: float = 0.05, num_folds: int = 1, resolutions: list[float] | None = None, spat_resolutions: list[float] | None = None, log_energy: bool = False, energy_epsilon: float = 1e-05, optimize_pwm: bool = True, optimize_spat: bool = True, symmetrize_spat: bool = True, seed: int | None = 60427, consensus_single_thresh: float = 0.5, consensus_double_thresh: float = 0.75, verbose: bool = False) -> RegressionResult

Core PWM regression optimizer (low-level).

This is the faithful port of the C++ PWMLRegression class, using coordinate descent to iteratively optimise PSSM probabilities and spatial factors. It does not perform k-mer screening, multi-kmer tries, or database matching. For the high-level wrapper, see :func:regress_pwm.

PARAMETER DESCRIPTION
sequences

DNA sequences (equal length, characters A/C/G/T/N).

TYPE: list[str] | ndarray

response

Response variable(s). Shape (n_sequences,) for a single response or (n_sequences, n_responses) for multiple.

TYPE: ndarray

motif

Initial motif. A kmer string (* = wildcard), a PSSM DataFrame, or None (defaults to all-wildcards).

TYPE: str | DataFrame | None DEFAULT: None

motif_length

Length of the seed motif (short kmers are padded with wildcards).

TYPE: int DEFAULT: 15

score_metric

"r2" or "ks".

TYPE: str DEFAULT: 'r2'

bidirect

Use both orientations of the motif.

TYPE: bool DEFAULT: True

spat_bin_size

Spatial bin size in bp. None auto-computes.

TYPE: int | None DEFAULT: None

spat_num_bins

Number of spatial bins. None auto-computes.

TYPE: int | None DEFAULT: None

spat_model

Pre-computed spatial model (bin, spat_factor).

TYPE: DataFrame | None DEFAULT: None

improve_epsilon

Convergence threshold.

TYPE: float DEFAULT: 0.0001

min_nuc_prob

Minimum nucleotide probability per iteration.

TYPE: float DEFAULT: 0.001

unif_prior

Uniform prior for nucleotide probabilities.

TYPE: float DEFAULT: 0.05

num_folds

Number of cross-validation folds (1 = no CV).

TYPE: int DEFAULT: 1

resolutions

Step sizes for each phase. None uses C++ defaults.

TYPE: list[float] | None DEFAULT: None

spat_resolutions

Spatial step sizes for each phase.

TYPE: list[float] | None DEFAULT: None

log_energy

Apply log transform to energies.

TYPE: bool DEFAULT: False

energy_epsilon

Small constant for log(energy + epsilon).

TYPE: float DEFAULT: 1e-05

optimize_pwm

Whether to optimize PWM probabilities.

TYPE: bool DEFAULT: True

optimize_spat

Whether to optimize spatial factors.

TYPE: bool DEFAULT: True

symmetrize_spat

Symmetrize spatial factors for bidirectional models.

TYPE: bool DEFAULT: True

seed

Random seed for reproducibility.

TYPE: int | None DEFAULT: 60427

consensus_single_thresh

Threshold for single-nucleotide consensus calls.

TYPE: float DEFAULT: 0.5

consensus_double_thresh

Threshold for double-nucleotide consensus calls.

TYPE: float DEFAULT: 0.75

verbose

Print progress messages.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
RegressionResult

Fitted regression result with PSSM, spatial model, predictions, etc.

regress_multiple_motifs

regress_multiple_motifs(sequences: list[str] | ndarray, response: ndarray, motif_num: int = 2, smooth_k: int = 100, alternative: str = 'less', verbose: bool = False, **kwargs) -> MultiRegressionResult

Iteratively regress multiple motifs.

Finds the first motif via :func:regress_pwm, then for each subsequent motif computes residuals (response - smoothed predictions) and regresses on those. A combined linear model is fit at each step.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str] | ndarray

response

Response variable(s).

TYPE: ndarray

motif_num

Number of motifs to discover (must be >= 2).

TYPE: int DEFAULT: 2

smooth_k

Window size for smoothing predictions when computing residuals.

TYPE: int DEFAULT: 100

alternative

Alternative hypothesis for the KS test.

TYPE: str DEFAULT: 'less'

verbose

Print progress.

TYPE: bool DEFAULT: False

**kwargs

Additional arguments passed to :func:regress_pwm.

DEFAULT: {}

RETURNS DESCRIPTION
MultiRegressionResult

Combined multi-motif result.

regress_pwm_clusters

regress_pwm_clusters(sequences: list[str] | ndarray, clusters: ndarray | list, alternative: str = 'less', verbose: bool = False, **kwargs) -> ClusterRegressionResult

Run PWM regression for each sequence cluster.

Creates a binary response (in-cluster vs. not) for each cluster and runs :func:regress_pwm on each.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str] | ndarray

clusters

Cluster assignment for each sequence.

TYPE: ndarray | list

alternative

Alternative hypothesis for KS test.

TYPE: str DEFAULT: 'less'

verbose

Print progress.

TYPE: bool DEFAULT: False

**kwargs

Additional arguments passed to :func:regress_pwm.

DEFAULT: {}

RETURNS DESCRIPTION
ClusterRegressionResult

Per-cluster models, predictions, and statistics.

regress_pwm_cv

regress_pwm_cv(sequences: list[str] | ndarray, response: ndarray, nfolds: int | None = None, metric: str | None = None, folds: ndarray | None = None, add_full_model: bool = True, seed: int | None = 60427, alternative: str = 'less', verbose: bool = False, **kwargs) -> CVRegressionResult

Cross-validate a PWM regression model.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str] | ndarray

response

Response variable(s).

TYPE: ndarray

nfolds

Number of folds. Required if folds is not provided.

TYPE: int | None DEFAULT: None

metric

Evaluation metric. Auto-selects "ks" for binary, "r2" for continuous.

TYPE: str | None DEFAULT: None

folds

Explicit fold assignments. Overrides nfolds.

TYPE: ndarray | None DEFAULT: None

add_full_model

Also train a model on all data.

TYPE: bool DEFAULT: True

seed

Random seed.

TYPE: int | None DEFAULT: 60427

alternative

Alternative hypothesis for KS test.

TYPE: str DEFAULT: 'less'

verbose

Print progress.

TYPE: bool DEFAULT: False

**kwargs

Additional arguments passed to :func:regress_pwm.

DEFAULT: {}

RETURNS DESCRIPTION
CVRegressionResult

Cross-validation results.