Regression¶

Motif discovery via iterative PWM regression. This module contains the core regress_pwm function and its variants for multiple motifs, cluster-specific regression, and cross-validation.

pyprego.regression ¶

PWM regression optimiser.

Mirrors regression.R / PWMLRegression.cpp from the R prego package. This module contains the core regress_pwm function that iteratively optimises a PSSM and spatial model to best explain a response variable given a set of DNA sequences.

The implementation is intentionally NumPy-based (no GPU / PyTorch) so that it closely mirrors the R behaviour and can run on any machine.

MultiRegressionResult `dataclass` ¶

Container for the output of :func:regress_multiple_motifs.

ATTRIBUTE	DESCRIPTION
`models`	Individual regression results for each motif. TYPE: `list[RegressionResult]`
`multi_stats`	Statistics DataFrame with columns: model, score, comb_score, diff, consensus, seed_motif. TYPE: `DataFrame`
`pred`	Combined prediction using linear model. TYPE: `ndarray`
`coef`	Linear model coefficients (one per motif + intercept). TYPE: `ndarray`

predict ¶

predict(sequences: list[str] | ndarray) -> np.ndarray

Predict combined scores for new sequences.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences. TYPE: `list[str] \| ndarray`

RETURNS	DESCRIPTION
`ndarray`	Combined predicted scores.

predict_multi ¶

predict_multi(sequences: list[str] | ndarray) -> pd.DataFrame

Predict per-motif scores for new sequences.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences. TYPE: `list[str] \| ndarray`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame with columns `e1`, `e2`, etc.

ClusterRegressionResult `dataclass` ¶

Container for the output of :func:regress_pwm_clusters.

ATTRIBUTE	DESCRIPTION
`models`	Per-cluster regression models. TYPE: `dict[str, RegressionResult]`
`cluster_mat`	Binary indicator matrix (n_sequences, n_clusters). TYPE: `ndarray`
`pred_mat`	Prediction matrix (n_sequences, n_clusters). TYPE: `ndarray`
`stats`	Per-cluster statistics. TYPE: `DataFrame`
`cluster_names`	Cluster names. TYPE: `list[str]`

CVRegressionResult `dataclass` ¶

Container for the output of :func:regress_pwm_cv.

ATTRIBUTE	DESCRIPTION
`cv_models`	Per-fold regression models. TYPE: `list[RegressionResult]`
`cv_pred`	Cross-validated predictions for each sequence. TYPE: `ndarray`
`score`	Overall score on cross-validated predictions. TYPE: `float`
`cv_scores`	Per-fold test scores. TYPE: `list[float]`
`folds`	Fold assignment per sequence. TYPE: `ndarray`
`full_model`	Full model (trained on all data), if requested. TYPE: `RegressionResult \| None`

regress_pwm ¶

regress_pwm(sequences: list[str] | ndarray, response: ndarray, *, motif: str | DataFrame | None = None, motif_length: int = 15, score_metric: str = 'r2', bidirect: bool = True, spat_bin_size: int | None = None, spat_num_bins: int | None = None, spat_model: DataFrame | None = None, improve_epsilon: float = 0.0001, min_nuc_prob: float = 0.001, unif_prior: float = 0.05, num_folds: int = 1, resolutions: list[float] | None = None, spat_resolutions: list[float] | None = None, log_energy: bool = False, energy_epsilon: float = 1e-05, optimize_pwm: bool = True, optimize_spat: bool = True, symmetrize_spat: bool = True, seed: int | None = 60427, consensus_single_thresh: float = 0.5, consensus_double_thresh: float = 0.75, verbose: bool = False, multi_kmers: bool = False, kmer_length: int | list[int] = 8, max_cands: int = 10, min_gap: int = 0, max_gap: int = 1, min_kmer_cor: float = 0.08, final_metric: str | None = None, sample_for_kmers: bool = False, sample_frac: float | None = None, sample_idxs: ndarray | None = None, sample_ratio: float = 1.0, val_frac: float = 0.1, match_with_db: bool = False, motif_db: dict[str, DataFrame] | DataFrame | None = None, alternative: str = 'less') -> RegressionResult

Perform PWM regression to discover a motif in DNA sequences.

This is the main entry point for motif regression. It wraps the core optimizer (:func:regress_pwm_core) with higher-level logic:

K-mer screening: When motif=None, screen k-mers to find the best seed (using :func:screen_kmers).
Multi-kmer mode: When multi_kmers=True, try multiple k-mer seeds and pick the best one based on final_metric.
Sampling: sample_for_kmers=True uses a subset for screening.
Database matching: match_with_db=True matches the result against a motif database (using :func:pssm_match).
Automatic metric selection: If final_metric is None, it auto-picks "ks" for binary responses and "r2" for continuous.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences (equal length, characters A/C/G/T/N). TYPE: `list[str] \| ndarray`
`response`	Response variable(s). TYPE: `ndarray`
`motif`	Initial motif. If `None` and `multi_kmers=False`, a single k-mer screen is used to find the best seed. If `None` and `multi_kmers=True`, multiple candidate k-mers are tried. TYPE: `str \| DataFrame \| None` DEFAULT: `None`
`motif_length`	Length of the seed motif. TYPE: `int` DEFAULT: `15`
`score_metric`	`"r2"` or `"ks"` (metric used during the optimization). TYPE: `str` DEFAULT: `'r2'`
`bidirect`	Use both orientations. TYPE: `bool` DEFAULT: `True`
`spat_bin_size`	Spatial bin parameters. TYPE: `int \| None` DEFAULT: `None`
`spat_num_bins`	Spatial bin parameters. TYPE: `int \| None` DEFAULT: `None`
`spat_model`	Pre-computed spatial model. TYPE: `DataFrame \| None` DEFAULT: `None`
`improve_epsilon`	Optimizer parameters. TYPE: `float` DEFAULT: `0.0001`
`min_nuc_prob`	Optimizer parameters. TYPE: `float` DEFAULT: `0.0001`
`unif_prior`	Optimizer parameters. TYPE: `float` DEFAULT: `0.0001`
`num_folds`	Internal cross-validation folds. TYPE: `int` DEFAULT: `1`
`resolutions`	Phase step sizes. TYPE: `list[float] \| None` DEFAULT: `None`
`spat_resolutions`	Phase step sizes. TYPE: `list[float] \| None` DEFAULT: `None`
`log_energy`	Apply log transform to energies. TYPE: `bool` DEFAULT: `False`
`energy_epsilon`	Epsilon for log transform. TYPE: `float` DEFAULT: `1e-05`
`optimize_pwm`	What to optimize. TYPE: `bool` DEFAULT: `True`
`optimize_spat`	What to optimize. TYPE: `bool` DEFAULT: `True`
`symmetrize_spat`	Symmetrize spatial factors. TYPE: `bool` DEFAULT: `True`
`seed`	Random seed. TYPE: `int \| None` DEFAULT: `60427`
`consensus_single_thresh`	Consensus thresholds. TYPE: `float` DEFAULT: `0.5`
`consensus_double_thresh`	Consensus thresholds. TYPE: `float` DEFAULT: `0.5`
`verbose`	Print progress. TYPE: `bool` DEFAULT: `False`
`multi_kmers`	Try multiple k-mer seeds and pick the best. TYPE: `bool` DEFAULT: `False`
`kmer_length`	K-mer length(s) to screen. TYPE: `int \| list[int]` DEFAULT: `8`
`max_cands`	Maximum number of k-mer candidates. TYPE: `int` DEFAULT: `10`
`min_gap`	Gap parameters for k-mer generation. TYPE: `int` DEFAULT: `0`
`max_gap`	Gap parameters for k-mer generation. TYPE: `int` DEFAULT: `0`
`min_kmer_cor`	Minimum correlation to include a k-mer. TYPE: `float` DEFAULT: `0.08`
`final_metric`	Metric for picking the best model. `None` auto-selects. TYPE: `str \| None` DEFAULT: `None`
`sample_for_kmers`	Sample the dataset for k-mer screening. TYPE: `bool` DEFAULT: `False`
`sample_frac`	Fraction to sample. TYPE: `float \| None` DEFAULT: `None`
`sample_idxs`	Explicit sample indices. TYPE: `ndarray \| None` DEFAULT: `None`
`sample_ratio`	Ratio of classes in sampling. TYPE: `float` DEFAULT: `1.0`
`val_frac`	Fraction for internal validation when using multi-kmer mode. TYPE: `float` DEFAULT: `0.1`
`match_with_db`	Match result against motif database. TYPE: `bool` DEFAULT: `False`
`motif_db`	Motif database for matching. TYPE: `dict \| DataFrame \| None` DEFAULT: `None`
`alternative`	Alternative for KS test. TYPE: `str` DEFAULT: `'less'`

RETURNS	DESCRIPTION
`RegressionResult`	Fitted regression result.

regress_pwm_core ¶

regress_pwm_core(sequences: list[str] | ndarray, response: ndarray, *, motif: str | DataFrame | None = None, motif_length: int = 15, score_metric: str = 'r2', bidirect: bool = True, spat_bin_size: int | None = None, spat_num_bins: int | None = None, spat_model: DataFrame | None = None, improve_epsilon: float = 0.0001, min_nuc_prob: float = 0.001, unif_prior: float = 0.05, num_folds: int = 1, resolutions: list[float] | None = None, spat_resolutions: list[float] | None = None, log_energy: bool = False, energy_epsilon: float = 1e-05, optimize_pwm: bool = True, optimize_spat: bool = True, symmetrize_spat: bool = True, seed: int | None = 60427, consensus_single_thresh: float = 0.5, consensus_double_thresh: float = 0.75, verbose: bool = False) -> RegressionResult

Core PWM regression optimizer (low-level).

This is the faithful port of the C++ PWMLRegression class, using coordinate descent to iteratively optimise PSSM probabilities and spatial factors. It does not perform k-mer screening, multi-kmer tries, or database matching. For the high-level wrapper, see :func:regress_pwm.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences (equal length, characters A/C/G/T/N). TYPE: `list[str] \| ndarray`
`response`	Response variable(s). Shape `(n_sequences,)` for a single response or `(n_sequences, n_responses)` for multiple. TYPE: `ndarray`
`motif`	Initial motif. A kmer string (`` = wildcard), a PSSM DataFrame, or `None` (defaults to all-wildcards). TYPE:* `str \| DataFrame \| None` DEFAULT: `None`
`motif_length`	Length of the seed motif (short kmers are padded with wildcards). TYPE: `int` DEFAULT: `15`
`score_metric`	`"r2"` or `"ks"`. TYPE: `str` DEFAULT: `'r2'`
`bidirect`	Use both orientations of the motif. TYPE: `bool` DEFAULT: `True`
`spat_bin_size`	Spatial bin size in bp. `None` auto-computes. TYPE: `int \| None` DEFAULT: `None`
`spat_num_bins`	Number of spatial bins. `None` auto-computes. TYPE: `int \| None` DEFAULT: `None`
`spat_model`	Pre-computed spatial model (bin, spat_factor). TYPE: `DataFrame \| None` DEFAULT: `None`
`improve_epsilon`	Convergence threshold. TYPE: `float` DEFAULT: `0.0001`
`min_nuc_prob`	Minimum nucleotide probability per iteration. TYPE: `float` DEFAULT: `0.001`
`unif_prior`	Uniform prior for nucleotide probabilities. TYPE: `float` DEFAULT: `0.05`
`num_folds`	Number of cross-validation folds (1 = no CV). TYPE: `int` DEFAULT: `1`
`resolutions`	Step sizes for each phase. `None` uses C++ defaults. TYPE: `list[float] \| None` DEFAULT: `None`
`spat_resolutions`	Spatial step sizes for each phase. TYPE: `list[float] \| None` DEFAULT: `None`
`log_energy`	Apply log transform to energies. TYPE: `bool` DEFAULT: `False`
`energy_epsilon`	Small constant for log(energy + epsilon). TYPE: `float` DEFAULT: `1e-05`
`optimize_pwm`	Whether to optimize PWM probabilities. TYPE: `bool` DEFAULT: `True`
`optimize_spat`	Whether to optimize spatial factors. TYPE: `bool` DEFAULT: `True`
`symmetrize_spat`	Symmetrize spatial factors for bidirectional models. TYPE: `bool` DEFAULT: `True`
`seed`	Random seed for reproducibility. TYPE: `int \| None` DEFAULT: `60427`
`consensus_single_thresh`	Threshold for single-nucleotide consensus calls. TYPE: `float` DEFAULT: `0.5`
`consensus_double_thresh`	Threshold for double-nucleotide consensus calls. TYPE: `float` DEFAULT: `0.75`
`verbose`	Print progress messages. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`RegressionResult`	Fitted regression result with PSSM, spatial model, predictions, etc.

regress_multiple_motifs ¶

regress_multiple_motifs(sequences: list[str] | ndarray, response: ndarray, motif_num: int = 2, smooth_k: int = 100, alternative: str = 'less', verbose: bool = False, **kwargs) -> MultiRegressionResult

Iteratively regress multiple motifs.

Finds the first motif via :func:regress_pwm, then for each subsequent motif computes residuals (response - smoothed predictions) and regresses on those. A combined linear model is fit at each step.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences. TYPE: `list[str] \| ndarray`
`response`	Response variable(s). TYPE: `ndarray`
`motif_num`	Number of motifs to discover (must be >= 2). TYPE: `int` DEFAULT: `2`
`smooth_k`	Window size for smoothing predictions when computing residuals. TYPE: `int` DEFAULT: `100`
`alternative`	Alternative hypothesis for the KS test. TYPE: `str` DEFAULT: `'less'`
`verbose`	Print progress. TYPE: `bool` DEFAULT: `False`
`**kwargs`	Additional arguments passed to :func:`regress_pwm`. DEFAULT: `{}`

RETURNS	DESCRIPTION
`MultiRegressionResult`	Combined multi-motif result.

regress_pwm_clusters ¶

regress_pwm_clusters(sequences: list[str] | ndarray, clusters: ndarray | list, alternative: str = 'less', verbose: bool = False, **kwargs) -> ClusterRegressionResult

Run PWM regression for each sequence cluster.

Creates a binary response (in-cluster vs. not) for each cluster and runs :func:regress_pwm on each.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences. TYPE: `list[str] \| ndarray`
`clusters`	Cluster assignment for each sequence. TYPE: `ndarray \| list`
`alternative`	Alternative hypothesis for KS test. TYPE: `str` DEFAULT: `'less'`
`verbose`	Print progress. TYPE: `bool` DEFAULT: `False`
`**kwargs`	Additional arguments passed to :func:`regress_pwm`. DEFAULT: `{}`

RETURNS	DESCRIPTION
`ClusterRegressionResult`	Per-cluster models, predictions, and statistics.

regress_pwm_cv ¶

regress_pwm_cv(sequences: list[str] | ndarray, response: ndarray, nfolds: int | None = None, metric: str | None = None, folds: ndarray | None = None, add_full_model: bool = True, seed: int | None = 60427, alternative: str = 'less', verbose: bool = False, **kwargs) -> CVRegressionResult

Cross-validate a PWM regression model.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences. TYPE: `list[str] \| ndarray`
`response`	Response variable(s). TYPE: `ndarray`
`nfolds`	Number of folds. Required if `folds` is not provided. TYPE: `int \| None` DEFAULT: `None`
`metric`	Evaluation metric. Auto-selects `"ks"` for binary, `"r2"` for continuous. TYPE: `str \| None` DEFAULT: `None`
`folds`	Explicit fold assignments. Overrides `nfolds`. TYPE: `ndarray \| None` DEFAULT: `None`
`add_full_model`	Also train a model on all data. TYPE: `bool` DEFAULT: `True`
`seed`	Random seed. TYPE: `int \| None` DEFAULT: `60427`
`alternative`	Alternative hypothesis for KS test. TYPE: `str` DEFAULT: `'less'`
`verbose`	Print progress. TYPE: `bool` DEFAULT: `False`
`**kwargs`	Additional arguments passed to :func:`regress_pwm`. DEFAULT: `{}`

RETURNS	DESCRIPTION
`CVRegressionResult`	Cross-validation results.