Regression¶
Motif discovery via iterative PWM regression. This module contains the core regress_pwm function and its variants for multiple motifs, cluster-specific regression, and cross-validation.
pyprego.regression ¶
PWM regression optimiser.
Mirrors regression.R / PWMLRegression.cpp from the R prego package. This
module contains the core regress_pwm function that iteratively optimises
a PSSM and spatial model to best explain a response variable given a set of
DNA sequences.
The implementation is intentionally NumPy-based (no GPU / PyTorch) so that it closely mirrors the R behaviour and can run on any machine.
MultiRegressionResult
dataclass
¶
Container for the output of :func:regress_multiple_motifs.
| ATTRIBUTE | DESCRIPTION |
|---|---|
models |
Individual regression results for each motif.
TYPE:
|
multi_stats |
Statistics DataFrame with columns: model, score, comb_score, diff, consensus, seed_motif.
TYPE:
|
pred |
Combined prediction using linear model.
TYPE:
|
coef |
Linear model coefficients (one per motif + intercept).
TYPE:
|
predict ¶
Predict combined scores for new sequences.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Combined predicted scores. |
predict_multi ¶
Predict per-motif scores for new sequences.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns |
ClusterRegressionResult
dataclass
¶
Container for the output of :func:regress_pwm_clusters.
| ATTRIBUTE | DESCRIPTION |
|---|---|
models |
Per-cluster regression models.
TYPE:
|
cluster_mat |
Binary indicator matrix (n_sequences, n_clusters).
TYPE:
|
pred_mat |
Prediction matrix (n_sequences, n_clusters).
TYPE:
|
stats |
Per-cluster statistics.
TYPE:
|
cluster_names |
Cluster names.
TYPE:
|
CVRegressionResult
dataclass
¶
Container for the output of :func:regress_pwm_cv.
| ATTRIBUTE | DESCRIPTION |
|---|---|
cv_models |
Per-fold regression models.
TYPE:
|
cv_pred |
Cross-validated predictions for each sequence.
TYPE:
|
score |
Overall score on cross-validated predictions.
TYPE:
|
cv_scores |
Per-fold test scores.
TYPE:
|
folds |
Fold assignment per sequence.
TYPE:
|
full_model |
Full model (trained on all data), if requested.
TYPE:
|
regress_pwm ¶
regress_pwm(sequences: list[str] | ndarray, response: ndarray, *, motif: str | DataFrame | None = None, motif_length: int = 15, score_metric: str = 'r2', bidirect: bool = True, spat_bin_size: int | None = None, spat_num_bins: int | None = None, spat_model: DataFrame | None = None, improve_epsilon: float = 0.0001, min_nuc_prob: float = 0.001, unif_prior: float = 0.05, num_folds: int = 1, resolutions: list[float] | None = None, spat_resolutions: list[float] | None = None, log_energy: bool = False, energy_epsilon: float = 1e-05, optimize_pwm: bool = True, optimize_spat: bool = True, symmetrize_spat: bool = True, seed: int | None = 60427, consensus_single_thresh: float = 0.5, consensus_double_thresh: float = 0.75, verbose: bool = False, multi_kmers: bool = False, kmer_length: int | list[int] = 8, max_cands: int = 10, min_gap: int = 0, max_gap: int = 1, min_kmer_cor: float = 0.08, final_metric: str | None = None, sample_for_kmers: bool = False, sample_frac: float | None = None, sample_idxs: ndarray | None = None, sample_ratio: float = 1.0, val_frac: float = 0.1, match_with_db: bool = False, motif_db: dict[str, DataFrame] | DataFrame | None = None, alternative: str = 'less') -> RegressionResult
Perform PWM regression to discover a motif in DNA sequences.
This is the main entry point for motif regression. It wraps the core
optimizer (:func:regress_pwm_core) with higher-level logic:
- K-mer screening: When
motif=None, screen k-mers to find the best seed (using :func:screen_kmers). - Multi-kmer mode: When
multi_kmers=True, try multiple k-mer seeds and pick the best one based onfinal_metric. - Sampling:
sample_for_kmers=Trueuses a subset for screening. - Database matching:
match_with_db=Truematches the result against a motif database (using :func:pssm_match). - Automatic metric selection: If
final_metricisNone, it auto-picks"ks"for binary responses and"r2"for continuous.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences (equal length, characters A/C/G/T/N).
TYPE:
|
response
|
Response variable(s).
TYPE:
|
motif
|
Initial motif. If
TYPE:
|
motif_length
|
Length of the seed motif.
TYPE:
|
score_metric
|
TYPE:
|
bidirect
|
Use both orientations.
TYPE:
|
spat_bin_size
|
Spatial bin parameters.
TYPE:
|
spat_num_bins
|
Spatial bin parameters.
TYPE:
|
spat_model
|
Pre-computed spatial model.
TYPE:
|
improve_epsilon
|
Optimizer parameters.
TYPE:
|
min_nuc_prob
|
Optimizer parameters.
TYPE:
|
unif_prior
|
Optimizer parameters.
TYPE:
|
num_folds
|
Internal cross-validation folds.
TYPE:
|
resolutions
|
Phase step sizes.
TYPE:
|
spat_resolutions
|
Phase step sizes.
TYPE:
|
log_energy
|
Apply log transform to energies.
TYPE:
|
energy_epsilon
|
Epsilon for log transform.
TYPE:
|
optimize_pwm
|
What to optimize.
TYPE:
|
optimize_spat
|
What to optimize.
TYPE:
|
symmetrize_spat
|
Symmetrize spatial factors.
TYPE:
|
seed
|
Random seed.
TYPE:
|
consensus_single_thresh
|
Consensus thresholds.
TYPE:
|
consensus_double_thresh
|
Consensus thresholds.
TYPE:
|
verbose
|
Print progress.
TYPE:
|
multi_kmers
|
Try multiple k-mer seeds and pick the best.
TYPE:
|
kmer_length
|
K-mer length(s) to screen.
TYPE:
|
max_cands
|
Maximum number of k-mer candidates.
TYPE:
|
min_gap
|
Gap parameters for k-mer generation.
TYPE:
|
max_gap
|
Gap parameters for k-mer generation.
TYPE:
|
min_kmer_cor
|
Minimum correlation to include a k-mer.
TYPE:
|
final_metric
|
Metric for picking the best model.
TYPE:
|
sample_for_kmers
|
Sample the dataset for k-mer screening.
TYPE:
|
sample_frac
|
Fraction to sample.
TYPE:
|
sample_idxs
|
Explicit sample indices.
TYPE:
|
sample_ratio
|
Ratio of classes in sampling.
TYPE:
|
val_frac
|
Fraction for internal validation when using multi-kmer mode.
TYPE:
|
match_with_db
|
Match result against motif database.
TYPE:
|
motif_db
|
Motif database for matching.
TYPE:
|
alternative
|
Alternative for KS test.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RegressionResult
|
Fitted regression result. |
regress_pwm_core ¶
regress_pwm_core(sequences: list[str] | ndarray, response: ndarray, *, motif: str | DataFrame | None = None, motif_length: int = 15, score_metric: str = 'r2', bidirect: bool = True, spat_bin_size: int | None = None, spat_num_bins: int | None = None, spat_model: DataFrame | None = None, improve_epsilon: float = 0.0001, min_nuc_prob: float = 0.001, unif_prior: float = 0.05, num_folds: int = 1, resolutions: list[float] | None = None, spat_resolutions: list[float] | None = None, log_energy: bool = False, energy_epsilon: float = 1e-05, optimize_pwm: bool = True, optimize_spat: bool = True, symmetrize_spat: bool = True, seed: int | None = 60427, consensus_single_thresh: float = 0.5, consensus_double_thresh: float = 0.75, verbose: bool = False) -> RegressionResult
Core PWM regression optimizer (low-level).
This is the faithful port of the C++ PWMLRegression class, using
coordinate descent to iteratively optimise PSSM probabilities and spatial
factors. It does not perform k-mer screening, multi-kmer tries, or
database matching. For the high-level wrapper, see :func:regress_pwm.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences (equal length, characters A/C/G/T/N).
TYPE:
|
response
|
Response variable(s). Shape
TYPE:
|
motif
|
Initial motif. A kmer string (
TYPE:
|
motif_length
|
Length of the seed motif (short kmers are padded with wildcards).
TYPE:
|
score_metric
|
TYPE:
|
bidirect
|
Use both orientations of the motif.
TYPE:
|
spat_bin_size
|
Spatial bin size in bp.
TYPE:
|
spat_num_bins
|
Number of spatial bins.
TYPE:
|
spat_model
|
Pre-computed spatial model (bin, spat_factor).
TYPE:
|
improve_epsilon
|
Convergence threshold.
TYPE:
|
min_nuc_prob
|
Minimum nucleotide probability per iteration.
TYPE:
|
unif_prior
|
Uniform prior for nucleotide probabilities.
TYPE:
|
num_folds
|
Number of cross-validation folds (1 = no CV).
TYPE:
|
resolutions
|
Step sizes for each phase.
TYPE:
|
spat_resolutions
|
Spatial step sizes for each phase.
TYPE:
|
log_energy
|
Apply log transform to energies.
TYPE:
|
energy_epsilon
|
Small constant for log(energy + epsilon).
TYPE:
|
optimize_pwm
|
Whether to optimize PWM probabilities.
TYPE:
|
optimize_spat
|
Whether to optimize spatial factors.
TYPE:
|
symmetrize_spat
|
Symmetrize spatial factors for bidirectional models.
TYPE:
|
seed
|
Random seed for reproducibility.
TYPE:
|
consensus_single_thresh
|
Threshold for single-nucleotide consensus calls.
TYPE:
|
consensus_double_thresh
|
Threshold for double-nucleotide consensus calls.
TYPE:
|
verbose
|
Print progress messages.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RegressionResult
|
Fitted regression result with PSSM, spatial model, predictions, etc. |
regress_multiple_motifs ¶
regress_multiple_motifs(sequences: list[str] | ndarray, response: ndarray, motif_num: int = 2, smooth_k: int = 100, alternative: str = 'less', verbose: bool = False, **kwargs) -> MultiRegressionResult
Iteratively regress multiple motifs.
Finds the first motif via :func:regress_pwm, then for each subsequent
motif computes residuals (response - smoothed predictions) and regresses
on those. A combined linear model is fit at each step.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences.
TYPE:
|
response
|
Response variable(s).
TYPE:
|
motif_num
|
Number of motifs to discover (must be >= 2).
TYPE:
|
smooth_k
|
Window size for smoothing predictions when computing residuals.
TYPE:
|
alternative
|
Alternative hypothesis for the KS test.
TYPE:
|
verbose
|
Print progress.
TYPE:
|
**kwargs
|
Additional arguments passed to :func:
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
MultiRegressionResult
|
Combined multi-motif result. |
regress_pwm_clusters ¶
regress_pwm_clusters(sequences: list[str] | ndarray, clusters: ndarray | list, alternative: str = 'less', verbose: bool = False, **kwargs) -> ClusterRegressionResult
Run PWM regression for each sequence cluster.
Creates a binary response (in-cluster vs. not) for each cluster and runs
:func:regress_pwm on each.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences.
TYPE:
|
clusters
|
Cluster assignment for each sequence.
TYPE:
|
alternative
|
Alternative hypothesis for KS test.
TYPE:
|
verbose
|
Print progress.
TYPE:
|
**kwargs
|
Additional arguments passed to :func:
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
ClusterRegressionResult
|
Per-cluster models, predictions, and statistics. |
regress_pwm_cv ¶
regress_pwm_cv(sequences: list[str] | ndarray, response: ndarray, nfolds: int | None = None, metric: str | None = None, folds: ndarray | None = None, add_full_model: bool = True, seed: int | None = 60427, alternative: str = 'less', verbose: bool = False, **kwargs) -> CVRegressionResult
Cross-validate a PWM regression model.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences.
TYPE:
|
response
|
Response variable(s).
TYPE:
|
nfolds
|
Number of folds. Required if
TYPE:
|
metric
|
Evaluation metric. Auto-selects
TYPE:
|
folds
|
Explicit fold assignments. Overrides
TYPE:
|
add_full_model
|
Also train a model on all data.
TYPE:
|
seed
|
Random seed.
TYPE:
|
alternative
|
Alternative hypothesis for KS test.
TYPE:
|
verbose
|
Print progress.
TYPE:
|
**kwargs
|
Additional arguments passed to :func:
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
CVRegressionResult
|
Cross-validation results. |