Skip to content

PWM Scoring

Score sequences against a known PSSM with optional spatial weighting.

pyprego.compute

PWM scoring / energy computation.

Mirrors the core compute_pwm / compute_local_pwm functions from the R prego package. Given a PSSM and optional spatial model, compute the predicted PWM energy for each input sequence.

All computation uses NumPy arrays; the interfaces accept and return arrays so that a torch backend could be swapped in later with minimal changes.

compute_pwm

compute_pwm(sequences: list[str] | ndarray, pssm: DataFrame, spat: DataFrame | None = None, *, spat_min: int = 1, spat_max: int | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp') -> np.ndarray

Compute PWM energy scores for sequences given a PSSM and spatial model.

Mirrors the R compute_pwm() function. For each sequence, slides the PSSM across all valid positions, computes the log-likelihood at each window, applies spatial weighting, and aggregates via logSumExp or max.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str] | ndarray

pssm

PSSM DataFrame (pos, A, C, G, T).

TYPE: DataFrame

spat

Spatial model DataFrame (bin, spat_factor). If None, uniform spatial weighting is used.

TYPE: DataFrame | None DEFAULT: None

spat_min

Minimum position in the sequence to consider (1-based, as in R).

TYPE: int DEFAULT: 1

spat_max

Maximum position. None means use full sequence length.

TYPE: int | None DEFAULT: None

bidirect

Score both orientations and combine.

TYPE: bool DEFAULT: True

prior

Uniform prior added to PSSM probabilities.

TYPE: float DEFAULT: 0.01

func

Combination function: "logSumExp" or "max".

TYPE: str DEFAULT: 'logSumExp'

RETURNS DESCRIPTION
ndarray

1-D array of scores, one per sequence.

compute_local_pwm

compute_local_pwm(sequences: list[str] | ndarray, pssm: DataFrame, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01) -> np.ndarray

Compute per-position PWM scores across each sequence.

Mirrors the R compute_local_pwm() function. At each valid position, computes the log-likelihood of the PSSM alignment. Positions where the PSSM does not fit are set to NaN.

In the R implementation, compute_local_pwm_cpp extracts a substring of motif_len at each position and calls integrate_energy on it. With a single-bin uniform spatial factor, this is equivalent to computing logSumExp(forward_score, rc_score) at each position when bidirect=True, or just the forward score when bidirect=False.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str] | ndarray

pssm

PSSM DataFrame.

TYPE: DataFrame

spat

Spatial model DataFrame. If provided, spatial weighting is applied. If None, uniform weighting (factor=1) is used.

TYPE: DataFrame | None DEFAULT: None

bidirect

Score both orientations.

TYPE: bool DEFAULT: True

prior

Uniform prior.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
ndarray

2-D array of shape (n_sequences, seq_length) with per-position scores. Positions where the PSSM window does not fit contain NaN.