PSSM Utilities¶
Functions for creating, manipulating, comparing, and analysing Position-Specific Scoring Matrices.
pyprego.pssm ¶
PSSM (Position-Specific Scoring Matrix) operations.
Functions for creating, manipulating, comparing, and analysing PSSMs, mirroring the pssm-utils.R and pssm-cor.R modules from the R prego package.
bits_per_pos ¶
Compute information content (bits) at each position.
Matches the R bits_per_pos exactly:
bits = log2(4) + sum(p * log2(p)) per position, floored at 0.
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
PSSM DataFrame with columns
TYPE:
|
prior
|
Prior probability added to each value before normalisation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
1-D array of bits per position (max 2 for DNA). |
consensus_from_pssm ¶
consensus_from_pssm(pssm: DataFrame, single_thresh: float = 0.5, double_thresh: float = 0.75) -> str
Derive a consensus sequence from a PSSM.
At each position the dominant nucleotide is used if its probability exceeds
single_thresh. If two nucleotides together exceed double_thresh (but
neither alone exceeds single_thresh), an IUPAC ambiguity code is used.
Otherwise N is emitted.
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
PSSM DataFrame with columns
TYPE:
|
single_thresh
|
Threshold for a single-nucleotide call.
TYPE:
|
double_thresh
|
Threshold for a two-nucleotide ambiguity call.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Consensus string. |
pssm_rc ¶
Return the reverse-complement of a PSSM.
Rows are reversed and columns swapped (A<->T, C<->G).
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
PSSM DataFrame.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Reverse-complemented PSSM. |
pssm_trim ¶
Trim low-information positions from the edges of a PSSM.
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
PSSM DataFrame.
TYPE:
|
bits_thresh
|
Minimum bits per position to keep at the edges.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Trimmed PSSM with |
pssm_add_prior ¶
Add a uniform prior to a PSSM and re-normalise.
new = (pssm + prior) / rowSums(pssm + prior)
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
PSSM DataFrame.
TYPE:
|
prior
|
Prior value added to each cell.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
New PSSM DataFrame with the prior applied. |
pssm_theoretical_max ¶
Theoretical maximum log-likelihood score for a PSSM.
Matches the R implementation: sum(log(regularization + rowMax(normalized_pssm))).
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
PSSM DataFrame.
TYPE:
|
prior
|
Prior added before normalisation.
TYPE:
|
regularization
|
Value added inside the log to prevent
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Maximum possible score. |
pssm_theoretical_min ¶
Theoretical minimum log-likelihood score for a PSSM.
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
PSSM DataFrame.
TYPE:
|
prior
|
Prior added before normalisation.
TYPE:
|
regularization
|
Value added inside the log to prevent
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Minimum possible score. |
pssm_quantile ¶
pssm_quantile(pssm: DataFrame, q: float, prior: float = 0.01, regularization: float = 0.01) -> float
Quantile of the theoretical score distribution.
Linearly interpolates between pssm_theoretical_min and
pssm_theoretical_max.
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
PSSM DataFrame.
TYPE:
|
q
|
Quantile (0 to 1).
TYPE:
|
prior
|
Passed through to the min/max functions.
TYPE:
|
regularization
|
Passed through to the min/max functions.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Score at the requested quantile. |
pssm_concat ¶
Concatenate multiple PSSMs vertically, optionally with a gap.
| PARAMETER | DESCRIPTION |
|---|---|
*pssms
|
PSSM DataFrames to concatenate.
TYPE:
|
gap
|
Number of uniform (0.25 each) positions to insert between successive PSSMs.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Combined PSSM with |
pssm_cor ¶
pssm_cor(pssm1: DataFrame, pssm2: DataFrame, method: str = 'spearman', prior: float = 0.01) -> float
Compute correlation between two PSSMs at the best alignment.
The shorter PSSM is slid along the longer one, and the maximum correlation (Spearman or Pearson) across all alignments is returned.
| PARAMETER | DESCRIPTION |
|---|---|
pssm1
|
PSSM DataFrames with columns
TYPE:
|
pssm2
|
PSSM DataFrames with columns
TYPE:
|
method
|
TYPE:
|
prior
|
Prior added before normalisation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Maximum correlation across all alignments. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If either PSSM is empty. |
pssm_diff ¶
Compute symmetric KL divergence between two PSSMs at the best alignment.
The shorter PSSM is slid along the longer one, and the minimum (symmetric) KL divergence across all alignments is returned.
| PARAMETER | DESCRIPTION |
|---|---|
pssm1
|
PSSM DataFrames.
TYPE:
|
pssm2
|
PSSM DataFrames.
TYPE:
|
prior
|
Prior added before normalisation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Minimum symmetric KL divergence (lower = more similar). |
pssm_dataset_cor ¶
pssm_dataset_cor(dataset1: DataFrame, dataset2: DataFrame | None = None, method: str = 'spearman', prior: float = 0.01) -> pd.DataFrame
Compute pairwise correlation matrix for PSSM datasets.
| PARAMETER | DESCRIPTION |
|---|---|
dataset1
|
DataFrame with columns
TYPE:
|
dataset2
|
Optional second dataset. If
TYPE:
|
method
|
TYPE:
|
prior
|
Prior for normalisation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Correlation matrix as a DataFrame with motif-name row and column labels. |
pssm_dataset_diff ¶
pssm_dataset_diff(dataset1: DataFrame, dataset2: DataFrame | None = None, prior: float = 0.01) -> pd.DataFrame
Compute pairwise KL divergence matrix for PSSM datasets.
| PARAMETER | DESCRIPTION |
|---|---|
dataset1
|
DataFrame with columns
TYPE:
|
dataset2
|
Optional second dataset.
TYPE:
|
prior
|
Prior for normalisation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
KL divergence matrix as a DataFrame. |
pssm_match ¶
pssm_match(pssm: DataFrame, motifs: dict[str, DataFrame] | DataFrame, best: bool = False, method: str = 'spearman', prior: float = 0.01) -> pd.DataFrame | str
Match a PSSM against a motif database.
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
Query PSSM DataFrame.
TYPE:
|
motifs
|
Either a dict mapping motif names to PSSM DataFrames, or a
DataFrame with a
TYPE:
|
best
|
If
TYPE:
|
method
|
TYPE:
|
prior
|
Prior for normalisation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame | str
|
If best is |