Skip to content

PSSM Utilities

Functions for creating, manipulating, comparing, and analysing Position-Specific Scoring Matrices.

pyprego.pssm

PSSM (Position-Specific Scoring Matrix) operations.

Functions for creating, manipulating, comparing, and analysing PSSMs, mirroring the pssm-utils.R and pssm-cor.R modules from the R prego package.

concat_pssm module-attribute

concat_pssm = pssm_concat

trim_pssm module-attribute

trim_pssm = pssm_trim

bits_per_pos

bits_per_pos(pssm: DataFrame, prior: float = 0.01) -> np.ndarray

Compute information content (bits) at each position.

Matches the R bits_per_pos exactly: bits = log2(4) + sum(p * log2(p)) per position, floored at 0.

PARAMETER DESCRIPTION
pssm

PSSM DataFrame with columns A, C, G, T.

TYPE: DataFrame

prior

Prior probability added to each value before normalisation.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
ndarray

1-D array of bits per position (max 2 for DNA).

consensus_from_pssm

consensus_from_pssm(pssm: DataFrame, single_thresh: float = 0.5, double_thresh: float = 0.75) -> str

Derive a consensus sequence from a PSSM.

At each position the dominant nucleotide is used if its probability exceeds single_thresh. If two nucleotides together exceed double_thresh (but neither alone exceeds single_thresh), an IUPAC ambiguity code is used. Otherwise N is emitted.

PARAMETER DESCRIPTION
pssm

PSSM DataFrame with columns A, C, G, T.

TYPE: DataFrame

single_thresh

Threshold for a single-nucleotide call.

TYPE: float DEFAULT: 0.5

double_thresh

Threshold for a two-nucleotide ambiguity call.

TYPE: float DEFAULT: 0.75

RETURNS DESCRIPTION
str

Consensus string.

pssm_rc

pssm_rc(pssm: DataFrame) -> pd.DataFrame

Return the reverse-complement of a PSSM.

Rows are reversed and columns swapped (A<->T, C<->G).

PARAMETER DESCRIPTION
pssm

PSSM DataFrame.

TYPE: DataFrame

RETURNS DESCRIPTION
DataFrame

Reverse-complemented PSSM.

pssm_trim

pssm_trim(pssm: DataFrame, bits_thresh: float = 0.1) -> pd.DataFrame

Trim low-information positions from the edges of a PSSM.

PARAMETER DESCRIPTION
pssm

PSSM DataFrame.

TYPE: DataFrame

bits_thresh

Minimum bits per position to keep at the edges.

TYPE: float DEFAULT: 0.1

RETURNS DESCRIPTION
DataFrame

Trimmed PSSM with pos column reset.

pssm_add_prior

pssm_add_prior(pssm: DataFrame, prior: float) -> pd.DataFrame

Add a uniform prior to a PSSM and re-normalise.

new = (pssm + prior) / rowSums(pssm + prior)

PARAMETER DESCRIPTION
pssm

PSSM DataFrame.

TYPE: DataFrame

prior

Prior value added to each cell.

TYPE: float

RETURNS DESCRIPTION
DataFrame

New PSSM DataFrame with the prior applied.

pssm_theoretical_max

pssm_theoretical_max(pssm: DataFrame, prior: float = 0.01, regularization: float = 0.01) -> float

Theoretical maximum log-likelihood score for a PSSM.

Matches the R implementation: sum(log(regularization + rowMax(normalized_pssm))).

PARAMETER DESCRIPTION
pssm

PSSM DataFrame.

TYPE: DataFrame

prior

Prior added before normalisation.

TYPE: float DEFAULT: 0.01

regularization

Value added inside the log to prevent -Inf.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
float

Maximum possible score.

pssm_theoretical_min

pssm_theoretical_min(pssm: DataFrame, prior: float = 0.01, regularization: float = 0.01) -> float

Theoretical minimum log-likelihood score for a PSSM.

PARAMETER DESCRIPTION
pssm

PSSM DataFrame.

TYPE: DataFrame

prior

Prior added before normalisation.

TYPE: float DEFAULT: 0.01

regularization

Value added inside the log to prevent -Inf.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
float

Minimum possible score.

pssm_quantile

pssm_quantile(pssm: DataFrame, q: float, prior: float = 0.01, regularization: float = 0.01) -> float

Quantile of the theoretical score distribution.

Linearly interpolates between pssm_theoretical_min and pssm_theoretical_max.

PARAMETER DESCRIPTION
pssm

PSSM DataFrame.

TYPE: DataFrame

q

Quantile (0 to 1).

TYPE: float

prior

Passed through to the min/max functions.

TYPE: float DEFAULT: 0.01

regularization

Passed through to the min/max functions.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
float

Score at the requested quantile.

pssm_concat

pssm_concat(*pssms: DataFrame, gap: int = 0) -> pd.DataFrame

Concatenate multiple PSSMs vertically, optionally with a gap.

PARAMETER DESCRIPTION
*pssms

PSSM DataFrames to concatenate.

TYPE: DataFrame DEFAULT: ()

gap

Number of uniform (0.25 each) positions to insert between successive PSSMs.

TYPE: int DEFAULT: 0

RETURNS DESCRIPTION
DataFrame

Combined PSSM with pos column renumbered from 0.

pssm_cor

pssm_cor(pssm1: DataFrame, pssm2: DataFrame, method: str = 'spearman', prior: float = 0.01) -> float

Compute correlation between two PSSMs at the best alignment.

The shorter PSSM is slid along the longer one, and the maximum correlation (Spearman or Pearson) across all alignments is returned.

PARAMETER DESCRIPTION
pssm1

PSSM DataFrames with columns A, C, G, T.

TYPE: DataFrame

pssm2

PSSM DataFrames with columns A, C, G, T.

TYPE: DataFrame

method

"spearman" (default) or "pearson".

TYPE: str DEFAULT: 'spearman'

prior

Prior added before normalisation.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
float

Maximum correlation across all alignments.

RAISES DESCRIPTION
ValueError

If either PSSM is empty.

pssm_diff

pssm_diff(pssm1: DataFrame, pssm2: DataFrame, prior: float = 0.01) -> float

Compute symmetric KL divergence between two PSSMs at the best alignment.

The shorter PSSM is slid along the longer one, and the minimum (symmetric) KL divergence across all alignments is returned.

PARAMETER DESCRIPTION
pssm1

PSSM DataFrames.

TYPE: DataFrame

pssm2

PSSM DataFrames.

TYPE: DataFrame

prior

Prior added before normalisation.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
float

Minimum symmetric KL divergence (lower = more similar).

pssm_dataset_cor

pssm_dataset_cor(dataset1: DataFrame, dataset2: DataFrame | None = None, method: str = 'spearman', prior: float = 0.01) -> pd.DataFrame

Compute pairwise correlation matrix for PSSM datasets.

PARAMETER DESCRIPTION
dataset1

DataFrame with columns motif, A, C, G, T.

TYPE: DataFrame

dataset2

Optional second dataset. If None, computes within-set correlations.

TYPE: DataFrame | None DEFAULT: None

method

"spearman" or "pearson".

TYPE: str DEFAULT: 'spearman'

prior

Prior for normalisation.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
DataFrame

Correlation matrix as a DataFrame with motif-name row and column labels.

pssm_dataset_diff

pssm_dataset_diff(dataset1: DataFrame, dataset2: DataFrame | None = None, prior: float = 0.01) -> pd.DataFrame

Compute pairwise KL divergence matrix for PSSM datasets.

PARAMETER DESCRIPTION
dataset1

DataFrame with columns motif, A, C, G, T.

TYPE: DataFrame

dataset2

Optional second dataset.

TYPE: DataFrame | None DEFAULT: None

prior

Prior for normalisation.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
DataFrame

KL divergence matrix as a DataFrame.

pssm_match

pssm_match(pssm: DataFrame, motifs: dict[str, DataFrame] | DataFrame, best: bool = False, method: str = 'spearman', prior: float = 0.01) -> pd.DataFrame | str

Match a PSSM against a motif database.

PARAMETER DESCRIPTION
pssm

Query PSSM DataFrame.

TYPE: DataFrame

motifs

Either a dict mapping motif names to PSSM DataFrames, or a DataFrame with a motif column (long format).

TYPE: dict[str, DataFrame] | DataFrame

best

If True, return only the best-matching motif name.

TYPE: bool DEFAULT: False

method

"spearman", "pearson", or "kl".

TYPE: str DEFAULT: 'spearman'

prior

Prior for normalisation.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
DataFrame | str

If best is True, a string with the best matching motif name. Otherwise a DataFrame with columns motif and either cor or kl, sorted by decreasing similarity (correlations) or increasing divergence (KL).