PSSM Utilities¶

Functions for creating, manipulating, comparing, and analysing Position-Specific Scoring Matrices.

pyprego.pssm ¶

PSSM (Position-Specific Scoring Matrix) operations.

Functions for creating, manipulating, comparing, and analysing PSSMs, mirroring the pssm-utils.R and pssm-cor.R modules from the R prego package.

concat_pssm `module-attribute` ¶

concat_pssm = pssm_concat

trim_pssm `module-attribute` ¶

trim_pssm = pssm_trim

bits_per_pos ¶

bits_per_pos(pssm: DataFrame, prior: float = 0.01) -> np.ndarray

Compute information content (bits) at each position.

Matches the R bits_per_pos exactly: bits = log2(4) + sum(p * log2(p)) per position, floored at 0.

PARAMETER	DESCRIPTION
`pssm`	PSSM DataFrame with columns `A`, `C`, `G`, `T`. TYPE: `DataFrame`
`prior`	Prior probability added to each value before normalisation. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`ndarray`	1-D array of bits per position (max 2 for DNA).

consensus_from_pssm ¶

consensus_from_pssm(pssm: DataFrame, single_thresh: float = 0.5, double_thresh: float = 0.75) -> str

Derive a consensus sequence from a PSSM.

At each position the dominant nucleotide is used if its probability exceeds single_thresh. If two nucleotides together exceed double_thresh (but neither alone exceeds single_thresh), an IUPAC ambiguity code is used. Otherwise N is emitted.

PARAMETER	DESCRIPTION
`pssm`	PSSM DataFrame with columns `A`, `C`, `G`, `T`. TYPE: `DataFrame`
`single_thresh`	Threshold for a single-nucleotide call. TYPE: `float` DEFAULT: `0.5`
`double_thresh`	Threshold for a two-nucleotide ambiguity call. TYPE: `float` DEFAULT: `0.75`

RETURNS	DESCRIPTION
`str`	Consensus string.

pssm_rc ¶

pssm_rc(pssm: DataFrame) -> pd.DataFrame

Return the reverse-complement of a PSSM.

Rows are reversed and columns swapped (A<->T, C<->G).

PARAMETER	DESCRIPTION
`pssm`	PSSM DataFrame. TYPE: `DataFrame`

RETURNS	DESCRIPTION
`DataFrame`	Reverse-complemented PSSM.

pssm_trim ¶

pssm_trim(pssm: DataFrame, bits_thresh: float = 0.1) -> pd.DataFrame

Trim low-information positions from the edges of a PSSM.

PARAMETER	DESCRIPTION
`pssm`	PSSM DataFrame. TYPE: `DataFrame`
`bits_thresh`	Minimum bits per position to keep at the edges. TYPE: `float` DEFAULT: `0.1`

RETURNS	DESCRIPTION
`DataFrame`	Trimmed PSSM with `pos` column reset.

pssm_add_prior ¶

pssm_add_prior(pssm: DataFrame, prior: float) -> pd.DataFrame

Add a uniform prior to a PSSM and re-normalise.

new = (pssm + prior) / rowSums(pssm + prior)

PARAMETER	DESCRIPTION
`pssm`	PSSM DataFrame. TYPE: `DataFrame`
`prior`	Prior value added to each cell. TYPE: `float`

RETURNS	DESCRIPTION
`DataFrame`	New PSSM DataFrame with the prior applied.

pssm_theoretical_max ¶

pssm_theoretical_max(pssm: DataFrame, prior: float = 0.01, regularization: float = 0.01) -> float

Theoretical maximum log-likelihood score for a PSSM.

Matches the R implementation: sum(log(regularization + rowMax(normalized_pssm))).

PARAMETER	DESCRIPTION
`pssm`	PSSM DataFrame. TYPE: `DataFrame`
`prior`	Prior added before normalisation. TYPE: `float` DEFAULT: `0.01`
`regularization`	Value added inside the log to prevent `-Inf`. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`float`	Maximum possible score.

pssm_theoretical_min ¶

pssm_theoretical_min(pssm: DataFrame, prior: float = 0.01, regularization: float = 0.01) -> float

Theoretical minimum log-likelihood score for a PSSM.

PARAMETER	DESCRIPTION
`pssm`	PSSM DataFrame. TYPE: `DataFrame`
`prior`	Prior added before normalisation. TYPE: `float` DEFAULT: `0.01`
`regularization`	Value added inside the log to prevent `-Inf`. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`float`	Minimum possible score.

pssm_quantile ¶

pssm_quantile(pssm: DataFrame, q: float, prior: float = 0.01, regularization: float = 0.01) -> float

Quantile of the theoretical score distribution.

Linearly interpolates between pssm_theoretical_min and pssm_theoretical_max.

PARAMETER	DESCRIPTION
`pssm`	PSSM DataFrame. TYPE: `DataFrame`
`q`	Quantile (0 to 1). TYPE: `float`
`prior`	Passed through to the min/max functions. TYPE: `float` DEFAULT: `0.01`
`regularization`	Passed through to the min/max functions. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`float`	Score at the requested quantile.

pssm_concat ¶

pssm_concat(*pssms: DataFrame, gap: int = 0) -> pd.DataFrame

Concatenate multiple PSSMs vertically, optionally with a gap.

PARAMETER	DESCRIPTION
`*pssms`	PSSM DataFrames to concatenate. TYPE: `DataFrame` DEFAULT: `()`
`gap`	Number of uniform (0.25 each) positions to insert between successive PSSMs. TYPE: `int` DEFAULT: `0`

RETURNS	DESCRIPTION
`DataFrame`	Combined PSSM with `pos` column renumbered from 0.

pssm_cor ¶

pssm_cor(pssm1: DataFrame, pssm2: DataFrame, method: str = 'spearman', prior: float = 0.01) -> float

Compute correlation between two PSSMs at the best alignment.

The shorter PSSM is slid along the longer one, and the maximum correlation (Spearman or Pearson) across all alignments is returned.

PARAMETER	DESCRIPTION
`pssm1`	PSSM DataFrames with columns `A`, `C`, `G`, `T`. TYPE: `DataFrame`
`pssm2`	PSSM DataFrames with columns `A`, `C`, `G`, `T`. TYPE: `DataFrame`
`method`	`"spearman"` (default) or `"pearson"`. TYPE: `str` DEFAULT: `'spearman'`
`prior`	Prior added before normalisation. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`float`	Maximum correlation across all alignments.

RAISES	DESCRIPTION
`ValueError`	If either PSSM is empty.

pssm_diff ¶

pssm_diff(pssm1: DataFrame, pssm2: DataFrame, prior: float = 0.01) -> float

Compute symmetric KL divergence between two PSSMs at the best alignment.

The shorter PSSM is slid along the longer one, and the minimum (symmetric) KL divergence across all alignments is returned.

PARAMETER	DESCRIPTION
`pssm1`	PSSM DataFrames. TYPE: `DataFrame`
`pssm2`	PSSM DataFrames. TYPE: `DataFrame`
`prior`	Prior added before normalisation. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`float`	Minimum symmetric KL divergence (lower = more similar).

pssm_dataset_cor ¶

pssm_dataset_cor(dataset1: DataFrame, dataset2: DataFrame | None = None, method: str = 'spearman', prior: float = 0.01) -> pd.DataFrame

Compute pairwise correlation matrix for PSSM datasets.

PARAMETER	DESCRIPTION
`dataset1`	DataFrame with columns `motif`, `A`, `C`, `G`, `T`. TYPE: `DataFrame`
`dataset2`	Optional second dataset. If `None`, computes within-set correlations. TYPE: `DataFrame \| None` DEFAULT: `None`
`method`	`"spearman"` or `"pearson"`. TYPE: `str` DEFAULT: `'spearman'`
`prior`	Prior for normalisation. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`DataFrame`	Correlation matrix as a DataFrame with motif-name row and column labels.

pssm_dataset_diff ¶

pssm_dataset_diff(dataset1: DataFrame, dataset2: DataFrame | None = None, prior: float = 0.01) -> pd.DataFrame

Compute pairwise KL divergence matrix for PSSM datasets.

PARAMETER	DESCRIPTION
`dataset1`	DataFrame with columns `motif`, `A`, `C`, `G`, `T`. TYPE: `DataFrame`
`dataset2`	Optional second dataset. TYPE: `DataFrame \| None` DEFAULT: `None`
`prior`	Prior for normalisation. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`DataFrame`	KL divergence matrix as a DataFrame.

pssm_match ¶

pssm_match(pssm: DataFrame, motifs: dict[str, DataFrame] | DataFrame, best: bool = False, method: str = 'spearman', prior: float = 0.01) -> pd.DataFrame | str

Match a PSSM against a motif database.

PARAMETER	DESCRIPTION
`pssm`	Query PSSM DataFrame. TYPE: `DataFrame`
`motifs`	Either a dict mapping motif names to PSSM DataFrames, or a DataFrame with a `motif` column (long format). TYPE: `dict[str, DataFrame] \| DataFrame`
`best`	If `True`, return only the best-matching motif name. TYPE: `bool` DEFAULT: `False`
`method`	`"spearman"`, `"pearson"`, or `"kl"`. TYPE: `str` DEFAULT: `'spearman'`
`prior`	Prior for normalisation. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`DataFrame \| str`	If best is `True`, a string with the best matching motif name. Otherwise a DataFrame with columns `motif` and either `cor` or `kl`, sorted by decreasing similarity (correlations) or increasing divergence (KL).