Skip to content

K-mer Operations

K-mer generation, counting, screening, and conversion to/from PSSMs.

pyprego.kmers

K-mer generation, scoring, and PSSM conversion.

Mirrors kmers.R and kmer-regression.R from the R prego package.

generate_kmers

generate_kmers(k: int, alphabet: str = 'ACGT', max_gap: int = 0, min_gap: int = 0) -> list[str]

Generate all possible DNA k-mers of length k, optionally with gaps.

Gaps are represented by 'N' at certain positions in the k-mer. When max_gap > 0, the function generates k-mers where 1..max_gap contiguous positions are replaced with 'N', at every possible offset within the k-mer.

This mirrors the R generate_kmers() function in kmers.R.

PARAMETER DESCRIPTION
k

K-mer length (number of positions, including gap positions). Must be >= 1.

TYPE: int

alphabet

Nucleotide alphabet (default "ACGT").

TYPE: str DEFAULT: 'ACGT'

max_gap

Maximum number of contiguous gap (wildcard N) positions. Default 0 means no gaps.

TYPE: int DEFAULT: 0

min_gap

Minimum gap length. Default 0.

TYPE: int DEFAULT: 0

RETURNS DESCRIPTION
list[str]

List of k-mers. With max_gap=0 this is all 4^k standard k-mers. With gaps, also includes gapped variants.

RAISES DESCRIPTION
ValueError

If parameters are invalid.

kmer_matrix

kmer_matrix(sequences: list[str] | ndarray, kmers: list[str] | int, max_gap: int = 0) -> pd.DataFrame

Count k-mer occurrences in each sequence.

If kmers is an integer, it is treated as the k-mer length and all standard k-mers (plus gapped variants if max_gap > 0) are generated. If kmers is a list of strings, those exact k-mers are counted.

For k-mers containing 'N' (wildcard), any nucleotide at that position is considered a match.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str] | ndarray

kmers

Either a k-mer length (int) or an explicit list of k-mer strings.

TYPE: list[str] | int

max_gap

Maximum gap length for auto-generated k-mers (only used when kmers is an int). Default 0.

TYPE: int DEFAULT: 0

RETURNS DESCRIPTION
DataFrame

DataFrame of shape (n_sequences, n_kmers) with occurrence counts. Columns are the k-mer strings.

screen_kmers

screen_kmers(sequences: list[str] | ndarray, response: ndarray, kmer_len: int | None = None, kmers: list[str] | None = None, max_gap: int = 0, min_gap: int = 0, seed: int | None = None, min_cor: float = 0.0) -> pd.DataFrame

Screen k-mers for correlation with response variable(s).

For each k-mer, compute its frequency across sequences and correlate with response. This mirrors the R screen_kmers() function.

PARAMETER DESCRIPTION
sequences

DNA sequences.

TYPE: list[str] | ndarray

response

Response variable(s). Shape (n_sequences,) for a single response, or (n_sequences, n_responses) for multiple response columns.

TYPE: ndarray

kmer_len

K-mer length. Either this or kmers must be provided.

TYPE: int | None DEFAULT: None

kmers

Explicit list of k-mers to screen. If given, overrides kmer_len.

TYPE: list[str] | None DEFAULT: None

max_gap

Maximum gap length. Default 0.

TYPE: int DEFAULT: 0

min_gap

Minimum gap length. Default 0.

TYPE: int DEFAULT: 0

seed

Random seed (for reproducibility; currently only sets numpy seed).

TYPE: int | None DEFAULT: None

min_cor

Minimum absolute correlation to include in the result. Default 0.0 (include all).

TYPE: float DEFAULT: 0.0

RETURNS DESCRIPTION
DataFrame

DataFrame with columns:

  • kmer: the k-mer string
  • max_r2: maximum R^2 across response columns
  • avg_n: average count of the k-mer per sequence
  • avg_var: variance of the count across sequences
  • One column per response variable with the Pearson correlation

Sorted by max_r2 descending.

RAISES DESCRIPTION
ValueError

If neither kmer_len nor kmers is provided, or if dimensions mismatch.

kmers_to_pssm

kmers_to_pssm(kmer: str | list[str], prior: float = 0.01) -> pd.DataFrame

Convert k-mer string(s) to PSSM DataFrame(s).

Each position gets high probability for the matching nucleotide and low (prior) for others. For gap positions (N), use uniform 0.25.

This mirrors the R kmers_to_pssm() function, which accepts a vector of k-mers and returns a combined DataFrame with a kmer column.

PARAMETER DESCRIPTION
kmer

K-mer string or list of k-mers. May contain N for wildcard positions.

TYPE: str | list[str]

prior

Prior probability for non-matching nucleotides. Default 0.01.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
DataFrame

DataFrame with columns kmer, pos, A, C, G, T. Each row sums to 1 across the four nucleotides.

RAISES DESCRIPTION
ValueError

If k-mer contains invalid characters.

pssm_to_kmer

pssm_to_kmer(pssm: DataFrame, kmer_length: int | None = None, pos_bits_thresh: float | None = 0.5, prior: float = 0.01) -> str

Convert PSSM back to a k-mer string.

Finds the window of kmer_length positions with the highest total information content, then at each position uses the dominant nucleotide. If pos_bits_thresh is set, positions below the threshold are replaced with N (wildcard).

This mirrors the R pssm_to_kmer() function.

PARAMETER DESCRIPTION
pssm

PSSM DataFrame with columns A, C, G, T.

TYPE: DataFrame

kmer_length

Length of the returned k-mer. If None, uses the full PSSM length.

TYPE: int | None DEFAULT: None

pos_bits_thresh

Minimum information content (bits) per position. Positions below this threshold are set to N. If None, all positions use the dominant nucleotide.

TYPE: float | None DEFAULT: 0.5

prior

Prior added before computing bits. Default 0.01.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
str

K-mer string, possibly containing N for low-information positions.

RAISES DESCRIPTION
ValueError

If PSSM has fewer rows than kmer_length.