K-mer Operations¶

K-mer generation, counting, screening, and conversion to/from PSSMs.

pyprego.kmers ¶

K-mer generation, scoring, and PSSM conversion.

Mirrors kmers.R and kmer-regression.R from the R prego package.

generate_kmers ¶

generate_kmers(k: int, alphabet: str = 'ACGT', max_gap: int = 0, min_gap: int = 0) -> list[str]

Generate all possible DNA k-mers of length k, optionally with gaps.

Gaps are represented by 'N' at certain positions in the k-mer. When max_gap > 0, the function generates k-mers where 1..max_gap contiguous positions are replaced with 'N', at every possible offset within the k-mer.

This mirrors the R generate_kmers() function in kmers.R.

PARAMETER	DESCRIPTION
`k`	K-mer length (number of positions, including gap positions). Must be >= 1. TYPE: `int`
`alphabet`	Nucleotide alphabet (default `"ACGT"`). TYPE: `str` DEFAULT: `'ACGT'`
`max_gap`	Maximum number of contiguous gap (wildcard `N`) positions. Default 0 means no gaps. TYPE: `int` DEFAULT: `0`
`min_gap`	Minimum gap length. Default 0. TYPE: `int` DEFAULT: `0`

RETURNS	DESCRIPTION
`list[str]`	List of k-mers. With `max_gap=0` this is all `4^k` standard k-mers. With gaps, also includes gapped variants.

RAISES	DESCRIPTION
`ValueError`	If parameters are invalid.

kmer_matrix ¶

kmer_matrix(sequences: list[str] | ndarray, kmers: list[str] | int, max_gap: int = 0) -> pd.DataFrame

Count k-mer occurrences in each sequence.

If kmers is an integer, it is treated as the k-mer length and all standard k-mers (plus gapped variants if max_gap > 0) are generated. If kmers is a list of strings, those exact k-mers are counted.

For k-mers containing 'N' (wildcard), any nucleotide at that position is considered a match.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences. TYPE: `list[str] \| ndarray`
`kmers`	Either a k-mer length (int) or an explicit list of k-mer strings. TYPE: `list[str] \| int`
`max_gap`	Maximum gap length for auto-generated k-mers (only used when kmers is an int). Default 0. TYPE: `int` DEFAULT: `0`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame of shape `(n_sequences, n_kmers)` with occurrence counts. Columns are the k-mer strings.

screen_kmers ¶

screen_kmers(sequences: list[str] | ndarray, response: ndarray, kmer_len: int | None = None, kmers: list[str] | None = None, max_gap: int = 0, min_gap: int = 0, seed: int | None = None, min_cor: float = 0.0) -> pd.DataFrame

Screen k-mers for correlation with response variable(s).

For each k-mer, compute its frequency across sequences and correlate with response. This mirrors the R screen_kmers() function.

PARAMETER	DESCRIPTION
`sequences`	DNA sequences. TYPE: `list[str] \| ndarray`
`response`	Response variable(s). Shape `(n_sequences,)` for a single response, or `(n_sequences, n_responses)` for multiple response columns. TYPE: `ndarray`
`kmer_len`	K-mer length. Either this or kmers must be provided. TYPE: `int \| None` DEFAULT: `None`
`kmers`	Explicit list of k-mers to screen. If given, overrides kmer_len. TYPE: `list[str] \| None` DEFAULT: `None`
`max_gap`	Maximum gap length. Default 0. TYPE: `int` DEFAULT: `0`
`min_gap`	Minimum gap length. Default 0. TYPE: `int` DEFAULT: `0`
`seed`	Random seed (for reproducibility; currently only sets numpy seed). TYPE: `int \| None` DEFAULT: `None`
`min_cor`	Minimum absolute correlation to include in the result. Default 0.0 (include all). TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame with columns: `kmer`: the k-mer string `max_r2`: maximum R^2 across response columns `avg_n`: average count of the k-mer per sequence `avg_var`: variance of the count across sequences One column per response variable with the Pearson correlation Sorted by `max_r2` descending.

RAISES	DESCRIPTION
`ValueError`	If neither kmer_len nor kmers is provided, or if dimensions mismatch.

kmers_to_pssm ¶

kmers_to_pssm(kmer: str | list[str], prior: float = 0.01) -> pd.DataFrame

Convert k-mer string(s) to PSSM DataFrame(s).

Each position gets high probability for the matching nucleotide and low (prior) for others. For gap positions (N), use uniform 0.25.

This mirrors the R kmers_to_pssm() function, which accepts a vector of k-mers and returns a combined DataFrame with a kmer column.

PARAMETER	DESCRIPTION
`kmer`	K-mer string or list of k-mers. May contain `N` for wildcard positions. TYPE: `str \| list[str]`
`prior`	Prior probability for non-matching nucleotides. Default 0.01. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame with columns `kmer`, `pos`, `A`, `C`, `G`, `T`. Each row sums to 1 across the four nucleotides.

RAISES	DESCRIPTION
`ValueError`	If k-mer contains invalid characters.

pssm_to_kmer ¶

pssm_to_kmer(pssm: DataFrame, kmer_length: int | None = None, pos_bits_thresh: float | None = 0.5, prior: float = 0.01) -> str

Convert PSSM back to a k-mer string.

Finds the window of kmer_length positions with the highest total information content, then at each position uses the dominant nucleotide. If pos_bits_thresh is set, positions below the threshold are replaced with N (wildcard).

This mirrors the R pssm_to_kmer() function.

PARAMETER	DESCRIPTION
`pssm`	PSSM DataFrame with columns `A`, `C`, `G`, `T`. TYPE: `DataFrame`
`kmer_length`	Length of the returned k-mer. If `None`, uses the full PSSM length. TYPE: `int \| None` DEFAULT: `None`
`pos_bits_thresh`	Minimum information content (bits) per position. Positions below this threshold are set to `N`. If `None`, all positions use the dominant nucleotide. TYPE: `float \| None` DEFAULT: `0.5`
`prior`	Prior added before computing bits. Default 0.01. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`str`	K-mer string, possibly containing `N` for low-information positions.

RAISES	DESCRIPTION
`ValueError`	If PSSM has fewer rows than kmer_length.