K-mer Operations¶
K-mer generation, counting, screening, and conversion to/from PSSMs.
pyprego.kmers ¶
K-mer generation, scoring, and PSSM conversion.
Mirrors kmers.R and kmer-regression.R from the R prego package.
generate_kmers ¶
Generate all possible DNA k-mers of length k, optionally with gaps.
Gaps are represented by 'N' at certain positions in the k-mer. When
max_gap > 0, the function generates k-mers where 1..max_gap contiguous
positions are replaced with 'N', at every possible offset within the
k-mer.
This mirrors the R generate_kmers() function in kmers.R.
| PARAMETER | DESCRIPTION |
|---|---|
k
|
K-mer length (number of positions, including gap positions). Must be >= 1.
TYPE:
|
alphabet
|
Nucleotide alphabet (default
TYPE:
|
max_gap
|
Maximum number of contiguous gap (wildcard
TYPE:
|
min_gap
|
Minimum gap length. Default 0.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
List of k-mers. With |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If parameters are invalid. |
kmer_matrix ¶
kmer_matrix(sequences: list[str] | ndarray, kmers: list[str] | int, max_gap: int = 0) -> pd.DataFrame
Count k-mer occurrences in each sequence.
If kmers is an integer, it is treated as the k-mer length and all standard k-mers (plus gapped variants if max_gap > 0) are generated. If kmers is a list of strings, those exact k-mers are counted.
For k-mers containing 'N' (wildcard), any nucleotide at that position
is considered a match.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences.
TYPE:
|
kmers
|
Either a k-mer length (int) or an explicit list of k-mer strings.
TYPE:
|
max_gap
|
Maximum gap length for auto-generated k-mers (only used when kmers is an int). Default 0.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame of shape |
screen_kmers ¶
screen_kmers(sequences: list[str] | ndarray, response: ndarray, kmer_len: int | None = None, kmers: list[str] | None = None, max_gap: int = 0, min_gap: int = 0, seed: int | None = None, min_cor: float = 0.0) -> pd.DataFrame
Screen k-mers for correlation with response variable(s).
For each k-mer, compute its frequency across sequences and correlate with
response. This mirrors the R screen_kmers() function.
| PARAMETER | DESCRIPTION |
|---|---|
sequences
|
DNA sequences.
TYPE:
|
response
|
Response variable(s). Shape
TYPE:
|
kmer_len
|
K-mer length. Either this or kmers must be provided.
TYPE:
|
kmers
|
Explicit list of k-mers to screen. If given, overrides kmer_len.
TYPE:
|
max_gap
|
Maximum gap length. Default 0.
TYPE:
|
min_gap
|
Minimum gap length. Default 0.
TYPE:
|
seed
|
Random seed (for reproducibility; currently only sets numpy seed).
TYPE:
|
min_cor
|
Minimum absolute correlation to include in the result. Default 0.0 (include all).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns:
Sorted by |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If neither kmer_len nor kmers is provided, or if dimensions mismatch. |
kmers_to_pssm ¶
Convert k-mer string(s) to PSSM DataFrame(s).
Each position gets high probability for the matching nucleotide and low
(prior) for others. For gap positions (N), use uniform 0.25.
This mirrors the R kmers_to_pssm() function, which accepts a vector of
k-mers and returns a combined DataFrame with a kmer column.
| PARAMETER | DESCRIPTION |
|---|---|
kmer
|
K-mer string or list of k-mers. May contain
TYPE:
|
prior
|
Prior probability for non-matching nucleotides. Default 0.01.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If k-mer contains invalid characters. |
pssm_to_kmer ¶
pssm_to_kmer(pssm: DataFrame, kmer_length: int | None = None, pos_bits_thresh: float | None = 0.5, prior: float = 0.01) -> str
Convert PSSM back to a k-mer string.
Finds the window of kmer_length positions with the highest total
information content, then at each position uses the dominant nucleotide.
If pos_bits_thresh is set, positions below the threshold are replaced
with N (wildcard).
This mirrors the R pssm_to_kmer() function.
| PARAMETER | DESCRIPTION |
|---|---|
pssm
|
PSSM DataFrame with columns
TYPE:
|
kmer_length
|
Length of the returned k-mer. If
TYPE:
|
pos_bits_thresh
|
Minimum information content (bits) per position. Positions below
this threshold are set to
TYPE:
|
prior
|
Prior added before computing bits. Default 0.01.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
K-mer string, possibly containing |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If PSSM has fewer rows than kmer_length. |