K-mer Analysis¶
This tutorial covers k-mer screening and the k-mer utilities in pyprego, which are useful both as a standalone analysis tool and as a precursor to full PWM regression.
K-mer Screening Basics¶
screen_kmers computes the Pearson correlation between each k-mer's occurrence count (across sequences) and a response variable. This provides a quick, interpretable view of which short sequence features drive the response.
import numpy as np
import pyprego
# Example data
rng = np.random.default_rng(42)
seqs = ["".join(rng.choice(list("ACGT"), 200)) for _ in range(500)]
response = rng.normal(size=500)
# Screen all 8-mers
kmers_df = pyprego.screen_kmers(seqs, response, kmer_len=8)
print(kmers_df.head(10))
The result is a DataFrame with columns:
| Column | Description |
|---|---|
kmer |
The k-mer string |
max_r2 |
Maximum R-squared across response columns |
avg_n |
Mean count of the k-mer per sequence |
avg_var |
Variance of the count across sequences |
r0 |
Pearson correlation with the response (one column per response dimension) |
Rows are sorted by max_r2 in descending order.
Filtering by correlation¶
Use min_cor to keep only k-mers above a correlation threshold:
kmers_df = pyprego.screen_kmers(
seqs, response,
kmer_len=8,
min_cor=0.05, # only keep |r| >= 0.05
)
Interpretation
min_cor filters on the absolute Pearson correlation coefficient, not R-squared. A k-mer passes if any of its response-column correlations has |r| >= min_cor.
Multi-dimensional response¶
When the response has multiple columns, screen_kmers computes correlations against each column independently:
response_2d = rng.normal(size=(500, 3))
kmers_df = pyprego.screen_kmers(seqs, response_2d, kmer_len=6)
print(kmers_df.columns)
# ['kmer', 'max_r2', 'avg_n', 'avg_var', 'r0', 'r1', 'r2']
Gapped K-mers¶
Gapped k-mers contain wildcard positions (N) that match any nucleotide. They capture motifs with variable spacers -- common in transcription factor binding sites (e.g. nuclear receptor half-sites separated by a variable gap).
kmers_df = pyprego.screen_kmers(
seqs, response,
kmer_len=8,
max_gap=2, # allow up to 2 contiguous N positions
min_gap=1, # require at least 1 gap position
)
# Gapped k-mers look like: "ACNNNGTC", "GATNCCAT", etc.
print(kmers_df[kmers_df["kmer"].str.contains("N")].head())
How gaps work¶
The generate_kmers function produces all k-mers where a contiguous block of 1 to max_gap positions is replaced with N:
# Standard 4-mers: 256 total
standard = pyprego.generate_kmers(4)
print(len(standard)) # 256
# With gaps: adds gapped variants
gapped = pyprego.generate_kmers(4, max_gap=1)
print(len(gapped)) # 256 + gapped variants
# See some gapped k-mers
print([k for k in gapped if "N" in k][:10])
When counting occurrences in kmer_matrix, N positions match any nucleotide, so "ACNGT" matches "ACAGT", "ACCGT", "ACGGT", and "ACTGT".
K-mer Count Matrix¶
For custom analysis, build the full k-mer count matrix:
kmer_list = pyprego.generate_kmers(6)
km_matrix = pyprego.kmer_matrix(seqs, kmer_list)
print(km_matrix.shape) # (500, 4096) for 6-mers
print(km_matrix.iloc[:5, :5])
The returned DataFrame has one row per sequence and one column per k-mer, with integer counts.
Memory
For long k-mers, the number of columns grows as 4^k. An 8-mer matrix has 65,536 columns. Consider using gapped k-mers or shorter k-mers if memory is a concern.
Custom screening with the count matrix¶
from scipy.stats import spearmanr
# Build count matrix once
km = pyprego.kmer_matrix(seqs, pyprego.generate_kmers(6))
# Spearman correlation instead of Pearson
correlations = {}
for col in km.columns:
r, p = spearmanr(km[col].values, response)
correlations[col] = (r, p)
# Find top k-mers
top = sorted(correlations.items(), key=lambda x: abs(x[1][0]), reverse=True)[:10]
for kmer, (r, p) in top:
print(f"{kmer}: rho={r:.4f}, p={p:.2e}")
Converting K-mers to PSSMs¶
A k-mer can be converted to a PSSM for use with compute_pwm or as a seed for regression:
pssm = pyprego.kmers_to_pssm("GATAAG")
print(pssm)
# pos A C G T
# 0 0 0.01 0.01 0.97 0.01
# 1 1 0.97 0.01 0.01 0.01
# 2 2 0.01 0.01 0.01 0.97
# 3 3 0.97 0.01 0.01 0.01
# 4 4 0.97 0.01 0.01 0.01
# 5 5 0.01 0.01 0.97 0.01
For gapped k-mers, N positions get uniform weights (0.25 each):
pssm_gapped = pyprego.kmers_to_pssm("GATNAG")
print(pssm_gapped.iloc[3])
# pos 3.00
# A 0.25
# C 0.25
# G 0.25
# T 0.25
Multiple k-mers¶
pssm_multi = pyprego.kmers_to_pssm(["GATAAG", "CACGTG"])
print(pssm_multi) # Combined DataFrame with a 'kmer' column
Converting PSSMs to K-mers¶
The reverse operation extracts the top-scoring k-mer from a PSSM:
Integration with Regression¶
K-mer screening is used internally by regress_pwm when no seed motif is provided. You can replicate this workflow explicitly:
# Step 1: Screen k-mers
kmers_df = pyprego.screen_kmers(seqs, response, kmer_len=8, max_gap=1)
top_kmer = kmers_df.iloc[0]["kmer"]
print(f"Best seed: {top_kmer}")
# Step 2: Use top k-mer as seed for regression
result = pyprego.regress_pwm(seqs, response, motif=top_kmer)
print(f"Consensus: {result.consensus}, R2: {result.r2:.4f}")
Trying multiple seeds manually¶
results = []
for _, row in kmers_df.head(5).iterrows():
res = pyprego.regress_pwm(seqs, response, motif=row["kmer"])
results.append((row["kmer"], res.r2, res.consensus))
print(f"Seed: {row['kmer']} -> {res.consensus} (R2={res.r2:.4f})")
# Pick the best
best = max(results, key=lambda x: x[1])
print(f"\nBest: seed={best[0]}, consensus={best[2]}, R2={best[1]:.4f}")
This is essentially what regress_pwm(multi_kmers=True) does automatically, with the addition of a validation holdout for model selection.