Sequence Analysis¶
Functions for extracting and analyzing DNA sequences, including reverse complement operations, k-mer counting, and position weight matrix scoring.
pymisha.gseq_extract ¶
Extract DNA sequences for given intervals.
Returns an array of sequence strings for each interval from 'intervals'. If intervals contain an additional 'strand' column and its value is -1, the reverse-complementary sequence is returned.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Intervals for which DNA sequence is returned. Must have 'chrom', 'start', and 'end' columns. Optional 'strand' column (-1 for reverse complement).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of str
|
Array of character strings representing DNA sequence. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals(["1", "2"], [10000, 10000], [10020, 10020])
>>> pm.gseq_extract(intervs)
[...]
See Also
gseq_rev, gseq_comp, gseq_revcomp, gseq_kmer
pymisha.gseq_rev ¶
Reverse DNA sequence(s).
| PARAMETER | DESCRIPTION |
|---|---|
seq
|
DNA sequence(s) to reverse.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str or list of str
|
Reversed sequence(s). |
Examples:
>>> import pymisha as pm
>>> pm.gseq_rev("ACGT")
'TGCA'
>>> pm.gseq_rev(["ACGT", "TATA"])
['TGCA', 'ATAT']
See Also
gseq_comp, gseq_revcomp, gseq_extract
pymisha.gseq_comp ¶
Return complement of DNA sequence(s).
| PARAMETER | DESCRIPTION |
|---|---|
seq
|
DNA sequence(s) to complement.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str or list of str
|
Complemented sequence(s). |
Examples:
>>> import pymisha as pm
>>> pm.gseq_comp("ACGT")
'TGCA'
>>> pm.gseq_comp(["ACGT", "AAAA"])
['TGCA', 'TTTT']
See Also
gseq_rev, gseq_revcomp, gseq_extract
pymisha.gseq_revcomp ¶
Return reverse complement of DNA sequence(s).
| PARAMETER | DESCRIPTION |
|---|---|
seq
|
DNA sequence(s) to reverse complement.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str or list of str
|
Reverse complemented sequence(s). |
Examples:
>>> import pymisha as pm
>>> pm.gseq_revcomp("AACG")
'CGTT'
>>> pm.gseq_revcomp(["AACG", "AAAA"])
['CGTT', 'TTTT']
See Also
gseq_rev, gseq_comp, gseq_extract
pymisha.gseq_kmer ¶
gseq_kmer(seqs, kmer, mode='count', strand=0, start_pos=None, end_pos=None, extend=False, skip_gaps=True, gap_chars=None)
Count k-mer occurrences in DNA sequences.
| PARAMETER | DESCRIPTION |
|---|---|
seqs
|
DNA sequence(s) to search.
TYPE:
|
kmer
|
K-mer pattern to search for (only A, C, G, T characters).
TYPE:
|
mode
|
"count" returns raw counts, "frac" returns fraction of possible positions.
TYPE:
|
strand
|
0 = both strands, 1 = forward only, -1 = reverse complement only.
TYPE:
|
start_pos
|
1-based start position of region of interest.
TYPE:
|
end_pos
|
1-based end position of region of interest (inclusive).
TYPE:
|
extend
|
If True, allow k-mer to extend beyond ROI boundaries.
TYPE:
|
skip_gaps
|
If True, skip gap characters when scanning.
TYPE:
|
gap_chars
|
Characters to treat as gaps. Default: ["-", "."].
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Array of counts or fractions, one per input sequence. |
Examples:
Count CG dinucleotides on both strands:
Get fraction instead of count:
Forward strand only:
Count in a specific region:
See Also
gseq_kmer_dist, gseq_pwm, gseq_extract
pymisha.gseq_kmer_dist ¶
Compute k-mer distribution in genomic intervals.
Counts all k-mers of size k within the specified genomic intervals, optionally excluding masked regions.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Genomic intervals to analyze.
TYPE:
|
k
|
K-mer size (1-10).
TYPE:
|
mask
|
Intervals to exclude from counting.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns 'kmer' (str) and 'count' (int). Only k-mers with count > 0 are included. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals(1, 0, 10000)
>>> result = pm.gseq_kmer_dist(intervs, k=2)
>>> list(result.columns)
['kmer', 'count']
See Also
gseq_kmer, gseq_pwm, gseq_extract
pymisha.gseq_pwm ¶
gseq_pwm(seqs, pssm, mode='lse', bidirect=True, strand=0, score_thresh=0, start_pos=None, end_pos=None, extend=False, spat_factor=None, spat_bin=1, spat_min=None, spat_max=None, return_strand=False, skip_gaps=True, gap_chars=None, neutral_chars=None, neutral_chars_policy='average', prior=0.01)
Score DNA sequences with a position weight matrix (PWM/PSSM).
Scans each input sequence with a sliding window of PSSM width and computes per-position log-probability scores, then aggregates them according to mode.
| PARAMETER | DESCRIPTION |
|---|---|
seqs
|
DNA sequence(s) to score.
TYPE:
|
pssm
|
Position-specific scoring matrix. If ndarray, shape
TYPE:
|
mode
|
Scoring mode:
TYPE:
|
bidirect
|
If True, score both forward and reverse complement strands. Overrides strand.
TYPE:
|
strand
|
When
TYPE:
|
score_thresh
|
Threshold for
TYPE:
|
start_pos
|
1-based inclusive start of region of interest.
TYPE:
|
end_pos
|
1-based inclusive end of region of interest.
TYPE:
|
extend
|
Allow motif window to start before ROI.
TYPE:
|
spat_factor
|
Spatial weighting factors, one per spatial bin. When provided,
the score at each window position is shifted in log-space by
TYPE:
|
spat_bin
|
Number of consecutive positions that share the same spatial
weight. Position offset maps to bin
TYPE:
|
spat_min
|
Reserved for virtual-track context (not used in string scoring).
TYPE:
|
spat_max
|
Reserved for virtual-track context (not used in string scoring).
TYPE:
|
return_strand
|
For
TYPE:
|
skip_gaps
|
Skip gap characters when scanning.
TYPE:
|
gap_chars
|
Characters treated as gaps. Default
TYPE:
|
neutral_chars
|
Characters treated as unknown/ambiguous bases.
Default
TYPE:
|
neutral_chars_policy
|
How to score neutral characters:
TYPE:
|
prior
|
Pseudocount added to PSSM before normalization. Must be >= 0.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray or DataFrame
|
Array of scores (one per input sequence). For |
Examples:
Create a simple PSSM (frequency matrix):
>>> pssm = np.array([
... [0.7, 0.1, 0.1, 0.1],
... [0.1, 0.7, 0.1, 0.1],
... [0.1, 0.1, 0.7, 0.1],
... [0.1, 0.1, 0.1, 0.7],
... ])
Score sequences using log-sum-exp (default mode):
Find position of best match:
Count matches above a threshold:
See Also
gseq_kmer, gseq_kmer_dist, gseq_extract
Motif Import¶
Functions to import motifs from standard motif format files (MEME, JASPAR, HOMER).
pymisha.gseq_read_meme ¶
Read motifs from a MEME minimal motif format file.
Parses a MEME minimal motif format file and returns a dict of position
probability matrices (PPM). Each DataFrame has columns A, C,
G, T with one row per motif position. Metadata is stored in the
DataFrame's .attrs dict.
| PARAMETER | DESCRIPTION |
|---|---|
file
|
Path to a MEME format file (
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, DataFrame]
|
Keys are motif identifiers, values are DataFrames. Each DataFrame
has |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If file does not exist. |
ValueError
|
If the file is malformed or contains unsupported content. |
pymisha.gseq_read_jaspar ¶
Read motifs from a JASPAR PFM format file.
Parses a JASPAR Position Frequency Matrix (PFM) file and returns a dict
of position probability matrices (PPM). Supports both the standard JASPAR
header format (>ID NAME followed by labeled A/C/G/T rows) and the
simple 4-row PFM format. Counts are converted to probabilities.
| PARAMETER | DESCRIPTION |
|---|---|
file
|
Path to a JASPAR format file (
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, DataFrame]
|
Keys are motif identifiers, values are DataFrames with columns
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If file does not exist. |
ValueError
|
If the file is malformed. |
pymisha.gseq_read_homer ¶
Read motifs from a HOMER motif format file.
Parses a HOMER .motif format file and returns a dict of position
probability matrices (PPM). Each DataFrame has columns A, C,
G, T.
| PARAMETER | DESCRIPTION |
|---|---|
file
|
Path to a HOMER motif file (
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, DataFrame]
|
Keys are derived from the consensus sequence. Each DataFrame has
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If file does not exist. |
ValueError
|
If the file is malformed. |