Sequence Analysis¶

Functions for extracting and analyzing DNA sequences, including reverse complement operations, k-mer counting, and position weight matrix scoring.

pymisha.gseq_extract ¶

gseq_extract(intervals)

Extract DNA sequences for given intervals.

Returns an array of sequence strings for each interval from 'intervals'. If intervals contain an additional 'strand' column and its value is -1, the reverse-complementary sequence is returned.

PARAMETER	DESCRIPTION
`intervals`	Intervals for which DNA sequence is returned. Must have 'chrom', 'start', and 'end' columns. Optional 'strand' column (-1 for reverse complement). TYPE: `DataFrame`

RETURNS	DESCRIPTION
`list of str`	Array of character strings representing DNA sequence.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals(["1", "2"], [10000, 10000], [10020, 10020])
>>> pm.gseq_extract(intervs)
[...]

See Also

gseq_rev, gseq_comp, gseq_revcomp, gseq_kmer

pymisha.gseq_rev ¶

gseq_rev(seq)

Reverse DNA sequence(s).

PARAMETER	DESCRIPTION
`seq`	DNA sequence(s) to reverse. TYPE: `str or list of str`

RETURNS	DESCRIPTION
`str or list of str`	Reversed sequence(s).

Examples:

>>> import pymisha as pm
>>> pm.gseq_rev("ACGT")
'TGCA'
>>> pm.gseq_rev(["ACGT", "TATA"])
['TGCA', 'ATAT']

See Also

gseq_comp, gseq_revcomp, gseq_extract

pymisha.gseq_comp ¶

gseq_comp(seq)

Return complement of DNA sequence(s).

PARAMETER	DESCRIPTION
`seq`	DNA sequence(s) to complement. TYPE: `str or list of str`

RETURNS	DESCRIPTION
`str or list of str`	Complemented sequence(s).

Examples:

>>> import pymisha as pm
>>> pm.gseq_comp("ACGT")
'TGCA'
>>> pm.gseq_comp(["ACGT", "AAAA"])
['TGCA', 'TTTT']

See Also

gseq_rev, gseq_revcomp, gseq_extract

pymisha.gseq_revcomp ¶

gseq_revcomp(seq)

Return reverse complement of DNA sequence(s).

PARAMETER	DESCRIPTION
`seq`	DNA sequence(s) to reverse complement. TYPE: `str or list of str`

RETURNS	DESCRIPTION
`str or list of str`	Reverse complemented sequence(s).

Examples:

>>> import pymisha as pm
>>> pm.gseq_revcomp("AACG")
'CGTT'
>>> pm.gseq_revcomp(["AACG", "AAAA"])
['CGTT', 'TTTT']

See Also

gseq_rev, gseq_comp, gseq_extract

pymisha.gseq_kmer ¶

gseq_kmer(seqs, kmer, mode='count', strand=0, start_pos=None, end_pos=None, extend=False, skip_gaps=True, gap_chars=None)

Count k-mer occurrences in DNA sequences.

PARAMETER	DESCRIPTION
`seqs`	DNA sequence(s) to search. TYPE: `str or list of str`
`kmer`	K-mer pattern to search for (only A, C, G, T characters). TYPE: `str`
`mode`	"count" returns raw counts, "frac" returns fraction of possible positions. TYPE: `str` DEFAULT: `"count"`
`strand`	0 = both strands, 1 = forward only, -1 = reverse complement only. TYPE: `int` DEFAULT: `0`
`start_pos`	1-based start position of region of interest. TYPE: `int` DEFAULT: `None`
`end_pos`	1-based end position of region of interest (inclusive). TYPE: `int` DEFAULT: `None`
`extend`	If True, allow k-mer to extend beyond ROI boundaries. TYPE: `bool` DEFAULT: `False`
`skip_gaps`	If True, skip gap characters when scanning. TYPE: `bool` DEFAULT: `True`
`gap_chars`	Characters to treat as gaps. Default: ["-", "."]. TYPE: `list of str` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray`	Array of counts or fractions, one per input sequence.

Examples:

>>> import pymisha as pm

Count CG dinucleotides on both strands:

>>> pm.gseq_kmer(["CGCGCGCGCG", "ATATATATAT"], "CG")
array([10.,  0.])

Get fraction instead of count:

>>> pm.gseq_kmer(["CGCGCGCGCG"], "CG", mode="frac")
array([0.555...])

Forward strand only:

>>> pm.gseq_kmer(["CGCGCGCGCG"], "CG", strand=1)
array([5.])

Count in a specific region:

>>> pm.gseq_kmer("ATCGATCG", "AT", start_pos=1, end_pos=4)
array([2.])

See Also

gseq_kmer_dist, gseq_pwm, gseq_extract

pymisha.gseq_kmer_dist ¶

gseq_kmer_dist(intervals, k=6, mask=None)

Compute k-mer distribution in genomic intervals.

Counts all k-mers of size k within the specified genomic intervals, optionally excluding masked regions.

PARAMETER	DESCRIPTION
`intervals`	Genomic intervals to analyze. TYPE: `DataFrame`
`k`	K-mer size (1-10). TYPE: `int` DEFAULT: `6`
`mask`	Intervals to exclude from counting. TYPE: `DataFrame` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame with columns 'kmer' (str) and 'count' (int). Only k-mers with count > 0 are included.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals(1, 0, 10000)
>>> result = pm.gseq_kmer_dist(intervs, k=2)
>>> list(result.columns)
['kmer', 'count']

See Also

gseq_kmer, gseq_pwm, gseq_extract

pymisha.gseq_pwm ¶

gseq_pwm(seqs, pssm, mode='lse', bidirect=True, strand=0, score_thresh=0, start_pos=None, end_pos=None, extend=False, spat_factor=None, spat_bin=1, spat_min=None, spat_max=None, return_strand=False, skip_gaps=True, gap_chars=None, neutral_chars=None, neutral_chars_policy='average', prior=0.01)

Score DNA sequences with a position weight matrix (PWM/PSSM).

Scans each input sequence with a sliding window of PSSM width and computes per-position log-probability scores, then aggregates them according to mode.

PARAMETER	DESCRIPTION
`seqs`	DNA sequence(s) to score. TYPE: `str or list of str`
`pssm`	Position-specific scoring matrix. If ndarray, shape `(w, 4)` with columns ordered [A, C, G, T]. If DataFrame, must contain columns `A`, `C`, `G`, `T` (extra columns are ignored). TYPE: `ndarray or DataFrame`
`mode`	Scoring mode: `"lse"`: log-sum-exp of all per-position scores. `"max"`: maximum per-position score. `"pos"`: 1-based position of best match. `"count"`: number of positions with score >= score_thresh. TYPE: `str` DEFAULT: ``"lse"``
`bidirect`	If True, score both forward and reverse complement strands. Overrides strand. TYPE: `bool` DEFAULT: `True`
`strand`	When `bidirect=False`: `1` = forward only, `-1` = reverse complement only, `0` = forward only (default). TYPE: `int` DEFAULT: `0`
`score_thresh`	Threshold for `"count"` mode. TYPE: `float` DEFAULT: `0`
`start_pos`	1-based inclusive start of region of interest. TYPE: `int or None` DEFAULT: `None`
`end_pos`	1-based inclusive end of region of interest. TYPE: `int or None` DEFAULT: `None`
`extend`	Allow motif window to start before ROI. `True` = `w-1`, `False` = 0, or an explicit integer extension. TYPE: `bool or int` DEFAULT: `False`
`spat_factor`	Spatial weighting factors, one per spatial bin. When provided, the score at each window position is shifted in log-space by `log(spat_factor[bin])` where `bin = offset // spat_bin` and offset is the 0-based position of the window start relative to the ROI start. Values must be non-negative. TYPE: `array - like or None` DEFAULT: `None`
`spat_bin`	Number of consecutive positions that share the same spatial weight. Position offset maps to bin `offset // spat_bin`. TYPE: `int` DEFAULT: `1`
`spat_min`	Reserved for virtual-track context (not used in string scoring). TYPE: `float or None` DEFAULT: `None`
`spat_max`	Reserved for virtual-track context (not used in string scoring). TYPE: `float or None` DEFAULT: `None`
`return_strand`	For `mode="pos"` with bidirectional scoring, return a DataFrame with `pos` and `strand` columns instead of a plain array. TYPE: `bool` DEFAULT: `False`
`skip_gaps`	Skip gap characters when scanning. TYPE: `bool` DEFAULT: `True`
`gap_chars`	Characters treated as gaps. Default `["-", "."]`. TYPE: `list of str or None` DEFAULT: `None`
`neutral_chars`	Characters treated as unknown/ambiguous bases. Default `["N", "n", ""]`. TYPE:* `list of str or None` DEFAULT: `None`
`neutral_chars_policy`	How to score neutral characters: `"average"`: mean log-probability of the PSSM column. `"log_quarter"`: `log(0.25)`. `"na"`: window is invalid (NaN). TYPE: `str` DEFAULT: ``"average"``
`prior`	Pseudocount added to PSSM before normalization. Must be >= 0. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`ndarray or DataFrame`	Array of scores (one per input sequence). For `mode="pos"` with `return_strand=True`, returns a DataFrame with columns `pos` and `strand`.

Examples:

>>> import pymisha as pm
>>> import numpy as np

Create a simple PSSM (frequency matrix):

>>> pssm = np.array([
...     [0.7, 0.1, 0.1, 0.1],
...     [0.1, 0.7, 0.1, 0.1],
...     [0.1, 0.1, 0.7, 0.1],
...     [0.1, 0.1, 0.1, 0.7],
... ])

Score sequences using log-sum-exp (default mode):

>>> pm.gseq_pwm(["ACGTACGT", "GGGGGGGG"], pssm, mode="lse")
array([...])

Find position of best match:

>>> pm.gseq_pwm(["ACGTACGT"], pssm, mode="pos")
array([...])

Count matches above a threshold:

>>> pm.gseq_pwm(["ACGTACGT"], pssm, mode="count", score_thresh=-5.0)
array([...])

See Also

gseq_kmer, gseq_kmer_dist, gseq_extract

Motif Import¶

Functions to import motifs from standard motif format files (MEME, JASPAR, HOMER).

pymisha.gseq_read_meme ¶

gseq_read_meme(file: str) -> dict[str, _pandas.DataFrame]

Read motifs from a MEME minimal motif format file.

Parses a MEME minimal motif format file and returns a dict of position probability matrices (PPM). Each DataFrame has columns A, C, G, T with one row per motif position. Metadata is stored in the DataFrame's .attrs dict.

PARAMETER	DESCRIPTION
`file`	Path to a MEME format file (`.meme`, `.txt`). TYPE: `str`

RETURNS	DESCRIPTION
`dict[str, DataFrame]`	Keys are motif identifiers, values are DataFrames. Each DataFrame has `.attrs` with keys: `name`, `alength`, `w`, `nsites`, `E`, `url`, `strand`, `background`.

RAISES	DESCRIPTION
`FileNotFoundError`	If file does not exist.
`ValueError`	If the file is malformed or contains unsupported content.

pymisha.gseq_read_jaspar ¶

gseq_read_jaspar(file: str) -> dict[str, _pandas.DataFrame]

Read motifs from a JASPAR PFM format file.

Parses a JASPAR Position Frequency Matrix (PFM) file and returns a dict of position probability matrices (PPM). Supports both the standard JASPAR header format (>ID NAME followed by labeled A/C/G/T rows) and the simple 4-row PFM format. Counts are converted to probabilities.

PARAMETER	DESCRIPTION
`file`	Path to a JASPAR format file (`.jaspar`, `.pfm`, `.txt`). TYPE: `str`

RETURNS	DESCRIPTION
`dict[str, DataFrame]`	Keys are motif identifiers, values are DataFrames with columns `A`, `C`, `G`, `T`. Each DataFrame has `.attrs` with keys: `name`, `w`, `nsites`, `format`.

RAISES	DESCRIPTION
`FileNotFoundError`	If file does not exist.
`ValueError`	If the file is malformed.

pymisha.gseq_read_homer ¶

gseq_read_homer(file: str) -> dict[str, _pandas.DataFrame]

Read motifs from a HOMER motif format file.

Parses a HOMER .motif format file and returns a dict of position probability matrices (PPM). Each DataFrame has columns A, C, G, T.

PARAMETER	DESCRIPTION
`file`	Path to a HOMER motif file (`.motif`). TYPE: `str`

RETURNS	DESCRIPTION
`dict[str, DataFrame]`	Keys are derived from the consensus sequence. Each DataFrame has `.attrs` with keys: `name`, `consensus`, `log_odds_threshold`, `log_p_value`, `w`, `source`.

RAISES	DESCRIPTION
`FileNotFoundError`	If file does not exist.
`ValueError`	If the file is malformed.