Skip to content

Sequence Analysis

Functions for extracting and analyzing DNA sequences, including reverse complement operations, k-mer counting, and position weight matrix scoring.

pymisha.gseq_extract

gseq_extract(intervals)

Extract DNA sequences for given intervals.

Returns an array of sequence strings for each interval from 'intervals'. If intervals contain an additional 'strand' column and its value is -1, the reverse-complementary sequence is returned.

PARAMETER DESCRIPTION
intervals

Intervals for which DNA sequence is returned. Must have 'chrom', 'start', and 'end' columns. Optional 'strand' column (-1 for reverse complement).

TYPE: DataFrame

RETURNS DESCRIPTION
list of str

Array of character strings representing DNA sequence.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals(["1", "2"], [10000, 10000], [10020, 10020])
>>> pm.gseq_extract(intervs)
[...]
See Also

gseq_rev, gseq_comp, gseq_revcomp, gseq_kmer

pymisha.gseq_rev

gseq_rev(seq)

Reverse DNA sequence(s).

PARAMETER DESCRIPTION
seq

DNA sequence(s) to reverse.

TYPE: str or list of str

RETURNS DESCRIPTION
str or list of str

Reversed sequence(s).

Examples:

>>> import pymisha as pm
>>> pm.gseq_rev("ACGT")
'TGCA'
>>> pm.gseq_rev(["ACGT", "TATA"])
['TGCA', 'ATAT']
See Also

gseq_comp, gseq_revcomp, gseq_extract

pymisha.gseq_comp

gseq_comp(seq)

Return complement of DNA sequence(s).

PARAMETER DESCRIPTION
seq

DNA sequence(s) to complement.

TYPE: str or list of str

RETURNS DESCRIPTION
str or list of str

Complemented sequence(s).

Examples:

>>> import pymisha as pm
>>> pm.gseq_comp("ACGT")
'TGCA'
>>> pm.gseq_comp(["ACGT", "AAAA"])
['TGCA', 'TTTT']
See Also

gseq_rev, gseq_revcomp, gseq_extract

pymisha.gseq_revcomp

gseq_revcomp(seq)

Return reverse complement of DNA sequence(s).

PARAMETER DESCRIPTION
seq

DNA sequence(s) to reverse complement.

TYPE: str or list of str

RETURNS DESCRIPTION
str or list of str

Reverse complemented sequence(s).

Examples:

>>> import pymisha as pm
>>> pm.gseq_revcomp("AACG")
'CGTT'
>>> pm.gseq_revcomp(["AACG", "AAAA"])
['CGTT', 'TTTT']
See Also

gseq_rev, gseq_comp, gseq_extract

pymisha.gseq_kmer

gseq_kmer(seqs, kmer, mode='count', strand=0, start_pos=None, end_pos=None, extend=False, skip_gaps=True, gap_chars=None)

Count k-mer occurrences in DNA sequences.

PARAMETER DESCRIPTION
seqs

DNA sequence(s) to search.

TYPE: str or list of str

kmer

K-mer pattern to search for (only A, C, G, T characters).

TYPE: str

mode

"count" returns raw counts, "frac" returns fraction of possible positions.

TYPE: str DEFAULT: "count"

strand

0 = both strands, 1 = forward only, -1 = reverse complement only.

TYPE: int DEFAULT: 0

start_pos

1-based start position of region of interest.

TYPE: int DEFAULT: None

end_pos

1-based end position of region of interest (inclusive).

TYPE: int DEFAULT: None

extend

If True, allow k-mer to extend beyond ROI boundaries.

TYPE: bool DEFAULT: False

skip_gaps

If True, skip gap characters when scanning.

TYPE: bool DEFAULT: True

gap_chars

Characters to treat as gaps. Default: ["-", "."].

TYPE: list of str DEFAULT: None

RETURNS DESCRIPTION
ndarray

Array of counts or fractions, one per input sequence.

Examples:

>>> import pymisha as pm

Count CG dinucleotides on both strands:

>>> pm.gseq_kmer(["CGCGCGCGCG", "ATATATATAT"], "CG")
array([10.,  0.])

Get fraction instead of count:

>>> pm.gseq_kmer(["CGCGCGCGCG"], "CG", mode="frac")
array([0.555...])

Forward strand only:

>>> pm.gseq_kmer(["CGCGCGCGCG"], "CG", strand=1)
array([5.])

Count in a specific region:

>>> pm.gseq_kmer("ATCGATCG", "AT", start_pos=1, end_pos=4)
array([2.])
See Also

gseq_kmer_dist, gseq_pwm, gseq_extract

pymisha.gseq_kmer_dist

gseq_kmer_dist(intervals, k=6, mask=None)

Compute k-mer distribution in genomic intervals.

Counts all k-mers of size k within the specified genomic intervals, optionally excluding masked regions.

PARAMETER DESCRIPTION
intervals

Genomic intervals to analyze.

TYPE: DataFrame

k

K-mer size (1-10).

TYPE: int DEFAULT: 6

mask

Intervals to exclude from counting.

TYPE: DataFrame DEFAULT: None

RETURNS DESCRIPTION
DataFrame

DataFrame with columns 'kmer' (str) and 'count' (int). Only k-mers with count > 0 are included.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals(1, 0, 10000)
>>> result = pm.gseq_kmer_dist(intervs, k=2)
>>> list(result.columns)
['kmer', 'count']
See Also

gseq_kmer, gseq_pwm, gseq_extract

pymisha.gseq_pwm

gseq_pwm(seqs, pssm, mode='lse', bidirect=True, strand=0, score_thresh=0, start_pos=None, end_pos=None, extend=False, spat_factor=None, spat_bin=1, spat_min=None, spat_max=None, return_strand=False, skip_gaps=True, gap_chars=None, neutral_chars=None, neutral_chars_policy='average', prior=0.01)

Score DNA sequences with a position weight matrix (PWM/PSSM).

Scans each input sequence with a sliding window of PSSM width and computes per-position log-probability scores, then aggregates them according to mode.

PARAMETER DESCRIPTION
seqs

DNA sequence(s) to score.

TYPE: str or list of str

pssm

Position-specific scoring matrix. If ndarray, shape (w, 4) with columns ordered [A, C, G, T]. If DataFrame, must contain columns A, C, G, T (extra columns are ignored).

TYPE: ndarray or DataFrame

mode

Scoring mode:

  • "lse": log-sum-exp of all per-position scores.
  • "max": maximum per-position score.
  • "pos": 1-based position of best match.
  • "count": number of positions with score >= score_thresh.

TYPE: str DEFAULT: ``"lse"``

bidirect

If True, score both forward and reverse complement strands. Overrides strand.

TYPE: bool DEFAULT: True

strand

When bidirect=False: 1 = forward only, -1 = reverse complement only, 0 = forward only (default).

TYPE: int DEFAULT: 0

score_thresh

Threshold for "count" mode.

TYPE: float DEFAULT: 0

start_pos

1-based inclusive start of region of interest.

TYPE: int or None DEFAULT: None

end_pos

1-based inclusive end of region of interest.

TYPE: int or None DEFAULT: None

extend

Allow motif window to start before ROI. True = w-1, False = 0, or an explicit integer extension.

TYPE: bool or int DEFAULT: False

spat_factor

Spatial weighting factors, one per spatial bin. When provided, the score at each window position is shifted in log-space by log(spat_factor[bin]) where bin = offset // spat_bin and offset is the 0-based position of the window start relative to the ROI start. Values must be non-negative.

TYPE: array - like or None DEFAULT: None

spat_bin

Number of consecutive positions that share the same spatial weight. Position offset maps to bin offset // spat_bin.

TYPE: int DEFAULT: 1

spat_min

Reserved for virtual-track context (not used in string scoring).

TYPE: float or None DEFAULT: None

spat_max

Reserved for virtual-track context (not used in string scoring).

TYPE: float or None DEFAULT: None

return_strand

For mode="pos" with bidirectional scoring, return a DataFrame with pos and strand columns instead of a plain array.

TYPE: bool DEFAULT: False

skip_gaps

Skip gap characters when scanning.

TYPE: bool DEFAULT: True

gap_chars

Characters treated as gaps. Default ["-", "."].

TYPE: list of str or None DEFAULT: None

neutral_chars

Characters treated as unknown/ambiguous bases. Default ["N", "n", "*"].

TYPE: list of str or None DEFAULT: None

neutral_chars_policy

How to score neutral characters:

  • "average": mean log-probability of the PSSM column.
  • "log_quarter": log(0.25).
  • "na": window is invalid (NaN).

TYPE: str DEFAULT: ``"average"``

prior

Pseudocount added to PSSM before normalization. Must be >= 0.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
ndarray or DataFrame

Array of scores (one per input sequence). For mode="pos" with return_strand=True, returns a DataFrame with columns pos and strand.

Examples:

>>> import pymisha as pm
>>> import numpy as np

Create a simple PSSM (frequency matrix):

>>> pssm = np.array([
...     [0.7, 0.1, 0.1, 0.1],
...     [0.1, 0.7, 0.1, 0.1],
...     [0.1, 0.1, 0.7, 0.1],
...     [0.1, 0.1, 0.1, 0.7],
... ])

Score sequences using log-sum-exp (default mode):

>>> pm.gseq_pwm(["ACGTACGT", "GGGGGGGG"], pssm, mode="lse")
array([...])

Find position of best match:

>>> pm.gseq_pwm(["ACGTACGT"], pssm, mode="pos")
array([...])

Count matches above a threshold:

>>> pm.gseq_pwm(["ACGTACGT"], pssm, mode="count", score_thresh=-5.0)
array([...])
See Also

gseq_kmer, gseq_kmer_dist, gseq_extract

Motif Import

Functions to import motifs from standard motif format files (MEME, JASPAR, HOMER).

pymisha.gseq_read_meme

gseq_read_meme(file: str) -> dict[str, _pandas.DataFrame]

Read motifs from a MEME minimal motif format file.

Parses a MEME minimal motif format file and returns a dict of position probability matrices (PPM). Each DataFrame has columns A, C, G, T with one row per motif position. Metadata is stored in the DataFrame's .attrs dict.

PARAMETER DESCRIPTION
file

Path to a MEME format file (.meme, .txt).

TYPE: str

RETURNS DESCRIPTION
dict[str, DataFrame]

Keys are motif identifiers, values are DataFrames. Each DataFrame has .attrs with keys: name, alength, w, nsites, E, url, strand, background.

RAISES DESCRIPTION
FileNotFoundError

If file does not exist.

ValueError

If the file is malformed or contains unsupported content.

pymisha.gseq_read_jaspar

gseq_read_jaspar(file: str) -> dict[str, _pandas.DataFrame]

Read motifs from a JASPAR PFM format file.

Parses a JASPAR Position Frequency Matrix (PFM) file and returns a dict of position probability matrices (PPM). Supports both the standard JASPAR header format (>ID NAME followed by labeled A/C/G/T rows) and the simple 4-row PFM format. Counts are converted to probabilities.

PARAMETER DESCRIPTION
file

Path to a JASPAR format file (.jaspar, .pfm, .txt).

TYPE: str

RETURNS DESCRIPTION
dict[str, DataFrame]

Keys are motif identifiers, values are DataFrames with columns A, C, G, T. Each DataFrame has .attrs with keys: name, w, nsites, format.

RAISES DESCRIPTION
FileNotFoundError

If file does not exist.

ValueError

If the file is malformed.

pymisha.gseq_read_homer

gseq_read_homer(file: str) -> dict[str, _pandas.DataFrame]

Read motifs from a HOMER motif format file.

Parses a HOMER .motif format file and returns a dict of position probability matrices (PPM). Each DataFrame has columns A, C, G, T.

PARAMETER DESCRIPTION
file

Path to a HOMER motif file (.motif).

TYPE: str

RETURNS DESCRIPTION
dict[str, DataFrame]

Keys are derived from the consensus sequence. Each DataFrame has .attrs with keys: name, consensus, log_odds_threshold, log_p_value, w, source.

RAISES DESCRIPTION
FileNotFoundError

If file does not exist.

ValueError

If the file is malformed.