Skip to content

Genomic Integration

Functions for working with genomic intervals. All functions in this module require the pymisha package.

Optional dependency

Install pymisha with pip install pymisha. Functions will raise ImportError with a clear message if pymisha is not available.

pyprego.genomic

Genomic integration layer (optional pymisha dependency).

Mirrors the misha.R module from R prego. Provides functions that operate on genomic intervals and tracks, extracting sequences and computing PWM scores over genomic regions.

All functions in this module require the pymisha package to be installed. Import will succeed regardless, but functions will raise ImportError with a clear message at call time if pymisha is not available.

intervals_to_seq

intervals_to_seq(intervals: DataFrame, size: int | None = None) -> list[str]

Extract DNA sequences for genomic intervals.

Mirrors the R intervals_to_seq() function. Uses pymisha.gseq_extract to retrieve sequences and optionally normalizes interval sizes around their centers first.

PARAMETER DESCRIPTION
intervals

Genomic intervals with columns chrom, start, end.

TYPE: DataFrame

size

If provided, normalize intervals to this size (bp) around their center before extraction. None keeps original intervals.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
list[str]

List of uppercase DNA sequences, one per interval.

gextract_pwm

gextract_pwm(intervals: DataFrame, pssm: DataFrame, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp', size: int | None = None) -> np.ndarray

Extract PWM scores for genomic intervals.

Extracts sequences from the genome for the given intervals, then scores each with compute_pwm. Mirrors the R gextract_pwm() function.

PARAMETER DESCRIPTION
intervals

Genomic intervals with chrom, start, end columns.

TYPE: DataFrame

pssm

PSSM DataFrame with columns A, C, G, T.

TYPE: DataFrame

spat

Spatial model DataFrame (bin, spat_factor columns).

TYPE: DataFrame | None DEFAULT: None

bidirect

Score both orientations.

TYPE: bool DEFAULT: True

prior

Uniform prior added to PSSM probabilities.

TYPE: float DEFAULT: 0.01

func

Aggregation function: "logSumExp" or "max".

TYPE: str DEFAULT: 'logSumExp'

size

Normalize intervals to this size before extraction.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
ndarray

1-D array of PWM scores, one per interval.

gextract_pwm_quantile

gextract_pwm_quantile(intervals: DataFrame, pssm: DataFrame, quantiles: ndarray | list[float], *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp', size: int | None = None, bg_intervals: DataFrame | None = None, n_sequences: int = 10000) -> np.ndarray

Extract quantiles of PWM scores for genomic intervals.

Computes PWM scores for the input intervals and maps them to quantiles estimated from background intervals. Mirrors the R gextract_pwm.quantile() function.

PARAMETER DESCRIPTION
intervals

Genomic intervals.

TYPE: DataFrame

pssm

PSSM DataFrame (A, C, G, T columns).

TYPE: DataFrame

quantiles

Quantile breakpoints for the background CDF (e.g. np.arange(0, 1.01, 0.01)).

TYPE: array - like

spat

Spatial model.

TYPE: DataFrame | None DEFAULT: None

bidirect

Score both orientations.

TYPE: bool DEFAULT: True

prior

Uniform prior.

TYPE: float DEFAULT: 0.01

func

Aggregation function.

TYPE: str DEFAULT: 'logSumExp'

size

Normalize intervals to this size.

TYPE: int | None DEFAULT: None

bg_intervals

Background intervals for quantile estimation. If None, pymisha.gintervals_random is used to sample n_sequences random intervals of the same size as the input intervals.

TYPE: DataFrame | None DEFAULT: None

n_sequences

Number of background sequences to sample when bg_intervals is None.

TYPE: int DEFAULT: 10000

RETURNS DESCRIPTION
ndarray

1-D array of quantile values (0--1) per interval.

gextract_local_pwm

gextract_local_pwm(intervals: DataFrame, pssm: DataFrame, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01, size: int | None = None) -> np.ndarray

Extract per-position PWM scores for genomic intervals.

Extracts sequences from the genome and computes per-position PWM scores using compute_local_pwm. Mirrors the R gextract.local_pwm() function.

PARAMETER DESCRIPTION
intervals

Genomic intervals.

TYPE: DataFrame

pssm

PSSM DataFrame (A, C, G, T columns).

TYPE: DataFrame

spat

Spatial model.

TYPE: DataFrame | None DEFAULT: None

bidirect

Score both orientations.

TYPE: bool DEFAULT: True

prior

Uniform prior.

TYPE: float DEFAULT: 0.01

size

Normalize intervals to this size.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
ndarray

2-D array of shape (n_intervals, seq_length) with per-position scores. Positions where the PSSM does not fit contain NaN.

gintervals_center_by_pssm

gintervals_center_by_pssm(intervals: DataFrame, pssm: DataFrame, size: int, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01) -> pd.DataFrame

Center intervals by the position with maximum PSSM score.

For each interval, computes per-position PWM scores and finds the position with the highest score. The interval is then re-centered on that position and normalized to size bp. Mirrors the R gintervals.center_by_pssm() function.

PARAMETER DESCRIPTION
intervals

Genomic intervals.

TYPE: DataFrame

pssm

PSSM DataFrame (A, C, G, T columns).

TYPE: DataFrame

size

Target interval size after re-centering.

TYPE: int

spat

Spatial model.

TYPE: DataFrame | None DEFAULT: None

bidirect

Score both orientations.

TYPE: bool DEFAULT: True

prior

Uniform prior.

TYPE: float DEFAULT: 0.01

RETURNS DESCRIPTION
DataFrame

Re-centered intervals with chrom, start, end (and any extra columns from the input).