Genomic Integration¶

Functions for working with genomic intervals. All functions in this module require the pymisha package.

Optional dependency

Install pymisha with pip install pymisha. Functions will raise ImportError with a clear message if pymisha is not available.

pyprego.genomic ¶

Genomic integration layer (optional pymisha dependency).

Mirrors the misha.R module from R prego. Provides functions that operate on genomic intervals and tracks, extracting sequences and computing PWM scores over genomic regions.

All functions in this module require the pymisha package to be installed. Import will succeed regardless, but functions will raise ImportError with a clear message at call time if pymisha is not available.

intervals_to_seq ¶

intervals_to_seq(intervals: DataFrame, size: int | None = None) -> list[str]

Extract DNA sequences for genomic intervals.

Mirrors the R intervals_to_seq() function. Uses pymisha.gseq_extract to retrieve sequences and optionally normalizes interval sizes around their centers first.

PARAMETER	DESCRIPTION
`intervals`	Genomic intervals with columns `chrom`, `start`, `end`. TYPE: `DataFrame`
`size`	If provided, normalize intervals to this size (bp) around their center before extraction. `None` keeps original intervals. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[str]`	List of uppercase DNA sequences, one per interval.

gextract_pwm ¶

gextract_pwm(intervals: DataFrame, pssm: DataFrame, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp', size: int | None = None) -> np.ndarray

Extract PWM scores for genomic intervals.

Extracts sequences from the genome for the given intervals, then scores each with compute_pwm. Mirrors the R gextract_pwm() function.

PARAMETER	DESCRIPTION
`intervals`	Genomic intervals with `chrom`, `start`, `end` columns. TYPE: `DataFrame`
`pssm`	PSSM DataFrame with columns A, C, G, T. TYPE: `DataFrame`
`spat`	Spatial model DataFrame (`bin`, `spat_factor` columns). TYPE: `DataFrame \| None` DEFAULT: `None`
`bidirect`	Score both orientations. TYPE: `bool` DEFAULT: `True`
`prior`	Uniform prior added to PSSM probabilities. TYPE: `float` DEFAULT: `0.01`
`func`	Aggregation function: `"logSumExp"` or `"max"`. TYPE: `str` DEFAULT: `'logSumExp'`
`size`	Normalize intervals to this size before extraction. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray`	1-D array of PWM scores, one per interval.

gextract_pwm_quantile ¶

gextract_pwm_quantile(intervals: DataFrame, pssm: DataFrame, quantiles: ndarray | list[float], *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp', size: int | None = None, bg_intervals: DataFrame | None = None, n_sequences: int = 10000) -> np.ndarray

Extract quantiles of PWM scores for genomic intervals.

Computes PWM scores for the input intervals and maps them to quantiles estimated from background intervals. Mirrors the R gextract_pwm.quantile() function.

PARAMETER	DESCRIPTION
`intervals`	Genomic intervals. TYPE: `DataFrame`
`pssm`	PSSM DataFrame (A, C, G, T columns). TYPE: `DataFrame`
`quantiles`	Quantile breakpoints for the background CDF (e.g. `np.arange(0, 1.01, 0.01)`). TYPE: `array - like`
`spat`	Spatial model. TYPE: `DataFrame \| None` DEFAULT: `None`
`bidirect`	Score both orientations. TYPE: `bool` DEFAULT: `True`
`prior`	Uniform prior. TYPE: `float` DEFAULT: `0.01`
`func`	Aggregation function. TYPE: `str` DEFAULT: `'logSumExp'`
`size`	Normalize intervals to this size. TYPE: `int \| None` DEFAULT: `None`
`bg_intervals`	Background intervals for quantile estimation. If `None`, `pymisha.gintervals_random` is used to sample n_sequences random intervals of the same size as the input intervals. TYPE: `DataFrame \| None` DEFAULT: `None`
`n_sequences`	Number of background sequences to sample when bg_intervals is `None`. TYPE: `int` DEFAULT: `10000`

RETURNS	DESCRIPTION
`ndarray`	1-D array of quantile values (0--1) per interval.

gextract_local_pwm ¶

gextract_local_pwm(intervals: DataFrame, pssm: DataFrame, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01, size: int | None = None) -> np.ndarray

Extract per-position PWM scores for genomic intervals.

Extracts sequences from the genome and computes per-position PWM scores using compute_local_pwm. Mirrors the R gextract.local_pwm() function.

PARAMETER	DESCRIPTION
`intervals`	Genomic intervals. TYPE: `DataFrame`
`pssm`	PSSM DataFrame (A, C, G, T columns). TYPE: `DataFrame`
`spat`	Spatial model. TYPE: `DataFrame \| None` DEFAULT: `None`
`bidirect`	Score both orientations. TYPE: `bool` DEFAULT: `True`
`prior`	Uniform prior. TYPE: `float` DEFAULT: `0.01`
`size`	Normalize intervals to this size. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray`	2-D array of shape `(n_intervals, seq_length)` with per-position scores. Positions where the PSSM does not fit contain NaN.

gintervals_center_by_pssm ¶

gintervals_center_by_pssm(intervals: DataFrame, pssm: DataFrame, size: int, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01) -> pd.DataFrame

Center intervals by the position with maximum PSSM score.

For each interval, computes per-position PWM scores and finds the position with the highest score. The interval is then re-centered on that position and normalized to size bp. Mirrors the R gintervals.center_by_pssm() function.

PARAMETER	DESCRIPTION
`intervals`	Genomic intervals. TYPE: `DataFrame`
`pssm`	PSSM DataFrame (A, C, G, T columns). TYPE: `DataFrame`
`size`	Target interval size after re-centering. TYPE: `int`
`spat`	Spatial model. TYPE: `DataFrame \| None` DEFAULT: `None`
`bidirect`	Score both orientations. TYPE: `bool` DEFAULT: `True`
`prior`	Uniform prior. TYPE: `float` DEFAULT: `0.01`

RETURNS	DESCRIPTION
`DataFrame`	Re-centered intervals with `chrom`, `start`, `end` (and any extra columns from the input).