Genomic Integration¶
Functions for working with genomic intervals. All functions in this module require the pymisha package.
Optional dependency
Install pymisha with pip install pymisha. Functions will raise
ImportError with a clear message if pymisha is not available.
pyprego.genomic ¶
Genomic integration layer (optional pymisha dependency).
Mirrors the misha.R module from R prego. Provides functions that operate on genomic intervals and tracks, extracting sequences and computing PWM scores over genomic regions.
All functions in this module require the pymisha package to be installed.
Import will succeed regardless, but functions will raise ImportError
with a clear message at call time if pymisha is not available.
intervals_to_seq ¶
Extract DNA sequences for genomic intervals.
Mirrors the R intervals_to_seq() function. Uses
pymisha.gseq_extract to retrieve sequences and optionally normalizes
interval sizes around their centers first.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Genomic intervals with columns
TYPE:
|
size
|
If provided, normalize intervals to this size (bp) around their
center before extraction.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
List of uppercase DNA sequences, one per interval. |
gextract_pwm ¶
gextract_pwm(intervals: DataFrame, pssm: DataFrame, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp', size: int | None = None) -> np.ndarray
Extract PWM scores for genomic intervals.
Extracts sequences from the genome for the given intervals, then scores
each with compute_pwm. Mirrors the R gextract_pwm() function.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Genomic intervals with
TYPE:
|
pssm
|
PSSM DataFrame with columns A, C, G, T.
TYPE:
|
spat
|
Spatial model DataFrame (
TYPE:
|
bidirect
|
Score both orientations.
TYPE:
|
prior
|
Uniform prior added to PSSM probabilities.
TYPE:
|
func
|
Aggregation function:
TYPE:
|
size
|
Normalize intervals to this size before extraction.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
1-D array of PWM scores, one per interval. |
gextract_pwm_quantile ¶
gextract_pwm_quantile(intervals: DataFrame, pssm: DataFrame, quantiles: ndarray | list[float], *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01, func: str = 'logSumExp', size: int | None = None, bg_intervals: DataFrame | None = None, n_sequences: int = 10000) -> np.ndarray
Extract quantiles of PWM scores for genomic intervals.
Computes PWM scores for the input intervals and maps them to quantiles
estimated from background intervals. Mirrors the R
gextract_pwm.quantile() function.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Genomic intervals.
TYPE:
|
pssm
|
PSSM DataFrame (A, C, G, T columns).
TYPE:
|
quantiles
|
Quantile breakpoints for the background CDF (e.g.
TYPE:
|
spat
|
Spatial model.
TYPE:
|
bidirect
|
Score both orientations.
TYPE:
|
prior
|
Uniform prior.
TYPE:
|
func
|
Aggregation function.
TYPE:
|
size
|
Normalize intervals to this size.
TYPE:
|
bg_intervals
|
Background intervals for quantile estimation. If
TYPE:
|
n_sequences
|
Number of background sequences to sample when bg_intervals is
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
1-D array of quantile values (0--1) per interval. |
gextract_local_pwm ¶
gextract_local_pwm(intervals: DataFrame, pssm: DataFrame, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01, size: int | None = None) -> np.ndarray
Extract per-position PWM scores for genomic intervals.
Extracts sequences from the genome and computes per-position PWM scores
using compute_local_pwm. Mirrors the R gextract.local_pwm()
function.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Genomic intervals.
TYPE:
|
pssm
|
PSSM DataFrame (A, C, G, T columns).
TYPE:
|
spat
|
Spatial model.
TYPE:
|
bidirect
|
Score both orientations.
TYPE:
|
prior
|
Uniform prior.
TYPE:
|
size
|
Normalize intervals to this size.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
2-D array of shape |
gintervals_center_by_pssm ¶
gintervals_center_by_pssm(intervals: DataFrame, pssm: DataFrame, size: int, *, spat: DataFrame | None = None, bidirect: bool = True, prior: float = 0.01) -> pd.DataFrame
Center intervals by the position with maximum PSSM score.
For each interval, computes per-position PWM scores and finds the
position with the highest score. The interval is then re-centered on
that position and normalized to size bp. Mirrors the R
gintervals.center_by_pssm() function.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Genomic intervals.
TYPE:
|
pssm
|
PSSM DataFrame (A, C, G, T columns).
TYPE:
|
size
|
Target interval size after re-centering.
TYPE:
|
spat
|
Spatial model.
TYPE:
|
bidirect
|
Score both orientations.
TYPE:
|
prior
|
Uniform prior.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Re-centered intervals with |