Data Operations¶
Functions for extracting, summarizing, and analyzing track data across genomic intervals, including statistical summaries, distributions, correlations, and segmentation.
pymisha.gextract ¶
Return evaluated track expression values for each iterator interval.
For each interval in the iterator, evaluates one or more track expressions
and returns the results as a DataFrame with interval coordinates and
expression values. An intervalID column maps each output row back to
the input interval.
If input intervals overlap, overlapped coordinates appear multiple times.
The order of results may differ from input interval order; use
intervalID to match rows to original intervals.
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
One or more track expressions to evaluate.
TYPE:
|
intervals
|
Genomic scope (chrom/start/end DataFrame or intervals set name). If None, uses ALLGENOME. For 2D tracks, pass 2D intervals (with chrom1/start1/end1/chrom2/start2/end2 columns).
TYPE:
|
colnames
|
Column names for expression values. Must match the number of expressions. If None, uses expression strings.
TYPE:
|
iterator
|
Track expression iterator. If None, determined from expressions. For multi-expression 2D extraction, pass an explicit iterator.
TYPE:
|
band
|
Diagonal band for 2D track extraction as
TYPE:
|
vars
|
Explicit variable bindings for the expression. When provided, these are used instead of auto-capturing the caller's namespace.
TYPE:
|
**kwargs
|
Additional keyword arguments:
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
DataFrame with columns: chrom, start, end, |
See Also
gsummary : Summarize track expression over intervals. gquantiles : Compute quantiles of track expression over intervals. gdist : Compute distribution of track expression over intervals. glookup : Look up track values at specific positions. gscreen : Find intervals where a logical expression is True.
Examples:
pymisha.gscreen ¶
Find intervals where a logical track expression is True.
Evaluates a logical track expression and returns all intervals where the expression value is True (non-zero). Adjacent True intervals on the same chromosome are merged into a single interval.
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
Logical track expression.
TYPE:
|
intervals
|
Genomic scope (chrom/start/end DataFrame or intervals set name). If None, uses ALLGENOME.
TYPE:
|
vars
|
Explicit variable bindings for the expression.
TYPE:
|
**kwargs
|
Additional keyword arguments:
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
DataFrame with columns: chrom, start, end. Returns None if no intervals match the expression. |
See Also
gextract : Extract track expression values for each interval. gsegment : Segment genome by track expression values.
Examples:
pymisha.gsummary ¶
Calculate summary statistics of a track expression.
Returns summary statistics: total bins, NaN count, min, max, sum, mean, and standard deviation of the values.
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
Track expression.
TYPE:
|
intervals
|
Genomic scope. If None, uses ALLGENOME.
TYPE:
|
iterator
|
Track expression iterator. If None, determined from expression.
TYPE:
|
vars
|
Explicit variable bindings for the expression.
TYPE:
|
band
|
Diagonal band for 2D tracks as
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
Series with index:
|
See Also
gintervals_summary, gbins_summary, gquantiles
Examples:
pymisha.gquantiles ¶
Calculate quantiles of a track expression.
Computes quantiles for the given percentiles. If data size exceeds the configured limit, data is randomly sampled to fit.
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
Track expression.
TYPE:
|
percentiles
|
Percentiles in [0, 1] range.
TYPE:
|
intervals
|
Genomic scope. If None, uses ALLGENOME.
TYPE:
|
iterator
|
Track expression iterator. If None, determined from expression.
TYPE:
|
vars
|
Explicit variable bindings for the expression.
TYPE:
|
band
|
Diagonal band for 2D tracks.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Array of quantile values corresponding to the given percentiles. |
See Also
gintervals_quantiles, gbins_quantiles, gdist
Examples:
pymisha.gdist ¶
gdist(*args, intervals=None, include_lowest=False, iterator=None, band=None, dataframe=False, names=None, vars=None, **kwargs)
Calculate distribution of track expressions over bins.
This function calculates the distribution of values of numeric track expressions over the given set of bins using a memory-efficient C++ streaming implementation.
| PARAMETER | DESCRIPTION |
|---|---|
*args
|
Alternating track expressions and their bin breaks. Example: gdist("track1", [0, 0.5, 1], "track2", [0, 1, 2])
TYPE:
|
intervals
|
Genomic scope for which the function is applied. If None, uses all genomic intervals.
TYPE:
|
include_lowest
|
If True, the lowest value will be included in the first bin. Example: breaks=[0, 0.2, 0.5] creates (0, 0.2], (0.2, 0.5]. With include_lowest=True: [0, 0.2], (0.2, 0.5].
TYPE:
|
iterator
|
Track expression iterator. If None, determined implicitly.
TYPE:
|
band
|
Track expression band (not yet supported).
TYPE:
|
dataframe
|
If True, return a DataFrame instead of an N-dimensional array.
TYPE:
|
names
|
Column names for the expressions in the returned DataFrame (only relevant when dataframe=True).
TYPE:
|
vars
|
Explicit variable bindings for the expression.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray or DataFrame
|
If dataframe=False: N-dimensional array where N is the number of expr-breaks pairs. The shape is (n_bins_1, n_bins_2, ..., n_bins_N). If dataframe=True: DataFrame with columns for each track expression (bin labels) and an 'n' column with counts. |
Examples:
Calculate distribution of dense_track for bins (0, 0.2], (0.2, 0.5], (0.5, 1]:
Calculate 2D distribution - dense_track vs sparse_track:
Get result as DataFrame:
See Also
gsummary, gquantiles, gpartition
pymisha.gpartition ¶
Partition track expression values into bins and return corresponding intervals.
Converts track expression values to 1-based bin indices according to 'breaks', then returns the intervals with their corresponding bin index. Adjacent intervals with the same bin value are merged.
The range of bins is determined by 'breaks' argument. For example: breaks=[x1, x2, x3, x4] represents three bins: (x1, x2], (x2, x3], (x3, x4].
If 'include_lowest' is True, the lowest value is included in the first bin: [x1, x2], (x2, x3], (x3, x4].
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
Track expression to evaluate.
TYPE:
|
breaks
|
Break points that determine the bins. Must have at least 2 elements and be strictly increasing.
TYPE:
|
intervals
|
Genomic scope for which the function is applied. If None, uses all genomic intervals.
TYPE:
|
include_lowest
|
If True, the lowest value of the range is included in the first bin.
TYPE:
|
iterator
|
Track expression iterator. If None, determined implicitly.
TYPE:
|
band
|
Track expression band (not yet supported).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
DataFrame with columns 'chrom', 'start', 'end', 'bin' where 'bin' is the 1-based bin index. Returns None if no values fall within the breaks. Adjacent intervals with the same bin are merged. |
Examples:
Partition dense_track values into 4 bins:
>>> breaks = [0, 0.05, 0.1, 0.15, 0.2]
>>> pm.gpartition("dense_track", breaks, pm.gintervals("1", 0, 5000))
See Also
gdist, gsummary, gextract
Notes
Values outside the break range are excluded from the result. NaN values are also excluded.
pymisha.gsample ¶
Sample values from a track expression using reservoir sampling.
Randomly samples n values from the track expression over the given intervals. The sampling is performed in a single streaming pass using a reservoir sampler, so it is memory-efficient regardless of the number of genomic positions scanned.
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
Track expression.
TYPE:
|
n
|
Number of samples to draw.
TYPE:
|
intervals
|
Genomic scope. If
TYPE:
|
iterator
|
Iterator policy for binning the intervals.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
1-D array of sampled values (float64). Length may be less than n if fewer non-NaN data points exist. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If n < 1. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> samples = pm.gsample("dense_track", 100)
>>> len(samples)
100
>>> samples = pm.gsample("dense_track", 50,
... pm.gintervals(1, 0, 10000))
>>> len(samples)
50
See Also
gextract, gsummary, gquantiles
pymisha.gcor ¶
Compute correlation between pairs of track expressions.
Calculates correlation in a single streaming pass over the data, making it memory-efficient for genome-wide computations. Supports multitasking via chromosome partitioning when enabled.
| PARAMETER | DESCRIPTION |
|---|---|
*exprs
|
An even number of track expressions. Each consecutive pair (expr1, expr2) defines one correlation to compute.
TYPE:
|
intervals
|
Genomic scope. If
TYPE:
|
iterator
|
Iterator policy.
TYPE:
|
method
|
Correlation method.
TYPE:
|
details
|
If
TYPE:
|
names
|
Names for each correlation pair. If
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray or DataFrame
|
If |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If an odd number of expressions is provided. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gcor("dense_track", "sparse_track",
... intervals=pm.gintervals(1, 0, 10000), iterator=1000)
array([...])
>>> pm.gcor("dense_track", "sparse_track",
... intervals=pm.gintervals(1, 0, 10000), iterator=1000,
... details=True)
cor cov mean1 ...
0 ...
>>> pm.gcor("dense_track", "sparse_track",
... intervals=pm.gintervals(1, 0, 10000), iterator=1000,
... method="spearman")
array([...])
>>> pm.gcor("dense_track", "sparse_track",
... intervals=pm.gintervals(1, 0, 10000), iterator=1000,
... method="spearman.exact", details=True)
n n.na cor
0 ...
See Also
gsummary, gextract
pymisha.gbins_summary ¶
gbins_summary(*args, expr=None, intervals=None, include_lowest=False, iterator=None, band=None, **kwargs)
Compute summary statistics per bin.
| PARAMETER | DESCRIPTION |
|---|---|
*args
|
Alternating track expressions and bin break vectors.
TYPE:
|
expr
|
Track expression to summarize. If None the first bin expression is used.
TYPE:
|
intervals
|
Genomic scope. Defaults to all intervals.
TYPE:
|
include_lowest
|
Include the left edge of the first bin.
TYPE:
|
iterator
|
Track expression iterator.
TYPE:
|
band
|
Diagonal band
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Shape |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gbins_summary("dense_track", [0, 0.2, 0.4, 2], expr="sparse_track",
... intervals=pm.gintervals(1), iterator="dense_track")
See Also
gsummary, gintervals_summary, gdist
pymisha.gbins_quantiles ¶
gbins_quantiles(*args, expr=None, percentiles=0.5, intervals=None, include_lowest=False, iterator=None, band=None, **kwargs)
Compute quantiles per bin.
| PARAMETER | DESCRIPTION |
|---|---|
*args
|
Alternating track expressions and bin break vectors.
TYPE:
|
expr
|
Track expression to compute quantiles for. If None the first bin expression is used.
TYPE:
|
percentiles
|
Percentile(s) in [0, 1].
TYPE:
|
intervals
|
Genomic scope. Defaults to all intervals.
TYPE:
|
include_lowest
|
Include the left edge of the first bin.
TYPE:
|
iterator
|
Track expression iterator.
TYPE:
|
band
|
Diagonal band
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Shape |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gbins_quantiles("dense_track", [0, 0.2, 0.4, 2],
... expr="sparse_track", percentiles=[0.2, 0.5],
... intervals=pm.gintervals(1), iterator="dense_track")
See Also
gquantiles, gintervals_quantiles, gdist
pymisha.gintervals_summary ¶
Compute summary statistics for each interval.
| PARAMETER | DESCRIPTION |
|---|---|
intervals_set_out
|
When provided, saves the resulting intervals set via
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
DataFrame with the original interval columns (chrom, start, end) plus
summary statistic columns: Total intervals, NaN intervals, Min, Max,
Sum, Mean, Std dev. Returns |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals([1, 2], 0, 5000)
>>> pm.gintervals_summary("dense_track", intervs)
See Also
gsummary, gbins_summary
pymisha.gintervals_quantiles ¶
Compute quantiles for each interval.
| PARAMETER | DESCRIPTION |
|---|---|
intervals_set_out
|
When provided, saves the resulting intervals set via
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
DataFrame with the original interval columns (chrom, start, end) plus
one column per requested percentile, named by the percentile value
(e.g., |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals([1, 2], 0, 5000)
>>> pm.gintervals_quantiles("dense_track", percentiles=[0.5, 0.3, 0.9],
... intervals=intervs)
See Also
gquantiles, gbins_quantiles
pymisha.gcis_decay ¶
gcis_decay(expr, breaks, src, domain, intervals=None, include_lowest=False, iterator=None, band=None)
Calculate distribution of cis contact distances.
For contacts where chrom1 equals chrom2 and the first interval
(I1) is fully within src intervals, this function bins the distance
between I1 and I2 separately for intra-domain and inter-domain contacts.
A contact is intra-domain when both I1 and I2 are fully contained within the same domain interval. Otherwise it is inter-domain.
The distance is abs((start1 + end1 - start2 - end2) / 2) (integer
division), i.e. the absolute difference of the interval midpoints.
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
A 2D track expression (must be a simple 2D track name).
TYPE:
|
breaks
|
Sorted break points defining distance bins.
Example:
TYPE:
|
src
|
Source intervals (chrom, start, end). Only contacts whose I1 is fully within the unified source intervals are counted. Overlapping source intervals are allowed and will be merged.
TYPE:
|
domain
|
Domain intervals (chrom, start, end). Must be non-overlapping. Used to classify contacts as intra- or inter-domain.
TYPE:
|
intervals
|
Genomic scope (1D intervals). Defaults to all genome intervals. Only cis contacts (chrom1 == chrom2) within these chromosomes are considered.
TYPE:
|
include_lowest
|
If True, the lowest break value is included in the first bin:
TYPE:
|
iterator
|
2D iterator specification. Currently unused (extraction uses the track's native resolution).
TYPE:
|
band
|
Diagonal band filter
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
2D array of shape |
See Also
gdist : General distribution of track expressions. gextract : Extract track values over intervals.
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import pandas as pd
>>> src = pd.DataFrame({"chrom": ["1", "1"], "start": [0, 200000], "end": [100000, 400000]})
>>> domain = pd.DataFrame({"chrom": ["1"], "start": [0], "end": [500000]})
>>> breaks = [0, 100000, 200000, 300000, 400000, 500000]
>>> result = pm.gcis_decay("rects_track", breaks, src, domain)
>>> result.shape[1]
2
pymisha.gsegment ¶
gsegment(expr, minsegment, maxpval=0.05, onetailed=True, intervals=None, iterator=None, intervals_set_out=None)
Divide track expression into segments using Wilcoxon test.
Divides the values of a track expression into segments, where each
segment size is at least minsegment and the P-value of comparing
the segment with the first minsegment values from the next segment
is at most maxpval. Comparison is done using the Wilcoxon
(Mann-Whitney) test.
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
Track expression.
TYPE:
|
minsegment
|
Minimal segment size in base pairs.
TYPE:
|
maxpval
|
Maximal P-value that separates two adjacent segments. Default 0.05.
TYPE:
|
onetailed
|
If True, Wilcoxon test is one-tailed. Default True.
TYPE:
|
intervals
|
Genomic scope. Defaults to all genome intervals.
TYPE:
|
iterator
|
Fixed bin iterator size. If None, determined from track expression.
TYPE:
|
intervals_set_out
|
If provided, save result as an intervals set and return None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Intervals where each row represents a segment (chrom, start, end). Returns None if intervals_set_out is provided, or if input is empty. |
See Also
gwilcox : Sliding-window Wilcoxon test.
Examples:
pymisha.gwilcox ¶
gwilcox(expr, winsize1, winsize2, maxpval=0.05, onetailed=True, what2find=1, intervals=None, iterator=None, intervals_set_out=None)
Sliding-window Wilcoxon test over track expression values.
Runs a Wilcoxon test (Mann-Whitney) over the values of a track expression
in two sliding windows with an identical center. Returns intervals where
the smaller window tested against the larger window gives a P-value below
maxpval.
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
Track expression.
TYPE:
|
winsize1
|
Size of the first sliding window in base pairs.
TYPE:
|
winsize2
|
Size of the second sliding window in base pairs.
TYPE:
|
maxpval
|
Maximal P-value threshold. Default 0.05.
TYPE:
|
onetailed
|
If True, Wilcoxon test is one-tailed. Default True.
TYPE:
|
what2find
|
-1 for lows, 1 for peaks, 0 for both. Default 1.
TYPE:
|
intervals
|
Genomic scope. Defaults to all genome intervals.
TYPE:
|
iterator
|
Fixed bin iterator size. If None, determined from track expression.
TYPE:
|
intervals_set_out
|
If provided, save result as an intervals set and return None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Intervals with |
See Also
gsegment : Divide track expression into segments using Wilcoxon test.
Examples:
pymisha.glookup ¶
glookup(lookup_table, *args, intervals=None, include_lowest=False, force_binning=True, iterator=None, band=None, **kwargs)
Look up values in an N-dimensional lookup table indexed by track expressions.
For each iterator interval, evaluates one or more track expressions and uses the resulting values to index into a lookup table. Returns the table value for each interval.
Uses a memory-efficient C++ streaming implementation when expressions do not contain virtual tracks. Falls back to Python (memory-resident) when virtual tracks, a band filter, or 2D intervals are present.
| PARAMETER | DESCRIPTION |
|---|---|
lookup_table
|
N-dimensional lookup table. The shape must match the number of bins
in each dimension. For 1D lookup, shape is
TYPE:
|
*args
|
Alternating track expressions and break arrays defining bins.
Same format as
TYPE:
|
intervals
|
Genomic scope for which the function is applied. Required.
TYPE:
|
include_lowest
|
If True, the lowest break value is included in the first bin.
Example:
TYPE:
|
force_binning
|
If True, clamp out-of-range values to the nearest bin instead of NaN. If False, out-of-range values produce NaN.
TYPE:
|
iterator
|
Track expression iterator. If None, determined implicitly.
TYPE:
|
band
|
Diagonal band for 2D tracks. Triggers Python fallback path.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Intervals with columns: |
See Also
gdist : Compute distribution over binned track expressions. gtrack_lookup : Create a track from an N-dimensional lookup table.
Examples:
One-dimensional lookup:
pymisha.gtrack_lookup ¶
gtrack_lookup(track, description, lookup_table, *args, iterator=None, include_lowest=False, force_binning=True, band=None)
Create a track from an N-dimensional lookup table.
Evaluates track expressions genome-wide, looks up values in the table, and creates a new track from the results. Dense or sparse output is determined by the iterator type.
This is the track-creation counterpart of :func:glookup, which returns
values in-memory without creating a track.
| PARAMETER | DESCRIPTION |
|---|---|
track
|
Name for the new track.
TYPE:
|
description
|
Track description.
TYPE:
|
lookup_table
|
N-dimensional lookup table. Shape must match the number of bins
defined by each
TYPE:
|
*args
|
Alternating track expressions and break arrays defining bins.
TYPE:
|
iterator
|
Track expression iterator. Integer values create dense tracks with that bin size. Intervals-based iterators create sparse tracks.
TYPE:
|
include_lowest
|
If True, the lowest break value is included in the first bin.
TYPE:
|
force_binning
|
If True, clamp out-of-range values to the nearest bin. If False, out-of-range values produce NaN in the track.
TYPE:
|
band
|
Diagonal band for 2D tracks. Passed through to :func:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
|
See Also
glookup : Look up values without creating a track. gtrack_create : Create a track from an expression. gdist : Compute distribution over binned track expressions.
Examples:
Create a dense track from 1D lookup: