Data Operations¶

Functions for extracting, summarizing, and analyzing track data across genomic intervals, including statistical summaries, distributions, correlations, and segmentation.

pymisha.gextract ¶

gextract(expr, intervals=None, iterator=None, colnames=None, band=None, vars=None, **kwargs)

Return evaluated track expression values for each iterator interval.

For each interval in the iterator, evaluates one or more track expressions and returns the results as a DataFrame with interval coordinates and expression values. An intervalID column maps each output row back to the input interval.

If input intervals overlap, overlapped coordinates appear multiple times. The order of results may differ from input interval order; use intervalID to match rows to original intervals.

PARAMETER	DESCRIPTION
`expr`	One or more track expressions to evaluate. TYPE: `str or list of str`
`intervals`	Genomic scope (chrom/start/end DataFrame or intervals set name). If None, uses ALLGENOME. For 2D tracks, pass 2D intervals (with chrom1/start1/end1/chrom2/start2/end2 columns). TYPE: `DataFrame or str` DEFAULT: `None`
`colnames`	Column names for expression values. Must match the number of expressions. If None, uses expression strings. TYPE: `list of str` DEFAULT: `None`
`iterator`	Track expression iterator. If None, determined from expressions. For multi-expression 2D extraction, pass an explicit iterator. TYPE: `int or str` DEFAULT: `None`
`band`	Diagonal band for 2D track extraction as `(d1, d2)`. Only applicable with 2D intervals. TYPE: `tuple of (int, int)` DEFAULT: `None`
`vars`	Explicit variable bindings for the expression. When provided, these are used instead of auto-capturing the caller's namespace. TYPE: `dict` DEFAULT: `None`
`**kwargs`	Additional keyword arguments: file (str, optional) -- Path to write extraction results as a tab-separated file. When provided, the result is written to the file and `None` is returned instead of a DataFrame. intervals_set_out (str, optional) -- Name of an interval set to save the result coordinate columns to. The interval set is created via :func:`gintervals_save`. progress (bool or str, optional) -- Whether to show a progress bar. progress_desc (str, optional) -- Description for the progress bar (default `'gextract'`). DEFAULT: `{}`

RETURNS	DESCRIPTION
`DataFrame or None`	DataFrame with columns: chrom, start, end, , ..., intervalID. Returns None if the iterator produces no intervals, or if file is specified.

See Also

gsummary : Summarize track expression over intervals. gquantiles : Compute quantiles of track expression over intervals. gdist : Compute distribution of track expression over intervals. glookup : Look up track values at specific positions. gscreen : Find intervals where a logical expression is True.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> result = pm.gextract("dense_track", intervals=pm.gintervals("1", 0, 1000),
...                      iterator=200, progress=False)
>>> result.columns.tolist()
['chrom', 'start', 'end', 'dense_track', 'intervalID']
>>> len(result)
5

pymisha.gscreen ¶

gscreen(expr, intervals=None, vars=None, **kwargs)

Find intervals where a logical track expression is True.

Evaluates a logical track expression and returns all intervals where the expression value is True (non-zero). Adjacent True intervals on the same chromosome are merged into a single interval.

PARAMETER	DESCRIPTION
`expr`	Logical track expression. TYPE: `str`
`intervals`	Genomic scope (chrom/start/end DataFrame or intervals set name). If None, uses ALLGENOME. TYPE: `DataFrame or str` DEFAULT: `None`
`vars`	Explicit variable bindings for the expression. TYPE: `dict` DEFAULT: `None`
`**kwargs`	Additional keyword arguments: iterator (int or str, optional) -- Track expression iterator. If None, determined from expression. progress (bool or str, optional) -- Whether to show a progress bar. progress_desc (str, optional) -- Description for the progress bar (default `'gscreen'`). DEFAULT: `{}`

RETURNS	DESCRIPTION
`DataFrame or None`	DataFrame with columns: chrom, start, end. Returns None if no intervals match the expression.

See Also

gextract : Extract track expression values for each interval. gsegment : Segment genome by track expression values.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> result = pm.gscreen("dense_track > 0.2", intervals=pm.gintervals("1", 0, 10000),
...                     progress=False)
>>> "chrom" in result.columns
True

pymisha.gsummary ¶

gsummary(expr, intervals=None, iterator=None, vars=None, **kwargs)

Calculate summary statistics of a track expression.

Returns summary statistics: total bins, NaN count, min, max, sum, mean, and standard deviation of the values.

PARAMETER	DESCRIPTION
`expr`	Track expression. TYPE: `str`
`intervals`	Genomic scope. If None, uses ALLGENOME. TYPE: `DataFrame or str` DEFAULT: `None`
`iterator`	Track expression iterator. If None, determined from expression. TYPE: `int or str` DEFAULT: `None`
`vars`	Explicit variable bindings for the expression. TYPE: `dict` DEFAULT: `None`
`band`	Diagonal band for 2D tracks as `(d1, d2)`. TYPE: `tuple of (int, int)`

RETURNS	DESCRIPTION
`Series`	Series with index: `["Total intervals", "NaN intervals", "Min", "Max", "Sum", "Mean", "Std dev"]`.

See Also

gintervals_summary, gbins_summary, gquantiles

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gsummary("dense_track")

pymisha.gquantiles ¶

gquantiles(expr, percentiles=0.5, intervals=None, iterator=None, vars=None, **kwargs)

Calculate quantiles of a track expression.

Computes quantiles for the given percentiles. If data size exceeds the configured limit, data is randomly sampled to fit.

PARAMETER	DESCRIPTION
`expr`	Track expression. TYPE: `str`
`percentiles`	Percentiles in [0, 1] range. TYPE: `array - like` DEFAULT: `0.5`
`intervals`	Genomic scope. If None, uses ALLGENOME. TYPE: `DataFrame or str` DEFAULT: `None`
`iterator`	Track expression iterator. If None, determined from expression. TYPE: `int or str` DEFAULT: `None`
`vars`	Explicit variable bindings for the expression. TYPE: `dict` DEFAULT: `None`
`band`	Diagonal band for 2D tracks. TYPE: `tuple of (int, int)`

RETURNS	DESCRIPTION
`ndarray`	Array of quantile values corresponding to the given percentiles.

See Also

gintervals_quantiles, gbins_quantiles, gdist

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gquantiles("dense_track", [0.25, 0.5, 0.75])

pymisha.gdist ¶

gdist(*args, intervals=None, include_lowest=False, iterator=None, band=None, dataframe=False, names=None, vars=None, **kwargs)

Calculate distribution of track expressions over bins.

This function calculates the distribution of values of numeric track expressions over the given set of bins using a memory-efficient C++ streaming implementation.

PARAMETER	DESCRIPTION
`*args`	Alternating track expressions and their bin breaks. Example: gdist("track1", [0, 0.5, 1], "track2", [0, 1, 2]) TYPE: `pairs of (expr, breaks)` DEFAULT: `()`
`intervals`	Genomic scope for which the function is applied. If None, uses all genomic intervals. TYPE: `DataFrame` DEFAULT: `None`
`include_lowest`	If True, the lowest value will be included in the first bin. Example: breaks=[0, 0.2, 0.5] creates (0, 0.2], (0.2, 0.5]. With include_lowest=True: [0, 0.2], (0.2, 0.5]. TYPE: `bool` DEFAULT: `False`
`iterator`	Track expression iterator. If None, determined implicitly. TYPE: `int or str` DEFAULT: `None`
`band`	Track expression band (not yet supported). TYPE: `optional` DEFAULT: `None`
`dataframe`	If True, return a DataFrame instead of an N-dimensional array. TYPE: `bool` DEFAULT: `False`
`names`	Column names for the expressions in the returned DataFrame (only relevant when dataframe=True). TYPE: `list of str` DEFAULT: `None`
`vars`	Explicit variable bindings for the expression. TYPE: `dict` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray or DataFrame`	If dataframe=False: N-dimensional array where N is the number of expr-breaks pairs. The shape is (n_bins_1, n_bins_2, ..., n_bins_N). If dataframe=True: DataFrame with columns for each track expression (bin labels) and an 'n' column with counts.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()

Calculate distribution of dense_track for bins (0, 0.2], (0.2, 0.5], (0.5, 1]:

>>> pm.gdist("dense_track", [0, 0.2, 0.5, 1])

Calculate 2D distribution - dense_track vs sparse_track:

>>> pm.gdist("dense_track", [0, 0.5, 1], "sparse_track", [0, 1, 2],
...          iterator=100)

Get result as DataFrame:

>>> pm.gdist("dense_track", [0, 0.2, 0.5, 1], dataframe=True)

See Also

gsummary, gquantiles, gpartition

pymisha.gpartition ¶

gpartition(expr, breaks, intervals=None, include_lowest=False, iterator=None, **kwargs)

Partition track expression values into bins and return corresponding intervals.

Converts track expression values to 1-based bin indices according to 'breaks', then returns the intervals with their corresponding bin index. Adjacent intervals with the same bin value are merged.

The range of bins is determined by 'breaks' argument. For example: breaks=[x1, x2, x3, x4] represents three bins: (x1, x2], (x2, x3], (x3, x4].

If 'include_lowest' is True, the lowest value is included in the first bin: [x1, x2], (x2, x3], (x3, x4].

PARAMETER	DESCRIPTION
`expr`	Track expression to evaluate. TYPE: `str`
`breaks`	Break points that determine the bins. Must have at least 2 elements and be strictly increasing. TYPE: `array - like`
`intervals`	Genomic scope for which the function is applied. If None, uses all genomic intervals. TYPE: `DataFrame` DEFAULT: `None`
`include_lowest`	If True, the lowest value of the range is included in the first bin. TYPE: `bool` DEFAULT: `False`
`iterator`	Track expression iterator. If None, determined implicitly. TYPE: `int or str` DEFAULT: `None`
`band`	Track expression band (not yet supported). TYPE: `optional`

RETURNS	DESCRIPTION
`DataFrame or None`	DataFrame with columns 'chrom', 'start', 'end', 'bin' where 'bin' is the 1-based bin index. Returns None if no values fall within the breaks. Adjacent intervals with the same bin are merged.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()

Partition dense_track values into 4 bins:

>>> breaks = [0, 0.05, 0.1, 0.15, 0.2]
>>> pm.gpartition("dense_track", breaks, pm.gintervals("1", 0, 5000))

See Also

gdist, gsummary, gextract

Notes

Values outside the break range are excluded from the result. NaN values are also excluded.

pymisha.gsample ¶

gsample(expr, n, intervals=None, iterator=None)

Sample values from a track expression using reservoir sampling.

Randomly samples n values from the track expression over the given intervals. The sampling is performed in a single streaming pass using a reservoir sampler, so it is memory-efficient regardless of the number of genomic positions scanned.

PARAMETER	DESCRIPTION
`expr`	Track expression. TYPE: `str`
`n`	Number of samples to draw. TYPE: `int`
`intervals`	Genomic scope. If `None`, uses `gintervals_all()`. TYPE: `DataFrame` DEFAULT: `None`
`iterator`	Iterator policy for binning the intervals. TYPE: `int or str` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray`	1-D array of sampled values (float64). Length may be less than n if fewer non-NaN data points exist.

RAISES	DESCRIPTION
`ValueError`	If n < 1.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> samples = pm.gsample("dense_track", 100)
>>> len(samples)
100
>>> samples = pm.gsample("dense_track", 50,
...                      pm.gintervals(1, 0, 10000))
>>> len(samples)
50

See Also

gextract, gsummary, gquantiles

pymisha.gcor ¶

gcor(*exprs, intervals=None, iterator=None, method='pearson', details=False, names=None)

Compute correlation between pairs of track expressions.

Calculates correlation in a single streaming pass over the data, making it memory-efficient for genome-wide computations. Supports multitasking via chromosome partitioning when enabled.

PARAMETER	DESCRIPTION
`*exprs`	An even number of track expressions. Each consecutive pair (expr1, expr2) defines one correlation to compute. TYPE: `str` DEFAULT: `()`
`intervals`	Genomic scope. If `None`, uses `gintervals_all()`. TYPE: `DataFrame` DEFAULT: `None`
`iterator`	Iterator policy. TYPE: `int or str` DEFAULT: `None`
`method`	Correlation method. `"pearson"` computes Pearson correlation in a streaming pass. `"spearman"` computes an approximate, memory-bounded Spearman correlation using reservoir sampling. `"spearman.exact"` computes exact Spearman correlation with average-rank ties (requires O(n) memory in number of non-NaN bins). TYPE: `(pearson, spearman, exact)` DEFAULT: `"pearson"`
`details`	If `True`, return a DataFrame with full statistics (cor, cov, mean1, mean2, sd1, sd2, n, n.na) for Pearson, or (n, n.na, cor) for Spearman methods, instead of just correlation values. TYPE: `bool` DEFAULT: `False`
`names`	Names for each correlation pair. If `None`, names are generated as `"expr1~expr2"`. TYPE: `list of str` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray or DataFrame`	If `details=False`: 1-D array of correlation values, one per pair. If `details=True`: DataFrame with rows per pair and columns for all statistics.

RAISES	DESCRIPTION
`ValueError`	If an odd number of expressions is provided.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gcor("dense_track", "sparse_track",
...         intervals=pm.gintervals(1, 0, 10000), iterator=1000)
array([...])
>>> pm.gcor("dense_track", "sparse_track",
...         intervals=pm.gintervals(1, 0, 10000), iterator=1000,
...         details=True)
          cor       cov     mean1  ...
0  ...
>>> pm.gcor("dense_track", "sparse_track",
...         intervals=pm.gintervals(1, 0, 10000), iterator=1000,
...         method="spearman")
array([...])
>>> pm.gcor("dense_track", "sparse_track",
...         intervals=pm.gintervals(1, 0, 10000), iterator=1000,
...         method="spearman.exact", details=True)
          n  n.na       cor
0  ...

See Also

gsummary, gextract

pymisha.gbins_summary ¶

gbins_summary(*args, expr=None, intervals=None, include_lowest=False, iterator=None, band=None, **kwargs)

Compute summary statistics per bin.

PARAMETER	DESCRIPTION
`*args`	Alternating track expressions and bin break vectors. TYPE: `pairs of (bin_expr, breaks)` DEFAULT: `()`
`expr`	Track expression to summarize. If None the first bin expression is used. TYPE: `str` DEFAULT: `None`
`intervals`	Genomic scope. Defaults to all intervals. TYPE: `DataFrame` DEFAULT: `None`
`include_lowest`	Include the left edge of the first bin. TYPE: `bool` DEFAULT: `False`
`iterator`	Track expression iterator. TYPE: `int or str` DEFAULT: `None`
`band`	Diagonal band `(d1, d2)` for 2D interval filtering. TYPE: `tuple of (float, float)` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray`	Shape `(*n_bins, 7)` where the last axis holds: [Total intervals, NaN intervals, Min, Max, Sum, Mean, Std dev].

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gbins_summary("dense_track", [0, 0.2, 0.4, 2], expr="sparse_track",
...                  intervals=pm.gintervals(1), iterator="dense_track")

See Also

gsummary, gintervals_summary, gdist

pymisha.gbins_quantiles ¶

gbins_quantiles(*args, expr=None, percentiles=0.5, intervals=None, include_lowest=False, iterator=None, band=None, **kwargs)

Compute quantiles per bin.

PARAMETER	DESCRIPTION
`*args`	Alternating track expressions and bin break vectors. TYPE: `pairs of (bin_expr, breaks)` DEFAULT: `()`
`expr`	Track expression to compute quantiles for. If None the first bin expression is used. TYPE: `str` DEFAULT: `None`
`percentiles`	Percentile(s) in [0, 1]. TYPE: `float or array - like` DEFAULT: `0.5`
`intervals`	Genomic scope. Defaults to all intervals. TYPE: `DataFrame` DEFAULT: `None`
`include_lowest`	Include the left edge of the first bin. TYPE: `bool` DEFAULT: `False`
`iterator`	Track expression iterator. TYPE: `int or str` DEFAULT: `None`
`band`	Diagonal band `(d1, d2)` for 2D interval filtering. TYPE: `tuple of (float, float)` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray`	Shape `(*n_bins, n_percentiles)`.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gbins_quantiles("dense_track", [0, 0.2, 0.4, 2],
...                    expr="sparse_track", percentiles=[0.2, 0.5],
...                    intervals=pm.gintervals(1), iterator="dense_track")

See Also

gquantiles, gintervals_quantiles, gdist

pymisha.gintervals_summary ¶

gintervals_summary(expr, intervals, iterator=None, **kwargs)

Compute summary statistics for each interval.

PARAMETER	DESCRIPTION
`intervals_set_out`	When provided, saves the resulting intervals set via `gintervals_save` and returns `None`. TYPE: `str`

RETURNS	DESCRIPTION
`DataFrame or None`	DataFrame with the original interval columns (chrom, start, end) plus summary statistic columns: Total intervals, NaN intervals, Min, Max, Sum, Mean, Std dev. Returns `None` if `intervals_set_out` is provided (result is saved to disk instead).

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals([1, 2], 0, 5000)
>>> pm.gintervals_summary("dense_track", intervs)

See Also

gsummary, gbins_summary

pymisha.gintervals_quantiles ¶

gintervals_quantiles(expr, percentiles=0.5, intervals=None, iterator=None, **kwargs)

Compute quantiles for each interval.

PARAMETER	DESCRIPTION
`intervals_set_out`	When provided, saves the resulting intervals set via `gintervals_save` and returns `None`. TYPE: `str`

RETURNS	DESCRIPTION
`DataFrame or None`	DataFrame with the original interval columns (chrom, start, end) plus one column per requested percentile, named by the percentile value (e.g., `"0.5"`). Returns `None` if `intervals_set_out` is provided (result is saved to disk instead).

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals([1, 2], 0, 5000)
>>> pm.gintervals_quantiles("dense_track", percentiles=[0.5, 0.3, 0.9],
...                         intervals=intervs)

See Also

gquantiles, gbins_quantiles

pymisha.gcis_decay ¶

gcis_decay(expr, breaks, src, domain, intervals=None, include_lowest=False, iterator=None, band=None)

Calculate distribution of cis contact distances.

For contacts where chrom1 equals chrom2 and the first interval (I1) is fully within src intervals, this function bins the distance between I1 and I2 separately for intra-domain and inter-domain contacts.

A contact is intra-domain when both I1 and I2 are fully contained within the same domain interval. Otherwise it is inter-domain.

The distance is abs((start1 + end1 - start2 - end2) / 2) (integer division), i.e. the absolute difference of the interval midpoints.

PARAMETER	DESCRIPTION
`expr`	A 2D track expression (must be a simple 2D track name). TYPE: `str`
`breaks`	Sorted break points defining distance bins. Example: `breaks=[x1, x2, x3]` creates bins `(x1, x2]` and `(x2, x3]`. TYPE: `array_like`
`src`	Source intervals (chrom, start, end). Only contacts whose I1 is fully within the unified source intervals are counted. Overlapping source intervals are allowed and will be merged. TYPE: `DataFrame`
`domain`	Domain intervals (chrom, start, end). Must be non-overlapping. Used to classify contacts as intra- or inter-domain. TYPE: `DataFrame`
`intervals`	Genomic scope (1D intervals). Defaults to all genome intervals. Only cis contacts (chrom1 == chrom2) within these chromosomes are considered. TYPE: `DataFrame` DEFAULT: `None`
`include_lowest`	If True, the lowest break value is included in the first bin: `[x1, x2]` instead of `(x1, x2]`. TYPE: `bool` DEFAULT: `False`
`iterator`	2D iterator specification. Currently unused (extraction uses the track's native resolution). TYPE: `str` DEFAULT: `None`
`band`	Diagonal band filter `(d1, d2)`. Only contacts where the diagonal offset falls within the band are considered. TYPE: `tuple of (int, int)` DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray`	2D array of shape `(n_bins, 2)` where column 0 is intra-domain counts and column 1 is inter-domain counts. Row and column labels are stored as a `breaks` attribute on the array.

See Also

gdist : General distribution of track expressions. gextract : Extract track values over intervals.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import pandas as pd
>>> src = pd.DataFrame({"chrom": ["1", "1"], "start": [0, 200000], "end": [100000, 400000]})
>>> domain = pd.DataFrame({"chrom": ["1"], "start": [0], "end": [500000]})
>>> breaks = [0, 100000, 200000, 300000, 400000, 500000]
>>> result = pm.gcis_decay("rects_track", breaks, src, domain)
>>> result.shape[1]
2

pymisha.gsegment ¶

gsegment(expr, minsegment, maxpval=0.05, onetailed=True, intervals=None, iterator=None, intervals_set_out=None)

Divide track expression into segments using Wilcoxon test.

Divides the values of a track expression into segments, where each segment size is at least minsegment and the P-value of comparing the segment with the first minsegment values from the next segment is at most maxpval. Comparison is done using the Wilcoxon (Mann-Whitney) test.

PARAMETER	DESCRIPTION
`expr`	Track expression. TYPE: `str`
`minsegment`	Minimal segment size in base pairs. TYPE: `int`
`maxpval`	Maximal P-value that separates two adjacent segments. Default 0.05. TYPE: `float` DEFAULT: `0.05`
`onetailed`	If True, Wilcoxon test is one-tailed. Default True. TYPE: `bool` DEFAULT: `True`
`intervals`	Genomic scope. Defaults to all genome intervals. TYPE: `DataFrame` DEFAULT: `None`
`iterator`	Fixed bin iterator size. If None, determined from track expression. TYPE: `int` DEFAULT: `None`
`intervals_set_out`	If provided, save result as an intervals set and return None. TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame or None`	Intervals where each row represents a segment (chrom, start, end). Returns None if intervals_set_out is provided, or if input is empty.

See Also

gwilcox : Sliding-window Wilcoxon test.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> result = pm.gsegment("dense_track", 5000, maxpval=0.0001)
>>> result.columns.tolist()
['chrom', 'start', 'end']

pymisha.gwilcox ¶

gwilcox(expr, winsize1, winsize2, maxpval=0.05, onetailed=True, what2find=1, intervals=None, iterator=None, intervals_set_out=None)

Sliding-window Wilcoxon test over track expression values.

Runs a Wilcoxon test (Mann-Whitney) over the values of a track expression in two sliding windows with an identical center. Returns intervals where the smaller window tested against the larger window gives a P-value below maxpval.

PARAMETER	DESCRIPTION
`expr`	Track expression. TYPE: `str`
`winsize1`	Size of the first sliding window in base pairs. TYPE: `int`
`winsize2`	Size of the second sliding window in base pairs. TYPE: `int`
`maxpval`	Maximal P-value threshold. Default 0.05. TYPE: `float` DEFAULT: `0.05`
`onetailed`	If True, Wilcoxon test is one-tailed. Default True. TYPE: `bool` DEFAULT: `True`
`what2find`	-1 for lows, 1 for peaks, 0 for both. Default 1. TYPE: `int` DEFAULT: `1`
`intervals`	Genomic scope. Defaults to all genome intervals. TYPE: `DataFrame` DEFAULT: `None`
`iterator`	Fixed bin iterator size. If None, determined from track expression. TYPE: `int` DEFAULT: `None`
`intervals_set_out`	If provided, save result as an intervals set and return None. TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame or None`	Intervals with `pval` column where P-value is below `maxpval`. Returns None if no significant regions found, input is empty, or intervals_set_out is provided.

See Also

gsegment : Divide track expression into segments using Wilcoxon test.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> result = pm.gwilcox("dense_track", 100000, 1000, maxpval=0.01, what2find=1)
>>> result is None or "chrom" in result.columns
True

pymisha.glookup ¶

glookup(lookup_table, *args, intervals=None, include_lowest=False, force_binning=True, iterator=None, band=None, **kwargs)

Look up values in an N-dimensional lookup table indexed by track expressions.

For each iterator interval, evaluates one or more track expressions and uses the resulting values to index into a lookup table. Returns the table value for each interval.

Uses a memory-efficient C++ streaming implementation when expressions do not contain virtual tracks. Falls back to Python (memory-resident) when virtual tracks, a band filter, or 2D intervals are present.

PARAMETER	DESCRIPTION
`lookup_table`	N-dimensional lookup table. The shape must match the number of bins in each dimension. For 1D lookup, shape is `(n_bins,)`. For multi-dimensional lookup, shape is `(n_bins_1, n_bins_2, ...)`. TYPE: `ndarray`
`*args`	Alternating track expressions and break arrays defining bins. Same format as `gdist`. Example: `glookup(table, "track1", breaks1, "track2", breaks2, ...)`. TYPE: `pairs of (str, array-like)` DEFAULT: `()`
`intervals`	Genomic scope for which the function is applied. Required. TYPE: `DataFrame or str` DEFAULT: `None`
`include_lowest`	If True, the lowest break value is included in the first bin. Example: `breaks=[0, 0.2, 0.5]` creates `(0, 0.2], (0.2, 0.5]`. With `include_lowest=True`: `[0, 0.2], (0.2, 0.5]`. TYPE: `bool` DEFAULT: `False`
`force_binning`	If True, clamp out-of-range values to the nearest bin instead of NaN. If False, out-of-range values produce NaN. TYPE: `bool` DEFAULT: `True`
`iterator`	Track expression iterator. If None, determined implicitly. TYPE: `int or str` DEFAULT: `None`
`band`	Diagonal band for 2D tracks. Triggers Python fallback path. TYPE: `tuple of (int, int)` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame or None`	Intervals with columns: `chrom`, `start`, `end`, `value`, `intervalID`. Returns None if intervals is empty.

See Also

gdist : Compute distribution over binned track expressions. gtrack_lookup : Create a track from an N-dimensional lookup table.

Examples:

>>> import pymisha as pm
>>> import numpy as np
>>> _ = pm.gdb_init_examples()

One-dimensional lookup:

>>> breaks = [0.1, 0.12, 0.14, 0.16, 0.18, 0.2]
>>> result = pm.glookup([10, 20, 30, 40, 50], "dense_track", breaks,
...                     intervals=pm.gintervals("1", 0, 500))
>>> "value" in result.columns
True

pymisha.gtrack_lookup ¶

gtrack_lookup(track, description, lookup_table, *args, iterator=None, include_lowest=False, force_binning=True, band=None)

Create a track from an N-dimensional lookup table.

Evaluates track expressions genome-wide, looks up values in the table, and creates a new track from the results. Dense or sparse output is determined by the iterator type.

This is the track-creation counterpart of :func:glookup, which returns values in-memory without creating a track.

PARAMETER	DESCRIPTION
`track`	Name for the new track. TYPE: `str`
`description`	Track description. TYPE: `str`
`lookup_table`	N-dimensional lookup table. Shape must match the number of bins defined by each `(expr, breaks)` pair. For 1D: `(n_bins,)`. For multi-dimensional: `(n_bins_1, n_bins_2, ...)`. TYPE: `ndarray`
`*args`	Alternating track expressions and break arrays defining bins. TYPE: `pairs of (str, array-like)` DEFAULT: `()`
`iterator`	Track expression iterator. Integer values create dense tracks with that bin size. Intervals-based iterators create sparse tracks. TYPE: `int or str` DEFAULT: `None`
`include_lowest`	If True, the lowest break value is included in the first bin. TYPE: `bool` DEFAULT: `False`
`force_binning`	If True, clamp out-of-range values to the nearest bin. If False, out-of-range values produce NaN in the track. TYPE: `bool` DEFAULT: `True`
`band`	Diagonal band for 2D tracks. Passed through to :func:`glookup`. TYPE: `tuple of (int, int)` DEFAULT: `None`

RETURNS	DESCRIPTION
`None`

See Also

glookup : Look up values without creating a track. gtrack_create : Create a track from an expression. gdist : Compute distribution over binned track expressions.

Examples:

>>> import pymisha as pm
>>> import numpy as np
>>> _ = pm.gdb_init_examples()

Create a dense track from 1D lookup:

>>> pm.gtrack_lookup("my_track", "lookup track",
...     np.array([10.0, 20.0, 30.0, 40.0]),
...     "dense_track", [0, 0.05, 0.1, 0.15, 0.2],
...     iterator=100)