Intervals¶
Functions for creating, manipulating, and querying genomic intervals, including set operations, annotation, normalization, and I/O.
pymisha.gintervals ¶
Create a 1D intervals DataFrame.
Constructs an intervals DataFrame from parallel arrays of chromosome names, start coordinates, and end coordinates. Scalar arguments are broadcast to match the longest array.
| PARAMETER | DESCRIPTION |
|---|---|
chroms
|
Chromosome names. Can be strings like
TYPE:
|
starts
|
Start coordinates (0-based, inclusive).
TYPE:
|
ends
|
End coordinates (0-based, exclusive).
TYPE:
|
strand
|
Strand information (
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Sorted intervals with columns: chrom, start, end (and optionally strand). |
See Also
gintervals_all : Return full-chromosome intervals for every chromosome. gintervals_2d : Create 2D intervals. gintervals_from_tuples : Create intervals from a list of tuples. gintervals_from_strings : Create intervals from region strings. gintervals_from_bed : Create intervals from a BED file.
Examples:
The following calls produce equivalent results:
Specify start coordinates:
Multiple intervals with broadcast:
pymisha.gintervals_all ¶
Return all chromosome intervals (ALLGENOME).
Returns a DataFrame with one row per chromosome, covering the full
extent of each chromosome in the current genome database as defined
by chrom_sizes.txt.
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Intervals with columns: chrom, start, end. |
See Also
gintervals : Create a custom set of 1D intervals. gintervals_2d_all : Return 2D intervals covering the whole genome. gintervals_from_tuples : Create intervals from a list of tuples.
Examples:
pymisha.gintervals_2d ¶
Create a set of 2D genomic intervals.
| PARAMETER | DESCRIPTION |
|---|---|
chroms1
|
Chromosome name(s) for first dimension.
TYPE:
|
starts1
|
Start coordinate(s) for first dimension.
TYPE:
|
ends1
|
End coordinate(s) for first dimension. -1 means full chromosome length.
TYPE:
|
chroms2
|
Chromosome name(s) for second dimension. Defaults to chroms1.
TYPE:
|
starts2
|
Start coordinate(s) for second dimension.
TYPE:
|
ends2
|
End coordinate(s) for second dimension. -1 means full chromosome length.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Sorted 2D intervals with columns: chrom1, start1, end1, chrom2, start2, end2. |
See Also
gintervals : Create 1D intervals. gintervals_2d_all : Return 2D intervals covering the whole genome. gintervals_2d_band_intersect : Intersect 2D intervals with a diagonal band. gintervals_force_range : Clamp intervals to chromosome boundaries.
Examples:
The following calls produce equivalent results:
Explicit coordinates on both dimensions:
Multiple intervals with broadcast:
pymisha.gintervals_2d_all ¶
Return 2D intervals covering the whole genome.
| PARAMETER | DESCRIPTION |
|---|---|
mode
|
"diagonal" returns only intra-chromosomal pairs (chrom1 == chrom2). "full" returns all NxN chromosome pairs.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
2D intervals with columns: chrom1, start1, end1, chrom2, start2, end2. |
See Also
gintervals_2d : Create a custom set of 2D intervals. gintervals_all : Return 1D intervals covering the whole genome. gintervals_2d_band_intersect : Intersect 2D intervals with a diagonal band.
Examples:
Diagonal mode (intra-chromosomal pairs only):
Full NxN chromosome pairs:
pymisha.gintervals_2d_band_intersect ¶
Intersect 2D intervals with a diagonal band.
Each 2D interval is intersected with the band defined by two distances d1 and d2 from the main diagonal (where x == y). The band captures the region where d1 <= (start1 - start2) < d2. If the intersection is non-empty, the interval is shrunk to the minimal bounding rectangle of the intersection.
Only cis (same-chromosome) intervals can intersect a band; trans intervals are removed.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
2D intervals with columns chrom1, start1, end1, chrom2, start2, end2.
TYPE:
|
band
|
Pair (d1, d2) defining the diagonal band. d1 must be < d2.
TYPE:
|
intervals_set_out
|
If provided, save result as intervals set and return None.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Intersected 2D intervals, or None if intervals_set_out is specified. |
See Also
gintervals_2d : Create 2D intervals. gintervals_2d_all : Return 2D intervals covering the whole genome. gintervals_intersect : Intersect two 1D interval sets.
Examples:
pymisha.gintervals_union ¶
Calculate the union of two sets of intervals.
Returns intervals representing the genomic space covered by either
intervals1 or intervals2. Overlapping and adjacent regions
are merged in the result.
| PARAMETER | DESCRIPTION |
|---|---|
intervals1
|
First set of 1D intervals (chrom, start, end).
TYPE:
|
intervals2
|
Second set of 1D intervals (chrom, start, end).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Union intervals sorted by chrom and start, or |
See Also
gintervals_intersect : Intersection of two interval sets. gintervals_diff : Difference of two interval sets. gintervals_canonic : Merge overlapping intervals within one set.
Examples:
pymisha.gintervals_intersect ¶
Calculate the intersection of two sets of intervals.
Returns intervals representing the genomic space covered by both
intervals1 and intervals2.
| PARAMETER | DESCRIPTION |
|---|---|
intervals1
|
First set of 1D intervals (chrom, start, end).
TYPE:
|
intervals2
|
Second set of 1D intervals (chrom, start, end).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Intersection intervals sorted by chrom and start, or |
See Also
gintervals_union : Union of two interval sets. gintervals_diff : Difference of two interval sets. gintervals_2d_band_intersect : Intersect 2D intervals with a diagonal band.
Examples:
pymisha.gintervals_diff ¶
Calculate the difference of two interval sets.
Returns genomic space covered by intervals1 but not by
intervals2.
| PARAMETER | DESCRIPTION |
|---|---|
intervals1
|
First set of 1D intervals (chrom, start, end).
TYPE:
|
intervals2
|
Second set of 1D intervals (chrom, start, end).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Difference intervals sorted by chrom and start, or |
See Also
gintervals_union : Union of two interval sets. gintervals_intersect : Intersection of two interval sets.
Examples:
pymisha.gintervals_canonic ¶
Convert intervals to canonical form.
Sorts intervals and merges overlapping ones. If
unify_touching_intervals is True, adjacent intervals (where one's
end equals another's start) are also merged. The result has no overlaps
and is properly sorted.
A mapping attribute is attached to the result DataFrame mapping
each original interval index to the canonical interval index:
result.attrs['mapping'].
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Intervals to canonicalize (chrom, start, end).
TYPE:
|
unify_touching_intervals
|
Whether to merge touching (end == start) intervals.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Canonical intervals with |
See Also
gintervals_union : Union of two interval sets (implicitly canonicalizes). gintervals_intersect : Intersection of two interval sets.
Examples:
pymisha.gintervals_force_range ¶
Force intervals into valid chromosome ranges.
Enforces intervals to lie within [0, chrom_length) by clamping their boundaries. Intervals that fall entirely outside chromosome ranges are removed.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
1D intervals with columns: chrom, start, end.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Clamped intervals, or |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If intervals is |
See Also
gintervals : Create a set of 1D intervals. gintervals_2d : Create a set of 2D intervals. gintervals_canonic : Merge overlapping intervals.
Examples:
pymisha.gintervals_covered_bp ¶
Compute total basepairs covered by intervals.
Overlapping intervals are merged before counting to avoid double-counting. When src is provided, only the portion of intervals that overlaps src is counted.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Interval set with columns: chrom, start, end. A string is interpreted as a saved interval-set name.
TYPE:
|
src
|
If provided, restrict counting to the intersection of intervals with src.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Total number of basepairs covered |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals("1", [0, 200], [300, 600])
>>> pm.gintervals_covered_bp(intervs) # 0-300 + 200-600 = 600 (overlaps merged)
600
See Also
gintervals_coverage_fraction : Fraction of genomic space covered. gintervals_canonic : Merge overlapping intervals. gintervals : Create a set of 1D intervals.
pymisha.gintervals_coverage_fraction ¶
Calculate the fraction of genomic space covered by intervals.
Returns the fraction of intervals2 (or the entire genome when
intervals2 is None) that is covered by intervals1. Overlapping
intervals in either set are unified before calculation.
| PARAMETER | DESCRIPTION |
|---|---|
intervals1
|
The covering set of 1D intervals (chrom, start, end).
TYPE:
|
intervals2
|
The reference space to measure against.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
A value between 0.0 and 1.0 representing the fraction of intervals2 (or the genome) covered by intervals1. |
See Also
gintervals_covered_bp : Total base pairs covered by intervals. gintervals_intersect : Intersection of two interval sets. gintervals_all : Return full-genome intervals.
Examples:
pymisha.gintervals_mark_overlaps ¶
Mark groups of overlapping intervals with a shared group ID.
Each interval in the input is assigned an integer group identifier.
Intervals that overlap (or touch, when unify_touching_intervals is
True) share the same group ID.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
1D intervals with columns
TYPE:
|
group_col
|
Name of the column to store group IDs.
TYPE:
|
unify_touching_intervals
|
Whether touching intervals (
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The original intervals with an added group_col column. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import pandas as pd
>>> intervs = pd.DataFrame({
... "chrom": ["1", "1", "1", "1"],
... "start": [11000, 100, 10000, 10500],
... "end": [12000, 200, 13000, 10600],
... "data": [10, 20, 30, 40],
... })
>>> pm.gintervals_mark_overlaps(intervs)
See Also
gintervals_canonic : Merge overlapping intervals. gintervals_intersect : Intersection of two interval sets. gintervals_annotate : Annotate intervals with nearest-neighbor columns.
pymisha.gintervals_annotate ¶
gintervals_annotate(intervals, annotation_intervals, annotation_columns=None, column_names=None, dist_column='dist', max_dist=float('inf'), na_value=_numpy.nan, maxneighbors=1, tie_method='first', overwrite=False, keep_order=True, **kwargs)
Annotate intervals with columns from the nearest annotation intervals.
For each interval in intervals, the nearest neighbor in
annotation_intervals is found (via :func:gintervals_neighbors),
and the specified annotation columns are copied over.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
1D query intervals.
TYPE:
|
annotation_intervals
|
Source intervals containing annotation data.
TYPE:
|
annotation_columns
|
Columns to copy from annotation_intervals.
TYPE:
|
column_names
|
Output names for the annotation columns (must match length of annotation_columns).
TYPE:
|
dist_column
|
Name for the distance column.
TYPE:
|
max_dist
|
Maximum absolute distance. Annotations farther away are replaced with na_value.
TYPE:
|
na_value
|
Fill value for annotations beyond max_dist or when no neighbor is found. Can be a dict mapping column names to individual fill values.
TYPE:
|
maxneighbors
|
Number of nearest neighbors to consider.
TYPE:
|
tie_method
|
Tie-breaking strategy when multiple neighbors are equidistant.
Only applies when
TYPE:
|
overwrite
|
If
TYPE:
|
keep_order
|
Preserve original row order.
TYPE:
|
**kwargs
|
Additional keyword arguments passed to
:func:
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The input intervals with added annotation and distance columns. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If annotation columns conflict with existing columns and
overwrite is |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals("1", [1000, 5000], [1100, 5050])
>>> ann = pm.gintervals("1", [900, 5400], [950, 5500])
>>> ann["remark"] = ["a", "b"]
>>> ann["score"] = [10.0, 20.0]
>>> pm.gintervals_annotate(intervs, ann)
>>> pm.gintervals_annotate(intervs, ann,
... annotation_columns=["remark"],
... column_names=["ann_remark"],
... dist_column="ann_dist")
>>> pm.gintervals_annotate(intervs, ann,
... annotation_columns=["remark"],
... max_dist=200, na_value="no_ann")
>>> pm.gintervals_annotate(intervs, ann,
... annotation_columns=["remark"],
... maxneighbors=2,
... tie_method="min.start")
See Also
gintervals_neighbors : Find nearest neighbors between interval sets. gintervals_mark_overlaps : Mark groups of overlapping intervals.
pymisha.gintervals_normalize ¶
Normalize intervals to a specified size by centering.
Each interval is resized to the target size while keeping its center position. Results are clamped to chromosome boundaries.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
1D intervals with columns
TYPE:
|
size
|
Target interval size(s) in basepairs. Can be:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Normalized intervals. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If size contains non-positive values or if vector length does not match the number of intervals. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervs = pm.gintervals("1", [1000, 5000], [2000, 6000])
>>> pm.gintervals_normalize(intervs, 500)
>>> pm.gintervals_normalize(intervs, [500, 1000])
>>> pm.gintervals_normalize(pm.gintervals("1", 1000, 2000), [500, 1000, 1500])
See Also
gintervals_force_range : Clamp intervals to chromosome boundaries. gintervals_window : Create intervals centered on positions.
pymisha.gintervals_neighbors ¶
gintervals_neighbors(intervals1, intervals2, maxneighbors=1, mindist=-1000000000.0, maxdist=1000000000.0, na_if_notfound=False, use_intervals1_strand=False)
Find nearest neighbors between two sets of intervals.
For each interval in intervals1, finds the closest intervals from intervals2. Distance directionality can be determined by either the strand of the target intervals (intervals2, default) or the query intervals (intervals1).
| PARAMETER | DESCRIPTION |
|---|---|
intervals1
|
Query intervals with columns 'chrom', 'start', 'end' (and optionally 'strand').
TYPE:
|
intervals2
|
Target intervals to search for neighbors.
TYPE:
|
maxneighbors
|
Maximum number of neighbors to return per query interval.
TYPE:
|
mindist
|
Minimum distance (negative means target is upstream/left of query).
TYPE:
|
maxdist
|
Maximum distance (positive means target is downstream/right of query).
TYPE:
|
na_if_notfound
|
If True, include queries with no neighbors (with NA values).
TYPE:
|
use_intervals1_strand
|
If True, use intervals1 strand column for distance directionality instead of intervals2 strand. This is useful for TSS analysis where you want upstream/downstream distances relative to gene direction. When True: - + strand queries: negative distance = upstream, positive = downstream - - strand queries: negative distance = downstream, positive = upstream
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
DataFrame with query and neighbor coordinates plus distance column. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> query = pm.gintervals("1", [5000], [5100])
>>> targets = pm.gintervals("1", [3000, 7000], [3100, 7100])
>>> pm.gintervals_neighbors(query, targets)
See Also
gintervals_neighbors_upstream : Find upstream neighbors only. gintervals_neighbors_downstream : Find downstream neighbors only. gintervals_neighbors_directional : Find both upstream and downstream. gintervals_annotate : Annotate intervals with nearest-neighbor columns.
pymisha.gintervals_neighbors_upstream ¶
gintervals_neighbors_upstream(intervals1, intervals2, maxneighbors=1, maxdist=1000000000.0, na_if_notfound=False)
Find upstream neighbors of query intervals using strand directionality.
Upstream neighbors are those located in the 5' direction relative to the query strand: left (negative distance) for + strand queries, right (positive distance) for - strand queries.
| PARAMETER | DESCRIPTION |
|---|---|
intervals1
|
Query intervals. If 'strand' column is present, it determines direction. Missing or strand=0 is treated as + strand.
TYPE:
|
intervals2
|
Target intervals to search for neighbors.
TYPE:
|
maxneighbors
|
Maximum number of upstream neighbors to return per query.
TYPE:
|
maxdist
|
Maximum distance to search for neighbors (in bp).
TYPE:
|
na_if_notfound
|
If True, include queries with no neighbors (with NA values).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
DataFrame with query and neighbor coordinates plus distance column. Distance values are always <= 0 (upstream direction). |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> query = pm.gintervals("1", [5000], [5100])
>>> query["strand"] = 1 # + strand
>>> targets = pm.gintervals("1", [3000, 7000], [3100, 7100])
>>> pm.gintervals_neighbors_upstream(query, targets)
See Also
gintervals_neighbors : General neighbor finding. gintervals_neighbors_downstream : Find downstream neighbors. gintervals_neighbors_directional : Find both upstream and downstream.
pymisha.gintervals_neighbors_downstream ¶
gintervals_neighbors_downstream(intervals1, intervals2, maxneighbors=1, maxdist=1000000000.0, na_if_notfound=False)
Find downstream neighbors of query intervals using strand directionality.
Downstream neighbors are those located in the 3' direction relative to the query strand: right (positive distance) for + strand queries, left (negative distance) for - strand queries.
| PARAMETER | DESCRIPTION |
|---|---|
intervals1
|
Query intervals. If 'strand' column is present, it determines direction. Missing or strand=0 is treated as + strand.
TYPE:
|
intervals2
|
Target intervals to search for neighbors.
TYPE:
|
maxneighbors
|
Maximum number of downstream neighbors to return per query.
TYPE:
|
maxdist
|
Maximum distance to search for neighbors (in bp).
TYPE:
|
na_if_notfound
|
If True, include queries with no neighbors (with NA values).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
DataFrame with query and neighbor coordinates plus distance column. Distance values are always >= 0 (downstream direction). |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> query = pm.gintervals("1", [5000], [5100])
>>> query["strand"] = 1 # + strand
>>> targets = pm.gintervals("1", [3000, 7000], [3100, 7100])
>>> pm.gintervals_neighbors_downstream(query, targets)
See Also
gintervals_neighbors : General neighbor finding. gintervals_neighbors_upstream : Find upstream neighbors. gintervals_neighbors_directional : Find both upstream and downstream.
pymisha.gintervals_neighbors_directional ¶
gintervals_neighbors_directional(intervals1, intervals2, maxneighbors_upstream=1, maxneighbors_downstream=1, maxdist=1000000000.0, na_if_notfound=False)
Find both upstream and downstream neighbors of query intervals.
Convenience function that returns both upstream and downstream neighbors in a single call.
| PARAMETER | DESCRIPTION |
|---|---|
intervals1
|
Query intervals. If 'strand' column is present, it determines direction. Missing or strand=0 is treated as + strand.
TYPE:
|
intervals2
|
Target intervals to search for neighbors.
TYPE:
|
maxneighbors_upstream
|
Maximum number of upstream neighbors to return per query.
TYPE:
|
maxneighbors_downstream
|
Maximum number of downstream neighbors to return per query.
TYPE:
|
maxdist
|
Maximum distance to search for neighbors (in bp).
TYPE:
|
na_if_notfound
|
If True, include queries with no neighbors (with NA values).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with keys 'upstream' and 'downstream', each containing a DataFrame (or None) with neighbor results. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> query = pm.gintervals("1", [5000], [5100])
>>> query["strand"] = 1
>>> targets = pm.gintervals("1", [3000, 7000], [3100, 7100])
>>> result = pm.gintervals_neighbors_directional(query, targets)
>>> result["upstream"]
>>> result["downstream"]
See Also
gintervals_neighbors : General neighbor finding. gintervals_neighbors_upstream : Find upstream neighbors only. gintervals_neighbors_downstream : Find downstream neighbors only.
pymisha.gintervals_random ¶
Generate random genomic intervals.
Intervals are sampled uniformly from the genome (after excluding chromosome edges and optional filter regions). Each interval is exactly size basepairs.
| PARAMETER | DESCRIPTION |
|---|---|
size
|
Interval size in basepairs (must be positive).
TYPE:
|
n
|
Number of intervals to generate (must be positive).
TYPE:
|
dist_from_edge
|
Minimum distance from chromosome boundaries.
TYPE:
|
chromosomes
|
Restrict sampling to these chromosomes.
TYPE:
|
mask
|
Intervals to exclude from sampling (columns
TYPE:
|
filter
|
Backward-compatible alias for
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gintervals_random(100, 1000)
>>> pm.gintervals_random(100, 1000, chromosomes=["1"])
>>> import numpy as np; np.random.seed(42)
>>> pm.gintervals_random(100, 50)
See Also
gintervals : Create intervals manually. gintervals_all : Return full-genome intervals.
pymisha.gintervals_from_tuples ¶
Create intervals from a list of tuples or dicts.
Each tuple should be (chrom, start, end) or
(chrom, start, end, strand). Alternatively, each element can be a
dict with the corresponding keys.
| PARAMETER | DESCRIPTION |
|---|---|
rows
|
Interval specifications. Tuples must have 3 or 4 elements.
TYPE:
|
strand
|
Strand values to assign when the tuples do not include strand.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Sorted intervals with columns: chrom, start, end (and optionally
strand). Returns |
See Also
gintervals : Create intervals from parallel arrays. gintervals_from_strings : Create intervals from region strings. gintervals_from_bed : Create intervals from a BED file. gintervals_all : Return full-chromosome intervals.
Examples:
pymisha.gintervals_from_strings ¶
Create intervals from region strings.
Parses strings of the form "chr1:100-200" or "chr1:100-200:+"
into an intervals DataFrame. If only a chromosome name is given
(e.g. "chr1"), the full chromosome extent is used.
| PARAMETER | DESCRIPTION |
|---|---|
regions
|
One or more region strings. Accepted formats:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Sorted intervals with columns: chrom, start, end (and optionally strand). |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If a region string cannot be parsed. |
See Also
gintervals : Create intervals from parallel arrays. gintervals_from_tuples : Create intervals from a list of tuples. gintervals_from_bed : Create intervals from a BED file.
Examples:
pymisha.gintervals_from_bed ¶
Create intervals from a BED-like file.
Reads a tab- or space-delimited file with at least three columns (chrom, start, end) and returns a sorted intervals DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to BED file (chrom, start, end[, ...]).
TYPE:
|
has_strand
|
If True, use column 6 for strand when present.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Sorted intervals with columns: chrom, start, end (and optionally
strand). Returns |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If path does not exist. |
See Also
gintervals : Create intervals from parallel arrays. gintervals_from_tuples : Create intervals from a list of tuples. gintervals_from_strings : Create intervals from region strings.
Examples:
pymisha.gintervals_import_genes ¶
Import gene annotations from a UCSC knownGene-format file.
Reads gene definitions from genes_file and produces four sets of
intervals: TSS, exons, 3'UTR, and 5'UTR. A strand column is
included (1 for "+", -1 for "-").
If annots_file is provided, annotations are attached to the
intervals. annots_names must be supplied when annots_file is
given.
Both genes_file and annots_file may be local file paths or URLs
(http, https, ftp). Gzipped files (.gz) are handled automatically.
Overlapping intervals within each set are unified (merged). When two
overlapping intervals have different strands, the merged strand is set
to 0. Annotations from overlapping intervals are concatenated with
semicolons; duplicate annotation values are removed.
| PARAMETER | DESCRIPTION |
|---|---|
genes_file
|
Path or URL to a knownGene-format file (12 tab-separated columns).
TYPE:
|
annots_file
|
Path or URL to an annotation file. The first column is the gene ID
(matching
TYPE:
|
annots_names
|
Names for the annotation columns. Required when
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary with keys |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If |
See Also
gintervals : Create a custom set of 1D intervals. gintervals_save : Save intervals to the database.
Examples:
pymisha.gintervals_window ¶
Create intervals centered on positions with fixed half-width.
Constructs intervals of width 2 * half_width centered on each
position in centers.
| PARAMETER | DESCRIPTION |
|---|---|
chroms
|
Chromosome name(s). Scalar is broadcast to match centers.
TYPE:
|
centers
|
Center positions. Scalar is broadcast to match chroms.
TYPE:
|
half_width
|
Half the desired interval width.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Sorted intervals with columns: chrom, start, end. |
See Also
gintervals : Create intervals from explicit start/end coordinates. gintervals_normalize : Resize intervals by centering.
Examples:
pymisha.gintervals_ls ¶
List named interval sets in the database.
| PARAMETER | DESCRIPTION |
|---|---|
pattern
|
Regular expression pattern to filter interval set names. Empty string matches all sets.
TYPE:
|
ignore_case
|
If True, pattern matching is case-insensitive.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of str
|
Names of interval sets matching the pattern. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gintervals_ls()
>>> pm.gintervals_ls("annot.*")
See Also
gintervals_exists : Check if a named interval set exists. gintervals_load : Load a named interval set. gintervals_save : Save intervals as a named set. gintervals_rm : Remove a named interval set.
pymisha.gintervals_exists ¶
Check if a named interval set exists.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Name of the interval set to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the interval set exists, False otherwise. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gintervals_exists("annotations")
True
See Also
gintervals_ls : List named interval sets. gintervals_load : Load a named interval set. gintervals_save : Save intervals as a named set. gintervals_rm : Remove a named interval set.
pymisha.gintervals_dataset ¶
Return the database/dataset root path for a named interval set.
Searches the user root, genome root, and all linked datasets for the given interval set name.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Name of the interval set (e.g.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str or None
|
The root path of the database/dataset containing the interval
set, or |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If intervals is |
See Also
gintervals_exists : Check if a named interval set exists. gintervals_ls : List named interval sets. gintervals_load : Load a named interval set.
Examples:
pymisha.gintervals_chrom_sizes ¶
Get chromosome sizes for intervals.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Intervals with 'chrom' column.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with 'chrom' column containing unique chromosomes present in the input intervals. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervals = pm.gintervals(["1", "2"], [0, 0], [10000, 20000])
>>> pm.gintervals_chrom_sizes(intervals)
See Also
gintervals_load : Load a named interval set. gintervals_exists : Check if a named interval set exists. gintervals_ls : List named interval sets.
pymisha.gintervals_load ¶
Load a named interval set from the database.
| PARAMETER | DESCRIPTION |
|---|---|
intervals_set
|
Name of the interval set to load (e.g., "annotations", "genes.coding").
TYPE:
|
chrom
|
If specified, only load intervals from this chromosome.
TYPE:
|
chrom1
|
If specified, load only intervals for this chromosome (2D only).
TYPE:
|
chrom2
|
If specified, load only intervals for this chromosome (2D only).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
DataFrame with columns 'chrom', 'start', 'end' plus any additional columns stored in the interval set. Returns None if no intervals match. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the interval set does not exist. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervals = pm.gintervals_load("annotations")
>>> intervals = pm.gintervals_load("annotations", chrom="1")
See Also
gintervals_save : Save intervals as a named set. gintervals_update : Update a chromosome in an existing set. gintervals_exists : Check if a named interval set exists. gintervals_ls : List named interval sets. gintervals_rm : Remove a named interval set.
pymisha.gintervals_save ¶
Save intervals to the database as a named interval set.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Intervals to save. Must have either 'chrom', 'start', 'end' columns (1D) or 'chrom1', 'start1', 'end1', 'chrom2', 'start2', 'end2' columns (2D).
TYPE:
|
intervals_set
|
Name for the interval set. Must start with a letter and contain only alphanumeric characters, underscores, and dots.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the interval set name is invalid or already exists. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> intervals = pm.gintervals(["1", "2"], [100, 200], [1000, 2000])
>>> pm.gintervals_save(intervals, "my_intervals")
| RETURNS | DESCRIPTION |
|---|---|
None
|
|
See Also
gintervals_load : Load a named interval set. gintervals_update : Update a chromosome in an existing set. gintervals_exists : Check if a named interval set exists. gintervals_ls : List named interval sets. gintervals_rm : Remove a named interval set.
pymisha.gintervals_update ¶
Update intervals for a specific chromosome in an existing intervals set.
Replaces all intervals for the given chromosome with the new intervals. Pass intervals=None to delete all intervals for that chromosome.
| PARAMETER | DESCRIPTION |
|---|---|
intervals_set
|
Name of the existing intervals set.
TYPE:
|
intervals
|
New intervals for the chromosome, or None to delete.
TYPE:
|
chrom
|
Chromosome to update. Required.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If intervals set does not exist or chrom is not specified. |
See Also
gintervals_save : Save a new interval set. gintervals_load : Load a named interval set. gintervals_exists : Check if a named interval set exists. gintervals_ls : List named interval sets.
Examples:
pymisha.gintervals_rm ¶
Remove a named interval set from the database.
| PARAMETER | DESCRIPTION |
|---|---|
intervals_set
|
Name of the interval set to remove.
TYPE:
|
force
|
If True, do not raise an error if the interval set does not exist.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the interval set does not exist and force is False. |
Examples:
| RETURNS | DESCRIPTION |
|---|---|
None
|
|
See Also
gintervals_save : Save intervals as a named set. gintervals_load : Load a named interval set. gintervals_exists : Check if a named interval set exists. gintervals_ls : List named interval sets.
pymisha.gintervals_rbind ¶
Concatenate interval sets (DataFrames and/or named interval-set strings).
| PARAMETER | DESCRIPTION |
|---|---|
*intervals
|
One or more interval sets. Each argument can be a DataFrame or a
named interval set (loaded via :func:
TYPE:
|
intervals_set_out
|
If provided, save the concatenated intervals via
:func:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Concatenated intervals when intervals_set_out is |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no interval arguments are provided, if an interval set does not exist, or if columns do not match exactly. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> i1 = pm.gextract("sparse_track", pm.gintervals(["1", "2"], 1000, 4000))
>>> i2 = pm.gextract("sparse_track", pm.gintervals(["2", "X"], 2000, 5000))
>>> pm.gintervals_save(i2, "tmp_intervs")
>>> pm.gintervals_rbind(i1, "tmp_intervs")
>>> pm.gintervals_rm("tmp_intervs", force=True)
See Also
gintervals_load : Load a named interval set. gintervals_save : Save intervals as a named set. gintervals_canonic : Merge overlapping intervals within one set.
pymisha.gintervals_mapply ¶
gintervals_mapply(func, *exprs, intervals=None, iterator=None, intervals_set_out=None, colnames='value')
Apply a function to track expression values for each interval.
Evaluates track expressions for each interval and passes the resulting value arrays to func. The return value of func becomes a new column in the output.
| PARAMETER | DESCRIPTION |
|---|---|
func
|
Function to apply. Receives one numpy array per track expression.
TYPE:
|
*exprs
|
Track expressions to evaluate.
TYPE:
|
intervals
|
Intervals to process.
TYPE:
|
iterator
|
Track expression iterator.
TYPE:
|
intervals_set_out
|
If given, save result as an intervals set and return None.
TYPE:
|
colnames
|
Name of the result column.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame or None
|
Intervals with an additional column containing func results, or None if intervals_set_out is specified. |
See Also
giterator_intervals : Inspect iterator bin boundaries.
Examples:
pymisha.gintervals_convert_to_indexed ¶
Convert a 1D big interval set to indexed format.
Converts per-chromosome interval files into a single
intervals.dat + intervals.idx pair, reducing file-descriptor
usage from N files to 2. The indexed format is backward-compatible
with all misha interval functions.
| PARAMETER | DESCRIPTION |
|---|---|
set_name
|
Name of the 1D interval set to convert.
TYPE:
|
remove_old
|
If True, remove the old per-chromosome files after conversion.
TYPE:
|
force
|
If True, re-convert even if the set is already indexed.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If set_name is empty or the interval set does not exist. |
See Also
gintervals_2d_convert_to_indexed : Convert a 2D interval set to indexed format. gintervals_is_indexed : Check if a set is already indexed. gintervals_save : Save intervals as a named set. gintervals_load : Load a named interval set.
Examples:
pymisha.gintervals_2d_convert_to_indexed ¶
Convert a 2D big interval set to indexed format.
Converts per-chromosome-pair interval files into a single
intervals2d.dat + intervals2d.idx pair. This dramatically
reduces file-descriptor usage, especially for genomes with many
chromosomes (from N*(N-1)/2 files to 2).
| PARAMETER | DESCRIPTION |
|---|---|
set_name
|
Name of the 2D interval set to convert.
TYPE:
|
remove_old
|
If True, remove the old per-pair files after conversion.
TYPE:
|
force
|
If True, re-convert even if the set is already indexed.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If set_name is empty or the interval set does not exist. |
See Also
gintervals_convert_to_indexed : Convert a 1D interval set to indexed format. gintervals_is_indexed : Check if a set is already indexed. gintervals_save : Save intervals as a named set. gintervals_load : Load a named interval set.
Examples:
pymisha.gintervals_is_indexed ¶
Check whether a big interval set is stored in indexed format.
Indexed format means the set uses intervals.idx/intervals.dat
(1D) or intervals2d.idx/intervals2d.dat (2D) files instead
of per-chromosome files.
| PARAMETER | DESCRIPTION |
|---|---|
intervals_set
|
Name of the interval set to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
|
See Also
gintervals_convert_to_indexed : Convert a 1D set to indexed format. gintervals_2d_convert_to_indexed : Convert a 2D set to indexed format. gintervals_exists : Check if a named interval set exists.
Examples:
pymisha.giterator_cartesian_grid ¶
giterator_cartesian_grid(intervals1, expansion1, intervals2=None, expansion2=None, min_band_idx=None, max_band_idx=None)
Create a 2D cartesian-grid iterator as 2D intervals.
The grid is built from 1D interval centers and expansion breakpoints.
For each center C and consecutive expansion pair (E[i], E[i+1]),
one 1D window [C + E[i], C + E[i+1]) is created (clipped to chromosome
bounds). The final result is the cartesian product of windows from
intervals1 and intervals2.
| PARAMETER | DESCRIPTION |
|---|---|
intervals1
|
1D intervals with columns
TYPE:
|
expansion1
|
Expansion breakpoints around centers of
TYPE:
|
intervals2
|
Second 1D interval source. If
TYPE:
|
expansion2
|
Expansion breakpoints for
TYPE:
|
min_band_idx
|
Lower bound for center-index delta filtering (
TYPE:
|
max_band_idx
|
Upper bound for center-index delta filtering. Can be used only when
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
2D intervals with columns:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If inputs are invalid. |
pymisha.giterator_intervals ¶
giterator_intervals(expr=None, intervals=None, iterator=None, interval_relative=False, partial_bins='clip')
Return the iterator intervals grid without evaluating track expressions.
This is useful for inspecting the bin boundaries that would be produced by a given iterator/interval combination before running a full extraction.
| PARAMETER | DESCRIPTION |
|---|---|
expr
|
Track expression (used to determine the implicit iterator when
iterator is
TYPE:
|
intervals
|
Genomic scope. Defaults to :func:
TYPE:
|
iterator
|
Numeric bin size or track name that defines the iterator.
TYPE:
|
interval_relative
|
When
TYPE:
|
partial_bins
|
How to handle bins that do not fit entirely within an interval.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If neither expr nor iterator is provided. |
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.giterator_intervals(intervals=pm.gintervals("1", 0, 200), iterator=50)
>>> pm.giterator_intervals("dense_track", pm.gintervals("1", 0, 1000))
See Also
gintervals_mapply : Apply a function to track values per interval.