Liftover¶

Functions for converting genomic coordinates between assemblies using chain files.

pymisha.gintervals_load_chain ¶

gintervals_load_chain(file, src_overlap_policy='error', tgt_overlap_policy='auto', src_groot=None, min_score=None)

Load an assembly conversion table from a UCSC chain file.

Reads a UCSC-format chain file and returns an assembly conversion table (DataFrame) that maps coordinates between a source genome and the current target genome. The resulting table can be used with gintervals_liftover and gtrack_liftover to convert intervals or tracks from the source assembly to the current one.

Source overlaps occur when the same source genome position maps to multiple target positions. Target overlaps occur when multiple source positions map to overlapping regions in the target genome. Both types of overlaps are handled according to the specified policies.

PARAMETER	DESCRIPTION
`file`	Path to the UCSC chain file. The file must follow the standard UCSC chain format specification. Chains whose target chromosomes are not present in the current database are silently skipped. TYPE: `str`
`src_overlap_policy`	Policy for handling source-side overlaps. One of: `"error"` (default) -- raise an error if source overlaps are detected. `"keep"` -- allow one source interval to map to multiple target intervals. `"discard"` -- remove all chain intervals involved in source overlaps. TYPE: `str` DEFAULT: `'error'`
`tgt_overlap_policy`	Policy for handling target-side overlaps. One of: `"error"` -- raise an error if target overlaps are detected. `"auto"` (default) -- alias for `"auto_score"`. `"auto_score"` -- segment overlapping target regions and select the chain with the highest alignment score per segment. Tie-breakers: longest span, then lowest chain_id. `"auto_longer"` -- segment and select the chain with the longest span per segment. Tie-breakers: highest score, then lowest chain_id. `"auto_first"` -- segment and select the chain with the lowest chain_id per segment. `"keep"` -- preserve all overlapping intervals. `"discard"` -- remove all chain intervals involved in target overlaps. `"agg"` -- segment overlaps into disjoint sub-regions, retaining all contributing chains per region for downstream aggregation. `"best_source_cluster"` -- cluster chains by source overlap and keep the cluster with the largest total target length. `"best_cluster_union"` -- best cluster union strategy. `"best_cluster_sum"` -- best cluster sum strategy. `"best_cluster_max"` -- best cluster max strategy. TYPE: `str` DEFAULT: `'auto'`
`src_groot`	Path to the source genome database root for validating source chromosomes and coordinates. Not yet implemented. TYPE: `str` DEFAULT: `None`
`min_score`	Minimum alignment score threshold. Chains with scores below this value are filtered out before overlap resolution. TYPE: `float` DEFAULT: `None`

RETURNS DESCRIPTION

DataFrame

Assembly conversion table with the following columns:

chrom (str) -- target chromosome name (normalized).
start (int) -- target interval start (0-based, inclusive).
end (int) -- target interval end (0-based, exclusive).
strand (int) -- target strand (0 = +, 1 = -).
chromsrc (str) -- source chromosome name.
startsrc (int) -- source interval start.
endsrc (int) -- source interval end.
strandsrc (int) -- source strand.
chain_id (int) -- chain identifier from the chain file.
score (float) -- chain alignment score.

The overlap policies are stored in DataFrame.attrs as "src_overlap_policy" and "tgt_overlap_policy".

RAISES	DESCRIPTION
`ValueError`	If the chain file is malformed, contains inconsistent chromosome sizes, has coordinates out of range, or if overlap policies are invalid. Also raised when `src_overlap_policy="error"` and source overlaps are detected, or `tgt_overlap_policy="error"` and target overlaps are detected.

See Also

gintervals_as_chain : Convert an existing DataFrame to chain format. gintervals_liftover : Lift intervals from one assembly to another. gtrack_liftover : Import a track from another assembly via liftover.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> chain = pm.gintervals_load_chain(
...     chainfile, src_overlap_policy="keep"
... )
>>> list(chain.columns)
['chrom', 'start', 'end', 'strand', 'chromsrc', 'startsrc', 'endsrc', 'strandsrc', 'chain_id', 'score']

pymisha.gintervals_as_chain ¶

gintervals_as_chain(intervals, src_overlap_policy='error', tgt_overlap_policy='auto', min_score=None)

Convert a DataFrame to chain format by validating columns and setting attributes.

Validates that the input DataFrame has all required chain columns and attaches overlap-policy metadata as DataFrame attributes. This is useful when you have manually constructed or modified chain data and need to mark it as a valid chain table for use with gintervals_liftover or gtrack_liftover.

PARAMETER	DESCRIPTION
`intervals`	A DataFrame that must contain all of the required chain columns: `chrom`, `start`, `end`, `strand`, `chromsrc`, `startsrc`, `endsrc`, `strandsrc`, `chain_id`, `score`. TYPE: `DataFrame`
`src_overlap_policy`	Policy for handling source-side overlaps. One of `"error"` (default), `"keep"`, or `"discard"`. This value is stored as a DataFrame attribute but does not trigger overlap resolution. TYPE: `str` DEFAULT: `'error'`
`tgt_overlap_policy`	Policy for handling target-side overlaps. One of `"error"`, `"auto"` (default, alias for `"auto_score"`), `"auto_score"`, `"auto_longer"`, `"auto_first"`, `"keep"`, `"discard"`, `"agg"`, `"best_source_cluster"`, `"best_cluster_union"`, `"best_cluster_sum"`, `"best_cluster_max"`. Stored as a DataFrame attribute. TYPE: `str` DEFAULT: `'auto'`
`min_score`	Minimum alignment score threshold to record as a DataFrame attribute. Does not filter the data; the value is stored for informational use by downstream functions. TYPE: `float` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	A copy of the input DataFrame with overlap-policy attributes set in `DataFrame.attrs`: `"src_overlap_policy"` -- the source overlap policy. `"tgt_overlap_policy"` -- the target overlap policy (`"auto"` is normalized to `"auto_score"`). `"min_score"` -- present only if min_score was provided.

RAISES	DESCRIPTION
`TypeError`	If intervals is not a `pandas.DataFrame`.
`ValueError`	If required columns are missing, or if either overlap policy string is not a recognized value.

See Also

gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_liftover : Lift intervals from one assembly to another. gtrack_liftover : Import a track from another assembly via liftover.

Examples:

>>> import pandas as pd
>>> import pymisha as pm
>>> chain_data = pd.DataFrame({
...     "chrom": ["1"], "start": [1000], "end": [2000], "strand": [0],
...     "chromsrc": ["chr25"], "startsrc": [5000], "endsrc": [6000],
...     "strandsrc": [0], "chain_id": [1], "score": [1000.0],
... })
>>> chain = pm.gintervals_as_chain(chain_data)
>>> chain.attrs["tgt_overlap_policy"]
'auto_score'

pymisha.gintervals_liftover ¶

gintervals_liftover(intervals, chain, src_overlap_policy='error', tgt_overlap_policy='auto', min_score=None, include_metadata=False, canonic=False, value_col=None, multi_target_agg='mean', params=None, na_rm=True, min_n=None)

Convert intervals from another assembly to the current one using a chain.

Maps each source interval through the chain's alignment blocks to produce the corresponding target-genome coordinates. A single source interval may produce multiple target intervals when it spans chain gaps or maps through multiple chains. The intervalID column in the output links each result row back to the originating source interval (0-based positional index).

When chain is a file path, it is loaded with the specified overlap policies. When it is a pre-loaded DataFrame (from gintervals_load_chain or gintervals_as_chain), the policies stored in its attributes are used and the policy arguments here are ignored.

PARAMETER	DESCRIPTION
`intervals`	Source-assembly intervals. Must contain at least the columns `chrom`, `start`, and `end`. Chromosome names should match the source side of the chain (`chromsrc`). TYPE: `DataFrame`
`chain`	Either a path to a UCSC chain file (loaded via `gintervals_load_chain`) or a pre-loaded chain DataFrame. TYPE: `str or DataFrame`
`src_overlap_policy`	Source overlap policy, used only when chain is a file path. One of `"error"` (default), `"keep"`, or `"discard"`. TYPE: `str` DEFAULT: `'error'`
`tgt_overlap_policy`	Target overlap policy, used only when chain is a file path. One of `"error"`, `"auto"` (default), `"auto_score"`, `"auto_longer"`, `"auto_first"`, `"keep"`, `"discard"`, `"agg"`, `"best_source_cluster"`, `"best_cluster_union"`, `"best_cluster_sum"`, `"best_cluster_max"`. TYPE: `str` DEFAULT: `'auto'`
`min_score`	Minimum chain alignment score, used only when chain is a file path. Chains scoring below this threshold are excluded. TYPE: `float` DEFAULT: `None`
`include_metadata`	If `True`, a `score` column is added to the output containing the alignment score of the chain that produced each mapping. Default is `False`. TYPE: `bool` DEFAULT: `False`
`canonic`	If `True`, adjacent target intervals originating from the same source interval (same `intervalID`) and the same chain (same `chain_id`) are merged into a single interval. Useful when a source interval maps to multiple adjacent target blocks separated by chain alignment gaps. Default is `False`. TYPE: `bool` DEFAULT: `False`
`value_col`	Name of a numeric column in intervals whose values should be carried through the liftover. When specified, the output includes this column with its original name. Ignored if `None`. TYPE: `str` DEFAULT: `None`
`multi_target_agg`	Aggregation method applied to value_col when multiple source intervals map to the same target region. One of `"mean"` (default), `"median"`, `"sum"`, `"min"`, `"max"`, `"count"`, `"first"`, `"last"`. Ignored when value_col is `None`. TYPE: `str` DEFAULT: `'mean'`
`params`	Additional parameters for specific aggregation methods (e.g., `n` for `"nth"` aggregation). TYPE: `dict or int` DEFAULT: `None`
`na_rm`	If `True` (default), `NaN` values are removed before aggregation. If `False`, any `NaN` in the group causes the aggregated result to be `NaN`. Only used when value_col is specified. TYPE: `bool` DEFAULT: `True`
`min_n`	Minimum number of non-`NaN` values required for aggregation. If fewer values are available, the result is `NaN`. `None` (default) means no minimum. Only used when value_col is specified. TYPE: `int` DEFAULT: `None`

RETURNS DESCRIPTION

DataFrame

Lifted intervals sorted by target coordinates with the columns:

chrom (str) -- target chromosome.
start (int) -- target start (0-based, inclusive).
end (int) -- target end (0-based, exclusive).
intervalID (int) -- 0-based index of the source interval in the input intervals DataFrame.
chain_id (int) -- identifier of the chain that produced the mapping.
score (float) -- chain alignment score (only when include_metadata is True).
value_col (float) -- carried-through values (only when value_col is specified).

RAISES	DESCRIPTION
`ValueError`	If intervals or chain is `None`, or if a file-path chain cannot be loaded.

See Also

gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_as_chain : Convert a DataFrame to chain format. gtrack_liftover : Import a full track from another assembly.

Examples:

>>> import pandas as pd
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> intervs = pd.DataFrame({
...     "chrom": ["chr25", "chr25"],
...     "start": [0, 7000],
...     "end": [6000, 20000],
... })
>>> lifted = pm.gintervals_liftover(
...     intervs, chainfile, src_overlap_policy="keep"
... )
>>> list(lifted.columns)
['chrom', 'start', 'end', 'intervalID', 'chain_id']

pymisha.gtrack_liftover ¶

gtrack_liftover(track, description, src_track_dir, chain, src_overlap_policy='error', tgt_overlap_policy='auto', multi_target_agg='mean', params=None, na_rm=True, min_n=None, min_score=None)

Import a track from another assembly via coordinate liftover.

Reads a source track from src_track_dir (a directory containing per-chromosome binary track files or an indexed track.idx/track.dat pair), maps its intervals through chain to the current target genome, aggregates values when multiple source intervals land on the same target region, and creates a new sparse track in the current database.

When chain is a file path it is loaded with the specified overlap policies. When it is a pre-loaded DataFrame the policies stored in its attributes are used and the policy arguments here are ignored.

PARAMETER	DESCRIPTION
`track`	Name of the new track to create in the current database. The track must not already exist. TYPE: `str`
`description`	Human-readable description stored as a track attribute. TYPE: `str`
`src_track_dir`	Path to the source track directory. The directory may contain per-chromosome binary files (dense or sparse) or an indexed pair of `track.idx` and `track.dat` files. TYPE: `str`
`chain`	Either a path to a UCSC chain file or a pre-loaded chain DataFrame as returned by `gintervals_load_chain`. TYPE: `str or DataFrame`
`src_overlap_policy`	Source overlap policy, used only when chain is a file path. One of `"error"` (default), `"keep"`, or `"discard"`. TYPE: `str` DEFAULT: `'error'`
`tgt_overlap_policy`	Target overlap policy, used only when chain is a file path. One of `"error"`, `"auto"` (default), `"auto_score"`, `"auto_longer"`, `"auto_first"`, `"keep"`, `"discard"`, `"agg"`, `"best_source_cluster"`, `"best_cluster_union"`, `"best_cluster_sum"`, `"best_cluster_max"`. TYPE: `str` DEFAULT: `'auto'`
`multi_target_agg`	Aggregation function applied when multiple source values map to the same target locus. One of `"mean"` (default), `"median"`, `"sum"`, `"min"`, `"max"`, `"count"`, `"first"`, `"last"`. TYPE: `str` DEFAULT: `'mean'`
`params`	Extra parameters for specific aggregation methods (e.g., `n` for `"nth"` aggregation). TYPE: `dict` DEFAULT: `None`
`na_rm`	If `True` (default), `NaN` values are removed before aggregation. If `False`, any `NaN` in the group causes the aggregated result to be `NaN`. TYPE: `bool` DEFAULT: `True`
`min_n`	Minimum number of non-`NaN` values required for aggregation. If fewer values are available the result is `NaN`. `None` (default) means no minimum. TYPE: `int` DEFAULT: `None`
`min_score`	Minimum chain alignment score. Chains scoring below this value are excluded during loading. Only used when chain is a file path. TYPE: `float` DEFAULT: `None`

RETURNS	DESCRIPTION
`None`	The function creates a new sparse track in the current database as a side effect and does not return a value.

RAISES	DESCRIPTION
`ValueError`	If track already exists, if src_track_dir does not exist, if the aggregation function is unsupported, or if the chain file is invalid.
`TypeError`	If chain is neither a file path string nor a `pandas.DataFrame`.

See Also

gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_liftover : Lift intervals (without creating a track). gtrack_create_sparse : Create a sparse track from intervals and values.

Notes

UCSC chain format terminology is reversed from misha convention: UCSC "target" (tName, tStart, tEnd) corresponds to misha "source" (chromsrc, startsrc, endsrc), and UCSC "query" corresponds to misha "target" (chrom, start, end).

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> pm.gtrack_liftover(
...     "lifted_track", "Track lifted from other assembly",
...     "/path/to/source/tracks/my_track.track", chainfile,
... )