Liftover¶

Functions for converting genomic coordinates between assemblies using chain files.

pymisha.gintervals_load_chain ¶

gintervals_load_chain(file: str, src_overlap_policy: str = 'error', tgt_overlap_policy: str = 'auto', src_groot: str | None = None, min_score: float | None = None) -> pd.DataFrame

Load an assembly conversion table from a UCSC chain file.

Reads a UCSC-format chain file and returns an assembly conversion table (DataFrame) that maps coordinates between a source genome and the current target genome. The resulting table can be used with gintervals_liftover and gtrack_liftover to convert intervals or tracks from the source assembly to the current one.

Source overlaps occur when the same source genome position maps to multiple target positions. Target overlaps occur when multiple source positions map to overlapping regions in the target genome. Both types of overlaps are handled according to the specified policies.

PARAMETER	DESCRIPTION
`file`	Path to the UCSC chain file. The file must follow the standard UCSC chain format specification. Chains whose target chromosomes are not present in the current database are silently skipped. TYPE: `str`
`src_overlap_policy`	Policy for handling source-side overlaps. One of: `"error"` (default) -- raise an error if source overlaps are detected. `"keep"` -- allow one source interval to map to multiple target intervals. `"discard"` -- remove all chain intervals involved in source overlaps. TYPE: `str` DEFAULT: `'error'`
`tgt_overlap_policy`	Policy for handling target-side overlaps. One of: `"error"` -- raise an error if target overlaps are detected. `"auto"` (default) -- alias for `"auto_score"`. `"auto_score"` -- segment overlapping target regions and select the chain with the highest alignment score per segment. Tie-breakers: longest span, then lowest chain_id. `"auto_longer"` -- segment and select the chain with the longest span per segment. Tie-breakers: highest score, then lowest chain_id. `"auto_first"` -- segment and select the chain with the lowest chain_id per segment. `"keep"` -- preserve all overlapping intervals. `"discard"` -- remove all chain intervals involved in target overlaps. `"agg"` -- segment overlaps into disjoint sub-regions, retaining all contributing chains per region for downstream aggregation. `"best_source_cluster"` -- cluster chains by source overlap and keep the cluster with the largest total target length. `"best_cluster_union"` -- best cluster union strategy. `"best_cluster_sum"` -- best cluster sum strategy. `"best_cluster_max"` -- best cluster max strategy. TYPE: `str` DEFAULT: `'auto'`
`src_groot`	Path to the source genome database root for validating source chromosomes and coordinates. Not yet implemented. TYPE: `str` DEFAULT: `None`
`min_score`	Minimum alignment score threshold. Chains with scores below this value are filtered out before overlap resolution. TYPE: `float` DEFAULT: `None`

RETURNS DESCRIPTION

DataFrame

Assembly conversion table with the following columns:

chrom (str) -- target chromosome name (normalized).
start (int) -- target interval start (0-based, inclusive).
end (int) -- target interval end (0-based, exclusive).
strand (int) -- target strand (0 = +, 1 = -).
chromsrc (str) -- source chromosome name.
startsrc (int) -- source interval start.
endsrc (int) -- source interval end.
strandsrc (int) -- source strand.
chain_id (int) -- chain identifier from the chain file.
score (float) -- chain alignment score.

The overlap policies are stored in DataFrame.attrs as "src_overlap_policy" and "tgt_overlap_policy".

RAISES	DESCRIPTION
`ValueError`	If the chain file is malformed, contains inconsistent chromosome sizes, has coordinates out of range, or if overlap policies are invalid. Also raised when `src_overlap_policy="error"` and source overlaps are detected, or `tgt_overlap_policy="error"` and target overlaps are detected.

pymisha.gintervals_as_chain ¶

gintervals_as_chain(intervals: DataFrame, src_overlap_policy: str = 'error', tgt_overlap_policy: str = 'auto', min_score: float | None = None) -> pd.DataFrame

Convert a DataFrame to chain format by validating columns and setting attributes.

Validates that the input DataFrame has all required chain columns and attaches overlap-policy metadata as DataFrame attributes. This is useful when you have manually constructed or modified chain data and need to mark it as a valid chain table for use with gintervals_liftover or gtrack_liftover.

PARAMETER	DESCRIPTION
`intervals`	A DataFrame that must contain all of the required chain columns: `chrom`, `start`, `end`, `strand`, `chromsrc`, `startsrc`, `endsrc`, `strandsrc`, `chain_id`, `score`. TYPE: `DataFrame`
`src_overlap_policy`	Policy for handling source-side overlaps. One of `"error"` (default), `"keep"`, or `"discard"`. This value is stored as a DataFrame attribute but does not trigger overlap resolution. TYPE: `str` DEFAULT: `'error'`
`tgt_overlap_policy`	Policy for handling target-side overlaps. One of `"error"`, `"auto"` (default, alias for `"auto_score"`), `"auto_score"`, `"auto_longer"`, `"auto_first"`, `"keep"`, `"discard"`, `"agg"`, `"best_source_cluster"`, `"best_cluster_union"`, `"best_cluster_sum"`, `"best_cluster_max"`. Stored as a DataFrame attribute. TYPE: `str` DEFAULT: `'auto'`
`min_score`	Minimum alignment score threshold to record as a DataFrame attribute. Does not filter the data; the value is stored for informational use by downstream functions. TYPE: `float` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	A copy of the input DataFrame with overlap-policy attributes set in `DataFrame.attrs`: `"src_overlap_policy"` -- the source overlap policy. `"tgt_overlap_policy"` -- the target overlap policy (`"auto"` is normalized to `"auto_score"`). `"min_score"` -- present only if min_score was provided.

RAISES	DESCRIPTION
`TypeError`	If intervals is not a `pandas.DataFrame`.
`ValueError`	If required columns are missing, or if either overlap policy string is not a recognized value.

pymisha.gintervals_liftover ¶

gintervals_liftover(intervals: DataFrame, chain: str | DataFrame, src_overlap_policy: str = 'error', tgt_overlap_policy: str = 'auto', min_score: float | None = None, include_metadata: bool = False, canonic: bool = False, value_col: str | None = None, multi_target_agg: str = 'mean', params: dict[str, Any] | int | None = None, na_rm: bool = True, min_n: int | None = None) -> pd.DataFrame

Convert intervals from another assembly to the current one using a chain.

Maps each source interval through the chain's alignment blocks to produce the corresponding target-genome coordinates. A single source interval may produce multiple target intervals when it spans chain gaps or maps through multiple chains. The intervalID column in the output links each result row back to the originating source interval (0-based positional index).

When chain is a file path, it is loaded with the specified overlap policies. When it is a pre-loaded DataFrame (from gintervals_load_chain or gintervals_as_chain), the policies stored in its attributes are used and the policy arguments here are ignored.

PARAMETER	DESCRIPTION
`intervals`	Source-assembly intervals. Must contain at least the columns `chrom`, `start`, and `end`. Chromosome names should match the source side of the chain (`chromsrc`). TYPE: `DataFrame`
`chain`	Either a path to a UCSC chain file (loaded via `gintervals_load_chain`) or a pre-loaded chain DataFrame. TYPE: `str or DataFrame`
`src_overlap_policy`	Source overlap policy, used only when chain is a file path. One of `"error"` (default), `"keep"`, or `"discard"`. TYPE: `str` DEFAULT: `'error'`
`tgt_overlap_policy`	Target overlap policy, used only when chain is a file path. One of `"error"`, `"auto"` (default), `"auto_score"`, `"auto_longer"`, `"auto_first"`, `"keep"`, `"discard"`, `"agg"`, `"best_source_cluster"`, `"best_cluster_union"`, `"best_cluster_sum"`, `"best_cluster_max"`. TYPE: `str` DEFAULT: `'auto'`
`min_score`	Minimum chain alignment score, used only when chain is a file path. Chains scoring below this threshold are excluded. TYPE: `float` DEFAULT: `None`
`include_metadata`	If `True`, a `score` column is added to the output containing the alignment score of the chain that produced each mapping. Default is `False`. TYPE: `bool` DEFAULT: `False`
`canonic`	If `True`, adjacent target intervals originating from the same source interval (same `intervalID`) and the same chain (same `chain_id`) are merged into a single interval. Useful when a source interval maps to multiple adjacent target blocks separated by chain alignment gaps. Default is `False`. TYPE: `bool` DEFAULT: `False`
`value_col`	Name of a numeric column in intervals whose values should be carried through the liftover. When specified, the output includes this column with its original name. Ignored if `None`. TYPE: `str` DEFAULT: `None`
`multi_target_agg`	Aggregation method applied to value_col when multiple source intervals map to the same target region. One of `"mean"` (default), `"median"`, `"sum"`, `"min"`, `"max"`, `"count"`, `"first"`, `"last"`. Ignored when value_col is `None`. TYPE: `str` DEFAULT: `'mean'`
`params`	Additional parameters for specific aggregation methods (e.g., `n` for `"nth"` aggregation). TYPE: `dict or int` DEFAULT: `None`
`na_rm`	If `True` (default), `NaN` values are removed before aggregation. If `False`, any `NaN` in the group causes the aggregated result to be `NaN`. Only used when value_col is specified. TYPE: `bool` DEFAULT: `True`
`min_n`	Minimum number of non-`NaN` values required for aggregation. If fewer values are available, the result is `NaN`. `None` (default) means no minimum. Only used when value_col is specified. TYPE: `int` DEFAULT: `None`

RETURNS DESCRIPTION

DataFrame

Lifted intervals sorted by target coordinates with the columns:

chrom (str) -- target chromosome.
start (int) -- target start (0-based, inclusive).
end (int) -- target end (0-based, exclusive).
intervalID (int) -- 0-based index of the source interval in the input intervals DataFrame.
chain_id (int) -- identifier of the chain that produced the mapping.
score (float) -- chain alignment score (only when include_metadata is True).
value_col (float) -- carried-through values (only when value_col is specified).

RAISES	DESCRIPTION
`ValueError`	If intervals or chain is `None`, or if a file-path chain cannot be loaded.

pymisha.gtrack_liftover ¶

gtrack_liftover(track: str, description: str, src_track_dir: str, chain: str | DataFrame, src_overlap_policy: str = 'error', tgt_overlap_policy: str = 'auto', multi_target_agg: str = 'mean', params: dict[str, Any] | None = None, na_rm: bool = True, min_n: int | None = None, min_score: float | None = None, *, _force_pure_python: bool = False) -> None

Import a track from another assembly via coordinate liftover.

Dispatches to the C++ fast path (G1.P3.C) by default. Falls back to the pure-Python implementation when _force_pure_python=True or when PYMISHA_FORCE_PY_LIFTOVER_TRACK=1 in the environment.

See :func:_gtrack_liftover_python for the full parameter docstring.