Skip to content

Liftover

Functions for converting genomic coordinates between assemblies using chain files.

pymisha.gintervals_load_chain

gintervals_load_chain(file, src_overlap_policy='error', tgt_overlap_policy='auto', src_groot=None, min_score=None)

Load an assembly conversion table from a UCSC chain file.

Reads a UCSC-format chain file and returns an assembly conversion table (DataFrame) that maps coordinates between a source genome and the current target genome. The resulting table can be used with gintervals_liftover and gtrack_liftover to convert intervals or tracks from the source assembly to the current one.

Source overlaps occur when the same source genome position maps to multiple target positions. Target overlaps occur when multiple source positions map to overlapping regions in the target genome. Both types of overlaps are handled according to the specified policies.

PARAMETER DESCRIPTION
file

Path to the UCSC chain file. The file must follow the standard UCSC chain format specification. Chains whose target chromosomes are not present in the current database are silently skipped.

TYPE: str

src_overlap_policy

Policy for handling source-side overlaps. One of:

  • "error" (default) -- raise an error if source overlaps are detected.
  • "keep" -- allow one source interval to map to multiple target intervals.
  • "discard" -- remove all chain intervals involved in source overlaps.

TYPE: str DEFAULT: 'error'

tgt_overlap_policy

Policy for handling target-side overlaps. One of:

  • "error" -- raise an error if target overlaps are detected.
  • "auto" (default) -- alias for "auto_score".
  • "auto_score" -- segment overlapping target regions and select the chain with the highest alignment score per segment. Tie-breakers: longest span, then lowest chain_id.
  • "auto_longer" -- segment and select the chain with the longest span per segment. Tie-breakers: highest score, then lowest chain_id.
  • "auto_first" -- segment and select the chain with the lowest chain_id per segment.
  • "keep" -- preserve all overlapping intervals.
  • "discard" -- remove all chain intervals involved in target overlaps.
  • "agg" -- segment overlaps into disjoint sub-regions, retaining all contributing chains per region for downstream aggregation.
  • "best_source_cluster" -- cluster chains by source overlap and keep the cluster with the largest total target length.
  • "best_cluster_union" -- best cluster union strategy.
  • "best_cluster_sum" -- best cluster sum strategy.
  • "best_cluster_max" -- best cluster max strategy.

TYPE: str DEFAULT: 'auto'

src_groot

Path to the source genome database root for validating source chromosomes and coordinates. Not yet implemented.

TYPE: str DEFAULT: None

min_score

Minimum alignment score threshold. Chains with scores below this value are filtered out before overlap resolution.

TYPE: float DEFAULT: None

RETURNS DESCRIPTION
DataFrame

Assembly conversion table with the following columns:

  • chrom (str) -- target chromosome name (normalized).
  • start (int) -- target interval start (0-based, inclusive).
  • end (int) -- target interval end (0-based, exclusive).
  • strand (int) -- target strand (0 = +, 1 = -).
  • chromsrc (str) -- source chromosome name.
  • startsrc (int) -- source interval start.
  • endsrc (int) -- source interval end.
  • strandsrc (int) -- source strand.
  • chain_id (int) -- chain identifier from the chain file.
  • score (float) -- chain alignment score.

The overlap policies are stored in DataFrame.attrs as "src_overlap_policy" and "tgt_overlap_policy".

RAISES DESCRIPTION
ValueError

If the chain file is malformed, contains inconsistent chromosome sizes, has coordinates out of range, or if overlap policies are invalid. Also raised when src_overlap_policy="error" and source overlaps are detected, or tgt_overlap_policy="error" and target overlaps are detected.

See Also

gintervals_as_chain : Convert an existing DataFrame to chain format. gintervals_liftover : Lift intervals from one assembly to another. gtrack_liftover : Import a track from another assembly via liftover.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> chain = pm.gintervals_load_chain(
...     chainfile, src_overlap_policy="keep"
... )
>>> list(chain.columns)
['chrom', 'start', 'end', 'strand', 'chromsrc', 'startsrc', 'endsrc', 'strandsrc', 'chain_id', 'score']

pymisha.gintervals_as_chain

gintervals_as_chain(intervals, src_overlap_policy='error', tgt_overlap_policy='auto', min_score=None)

Convert a DataFrame to chain format by validating columns and setting attributes.

Validates that the input DataFrame has all required chain columns and attaches overlap-policy metadata as DataFrame attributes. This is useful when you have manually constructed or modified chain data and need to mark it as a valid chain table for use with gintervals_liftover or gtrack_liftover.

PARAMETER DESCRIPTION
intervals

A DataFrame that must contain all of the required chain columns: chrom, start, end, strand, chromsrc, startsrc, endsrc, strandsrc, chain_id, score.

TYPE: DataFrame

src_overlap_policy

Policy for handling source-side overlaps. One of "error" (default), "keep", or "discard". This value is stored as a DataFrame attribute but does not trigger overlap resolution.

TYPE: str DEFAULT: 'error'

tgt_overlap_policy

Policy for handling target-side overlaps. One of "error", "auto" (default, alias for "auto_score"), "auto_score", "auto_longer", "auto_first", "keep", "discard", "agg", "best_source_cluster", "best_cluster_union", "best_cluster_sum", "best_cluster_max". Stored as a DataFrame attribute.

TYPE: str DEFAULT: 'auto'

min_score

Minimum alignment score threshold to record as a DataFrame attribute. Does not filter the data; the value is stored for informational use by downstream functions.

TYPE: float DEFAULT: None

RETURNS DESCRIPTION
DataFrame

A copy of the input DataFrame with overlap-policy attributes set in DataFrame.attrs:

  • "src_overlap_policy" -- the source overlap policy.
  • "tgt_overlap_policy" -- the target overlap policy ("auto" is normalized to "auto_score").
  • "min_score" -- present only if min_score was provided.
RAISES DESCRIPTION
TypeError

If intervals is not a pandas.DataFrame.

ValueError

If required columns are missing, or if either overlap policy string is not a recognized value.

See Also

gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_liftover : Lift intervals from one assembly to another. gtrack_liftover : Import a track from another assembly via liftover.

Examples:

>>> import pandas as pd
>>> import pymisha as pm
>>> chain_data = pd.DataFrame({
...     "chrom": ["1"], "start": [1000], "end": [2000], "strand": [0],
...     "chromsrc": ["chr25"], "startsrc": [5000], "endsrc": [6000],
...     "strandsrc": [0], "chain_id": [1], "score": [1000.0],
... })
>>> chain = pm.gintervals_as_chain(chain_data)
>>> chain.attrs["tgt_overlap_policy"]
'auto_score'

pymisha.gintervals_liftover

gintervals_liftover(intervals, chain, src_overlap_policy='error', tgt_overlap_policy='auto', min_score=None, include_metadata=False, canonic=False, value_col=None, multi_target_agg='mean', params=None, na_rm=True, min_n=None)

Convert intervals from another assembly to the current one using a chain.

Maps each source interval through the chain's alignment blocks to produce the corresponding target-genome coordinates. A single source interval may produce multiple target intervals when it spans chain gaps or maps through multiple chains. The intervalID column in the output links each result row back to the originating source interval (0-based positional index).

When chain is a file path, it is loaded with the specified overlap policies. When it is a pre-loaded DataFrame (from gintervals_load_chain or gintervals_as_chain), the policies stored in its attributes are used and the policy arguments here are ignored.

PARAMETER DESCRIPTION
intervals

Source-assembly intervals. Must contain at least the columns chrom, start, and end. Chromosome names should match the source side of the chain (chromsrc).

TYPE: DataFrame

chain

Either a path to a UCSC chain file (loaded via gintervals_load_chain) or a pre-loaded chain DataFrame.

TYPE: str or DataFrame

src_overlap_policy

Source overlap policy, used only when chain is a file path. One of "error" (default), "keep", or "discard".

TYPE: str DEFAULT: 'error'

tgt_overlap_policy

Target overlap policy, used only when chain is a file path. One of "error", "auto" (default), "auto_score", "auto_longer", "auto_first", "keep", "discard", "agg", "best_source_cluster", "best_cluster_union", "best_cluster_sum", "best_cluster_max".

TYPE: str DEFAULT: 'auto'

min_score

Minimum chain alignment score, used only when chain is a file path. Chains scoring below this threshold are excluded.

TYPE: float DEFAULT: None

include_metadata

If True, a score column is added to the output containing the alignment score of the chain that produced each mapping. Default is False.

TYPE: bool DEFAULT: False

canonic

If True, adjacent target intervals originating from the same source interval (same intervalID) and the same chain (same chain_id) are merged into a single interval. Useful when a source interval maps to multiple adjacent target blocks separated by chain alignment gaps. Default is False.

TYPE: bool DEFAULT: False

value_col

Name of a numeric column in intervals whose values should be carried through the liftover. When specified, the output includes this column with its original name. Ignored if None.

TYPE: str DEFAULT: None

multi_target_agg

Aggregation method applied to value_col when multiple source intervals map to the same target region. One of "mean" (default), "median", "sum", "min", "max", "count", "first", "last". Ignored when value_col is None.

TYPE: str DEFAULT: 'mean'

params

Additional parameters for specific aggregation methods (e.g., n for "nth" aggregation).

TYPE: dict or int DEFAULT: None

na_rm

If True (default), NaN values are removed before aggregation. If False, any NaN in the group causes the aggregated result to be NaN. Only used when value_col is specified.

TYPE: bool DEFAULT: True

min_n

Minimum number of non-NaN values required for aggregation. If fewer values are available, the result is NaN. None (default) means no minimum. Only used when value_col is specified.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
DataFrame

Lifted intervals sorted by target coordinates with the columns:

  • chrom (str) -- target chromosome.
  • start (int) -- target start (0-based, inclusive).
  • end (int) -- target end (0-based, exclusive).
  • intervalID (int) -- 0-based index of the source interval in the input intervals DataFrame.
  • chain_id (int) -- identifier of the chain that produced the mapping.
  • score (float) -- chain alignment score (only when include_metadata is True).
  • value_col (float) -- carried-through values (only when value_col is specified).
RAISES DESCRIPTION
ValueError

If intervals or chain is None, or if a file-path chain cannot be loaded.

See Also

gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_as_chain : Convert a DataFrame to chain format. gtrack_liftover : Import a full track from another assembly.

Examples:

>>> import pandas as pd
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> intervs = pd.DataFrame({
...     "chrom": ["chr25", "chr25"],
...     "start": [0, 7000],
...     "end": [6000, 20000],
... })
>>> lifted = pm.gintervals_liftover(
...     intervs, chainfile, src_overlap_policy="keep"
... )
>>> list(lifted.columns)
['chrom', 'start', 'end', 'intervalID', 'chain_id']

pymisha.gtrack_liftover

gtrack_liftover(track, description, src_track_dir, chain, src_overlap_policy='error', tgt_overlap_policy='auto', multi_target_agg='mean', params=None, na_rm=True, min_n=None, min_score=None)

Import a track from another assembly via coordinate liftover.

Reads a source track from src_track_dir (a directory containing per-chromosome binary track files or an indexed track.idx/track.dat pair), maps its intervals through chain to the current target genome, aggregates values when multiple source intervals land on the same target region, and creates a new sparse track in the current database.

When chain is a file path it is loaded with the specified overlap policies. When it is a pre-loaded DataFrame the policies stored in its attributes are used and the policy arguments here are ignored.

PARAMETER DESCRIPTION
track

Name of the new track to create in the current database. The track must not already exist.

TYPE: str

description

Human-readable description stored as a track attribute.

TYPE: str

src_track_dir

Path to the source track directory. The directory may contain per-chromosome binary files (dense or sparse) or an indexed pair of track.idx and track.dat files.

TYPE: str

chain

Either a path to a UCSC chain file or a pre-loaded chain DataFrame as returned by gintervals_load_chain.

TYPE: str or DataFrame

src_overlap_policy

Source overlap policy, used only when chain is a file path. One of "error" (default), "keep", or "discard".

TYPE: str DEFAULT: 'error'

tgt_overlap_policy

Target overlap policy, used only when chain is a file path. One of "error", "auto" (default), "auto_score", "auto_longer", "auto_first", "keep", "discard", "agg", "best_source_cluster", "best_cluster_union", "best_cluster_sum", "best_cluster_max".

TYPE: str DEFAULT: 'auto'

multi_target_agg

Aggregation function applied when multiple source values map to the same target locus. One of "mean" (default), "median", "sum", "min", "max", "count", "first", "last".

TYPE: str DEFAULT: 'mean'

params

Extra parameters for specific aggregation methods (e.g., n for "nth" aggregation).

TYPE: dict DEFAULT: None

na_rm

If True (default), NaN values are removed before aggregation. If False, any NaN in the group causes the aggregated result to be NaN.

TYPE: bool DEFAULT: True

min_n

Minimum number of non-NaN values required for aggregation. If fewer values are available the result is NaN. None (default) means no minimum.

TYPE: int DEFAULT: None

min_score

Minimum chain alignment score. Chains scoring below this value are excluded during loading. Only used when chain is a file path.

TYPE: float DEFAULT: None

RETURNS DESCRIPTION
None

The function creates a new sparse track in the current database as a side effect and does not return a value.

RAISES DESCRIPTION
ValueError

If track already exists, if src_track_dir does not exist, if the aggregation function is unsupported, or if the chain file is invalid.

TypeError

If chain is neither a file path string nor a pandas.DataFrame.

See Also

gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_liftover : Lift intervals (without creating a track). gtrack_create_sparse : Create a sparse track from intervals and values.

Notes

UCSC chain format terminology is reversed from misha convention: UCSC "target" (tName, tStart, tEnd) corresponds to misha "source" (chromsrc, startsrc, endsrc), and UCSC "query" corresponds to misha "target" (chrom, start, end).

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> pm.gtrack_liftover(
...     "lifted_track", "Track lifted from other assembly",
...     "/path/to/source/tracks/my_track.track", chainfile,
... )