Liftover¶
Functions for converting genomic coordinates between assemblies using chain files.
pymisha.gintervals_load_chain ¶
gintervals_load_chain(file: str, src_overlap_policy: str = 'error', tgt_overlap_policy: str = 'auto', src_groot: str | None = None, min_score: float | None = None) -> pd.DataFrame
Load an assembly conversion table from a UCSC chain file.
Reads a UCSC-format chain file and returns an assembly conversion table
(DataFrame) that maps coordinates between a source genome and the current
target genome. The resulting table can be used with
gintervals_liftover and gtrack_liftover to convert intervals or
tracks from the source assembly to the current one.
Source overlaps occur when the same source genome position maps to multiple target positions. Target overlaps occur when multiple source positions map to overlapping regions in the target genome. Both types of overlaps are handled according to the specified policies.
| PARAMETER | DESCRIPTION |
|---|---|
file
|
Path to the UCSC chain file. The file must follow the standard UCSC chain format specification. Chains whose target chromosomes are not present in the current database are silently skipped.
TYPE:
|
src_overlap_policy
|
Policy for handling source-side overlaps. One of:
TYPE:
|
tgt_overlap_policy
|
Policy for handling target-side overlaps. One of:
TYPE:
|
src_groot
|
Path to the source genome database root for validating source chromosomes and coordinates. Not yet implemented.
TYPE:
|
min_score
|
Minimum alignment score threshold. Chains with scores below this value are filtered out before overlap resolution.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Assembly conversion table with the following columns:
The overlap policies are stored in |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the chain file is malformed, contains inconsistent chromosome
sizes, has coordinates out of range, or if overlap policies are
invalid. Also raised when |
See Also
gintervals_as_chain : Convert an existing DataFrame to chain format. gintervals_liftover : Lift intervals from one assembly to another. gtrack_liftover : Import a track from another assembly via liftover.
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> chain = pm.gintervals_load_chain(
... chainfile, src_overlap_policy="keep"
... )
>>> list(chain.columns)
['chrom', 'start', 'end', 'strand', 'chromsrc', 'startsrc', 'endsrc', 'strandsrc', 'chain_id', 'score']
pymisha.gintervals_as_chain ¶
gintervals_as_chain(intervals: DataFrame, src_overlap_policy: str = 'error', tgt_overlap_policy: str = 'auto', min_score: float | None = None) -> pd.DataFrame
Convert a DataFrame to chain format by validating columns and setting attributes.
Validates that the input DataFrame has all required chain columns and
attaches overlap-policy metadata as DataFrame attributes. This is useful
when you have manually constructed or modified chain data and need to
mark it as a valid chain table for use with gintervals_liftover or
gtrack_liftover.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
A DataFrame that must contain all of the required chain columns:
TYPE:
|
src_overlap_policy
|
Policy for handling source-side overlaps. One of
TYPE:
|
tgt_overlap_policy
|
Policy for handling target-side overlaps. One of
TYPE:
|
min_score
|
Minimum alignment score threshold to record as a DataFrame attribute. Does not filter the data; the value is stored for informational use by downstream functions.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A copy of the input DataFrame with overlap-policy attributes set in
|
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
If intervals is not a |
ValueError
|
If required columns are missing, or if either overlap policy string is not a recognized value. |
See Also
gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_liftover : Lift intervals from one assembly to another. gtrack_liftover : Import a track from another assembly via liftover.
Examples:
>>> import pandas as pd
>>> import pymisha as pm
>>> chain_data = pd.DataFrame({
... "chrom": ["1"], "start": [1000], "end": [2000], "strand": [0],
... "chromsrc": ["chr25"], "startsrc": [5000], "endsrc": [6000],
... "strandsrc": [0], "chain_id": [1], "score": [1000.0],
... })
>>> chain = pm.gintervals_as_chain(chain_data)
>>> chain.attrs["tgt_overlap_policy"]
'auto_score'
pymisha.gintervals_liftover ¶
gintervals_liftover(intervals: DataFrame, chain: str | DataFrame, src_overlap_policy: str = 'error', tgt_overlap_policy: str = 'auto', min_score: float | None = None, include_metadata: bool = False, canonic: bool = False, value_col: str | None = None, multi_target_agg: str = 'mean', params: dict[str, Any] | int | None = None, na_rm: bool = True, min_n: int | None = None) -> pd.DataFrame
Convert intervals from another assembly to the current one using a chain.
Maps each source interval through the chain's alignment blocks to produce
the corresponding target-genome coordinates. A single source interval may
produce multiple target intervals when it spans chain gaps or maps through
multiple chains. The intervalID column in the output links each result
row back to the originating source interval (0-based positional index).
When chain is a file path, it is loaded with the specified overlap
policies. When it is a pre-loaded DataFrame (from gintervals_load_chain
or gintervals_as_chain), the policies stored in its attributes are
used and the policy arguments here are ignored.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Source-assembly intervals. Must contain at least the columns
TYPE:
|
chain
|
Either a path to a UCSC chain file (loaded via
TYPE:
|
src_overlap_policy
|
Source overlap policy, used only when chain is a file path.
One of
TYPE:
|
tgt_overlap_policy
|
Target overlap policy, used only when chain is a file path.
One of
TYPE:
|
min_score
|
Minimum chain alignment score, used only when chain is a file path. Chains scoring below this threshold are excluded.
TYPE:
|
include_metadata
|
If
TYPE:
|
canonic
|
If
TYPE:
|
value_col
|
Name of a numeric column in intervals whose values should be
carried through the liftover. When specified, the output includes
this column with its original name. Ignored if
TYPE:
|
multi_target_agg
|
Aggregation method applied to value_col when multiple source
intervals map to the same target region. One of
TYPE:
|
params
|
Additional parameters for specific aggregation methods (e.g.,
TYPE:
|
na_rm
|
If
TYPE:
|
min_n
|
Minimum number of non-
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Lifted intervals sorted by target coordinates with the columns:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If intervals or chain is |
See Also
gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_as_chain : Convert a DataFrame to chain format. gtrack_liftover : Import a full track from another assembly.
Examples:
>>> import pandas as pd
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> intervs = pd.DataFrame({
... "chrom": ["chr25", "chr25"],
... "start": [0, 7000],
... "end": [6000, 20000],
... })
>>> lifted = pm.gintervals_liftover(
... intervs, chainfile, src_overlap_policy="keep"
... )
>>> list(lifted.columns)
['chrom', 'start', 'end', 'intervalID', 'chain_id']
pymisha.gtrack_liftover ¶
gtrack_liftover(track: str, description: str, src_track_dir: str, chain: str | DataFrame, src_overlap_policy: str = 'error', tgt_overlap_policy: str = 'auto', multi_target_agg: str = 'mean', params: dict[str, Any] | None = None, na_rm: bool = True, min_n: int | None = None, min_score: float | None = None, *, _force_pure_python: bool = False) -> None
Import a track from another assembly via coordinate liftover.
Dispatches to the C++ fast path (G1.P3.C) by default. Falls back to the
pure-Python implementation when _force_pure_python=True or when
PYMISHA_FORCE_PY_LIFTOVER_TRACK=1 in the environment.
See :func:_gtrack_liftover_python for the full parameter docstring.