Liftover¶
Functions for converting genomic coordinates between assemblies using chain files.
pymisha.gintervals_load_chain ¶
gintervals_load_chain(file, src_overlap_policy='error', tgt_overlap_policy='auto', src_groot=None, min_score=None)
Load an assembly conversion table from a UCSC chain file.
Reads a UCSC-format chain file and returns an assembly conversion table
(DataFrame) that maps coordinates between a source genome and the current
target genome. The resulting table can be used with
gintervals_liftover and gtrack_liftover to convert intervals or
tracks from the source assembly to the current one.
Source overlaps occur when the same source genome position maps to multiple target positions. Target overlaps occur when multiple source positions map to overlapping regions in the target genome. Both types of overlaps are handled according to the specified policies.
| PARAMETER | DESCRIPTION |
|---|---|
file
|
Path to the UCSC chain file. The file must follow the standard UCSC chain format specification. Chains whose target chromosomes are not present in the current database are silently skipped.
TYPE:
|
src_overlap_policy
|
Policy for handling source-side overlaps. One of:
TYPE:
|
tgt_overlap_policy
|
Policy for handling target-side overlaps. One of:
TYPE:
|
src_groot
|
Path to the source genome database root for validating source chromosomes and coordinates. Not yet implemented.
TYPE:
|
min_score
|
Minimum alignment score threshold. Chains with scores below this value are filtered out before overlap resolution.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Assembly conversion table with the following columns:
The overlap policies are stored in |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the chain file is malformed, contains inconsistent chromosome
sizes, has coordinates out of range, or if overlap policies are
invalid. Also raised when |
See Also
gintervals_as_chain : Convert an existing DataFrame to chain format. gintervals_liftover : Lift intervals from one assembly to another. gtrack_liftover : Import a track from another assembly via liftover.
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> chain = pm.gintervals_load_chain(
... chainfile, src_overlap_policy="keep"
... )
>>> list(chain.columns)
['chrom', 'start', 'end', 'strand', 'chromsrc', 'startsrc', 'endsrc', 'strandsrc', 'chain_id', 'score']
pymisha.gintervals_as_chain ¶
gintervals_as_chain(intervals, src_overlap_policy='error', tgt_overlap_policy='auto', min_score=None)
Convert a DataFrame to chain format by validating columns and setting attributes.
Validates that the input DataFrame has all required chain columns and
attaches overlap-policy metadata as DataFrame attributes. This is useful
when you have manually constructed or modified chain data and need to
mark it as a valid chain table for use with gintervals_liftover or
gtrack_liftover.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
A DataFrame that must contain all of the required chain columns:
TYPE:
|
src_overlap_policy
|
Policy for handling source-side overlaps. One of
TYPE:
|
tgt_overlap_policy
|
Policy for handling target-side overlaps. One of
TYPE:
|
min_score
|
Minimum alignment score threshold to record as a DataFrame attribute. Does not filter the data; the value is stored for informational use by downstream functions.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A copy of the input DataFrame with overlap-policy attributes set in
|
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
If intervals is not a |
ValueError
|
If required columns are missing, or if either overlap policy string is not a recognized value. |
See Also
gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_liftover : Lift intervals from one assembly to another. gtrack_liftover : Import a track from another assembly via liftover.
Examples:
>>> import pandas as pd
>>> import pymisha as pm
>>> chain_data = pd.DataFrame({
... "chrom": ["1"], "start": [1000], "end": [2000], "strand": [0],
... "chromsrc": ["chr25"], "startsrc": [5000], "endsrc": [6000],
... "strandsrc": [0], "chain_id": [1], "score": [1000.0],
... })
>>> chain = pm.gintervals_as_chain(chain_data)
>>> chain.attrs["tgt_overlap_policy"]
'auto_score'
pymisha.gintervals_liftover ¶
gintervals_liftover(intervals, chain, src_overlap_policy='error', tgt_overlap_policy='auto', min_score=None, include_metadata=False, canonic=False, value_col=None, multi_target_agg='mean', params=None, na_rm=True, min_n=None)
Convert intervals from another assembly to the current one using a chain.
Maps each source interval through the chain's alignment blocks to produce
the corresponding target-genome coordinates. A single source interval may
produce multiple target intervals when it spans chain gaps or maps through
multiple chains. The intervalID column in the output links each result
row back to the originating source interval (0-based positional index).
When chain is a file path, it is loaded with the specified overlap
policies. When it is a pre-loaded DataFrame (from gintervals_load_chain
or gintervals_as_chain), the policies stored in its attributes are
used and the policy arguments here are ignored.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Source-assembly intervals. Must contain at least the columns
TYPE:
|
chain
|
Either a path to a UCSC chain file (loaded via
TYPE:
|
src_overlap_policy
|
Source overlap policy, used only when chain is a file path.
One of
TYPE:
|
tgt_overlap_policy
|
Target overlap policy, used only when chain is a file path.
One of
TYPE:
|
min_score
|
Minimum chain alignment score, used only when chain is a file path. Chains scoring below this threshold are excluded.
TYPE:
|
include_metadata
|
If
TYPE:
|
canonic
|
If
TYPE:
|
value_col
|
Name of a numeric column in intervals whose values should be
carried through the liftover. When specified, the output includes
this column with its original name. Ignored if
TYPE:
|
multi_target_agg
|
Aggregation method applied to value_col when multiple source
intervals map to the same target region. One of
TYPE:
|
params
|
Additional parameters for specific aggregation methods (e.g.,
TYPE:
|
na_rm
|
If
TYPE:
|
min_n
|
Minimum number of non-
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Lifted intervals sorted by target coordinates with the columns:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If intervals or chain is |
See Also
gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_as_chain : Convert a DataFrame to chain format. gtrack_liftover : Import a full track from another assembly.
Examples:
>>> import pandas as pd
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> import os
>>> chainfile = os.path.join(pm._GROOT, "data", "test.chain")
>>> intervs = pd.DataFrame({
... "chrom": ["chr25", "chr25"],
... "start": [0, 7000],
... "end": [6000, 20000],
... })
>>> lifted = pm.gintervals_liftover(
... intervs, chainfile, src_overlap_policy="keep"
... )
>>> list(lifted.columns)
['chrom', 'start', 'end', 'intervalID', 'chain_id']
pymisha.gtrack_liftover ¶
gtrack_liftover(track, description, src_track_dir, chain, src_overlap_policy='error', tgt_overlap_policy='auto', multi_target_agg='mean', params=None, na_rm=True, min_n=None, min_score=None)
Import a track from another assembly via coordinate liftover.
Reads a source track from src_track_dir (a directory containing
per-chromosome binary track files or an indexed track.idx/track.dat
pair), maps its intervals through chain to the current target genome,
aggregates values when multiple source intervals land on the same target
region, and creates a new sparse track in the current database.
When chain is a file path it is loaded with the specified overlap policies. When it is a pre-loaded DataFrame the policies stored in its attributes are used and the policy arguments here are ignored.
| PARAMETER | DESCRIPTION |
|---|---|
track
|
Name of the new track to create in the current database. The track must not already exist.
TYPE:
|
description
|
Human-readable description stored as a track attribute.
TYPE:
|
src_track_dir
|
Path to the source track directory. The directory may contain
per-chromosome binary files (dense or sparse) or an indexed pair of
TYPE:
|
chain
|
Either a path to a UCSC chain file or a pre-loaded chain DataFrame
as returned by
TYPE:
|
src_overlap_policy
|
Source overlap policy, used only when chain is a file path.
One of
TYPE:
|
tgt_overlap_policy
|
Target overlap policy, used only when chain is a file path.
One of
TYPE:
|
multi_target_agg
|
Aggregation function applied when multiple source values map to
the same target locus. One of
TYPE:
|
params
|
Extra parameters for specific aggregation methods (e.g.,
TYPE:
|
na_rm
|
If
TYPE:
|
min_n
|
Minimum number of non-
TYPE:
|
min_score
|
Minimum chain alignment score. Chains scoring below this value are excluded during loading. Only used when chain is a file path.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
The function creates a new sparse track in the current database as a side effect and does not return a value. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If track already exists, if src_track_dir does not exist, if the aggregation function is unsupported, or if the chain file is invalid. |
TypeError
|
If chain is neither a file path string nor a |
See Also
gintervals_load_chain : Load a chain from a UCSC chain file. gintervals_liftover : Lift intervals (without creating a track). gtrack_create_sparse : Create a sparse track from intervals and values.
Notes
UCSC chain format terminology is reversed from misha convention: UCSC
"target" (tName, tStart, tEnd) corresponds to misha "source"
(chromsrc, startsrc, endsrc), and UCSC "query" corresponds to
misha "target" (chrom, start, end).
Examples: