Skip to content

Datasets

Functions for loading, saving, and managing named datasets within the genomic database.

pymisha.gdataset_load

gdataset_load(path: str, force: bool = False, verbose: bool = False)

Load a dataset into the namespace.

Loads tracks and intervals from a dataset directory, making them available for analysis alongside the working database. If the dataset contains tracks or intervals whose names collide with objects in the working database or previously loaded datasets, an error is raised unless force=True. When collisions are forced, the working database always wins; for dataset-to-dataset collisions, the later-loaded dataset overrides earlier ones.

PARAMETER DESCRIPTION
path

Path to a dataset or misha database directory.

TYPE: str

force

If True, ignore name collisions (working db wins; later datasets override earlier).

TYPE: bool DEFAULT: False

verbose

Print loaded track/interval counts.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
dict

Dictionary with keys "tracks", "intervals", "shadowed_tracks", and "shadowed_intervals" indicating how many objects were loaded and how many were shadowed by collisions.

RAISES DESCRIPTION
ValueError

If the dataset path does not exist, lacks a tracks/ directory, has a mismatched genome, or has collisions without force=True.

See Also

gdataset_unload : Unload a dataset from the namespace. gdataset_save : Save tracks/intervals as a dataset. gdataset_ls : List loaded datasets.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> result = pm.gdataset_load("/path/to/dataset")
>>> result["tracks"]
5

pymisha.gdataset_unload

gdataset_unload(path: str, validate: bool = False)

Unload a dataset from the namespace.

Removes all tracks and intervals from a previously loaded dataset. If a dataset track was shadowing another, the shadowed track becomes visible again.

PARAMETER DESCRIPTION
path

Path to a previously loaded dataset.

TYPE: str

validate

If True, raise an error if the path is not currently loaded. Otherwise silently no-op.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
None
RAISES DESCRIPTION
ValueError

If validate is True and the dataset is not currently loaded.

See Also

gdataset_load : Load a dataset into the namespace. gdataset_ls : List loaded datasets.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdataset_load("/path/to/dataset")
>>> pm.gdataset_unload("/path/to/dataset", validate=True)

pymisha.gdataset_ls

gdataset_ls() -> list[str]

List currently loaded datasets.

Returns the normalized absolute paths of all datasets that have been loaded into the current session via gdataset_load.

RETURNS DESCRIPTION
list[str]

Normalized absolute paths of loaded datasets. Empty list if no datasets are loaded.

See Also

gdataset_load : Load a dataset into the namespace. gdataset_info : Return metadata for a dataset.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdataset_ls()
[]

pymisha.gdataset_save

gdataset_save(path: str, description: str, tracks: str | Iterable[str] | None = None, intervals: str | Iterable[str] | None = None, symlinks: bool = False, copy_seq: bool = False) -> str

Save selected tracks/intervals into a standalone dataset directory.

PARAMETER DESCRIPTION
path

Destination directory. Must not exist.

TYPE: str

description

Dataset description stored in misha.yaml.

TYPE: str

tracks

Track names to include.

TYPE: str | Iterable[str] | None DEFAULT: None

intervals

Interval set names to include.

TYPE: str | Iterable[str] | None DEFAULT: None

symlinks

If True, link track/interval resources instead of copying.

TYPE: bool DEFAULT: False

copy_seq

If True, copy seq/. Otherwise create a symlink to the working DB seq/.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
str

Absolute path of the created dataset directory.

RAISES DESCRIPTION
ValueError

If neither tracks nor intervals is specified, the path already exists, or a requested track/interval does not exist.

See Also

gdataset_load : Load a dataset into the namespace. gdataset_info : Return metadata for a dataset.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdataset_save(
...     "/tmp/my_dataset",
...     description="Example dataset",
...     tracks=["dense_track"],
...     intervals=["my_intervals"],
... )

pymisha.gdataset_info

gdataset_info(path: str) -> dict[str, Any]

Return metadata and contents summary for a dataset path.

Reads the misha.yaml metadata file and scans the dataset for tracks and intervals. The dataset does not need to be loaded.

PARAMETER DESCRIPTION
path

Path to a dataset directory (loaded or not).

TYPE: str

RETURNS DESCRIPTION
dict[str, Any]

Dictionary with keys: "description", "author", "created", "original_db", "misha_version", "track_count", "interval_count", "genome", and "is_loaded".

See Also

gdataset_ls : List loaded datasets. gdataset_load : Load a dataset into the namespace.

Examples:

>>> import pymisha as pm
>>> info = pm.gdataset_info("/path/to/dataset")
>>> info["track_count"]
3