Datasets¶

Functions for loading, saving, and managing named datasets within the genomic database.

pymisha.gdataset_load ¶

gdataset_load(path: str, force: bool = False, verbose: bool = False)

Load a dataset into the namespace.

Loads tracks and intervals from a dataset directory, making them available for analysis alongside the working database. If the dataset contains tracks or intervals whose names collide with objects in the working database or previously loaded datasets, an error is raised unless force=True. When collisions are forced, the working database always wins; for dataset-to-dataset collisions, the later-loaded dataset overrides earlier ones.

PARAMETER	DESCRIPTION
`path`	Path to a dataset or misha database directory. TYPE: `str`
`force`	If True, ignore name collisions (working db wins; later datasets override earlier). TYPE: `bool` DEFAULT: `False`
`verbose`	Print loaded track/interval counts. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`dict`	Dictionary with keys `"tracks"`, `"intervals"`, `"shadowed_tracks"`, and `"shadowed_intervals"` indicating how many objects were loaded and how many were shadowed by collisions.

RAISES	DESCRIPTION
`ValueError`	If the dataset path does not exist, lacks a `tracks/` directory, has a mismatched genome, or has collisions without `force=True`.

See Also

gdataset_unload : Unload a dataset from the namespace. gdataset_save : Save tracks/intervals as a dataset. gdataset_ls : List loaded datasets.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> result = pm.gdataset_load("/path/to/dataset")
>>> result["tracks"]
5

pymisha.gdataset_unload ¶

gdataset_unload(path: str, validate: bool = False)

Unload a dataset from the namespace.

Removes all tracks and intervals from a previously loaded dataset. If a dataset track was shadowing another, the shadowed track becomes visible again.

PARAMETER	DESCRIPTION
`path`	Path to a previously loaded dataset. TYPE: `str`
`validate`	If True, raise an error if the path is not currently loaded. Otherwise silently no-op. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`None`

RAISES	DESCRIPTION
`ValueError`	If `validate` is True and the dataset is not currently loaded.

See Also

gdataset_load : Load a dataset into the namespace. gdataset_ls : List loaded datasets.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdataset_load("/path/to/dataset")
>>> pm.gdataset_unload("/path/to/dataset", validate=True)

pymisha.gdataset_ls ¶

gdataset_ls() -> list[str]

List currently loaded datasets.

Returns the normalized absolute paths of all datasets that have been loaded into the current session via gdataset_load.

RETURNS	DESCRIPTION
`list[str]`	Normalized absolute paths of loaded datasets. Empty list if no datasets are loaded.

See Also

gdataset_load : Load a dataset into the namespace. gdataset_info : Return metadata for a dataset.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdataset_ls()
[]

pymisha.gdataset_save ¶

gdataset_save(path: str, description: str, tracks: str | Iterable[str] | None = None, intervals: str | Iterable[str] | None = None, symlinks: bool = False, copy_seq: bool = False) -> str

Save selected tracks/intervals into a standalone dataset directory.

PARAMETER	DESCRIPTION
`path`	Destination directory. Must not exist. TYPE: `str`
`description`	Dataset description stored in `misha.yaml`. TYPE: `str`
`tracks`	Track names to include. TYPE: `str \| Iterable[str] \| None` DEFAULT: `None`
`intervals`	Interval set names to include. TYPE: `str \| Iterable[str] \| None` DEFAULT: `None`
`symlinks`	If True, link track/interval resources instead of copying. TYPE: `bool` DEFAULT: `False`
`copy_seq`	If True, copy `seq/`. Otherwise create a symlink to the working DB `seq/`. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`str`	Absolute path of the created dataset directory.

RAISES	DESCRIPTION
`ValueError`	If neither `tracks` nor `intervals` is specified, the path already exists, or a requested track/interval does not exist.

See Also

gdataset_load : Load a dataset into the namespace. gdataset_info : Return metadata for a dataset.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdataset_save(
...     "/tmp/my_dataset",
...     description="Example dataset",
...     tracks=["dense_track"],
...     intervals=["my_intervals"],
... )

pymisha.gdataset_info ¶

gdataset_info(path: str) -> dict[str, Any]

Return metadata and contents summary for a dataset path.

Reads the misha.yaml metadata file and scans the dataset for tracks and intervals. The dataset does not need to be loaded.

PARAMETER	DESCRIPTION
`path`	Path to a dataset directory (loaded or not). TYPE: `str`

RETURNS	DESCRIPTION
`dict[str, Any]`	Dictionary with keys: `"description"`, `"author"`, `"created"`, `"original_db"`, `"misha_version"`, `"track_count"`, `"interval_count"`, `"genome"`, and `"is_loaded"`.

See Also

gdataset_ls : List loaded datasets. gdataset_load : Load a dataset into the namespace.

Examples:

>>> import pymisha as pm
>>> info = pm.gdataset_info("/path/to/dataset")
>>> info["track_count"]
3