Database¶

Functions for initializing, configuring, and managing genomic databases, including directory operations and genome creation.

pymisha.gdb_init ¶

gdb_init(path: str, userpath: str = None)

Initialize connection to a misha genomic database.

Loads the genome database at the given path and makes it available for all subsequent genomic operations. Must be called before any other pymisha function that accesses track data.

PARAMETER	DESCRIPTION
`path`	Path to the root directory of the genome database. TYPE: `str`
`userpath`	Path to a user-writable database root. New tracks and interval sets will be created here. If None, defaults to `path`. TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`None`

See Also

gdb_reload : Refresh track lists after external changes. gdb_unload : Disconnect from the database and clear all state. gdb_info : Return metadata about the database. gsetroot : Alternative entry point with directory validation.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()  # initializes a real test DB

pymisha.gsetroot ¶

gsetroot(groot, subdir=None, rescan=False, **kwargs)

Set the database root directory with validation.

Connects to a genome database after verifying that the directory exists and contains the required tracks/ and seq/ subdirectories. This matches the R gsetroot() interface and is the recommended entry point when working interactively, since it provides clear error messages for invalid database paths.

PARAMETER	DESCRIPTION
`groot`	Path to the genome database root directory. TYPE: `str`
`subdir`	Sub-directory within `tracks/` to use as working directory after initialization. TYPE: `str` DEFAULT: `None`
`dir`	Backward-compatible alias for `subdir`. TYPE: `str`
`rescan`	If True, force a rescan of the database after initialization. Equivalent to calling :func:`gdb_reload` after :func:`gdb_init`. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`None`

RAISES	DESCRIPTION
`FileNotFoundError`	If `groot` does not exist, or is missing the required `tracks/` or `seq/` subdirectories.

See Also

gdb_init : Lower-level initializer without directory validation. gdb_reload : Refresh track lists without re-initializing.

Examples:

>>> import pymisha as pm
>>> pm.gsetroot(pm.gdb_examples_path())

pymisha.gdb_reload ¶

gdb_reload()

Reload the database, refreshing track lists and metadata.

Re-scans the database root directories for newly created or removed tracks and interval sets. Call this after external modifications to the database on disk (e.g., tracks created by R misha or another process).

RETURNS	DESCRIPTION
`None`

RAISES	DESCRIPTION
`ValueError`	If no database is currently initialized.

See Also

gdb_init : Initialize a database connection. gdb_unload : Disconnect from the database entirely.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdb_reload()

pymisha.gdb_unload ¶

gdb_unload()

Unload the database, clearing all state.

Disconnects from the currently active genome database and resets all internal state including the database root paths, working directory, datasets, and virtual tracks. After calling this function, a new :func:gdb_init call is required before any genomic operations.

RETURNS	DESCRIPTION
`None`

See Also

gdb_init : Initialize a new database connection. gdb_reload : Refresh without disconnecting.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdb_unload()

pymisha.gdb_info ¶

gdb_info(groot: str = None)

Return high-level information about a misha database.

Inspects a genome database directory and returns metadata including the storage format, number of chromosomes, total genome size, and a table of per-chromosome sizes. Can be used to validate a database path without fully initializing a connection.

PARAMETER	DESCRIPTION
`groot`	Path to a database root directory. If `None`, uses the currently initialized database. TYPE: `str` DEFAULT: `None`

RETURNS DESCRIPTION

dict

Dictionary with keys:

path (str) -- Resolved absolute path to the database.
is_db (bool) -- Whether the path is a valid misha database.
error (str) -- Present only when is_db is False; describes why validation failed.
format (str) -- "indexed" or "per-chromosome". Present only when is_db is True.
num_chromosomes (int) -- Number of chromosomes. Present only when is_db is True.
genome_size (int) -- Sum of all chromosome sizes. Present only when is_db is True.
chromosomes (pandas.DataFrame) -- Two-column table with chrom and size. Present only when is_db is True.

RAISES	DESCRIPTION
`ValueError`	If `groot` is None and no database is currently initialized.

See Also

gdb_init : Initialize a database connection.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> info = pm.gdb_info()
>>> info["num_chromosomes"]
3
>>> info["genome_size"]
1000000

pymisha.gdb_examples_path ¶

gdb_examples_path()

Return the path to the example database if available.

Checks the following locations in order: 1) PYMISHA_EXAMPLES_DB environment variable 2) pymisha/examples/trackdb/test (if packaged) 3) tests/testdb/trackdb/test (repo checkout)

RETURNS	DESCRIPTION
`str`	Absolute path to the example database root directory.

RAISES	DESCRIPTION
`FileNotFoundError`	If the example database cannot be located in any of the searched locations.

See Also

gdb_init_examples : Initialize the example database. gdb_init : Initialize a custom database.

Examples:

>>> import pymisha as pm
>>> path = pm.gdb_examples_path()
>>> path

pymisha.gdb_init_examples ¶

gdb_init_examples(copy=True)

Initialize the example database (mirrors R's gdb.init_examples).

PARAMETER	DESCRIPTION
`copy`	If True, copy the example DB into a temp dir before initializing. This avoids mutating the repo data when running examples. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`str`	Path to the initialized example DB.

See Also

gdb_examples_path : Get the path to the example database. gdb_init : Initialize a custom database.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gtrack_ls()
['array_track', 'dense_track', 'rects_track', 'sparse_track', 'subdir.dense_track2']

pymisha.gdb_create ¶

gdb_create(groot, fasta, genes_file=None, annots_file=None, annots_names=None, db_format='indexed', verbose=False, **kwargs)

Create a new Genomic Database from FASTA file(s).

Creates the directory structure, imports sequences, and writes the chromosome sizes file. Two formats are supported:

"indexed" (default): Single genome.seq + genome.idx. Recommended for genomes with many contigs.
"per-chromosome": Separate .seq file per contig in the seq/ directory.

PARAMETER	DESCRIPTION
`groot`	Path for the new database root directory. TYPE: `str`
`fasta`	Path(s) to FASTA file(s). Gzipped files (.fa.gz) are supported. TYPE: `str or list of str`
`genes_file`	Path to genes annotation file. Not yet implemented. TYPE: `str` DEFAULT: `None`
`annots_file`	Path to annotations file. Not yet implemented. TYPE: `str` DEFAULT: `None`
`annots_names`	Names for annotations. Not yet implemented. TYPE: `list of str` DEFAULT: `None`
`db_format`	Database format: `"indexed"` or `"per-chromosome"`. TYPE: `str` DEFAULT: `"indexed"`
`format`	Backward-compatible alias for `db_format`. TYPE: `str`
`verbose`	If True, print progress messages. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame with columns `name` (contig name) and `size` (contig length in bases).

RAISES	DESCRIPTION
`FileExistsError`	If the target directory already exists.
`FileNotFoundError`	If a FASTA file does not exist.
`ValueError`	If no contigs are found, duplicate contig names are detected, or an unsupported format is specified.

See Also

gdb_init : Initialize a database connection. gdb_reload : Reload the current database. gdb_create_genome : Download and initialize a prebuilt genome. gdb_convert_to_indexed : Convert per-chromosome format to indexed.

Examples:

Create a database from a single FASTA file:

>>> import pymisha as pm
>>> contigs = pm.gdb_create("/tmp/mydb", "genome.fa.gz")

Create from multiple FASTA files:

>>> pm.gdb_create("/tmp/mydb", ["chr1.fa", "chr2.fa"], verbose=True)

Create a per-chromosome database:

>>> pm.gdb_create("/tmp/mydb", "genome.fa", db_format="per-chromosome")

pymisha.gdb_create_genome ¶

gdb_create_genome(genome, path=None, tmpdir=None, verify_checksum=True)

Download and initialize a prebuilt genome database.

PARAMETER	DESCRIPTION
`genome`	Genome identifier. Supported values: `mm9`, `mm10`, `mm39`, `hg19`, `hg38`. TYPE: `str`
`path`	Directory to extract into. Defaults to current working directory. TYPE: `str` DEFAULT: `None`
`tmpdir`	Directory to store the temporary downloaded archive. Defaults to `tempfile.gettempdir()`. TYPE: `str` DEFAULT: `None`
`verify_checksum`	If True, download and verify the archive SHA256 checksum from `<archive_url>.sha256` before extraction. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`None`

RAISES	DESCRIPTION
`ValueError`	If the genome identifier is not supported.
`FileNotFoundError`	If the downloaded archive does not contain the expected directory.

See Also

gdb_create : Create a database from local FASTA files. gdb_init : Initialize a database connection.

Examples:

>>> import pymisha as pm
>>> pm.gdb_create_genome("hg38", path="/tmp")

pymisha.gdb_create_linked ¶

gdb_create_linked(path, parent)

Create a linked database that reuses sequence data from a parent DB.

Creates a new DB root with a writable tracks/ directory and symlinks to the parent's seq/ directory and chrom_sizes.txt file.

PARAMETER	DESCRIPTION
`path`	Path for the new linked DB. TYPE: `str`
`parent`	Path to parent DB root. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	`True` on success.

RAISES	DESCRIPTION
`FileNotFoundError`	If the parent database directory does not exist or is missing required files (`chrom_sizes.txt`, `seq/`).
`FileExistsError`	If the target path already exists.

See Also

gdb_create : Create a new database from FASTA files. gdataset_load : Load a dataset into the namespace. gdataset_ls : List loaded datasets.

Examples:

>>> import pymisha as pm
>>> pm.gdb_create_linked("~/my_tracks", parent="/shared/genomics/hg38")
True

pymisha.gdb_convert_to_indexed ¶

gdb_convert_to_indexed(groot=None, remove_old_files=False, force=False, validate=True, convert_tracks=False, convert_intervals=False, verbose=False, chunk_size=104857600)

Convert a per-chromosome database to indexed genome format.

PARAMETER	DESCRIPTION
`groot`	Database root. If None, uses currently active DB. TYPE: `str` DEFAULT: `None`
`remove_old_files`	If True, remove old per-chromosome `.seq` files after conversion. TYPE:* `bool` DEFAULT: `False`
`force`	Kept for parity with R API. Ignored in non-interactive Python flow. TYPE: `bool` DEFAULT: `False`
`validate`	If True, validates converted `genome.seq` against source files. TYPE: `bool` DEFAULT: `True`
`convert_tracks`	If True, converts all eligible tracks to indexed format. TYPE: `bool` DEFAULT: `False`
`convert_intervals`	If True, converts all eligible interval sets to indexed format. TYPE: `bool` DEFAULT: `False`
`verbose`	If True, prints conversion progress. TYPE: `bool` DEFAULT: `False`
`chunk_size`	I/O chunk size for reading sequence files. TYPE: `int` DEFAULT: `104857600`

RETURNS	DESCRIPTION
`None`

RAISES	DESCRIPTION
`ValueError`	If `chunk_size` is not positive, or no database is active and `groot` is not specified.
`FileNotFoundError`	If the database directory, `seq/` directory, or `chrom_sizes.txt` does not exist.

See Also

gdb_create : Create a new database from FASTA files. gdb_init : Initialize a database connection.

Examples:

Convert the currently active database to indexed format:

>>> import pymisha as pm
>>> pm.gdb_convert_to_indexed(groot="/path/to/mydb")

Convert a specific database with full options:

>>> pm.gdb_convert_to_indexed(
...     groot="/path/to/mydb",
...     convert_tracks=True,
...     convert_intervals=True,
...     remove_old_files=True,
...     verbose=True,
... )

pymisha.gdb_get_readonly_attrs ¶

gdb_get_readonly_attrs()

Return read-only track attributes for the current database.

Returns the list of track attribute names that are protected from modification or deletion. If no attributes are marked as read-only, None is returned.

RETURNS	DESCRIPTION
`list[str] \| None`	List of read-only attribute names, or `None` when no read-only attributes are configured.

See Also

gdb_set_readonly_attrs : Set the list of read-only attributes.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> result = pm.gdb_get_readonly_attrs()
>>> result is None or isinstance(result, list)
True

pymisha.gdb_set_readonly_attrs ¶

gdb_set_readonly_attrs(attrs)

Set the list of read-only track attributes for the current database.

PARAMETER	DESCRIPTION
`attrs`	Attribute names to protect. Pass `None` to clear all read-only attributes. TYPE: `list[str] \| tuple[str] \| str \| None`

RETURNS	DESCRIPTION
`None`

RAISES	DESCRIPTION
`ValueError`	If an attribute name is empty or appears more than once.

See Also

gdb_get_readonly_attrs : Return the current read-only attributes.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdb_set_readonly_attrs(["created_by", "creation_date"])
>>> pm.gdb_set_readonly_attrs(None)

pymisha.gdir_cwd ¶

gdir_cwd()

Return the current working directory in the genomic database.

Returns the absolute path of the current working directory in the genomic database. This is not the shell's current working directory but the directory within the misha tracks tree used for resolving track and interval set names.

RETURNS	DESCRIPTION
`str`	Absolute path of the current working directory within the database.

See Also

gdir_cd : Change the current working directory. gdir_create : Create a new directory in the database. gdir_rm : Delete a directory from the database.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdir_cwd()
'...tracks'

pymisha.gdir_cd ¶

gdir_cd(dir)

Change the current working directory in the genomic database.

Changes the directory used for resolving track and interval set names. The list of database objects (tracks, intervals) is rescanned recursively under the new directory. Object names are updated relative to the new working directory. For example, a track named subdir.dense becomes dense once the working directory is set to subdir. All virtual tracks are cleared.

PARAMETER	DESCRIPTION
`dir`	Directory path (relative to current working directory, or ".."). TYPE: `str`

RETURNS	DESCRIPTION
`None`

See Also

gdir_cwd : Return the current working directory. gdir_create : Create a new directory in the database. gdir_rm : Delete a directory from the database.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdir_cd("subdir")
>>> pm.gdir_cd("..")

pymisha.gdir_create ¶

gdir_create(dir, show_warnings=True)

Create a new directory in the genomic database.

Creates a single directory level under the current working directory. Only the last element in the specified path is created; recursive directory creation is not supported. A new directory cannot be created within an existing .track directory.

PARAMETER	DESCRIPTION
`dir`	Directory path relative to the current working directory. TYPE: `str`
`show_warnings`	If True, show warnings (currently unused; kept for R parity). TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`None`

RAISES	DESCRIPTION
`FileNotFoundError`	If the parent directory does not exist.
`ValueError`	If the target is inside a `.track` directory or the name ends with `.track`.

See Also

gdir_rm : Delete a directory from the database. gdir_cd : Change the current working directory. gdir_cwd : Return the current working directory.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdir_create("my_subdir")

pymisha.gdir_rm ¶

gdir_rm(dir, recursive=False, force=False)

Delete a directory from the genomic database.

If recursive is True, the directory is deleted with all files and subdirectories it contains. Cannot delete .track directories directly; use track-removal functions instead.

PARAMETER	DESCRIPTION
`dir`	Directory path relative to the current working directory. TYPE: `str`
`recursive`	If True, delete the directory and all its contents. TYPE: `bool` DEFAULT: `False`
`force`	If True, suppress errors for non-existent directories. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`None`

RAISES	DESCRIPTION
`FileNotFoundError`	If the directory does not exist and `force` is False.
`ValueError`	If the target is a `.track` directory.
`OSError`	If the directory is not empty and `recursive` is False.

See Also

gdir_create : Create a new directory in the database. gdir_cd : Change the current working directory. gdir_cwd : Return the current working directory.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdir_create("temp_dir")
>>> pm.gdir_rm("temp_dir")

pymisha.gtrack_create_dirs ¶

gtrack_create_dirs(track, mode='0777')

Create the directory hierarchy needed for a dotted track name.

For example, gtrack_create_dirs("proj.sample.my_track") creates the directories proj and proj/sample under the current working directory. Use this function with caution -- a long track name may create a deep directory structure.

PARAMETER	DESCRIPTION
`track`	Track name with dot-separated namespace. TYPE: `str`
`mode`	Directory permissions (currently passed to os.mkdir). TYPE: `str` DEFAULT: `"0777"`

RETURNS	DESCRIPTION
`None`

See Also

gdir_create : Create a single directory in the database. gdir_cwd : Return the current working directory.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gtrack_create_dirs("proj.sample.my_track")