Skip to content

Database

Functions for initializing, configuring, and managing genomic databases, including directory operations and genome creation.

pymisha.gdb_init

gdb_init(path: str, userpath: str = None)

Initialize connection to a misha genomic database.

Loads the genome database at the given path and makes it available for all subsequent genomic operations. Must be called before any other pymisha function that accesses track data.

PARAMETER DESCRIPTION
path

Path to the root directory of the genome database.

TYPE: str

userpath

Path to a user-writable database root. New tracks and interval sets will be created here. If None, defaults to path.

TYPE: str DEFAULT: None

RETURNS DESCRIPTION
None
See Also

gdb_reload : Refresh track lists after external changes. gdb_unload : Disconnect from the database and clear all state. gdb_info : Return metadata about the database. gsetroot : Alternative entry point with directory validation.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()  # initializes a real test DB

pymisha.gsetroot

gsetroot(groot, subdir=None, rescan=False, **kwargs)

Set the database root directory with validation.

Connects to a genome database after verifying that the directory exists and contains the required tracks/ and seq/ subdirectories. This matches the R gsetroot() interface and is the recommended entry point when working interactively, since it provides clear error messages for invalid database paths.

PARAMETER DESCRIPTION
groot

Path to the genome database root directory.

TYPE: str

subdir

Sub-directory within tracks/ to use as working directory after initialization.

TYPE: str DEFAULT: None

dir

Backward-compatible alias for subdir.

TYPE: str

rescan

If True, force a rescan of the database after initialization. Equivalent to calling :func:gdb_reload after :func:gdb_init.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
None
RAISES DESCRIPTION
FileNotFoundError

If groot does not exist, or is missing the required tracks/ or seq/ subdirectories.

See Also

gdb_init : Lower-level initializer without directory validation. gdb_reload : Refresh track lists without re-initializing.

Examples:

>>> import pymisha as pm
>>> pm.gsetroot(pm.gdb_examples_path())

pymisha.gdb_reload

gdb_reload()

Reload the database, refreshing track lists and metadata.

Re-scans the database root directories for newly created or removed tracks and interval sets. Call this after external modifications to the database on disk (e.g., tracks created by R misha or another process).

RETURNS DESCRIPTION
None
RAISES DESCRIPTION
ValueError

If no database is currently initialized.

See Also

gdb_init : Initialize a database connection. gdb_unload : Disconnect from the database entirely.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdb_reload()

pymisha.gdb_unload

gdb_unload()

Unload the database, clearing all state.

Disconnects from the currently active genome database and resets all internal state including the database root paths, working directory, datasets, and virtual tracks. After calling this function, a new :func:gdb_init call is required before any genomic operations.

RETURNS DESCRIPTION
None
See Also

gdb_init : Initialize a new database connection. gdb_reload : Refresh without disconnecting.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdb_unload()

pymisha.gdb_info

gdb_info(groot: str = None)

Return high-level information about a misha database.

Inspects a genome database directory and returns metadata including the storage format, number of chromosomes, total genome size, and a table of per-chromosome sizes. Can be used to validate a database path without fully initializing a connection.

PARAMETER DESCRIPTION
groot

Path to a database root directory. If None, uses the currently initialized database.

TYPE: str DEFAULT: None

RETURNS DESCRIPTION
dict

Dictionary with keys:

  • path (str) -- Resolved absolute path to the database.
  • is_db (bool) -- Whether the path is a valid misha database.
  • error (str) -- Present only when is_db is False; describes why validation failed.
  • format (str) -- "indexed" or "per-chromosome". Present only when is_db is True.
  • num_chromosomes (int) -- Number of chromosomes. Present only when is_db is True.
  • genome_size (int) -- Sum of all chromosome sizes. Present only when is_db is True.
  • chromosomes (pandas.DataFrame) -- Two-column table with chrom and size. Present only when is_db is True.
RAISES DESCRIPTION
ValueError

If groot is None and no database is currently initialized.

See Also

gdb_init : Initialize a database connection.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> info = pm.gdb_info()
>>> info["num_chromosomes"]
3
>>> info["genome_size"]
1000000

pymisha.gdb_examples_path

gdb_examples_path()

Return the path to the example database if available.

Checks the following locations in order: 1) PYMISHA_EXAMPLES_DB environment variable 2) pymisha/examples/trackdb/test (if packaged) 3) tests/testdb/trackdb/test (repo checkout)

RETURNS DESCRIPTION
str

Absolute path to the example database root directory.

RAISES DESCRIPTION
FileNotFoundError

If the example database cannot be located in any of the searched locations.

See Also

gdb_init_examples : Initialize the example database. gdb_init : Initialize a custom database.

Examples:

>>> import pymisha as pm
>>> path = pm.gdb_examples_path()
>>> path

pymisha.gdb_init_examples

gdb_init_examples(copy=True)

Initialize the example database (mirrors R's gdb.init_examples).

PARAMETER DESCRIPTION
copy

If True, copy the example DB into a temp dir before initializing. This avoids mutating the repo data when running examples.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
str

Path to the initialized example DB.

See Also

gdb_examples_path : Get the path to the example database. gdb_init : Initialize a custom database.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gtrack_ls()
['array_track', 'dense_track', 'rects_track', 'sparse_track', 'subdir.dense_track2']

pymisha.gdb_create

gdb_create(groot, fasta, genes_file=None, annots_file=None, annots_names=None, db_format='indexed', verbose=False, **kwargs)

Create a new Genomic Database from FASTA file(s).

Creates the directory structure, imports sequences, and writes the chromosome sizes file. Two formats are supported:

  • "indexed" (default): Single genome.seq + genome.idx. Recommended for genomes with many contigs.
  • "per-chromosome": Separate .seq file per contig in the seq/ directory.
PARAMETER DESCRIPTION
groot

Path for the new database root directory.

TYPE: str

fasta

Path(s) to FASTA file(s). Gzipped files (.fa.gz) are supported.

TYPE: str or list of str

genes_file

Path to genes annotation file. Not yet implemented.

TYPE: str DEFAULT: None

annots_file

Path to annotations file. Not yet implemented.

TYPE: str DEFAULT: None

annots_names

Names for annotations. Not yet implemented.

TYPE: list of str DEFAULT: None

db_format

Database format: "indexed" or "per-chromosome".

TYPE: str DEFAULT: "indexed"

format

Backward-compatible alias for db_format.

TYPE: str

verbose

If True, print progress messages.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
DataFrame

DataFrame with columns name (contig name) and size (contig length in bases).

RAISES DESCRIPTION
FileExistsError

If the target directory already exists.

FileNotFoundError

If a FASTA file does not exist.

ValueError

If no contigs are found, duplicate contig names are detected, or an unsupported format is specified.

See Also

gdb_init : Initialize a database connection. gdb_reload : Reload the current database. gdb_create_genome : Download and initialize a prebuilt genome. gdb_convert_to_indexed : Convert per-chromosome format to indexed.

Examples:

Create a database from a single FASTA file:

>>> import pymisha as pm
>>> contigs = pm.gdb_create("/tmp/mydb", "genome.fa.gz")

Create from multiple FASTA files:

>>> pm.gdb_create("/tmp/mydb", ["chr1.fa", "chr2.fa"], verbose=True)

Create a per-chromosome database:

>>> pm.gdb_create("/tmp/mydb", "genome.fa", db_format="per-chromosome")

pymisha.gdb_create_genome

gdb_create_genome(genome, path=None, tmpdir=None, verify_checksum=True)

Download and initialize a prebuilt genome database.

PARAMETER DESCRIPTION
genome

Genome identifier. Supported values: mm9, mm10, mm39, hg19, hg38.

TYPE: str

path

Directory to extract into. Defaults to current working directory.

TYPE: str DEFAULT: None

tmpdir

Directory to store the temporary downloaded archive. Defaults to tempfile.gettempdir().

TYPE: str DEFAULT: None

verify_checksum

If True, download and verify the archive SHA256 checksum from <archive_url>.sha256 before extraction.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
None
RAISES DESCRIPTION
ValueError

If the genome identifier is not supported.

FileNotFoundError

If the downloaded archive does not contain the expected directory.

See Also

gdb_create : Create a database from local FASTA files. gdb_init : Initialize a database connection.

Examples:

>>> import pymisha as pm
>>> pm.gdb_create_genome("hg38", path="/tmp")

pymisha.gdb_create_linked

gdb_create_linked(path, parent)

Create a linked database that reuses sequence data from a parent DB.

Creates a new DB root with a writable tracks/ directory and symlinks to the parent's seq/ directory and chrom_sizes.txt file.

PARAMETER DESCRIPTION
path

Path for the new linked DB.

TYPE: str

parent

Path to parent DB root.

TYPE: str

RETURNS DESCRIPTION
bool

True on success.

RAISES DESCRIPTION
FileNotFoundError

If the parent database directory does not exist or is missing required files (chrom_sizes.txt, seq/).

FileExistsError

If the target path already exists.

See Also

gdb_create : Create a new database from FASTA files. gdataset_load : Load a dataset into the namespace. gdataset_ls : List loaded datasets.

Examples:

>>> import pymisha as pm
>>> pm.gdb_create_linked("~/my_tracks", parent="/shared/genomics/hg38")
True

pymisha.gdb_convert_to_indexed

gdb_convert_to_indexed(groot=None, remove_old_files=False, force=False, validate=True, convert_tracks=False, convert_intervals=False, verbose=False, chunk_size=104857600)

Convert a per-chromosome database to indexed genome format.

PARAMETER DESCRIPTION
groot

Database root. If None, uses currently active DB.

TYPE: str DEFAULT: None

remove_old_files

If True, remove old per-chromosome *.seq files after conversion.

TYPE: bool DEFAULT: False

force

Kept for parity with R API. Ignored in non-interactive Python flow.

TYPE: bool DEFAULT: False

validate

If True, validates converted genome.seq against source files.

TYPE: bool DEFAULT: True

convert_tracks

If True, converts all eligible tracks to indexed format.

TYPE: bool DEFAULT: False

convert_intervals

If True, converts all eligible interval sets to indexed format.

TYPE: bool DEFAULT: False

verbose

If True, prints conversion progress.

TYPE: bool DEFAULT: False

chunk_size

I/O chunk size for reading sequence files.

TYPE: int DEFAULT: 104857600

RETURNS DESCRIPTION
None
RAISES DESCRIPTION
ValueError

If chunk_size is not positive, or no database is active and groot is not specified.

FileNotFoundError

If the database directory, seq/ directory, or chrom_sizes.txt does not exist.

See Also

gdb_create : Create a new database from FASTA files. gdb_init : Initialize a database connection.

Examples:

Convert the currently active database to indexed format:

>>> import pymisha as pm
>>> pm.gdb_convert_to_indexed(groot="/path/to/mydb")

Convert a specific database with full options:

>>> pm.gdb_convert_to_indexed(
...     groot="/path/to/mydb",
...     convert_tracks=True,
...     convert_intervals=True,
...     remove_old_files=True,
...     verbose=True,
... )

pymisha.gdb_get_readonly_attrs

gdb_get_readonly_attrs()

Return read-only track attributes for the current database.

Returns the list of track attribute names that are protected from modification or deletion. If no attributes are marked as read-only, None is returned.

RETURNS DESCRIPTION
list[str] | None

List of read-only attribute names, or None when no read-only attributes are configured.

See Also

gdb_set_readonly_attrs : Set the list of read-only attributes.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> result = pm.gdb_get_readonly_attrs()
>>> result is None or isinstance(result, list)
True

pymisha.gdb_set_readonly_attrs

gdb_set_readonly_attrs(attrs)

Set the list of read-only track attributes for the current database.

PARAMETER DESCRIPTION
attrs

Attribute names to protect. Pass None to clear all read-only attributes.

TYPE: list[str] | tuple[str] | str | None

RETURNS DESCRIPTION
None
RAISES DESCRIPTION
ValueError

If an attribute name is empty or appears more than once.

See Also

gdb_get_readonly_attrs : Return the current read-only attributes.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdb_set_readonly_attrs(["created_by", "creation_date"])
>>> pm.gdb_set_readonly_attrs(None)

pymisha.gdir_cwd

gdir_cwd()

Return the current working directory in the genomic database.

Returns the absolute path of the current working directory in the genomic database. This is not the shell's current working directory but the directory within the misha tracks tree used for resolving track and interval set names.

RETURNS DESCRIPTION
str

Absolute path of the current working directory within the database.

See Also

gdir_cd : Change the current working directory. gdir_create : Create a new directory in the database. gdir_rm : Delete a directory from the database.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdir_cwd()
'...tracks'

pymisha.gdir_cd

gdir_cd(dir)

Change the current working directory in the genomic database.

Changes the directory used for resolving track and interval set names. The list of database objects (tracks, intervals) is rescanned recursively under the new directory. Object names are updated relative to the new working directory. For example, a track named subdir.dense becomes dense once the working directory is set to subdir. All virtual tracks are cleared.

PARAMETER DESCRIPTION
dir

Directory path (relative to current working directory, or "..").

TYPE: str

RETURNS DESCRIPTION
None
See Also

gdir_cwd : Return the current working directory. gdir_create : Create a new directory in the database. gdir_rm : Delete a directory from the database.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdir_cd("subdir")
>>> pm.gdir_cd("..")

pymisha.gdir_create

gdir_create(dir, show_warnings=True)

Create a new directory in the genomic database.

Creates a single directory level under the current working directory. Only the last element in the specified path is created; recursive directory creation is not supported. A new directory cannot be created within an existing .track directory.

PARAMETER DESCRIPTION
dir

Directory path relative to the current working directory.

TYPE: str

show_warnings

If True, show warnings (currently unused; kept for R parity).

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
None
RAISES DESCRIPTION
FileNotFoundError

If the parent directory does not exist.

ValueError

If the target is inside a .track directory or the name ends with .track.

See Also

gdir_rm : Delete a directory from the database. gdir_cd : Change the current working directory. gdir_cwd : Return the current working directory.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdir_create("my_subdir")

pymisha.gdir_rm

gdir_rm(dir, recursive=False, force=False)

Delete a directory from the genomic database.

If recursive is True, the directory is deleted with all files and subdirectories it contains. Cannot delete .track directories directly; use track-removal functions instead.

PARAMETER DESCRIPTION
dir

Directory path relative to the current working directory.

TYPE: str

recursive

If True, delete the directory and all its contents.

TYPE: bool DEFAULT: False

force

If True, suppress errors for non-existent directories.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
None
RAISES DESCRIPTION
FileNotFoundError

If the directory does not exist and force is False.

ValueError

If the target is a .track directory.

OSError

If the directory is not empty and recursive is False.

See Also

gdir_create : Create a new directory in the database. gdir_cd : Change the current working directory. gdir_cwd : Return the current working directory.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gdir_create("temp_dir")
>>> pm.gdir_rm("temp_dir")

pymisha.gtrack_create_dirs

gtrack_create_dirs(track, mode='0777')

Create the directory hierarchy needed for a dotted track name.

For example, gtrack_create_dirs("proj.sample.my_track") creates the directories proj and proj/sample under the current working directory. Use this function with caution -- a long track name may create a deep directory structure.

PARAMETER DESCRIPTION
track

Track name with dot-separated namespace.

TYPE: str

mode

Directory permissions (currently passed to os.mkdir).

TYPE: str DEFAULT: "0777"

RETURNS DESCRIPTION
None
See Also

gdir_create : Create a single directory in the database. gdir_cwd : Return the current working directory.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> pm.gtrack_create_dirs("proj.sample.my_track")