Creating Genome Databases¶
This tutorial covers how to create a PyMisha genomic database from UCSC genome assemblies. PyMisha supports five commonly used genomes out of the box: hg19, hg38, mm9, mm10, and mm39.
Quick Start: Prebuilt Genomes¶
The easiest way to create a database is with pm.gdb_create_genome(). This downloads a prebuilt, ready-to-use database archive from a hosted repository and initializes it automatically.
import pymisha as pm
pm.gdb_create_genome("hg19") # Human, GRCh37
pm.gdb_create_genome("hg38") # Human, GRCh38
pm.gdb_create_genome("mm9") # Mouse, NCBI37
pm.gdb_create_genome("mm10") # Mouse, GRCm38
pm.gdb_create_genome("mm39") # Mouse, GRCm39
By default the database directory is created under the current working directory with the genome name (e.g., ./hg38/). You can control the location with the path parameter:
Checksum verification
By default, gdb_create_genome verifies a SHA-256 checksum after downloading the archive. You can disable this with verify_checksum=False if needed, but it is recommended to keep it enabled.
After gdb_create_genome completes, the database is already initialized -- you can start querying immediately:
pm.gdb_create_genome("hg38", path="/data/genomes")
pm.gchrom_sizes() # Shows chromosome names and sizes
Manual Database Creation from UCSC¶
If you need to create a database for a genome not covered by gdb_create_genome, or if you want to build from the latest UCSC files directly, you can use pm.gdb_create() with local FASTA files.
The general workflow is:
- Download chromosome FASTA files from UCSC
- Call
pm.gdb_create()with the paths to those files - Initialize the database with
pm.gdb_init()
FTP vs HTTPS
UCSC provides genome data via both FTP and HTTPS. The examples below use HTTPS URLs (https://hgdownload.soe.ucsc.edu/...), which work in most environments without requiring an FTP client.
Downloading Chromosome Files¶
You can download chromosome FASTA files using Python directly, or with command-line tools like wget or curl.
import urllib.request
from pathlib import Path
genome = "hg38"
base_url = f"https://hgdownload.soe.ucsc.edu/goldenPath/{genome}/chromosomes"
chroms = [f"chr{c}" for c in list(range(1, 23)) + ["X", "Y", "M"]]
download_dir = Path(f"{genome}_fasta")
download_dir.mkdir(exist_ok=True)
for chrom in chroms:
url = f"{base_url}/{chrom}.fa.gz"
dest = download_dir / f"{chrom}.fa.gz"
if not dest.exists():
print(f"Downloading {chrom}...")
urllib.request.urlretrieve(url, dest)
Building the Database¶
Once the FASTA files are downloaded, create the database:
import pymisha as pm
from pathlib import Path
genome = "hg38"
download_dir = Path(f"{genome}_fasta")
# Collect all chromosome FASTA files
chroms = [f"chr{c}" for c in list(range(1, 23)) + ["X", "Y", "M"]]
fasta_files = [str(download_dir / f"{chrom}.fa.gz") for chrom in chroms]
# Create the database
pm.gdb_create(genome, fasta_files, verbose=True)
# Initialize
pm.gdb_init(genome)
Database format
By default, gdb_create uses the "indexed" format, which stores all sequences in a single genome.seq file with an accompanying genome.idx index. This is the recommended format. You can also use db_format="per-chromosome" to store one .seq file per contig, which can then be converted later with pm.gdb_convert_to_indexed().
Genome-Specific Examples¶
Below are complete examples for each supported genome. The chromosome lists differ between human (1-22, X, Y, M) and mouse (1-19, X, Y, M).
hg19 (Human, GRCh37)¶
import pymisha as pm
from pathlib import Path
base_url = "https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes"
chroms = [f"chr{c}" for c in list(range(1, 23)) + ["X", "Y", "M"]]
fasta_files = [f"{base_url}/{chrom}.fa.gz" for chrom in chroms]
# Download locally first, then create
# (see download examples above)
download_dir = Path("hg19_fasta")
local_files = [str(download_dir / f"{chrom}.fa.gz") for chrom in chroms]
pm.gdb_create("hg19", local_files)
pm.gdb_init("hg19")
UCSC data URLs:
- Chromosomes:
https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/ - Gene annotations:
https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz - Gene cross-references:
https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/kgXref.txt.gz
hg38 (Human, GRCh38)¶
import pymisha as pm
from pathlib import Path
base_url = "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes"
chroms = [f"chr{c}" for c in list(range(1, 23)) + ["X", "Y", "M"]]
download_dir = Path("hg38_fasta")
local_files = [str(download_dir / f"{chrom}.fa.gz") for chrom in chroms]
pm.gdb_create("hg38", local_files)
pm.gdb_init("hg38")
UCSC data URLs:
- Chromosomes:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/ - Gene annotations:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/knownGene.txt.gz - Gene cross-references:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/kgXref.txt.gz
mm9 (Mouse, NCBI37)¶
import pymisha as pm
from pathlib import Path
base_url = "https://hgdownload.soe.ucsc.edu/goldenPath/mm9/chromosomes"
chroms = [f"chr{c}" for c in list(range(1, 20)) + ["X", "Y", "M"]]
download_dir = Path("mm9_fasta")
local_files = [str(download_dir / f"{chrom}.fa.gz") for chrom in chroms]
pm.gdb_create("mm9", local_files)
pm.gdb_init("mm9")
UCSC data URLs:
- Chromosomes:
https://hgdownload.soe.ucsc.edu/goldenPath/mm9/chromosomes/ - Gene annotations:
https://hgdownload.soe.ucsc.edu/goldenPath/mm9/database/knownGene.txt.gz - Gene cross-references:
https://hgdownload.soe.ucsc.edu/goldenPath/mm9/database/kgXref.txt.gz
mm10 (Mouse, GRCm38)¶
import pymisha as pm
from pathlib import Path
base_url = "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/chromosomes"
chroms = [f"chr{c}" for c in list(range(1, 20)) + ["X", "Y", "M"]]
download_dir = Path("mm10_fasta")
local_files = [str(download_dir / f"{chrom}.fa.gz") for chrom in chroms]
pm.gdb_create("mm10", local_files)
pm.gdb_init("mm10")
UCSC data URLs:
- Chromosomes:
https://hgdownload.soe.ucsc.edu/goldenPath/mm10/chromosomes/ - Gene annotations:
https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/knownGene.txt.gz - Gene cross-references:
https://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/kgXref.txt.gz
mm39 (Mouse, GRCm39)¶
import pymisha as pm
from pathlib import Path
base_url = "https://hgdownload.soe.ucsc.edu/goldenPath/mm39/chromosomes"
chroms = [f"chr{c}" for c in list(range(1, 20)) + ["X", "Y", "M"]]
download_dir = Path("mm39_fasta")
local_files = [str(download_dir / f"{chrom}.fa.gz") for chrom in chroms]
pm.gdb_create("mm39", local_files)
pm.gdb_init("mm39")
UCSC data URLs:
- Chromosomes:
https://hgdownload.soe.ucsc.edu/goldenPath/mm39/chromosomes/ - Gene annotations:
https://hgdownload.soe.ucsc.edu/goldenPath/mm39/database/knownGene.txt.gz - Gene cross-references:
https://hgdownload.soe.ucsc.edu/goldenPath/mm39/database/kgXref.txt.gz
Gene Annotations¶
Installing gene annotations
The genes_file / annots_file / annots_names parameters of
pm.gdb_create() are accepted for API compatibility but are not acted on
(no gene import happens at gdb_create time). Install gene annotations
instead with pm.gdb_install_intervals() (into an existing genome), or
build a genome together with its gene sets in one step via
pm.gdb_build_genome(). Both populate the tss / exons / utr3 /
utr5 interval sets, each carrying name (transcript accession) and
geneName (gene symbol) columns.
Linked Databases¶
If you already have a genome database and want to create a separate workspace that shares the same sequence data, use pm.gdb_create_linked():
This creates a new database at ~/my_project with a writable tracks/ directory and symlinks to the parent's seq/ and chrom_sizes.txt. This is useful for maintaining separate track collections without duplicating large sequence files.
Converting to Indexed Format¶
If you have an older per-chromosome database (one .seq file per contig), you can convert it to the more efficient indexed format:
pm.gdb_convert_to_indexed(
groot="/path/to/mydb",
convert_tracks=True,
convert_intervals=True,
remove_old_files=True,
verbose=True,
)
API Reference¶
| Function | Description |
|---|---|
pm.gdb_create_genome() |
Download and initialize a prebuilt genome |
pm.gdb_create() |
Create a database from local FASTA files |
pm.gdb_create_linked() |
Create a linked database sharing sequence data |
pm.gdb_convert_to_indexed() |
Convert per-chromosome format to indexed |
pm.gdb_init() |
Initialize a database connection |