vignettes/Database-Formats.Rmd
Database-Formats.RmdStarting with misha 5.3.0, databases can be stored in two formats:
The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).
The indexed format uses unified files:
Sequence data: - seq/genome.seq - All
chromosome sequences concatenated - seq/genome.idx - Index
mapping chromosome names to positions
Track data: -
tracks/mytrack.track/track.dat - All chromosome data
concatenated - tracks/mytrack.track/track.idx - Index with
offset/length per chromosome
Advantages: - Fewer file descriptors (important for genomes with 100+ contigs) - Better performance for large workloads (14% faster) - Smaller disk footprint - Faster track creation and conversion
The per-chromosome format uses separate files:
Sequence data: - seq/chr1.seq,
seq/chr2.seq, … - One file per chromosome
Track data: -
tracks/mytrack.track/chr1.track, chr2.track, …
- One file per chromosome
When to use: - Compatibility with older misha versions (<5.3.0) - Small genomes (<25 chromosomes) where performance difference is negligible
By default, new databases use the indexed format:
# Create database from FASTA file
gdb.create("mydb", "/path/to/genome.fa")
# Or download pre-built genome
gdb.create_genome("hg38", path = "/path/to/install")To create a database in legacy format (for compatibility with older misha):
# Set option before creating database
options(gmulticontig.indexed_format = FALSE)
gdb.create("mydb", "/path/to/genome.fa")Use gdb.info() to check your database format:
Example output:
info <- gdb.info()
# $path
# [1] "/path/to/mydb"
#
# $is_db
# [1] TRUE
#
# $format
# [1] "indexed"
#
# $num_chromosomes
# [1] 24
#
# $genome_size
# [1] 3095693983Convert all tracks and sequences to indexed format:
gsetroot("/path/to/mydb")
gdb.convert_to_indexed()This will: 1. Convert sequence files (chr*.seq →
genome.seq + genome.idx) 2. Convert all tracks to indexed
format 3. Validate conversions 4. Remove old files after successful
conversion
Convert specific tracks while keeping others in legacy format:
gtrack.convert_to_indexed("mytrack")Note that 2D tracks cannot be converted to indexed format yet.
Convert interval sets to indexed format:
# 1D intervals
gintervals.convert_to_indexed("myintervals")
# 2D intervals
gintervals.2d.convert_to_indexed("my2dintervals")High priority (significant benefits): - Genomes with many contigs (>50 chromosomes) - Large-scale analyses (10M+ bp regions frequently) - 2D track workflows - File descriptor limit issues
Medium priority (moderate benefits): - Repeated extraction workflows - Regular analyses on medium-sized regions (1-10M bp)
Low priority (minimal benefits): - Small genomes (<25 chromosomes) - One-off analyses - Simple queries on small regions
Step 1: Backup (optional but recommended)
# Create backup of important database
system("cp -r /path/to/mydb /path/to/mydb.backup")Step 2: Check current format
Step 3: Convert
Step 4: Verify
# Check format changed
info <- gdb.info()
print(paste("New format:", info$format))
# Test a few operations
result <- gextract("mytrack", gintervals(1, 0, 1000))
print(head(result))Step 5: Remove backup (after validation)
# After thorough testing
system("rm -rf /path/to/mydb.backup")You can freely copy tracks between databases with different formats.
# Export from source database
gsetroot("/path/to/source_db")
gextract("mytrack", gintervals.all(),
iterator = "mytrack",
file = "/tmp/mytrack.txt"
)
# Import to target database (format auto-detected)
gsetroot("/path/to/target_db")
gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0)
# Automatically converted to target database format!
# Copy multiple tracks
tracks <- c("track1", "track2", "track3")
for (track in tracks) {
# Export
gsetroot("/path/to/source_db")
file_path <- sprintf("/tmp/%s.txt", track)
gextract(track, gintervals.all(), iterator = track, file = file_path)
# Import
gsetroot("/path/to/target_db")
info <- gtrack.info(track) # Get description
gtrack.import(track, info$description, file_path, binsize = 0)
unlink(file_path)
}Based on comprehensive benchmarks comparing indexed vs legacy formats:
# Work with both formats in same session
gsetroot("/path/to/legacy_db")
data1 <- gextract("track1", gintervals(1, 0, 1000))
gsetroot("/path/to/indexed_db")
data2 <- gextract("track2", gintervals(1, 0, 1000))This occurs with many-contig genomes in legacy format:
Solution: Convert to indexed format
After manually copying track directories:
Solution: Reload database
Conversion needs 2x track size temporarily:
Solution: Free disk space or convert tracks individually
# Convert one track at a time
gtrack.convert_to_indexed("track1")
gtrack.convert_to_indexed("track2")
# etc.gdb.create_genome() for standard genomesgdb.create() with multi-FASTA for custom
genomesgdb.info()
gdb.convert_to_indexed()