Database Formats and Multi-Contig Support • misha

library(misha)

Overview

Starting with misha 5.3.0, databases can be stored in two formats:

Indexed format (default, recommended): Single unified files for sequences and tracks
Per-chromosome format (legacy): Separate files for each chromosome

The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).

Key Features

Automatic format detection - misha automatically detects which format your database uses
Fully backward compatible - existing databases continue to work without modification
Transparent to users - same API for both formats
Migration tools - convert databases when convenient
Performance benefits - 4-14% faster for large-scale analyses

Database Formats

Indexed Format (Recommended)

The indexed format uses unified files:

Sequence data: - seq/genome.seq - All chromosome sequences concatenated - seq/genome.idx - Index mapping chromosome names to positions

Track data: - tracks/mytrack.track/track.dat - All chromosome data concatenated - tracks/mytrack.track/track.idx - Index with offset/length per chromosome

Advantages: - Fewer file descriptors (important for genomes with 100+ contigs) - Better performance for large workloads (14% faster) - Smaller disk footprint - Faster track creation and conversion

Per-Chromosome Format (Legacy)

The per-chromosome format uses separate files:

Sequence data: - seq/chr1.seq, seq/chr2.seq, … - One file per chromosome

Track data: - tracks/mytrack.track/chr1.track, chr2.track, … - One file per chromosome

When to use: - Compatibility with older misha versions (<5.3.0) - Small genomes (<25 chromosomes) where performance difference is negligible

Creating Databases

New Databases (Indexed Format)

By default, new databases use the indexed format:

# Create database from FASTA file
gdb.create("mydb", "/path/to/genome.fa")

# Or download pre-built genome
gdb.create_genome("hg38", path = "/path/to/install")

Force Legacy Format

To create a database in legacy format (for compatibility with older misha):

# Set option before creating database
options(gmulticontig.indexed_format = FALSE)
gdb.create("mydb", "/path/to/genome.fa")

Checking Database Format

Use gdb.info() to check your database format:

gsetroot("/path/to/mydb")
info <- gdb.info()
print(info$format) # "indexed" or "per-chromosome"

Example output:

info <- gdb.info()
# $path
# [1] "/path/to/mydb"
#
# $is_db
# [1] TRUE
#
# $format
# [1] "indexed"
#
# $num_chromosomes
# [1] 24
#
# $genome_size
# [1] 3095693983

Converting Databases

Convert Entire Database

Convert all tracks and sequences to indexed format:

gsetroot("/path/to/mydb")
gdb.convert_to_indexed()

This will: 1. Convert sequence files (chr*.seq → genome.seq + genome.idx) 2. Convert all tracks to indexed format 3. Validate conversions 4. Remove old files after successful conversion

Convert Individual Tracks

Convert specific tracks while keeping others in legacy format:

gtrack.convert_to_indexed("mytrack")

Note that 2D tracks cannot be converted to indexed format yet.

Convert Intervals

Convert interval sets to indexed format:

# 1D intervals
gintervals.convert_to_indexed("myintervals")

# 2D intervals
gintervals.2d.convert_to_indexed("my2dintervals")

Migration Guide

When to Migrate

High priority (significant benefits): - Genomes with many contigs (>50 chromosomes) - Large-scale analyses (10M+ bp regions frequently) - 2D track workflows - File descriptor limit issues

Medium priority (moderate benefits): - Repeated extraction workflows - Regular analyses on medium-sized regions (1-10M bp)

Low priority (minimal benefits): - Small genomes (<25 chromosomes) - One-off analyses - Simple queries on small regions

Migration Workflow

Step 1: Backup (optional but recommended)

# Create backup of important database
system("cp -r /path/to/mydb /path/to/mydb.backup")

Step 2: Check current format

gsetroot("/path/to/mydb")
info <- gdb.info()
print(paste("Current format:", info$format))

Step 3: Convert

gdb.convert_to_indexed()

Step 4: Verify

# Check format changed
info <- gdb.info()
print(paste("New format:", info$format))

# Test a few operations
result <- gextract("mytrack", gintervals(1, 0, 1000))
print(head(result))

Step 5: Remove backup (after validation)

# After thorough testing
system("rm -rf /path/to/mydb.backup")

Copying Tracks Between Databases

You can freely copy tracks between databases with different formats.

Method 1: Export and Import

# Export from source database
gsetroot("/path/to/source_db")
gextract("mytrack", gintervals.all(),
    iterator = "mytrack",
    file = "/tmp/mytrack.txt"
)

# Import to target database (format auto-detected)
gsetroot("/path/to/target_db")
gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0)
# Automatically converted to target database format!

Method 2: Batch Copy

# Copy multiple tracks
tracks <- c("track1", "track2", "track3")

for (track in tracks) {
    # Export
    gsetroot("/path/to/source_db")
    file_path <- sprintf("/tmp/%s.txt", track)
    gextract(track, gintervals.all(), iterator = track, file = file_path)

    # Import
    gsetroot("/path/to/target_db")
    info <- gtrack.info(track) # Get description
    gtrack.import(track, info$description, file_path, binsize = 0)
    unlink(file_path)
}

Performance Comparison

Based on comprehensive benchmarks comparing indexed vs legacy formats:

Operations Faster with Indexed Format

Very large workloads (10M+ bp): 14% faster
2D track operations: 2% faster (was 36% slower in legacy)
Repeated extractions: 14% faster
Real workflows: 8% faster on average

Operations Similar Performance

Single chromosome extraction: Within 5%
Multi-chromosome (10-22 chr): Within 1%
PWM operations: Within 3%
Small workloads (<1M bp): Within 10%

Summary

64% of operations are faster with indexed format
93% within 5% of legacy format (statistically equal)
Average: 4% faster across all benchmarks
No regressions for common use cases

Backward Compatibility

Fully Compatible

✅ Existing databases work without modification
✅ Existing scripts work without changes
✅ Same API for both formats
✅ Automatic format detection
✅ Can mix formats in same analysis

Example: Mixed Environment

# Work with both formats in same session
gsetroot("/path/to/legacy_db")
data1 <- gextract("track1", gintervals(1, 0, 1000))

gsetroot("/path/to/indexed_db")
data2 <- gextract("track2", gintervals(1, 0, 1000))

Troubleshooting

“File descriptor limit reached”

This occurs with many-contig genomes in legacy format:

Solution: Convert to indexed format

gdb.convert_to_indexed()

“Track not found after copying files”

After manually copying track directories:

Solution: Reload database

gdb.reload()

“Conversion fails with disk space error”

Conversion needs 2x track size temporarily:

Solution: Free disk space or convert tracks individually

# Convert one track at a time
gtrack.convert_to_indexed("track1")
gtrack.convert_to_indexed("track2")
# etc.

Best Practices

For New Projects

Use indexed format (default) for new databases
Use gdb.create_genome() for standard genomes
Use gdb.create() with multi-FASTA for custom genomes

For Existing Projects

Check format with gdb.info()
Migrate if beneficial (many contigs, large analyses)
No rush - legacy format remains fully supported

Summary

Indexed format is the default and recommended for new databases
Legacy format remains fully supported - no forced migration
Automatic format detection - users don’t need to think about it
Significant performance benefits for large-scale analyses
Easy migration with gdb.convert_to_indexed()
Fully backward compatible - existing code works unchanged