misha database from UCSC
The easiest way to create a misha database is to use the
gdb.create_genome function:
gdb.create_genome("hg19") # creates a database for the hg19 genome
gdb.create_genome("hg38") # creates a database for the hg38 genome
gdb.create_genome("mm10") # creates a database for the mm10 genome
gdb.create_genome("mm9") # creates a database for the mm9 genome
gdb.create_genome("mm39") # creates a database for the mm39 genomeHowever, if you need to create a database for a genome that is not
supported by gdb.create_genome, or if you want to make sure
that the database is created from the latest version of the genome in
ucsc, you can create it manually using the commands below.
In order to create a misha database for hg19 genome, run the following commands (assuming “hg19” is your new data base path):
ftp <- "ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19"
gdb.create(
"hg19",
paste(ftp, "chromosomes", paste0("chr", c(1:22, "X", "Y", "M"), ".fa.gz"), sep = "/"),
paste(ftp, "database/knownGene.txt.gz", sep = "/"),
paste(ftp, "database/kgXref.txt.gz", sep = "/"),
c(
"kgID", "mRNA", "spID", "spDisplayID", "geneSymbol",
"refseq", "protAcc", "description", "rfamAcc",
"tRnaName"
)
)
gdb.init("hg19")In order to create a misha database for hg38 genome, run the following commands (assuming “hg38” is your new data base path):
ftp <- "ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38"
gdb.create(
"hg38",
paste(ftp, "chromosomes", paste0("chr", c(1:22, "X", "Y", "M"), ".fa.gz"), sep = "/"),
paste(ftp, "database/knownGene.txt.gz", sep = "/"),
paste(ftp, "database/kgXref.txt.gz", sep = "/"),
c(
"kgID", "mRNA", "spID", "spDisplayID", "geneSymbol",
"refseq", "protAcc", "description", "rfamAcc",
"tRnaName"
)
)
gdb.init("hg38")In order to create a misha database for mm9 genome, run the following commands (assuming “mm9” is your new data base path):
ftp <- "ftp://hgdownload.soe.ucsc.edu/goldenPath/mm9"
gdb.create(
"mm9",
paste(ftp, "chromosomes", paste0("chr", c(1:19, "X", "Y", "M"), ".fa.gz"), sep = "/"),
paste(ftp, "database/knownGene.txt.gz", sep = "/"),
paste(ftp, "database/kgXref.txt.gz", sep = "/"),
c(
"kgID", "mRNA", "spID", "spDisplayID", "geneSymbol",
"refseq", "protAcc", "description"
)
)
gdb.init("mm9")In order to create a misha database for mm10 genome, run the following commands (assuming “mm10” is your new data base path):
ftp <- "ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10"
gdb.create(
"mm10",
paste(ftp, "chromosomes", paste0("chr", c(1:19, "X", "Y", "M"), ".fa.gz"), sep = "/"),
paste(ftp, "database/knownGene.txt.gz", sep = "/"),
paste(ftp, "database/kgXref.txt.gz", sep = "/"),
c(
"kgID", "mRNA", "spID", "spDisplayID", "geneSymbol",
"refseq", "protAcc", "description", "rfamAcc",
"tRnaName"
)
)
gdb.init("mm10")In order to create a misha database for mm39 genome, run the following commands (assuming “mm39” is your new data base path):
ftp <- "ftp://hgdownload.soe.ucsc.edu/goldenPath/mm39"
gdb.create(
"mm39",
paste(ftp, "chromosomes", paste0("chr", c(1:19, "X", "Y", "M"), ".fa.gz"), sep = "/"),
paste(ftp, "database/knownGene.txt.gz", sep = "/"),
paste(ftp, "database/kgXref.txt.gz", sep = "/"),
c(
"kgID", "mRNA", "spID", "spDisplayID", "geneSymbol",
"refseq", "protAcc", "description", "rfamAcc",
"tRnaName"
)
)
gdb.init("mm39")UCSC mammal hubs host 224 of the 241 Cactus / Zoonomia mammalian genomes under a uniform per-species directory layout, including chromAlias, repeat masker output, and one or more GTFs per species. Build a misha groot from a hub assembly with:
# One-shot. Default sets = c("genes", "rmsk", "cgi", "cytoband"); cytoband and
# (often) cgi are skipped with a warning since hubs don't ship them.
gdb.build_genome("GCA_004023825.1",
path = "/genomes/arctic_fox",
prefix = "intervs.global."
)If you already have a private FASTA build for the same accession (or a HAL- derived assembly) and just want to layer canonical annotations on it:
gdb.install_intervals(
groot = "/genomes/my_private_assembly",
source = list(source = "ucsc-hub", accession = "GCA_004023825.1"),
sets = c("genes", "rmsk"),
prefix = "intervs.global."
)If your groot was built with a different chromosome-naming scheme
than the hub uses (for example, Cactus-derived names), the hub’s
chromAlias.txt is consulted to translate annotation chroms
into the groot’s canonical names — strictly. The chosen alias column
must contain every groot chrom and every distinct
chrom in the source file; otherwise the install errors with a diagnostic
per-column overlap table.
To convert hub GTFs the install needs UCSC’s
gtfToGenePred binary (and gff3ToGenePred for
NCBI Datasets sources). On the first call the package prompts for
consent to download UCSC’s prebuilt binaries (~5 MB each). To
pre-install in non-interactive contexts:
The prebuilt binaries require glibc ≥ 2.34. On older systems (e.g. CentOS 8) the download will succeed but the binary won’t run. Workaround:
Then point misha at the env’s binaries:
Sys.setenv(MISHA_GFF3_TO_GENEPRED = "/path/to/env/bin/gff3ToGenePred")
Sys.setenv(MISHA_GTF_TO_GENEPRED = "/path/to/env/bin/gtfToGenePred")