Builds a misha genomic database for a named assembly. Resolves the name
through the registry chain (or pattern-fallback for GC[FA]_*
accessions), downloads the FASTA, calls gdb.create to build
the seq-only groot, then dispatches to gdb.install_intervals
for the requested sets.
gdb.build_genome(
name,
path = name,
registry = NULL,
sets = c("genes", "rmsk", "cgi", "cytoband"),
prefix = "",
gene_sets = c(tss = "tss", exons = "exons", utr3 = "utr3", utr5 = "utr5"),
gtf_priority = c("ncbiRefSeq", "bestRefSeq", "ensGene", "augustus", "xenoRefGene"),
chrom_naming = NULL,
target_chroms = NULL,
target_lengths = NULL,
min_coverage = 1,
match_by_length = TRUE,
format = NULL,
verbose = TRUE
)Genome name (registry key, alias, or GC[FA]_* accession).
Output directory; must not exist.
Optional path to an explicit registry YAML.
Subset of c("genes", "rmsk", "cgi", "cytoband").
Empty vector character(0) = sequence-only build.
Character scalar prepended to set names (see
gdb.install_intervals).
Named character vector mapping the four
gintervals.import_genes() roles to on-disk set names; NA skips
a role.
Character vector ordering GTF source preference.
Optional override for the recipe's chrom_naming.
Selects which name space the canonical chrom names should come from. For
ucsc-hub: any chromAlias column ("ucsc", "genbank",
"refseq", "ncbi"), plus the friendly aliases
"sequence_name" (= "assembly") and "accession" (keep
the FASTA's source column). For ncbi: "sequence_name" (default),
"ucsc", or "accession". NULL (default) keeps whatever
the recipe specifies.
Optional character vector of chrom names the resulting
groot should align to (typically the output of halStats
--sequenceStats, the chrom names in a HAL file you intend to liftover
against). When supplied, misha auto-picks the chromAlias column whose
values cover target_chroms best and uses that column as the
canonical naming, instead of chrom_naming. Honored only by the
ucsc-hub backend; supplying it with any other source is an error
(raised before any download).
Optional numeric vector aligned with
target_chroms (typically the second field of
halStats --sequenceStats). When supplied alongside
target_chroms and with match_by_length = TRUE, this is
the strong-guarantee path: misha force-aligns the hub FASTA to
target_chroms, placing every target on its chromAlias row by
name match across columns or unique-on-both-sides length pairing. If
any target can't be placed, the build errors (in the pre-flight,
before the multi-GB FASTA download). On success the resulting groot's
chrom names are exactly target_chroms (alias rows not in
target_chroms keep their original FASTA-header accession).
Honored only by the ucsc-hub backend.
Minimum fraction of groot chroms that must appear in a
chromAlias column for that column to be picked as canonical (forwarded to
gdb.install_intervals). Default 1.0 (strict). Lower
to e.g. 0.99 when a column has small gaps – typical when a target
column doesn't span every contig (e.g. UCSC's genbank column has
no value for the mitochondrion in many hubs, leaving 1 stray chrom).
Honored only by the ucsc-hub backend; supplying a non-default value
for any other source is an error (raised before any download).
Forwarded to gdb.install_intervals.
When TRUE (default), complements column-based canonical detection
with a per-row length match for alias rows the chosen column couldn't
cover, and switches asset translation to a cross-column per-row lookup
so GFFs in any naming scheme import cleanly. Set FALSE for the
stricter single-column-only behavior.
"indexed" or "per-chromosome"; NULL =>
getOption("gmulticontig.indexed_format", TRUE).
If TRUE, prints progress.
None (invisible NULL).
For details on resolution, sources, sets, and chromosome-alias handling,
see gdb.install_intervals.
if (FALSE) { # \dontrun{
gdb.build_genome("hg38", path = "~/genomes/hg38")
gdb.build_genome("GCA_004023825.1",
path = "~/genomes/arctic_fox",
prefix = "intervs.global."
)
# Match HAL/Cactus canonical names (GenBank accessions like JH880237.1):
gdb.build_genome("GCF_000298355.1",
path = "~/genomes/Bos_mutus",
chrom_naming = "genbank",
prefix = "intervs.global."
)
} # }