Computes a Markov model of order k (default 5) optionally stratified by
bins of one or more track expressions (e.g., GC content and CG dinucleotide
frequency). This model can be used to generate synthetic genomes that preserve
the k-mer statistics of the original genome within each stratification bin.
When called with no dimension specifications, trains a single unstratified model.
gsynth.train(
...,
mask = NULL,
intervals = NULL,
iterator = NULL,
pseudocount = 1,
min_obs = 0,
k = 5L
)Zero or more dimension specifications. Each specification is a list containing:
Track expression for this dimension (required)
Numeric vector of bin boundaries for this dimension (required)
Optional list of merge specifications for merging sparse bins. Each specification is a named list with 'from' and 'to' elements.
If no dimensions are provided, trains an unstratified model with a single bin.
Optional intervals to exclude from training. Regions in the mask
will not contribute to k-mer counts. Can be computed using gscreen().
Genomic intervals to process. If NULL, uses all chromosomes.
Iterator for track evaluation, determines the resolution at which track values are computed.
Pseudocount added to all k-mer counts to avoid zero probabilities. Default is 1.
Minimum number of observations ((k+1)-mers) required per bin. Bins
with fewer observations will be marked as NA (not learned) and a warning will
be issued. Default is 0 (no minimum). During sampling, NA bins will fall back
to uniform sampling unless merged via bin_merge.
Integer Markov order (1–10). Default is 5, which models 6-mer (context of length 5 plus the emitted base) transition probabilities. Higher values capture longer-range sequence dependencies but require exponentially more memory (\(4^k\) context states).
A gsynth.model object containing:
Markov order used for training
Number of context states (\(4^k\))
Number of stratification dimensions
List of dimension specifications (expr, breaks, num_bins, bin_map)
Vector of bin counts per dimension
Total number of bins (product of dim_sizes)
Total number of valid (k+1)-mers counted
Number of (k+1)-mers counted per bin
Number of positions skipped due to mask
Number of positions skipped due to N bases
Internal model data (counts and CDFs)
Strand symmetry: The training process counts both the forward strand (k+1)-mer and its reverse complement for each position, ensuring strand-symmetric transition probabilities. This means the reported total_kmers is approximately double the number of genomic positions processed.
N bases: Positions where the (k+1)-mer contains any N (unknown) bases
are skipped during training and counted in total_n. The model only learns
from valid A/C/G/T sequences.
gdb.init_examples()
# Create virtual tracks for stratification
gvtrack.create("g_frac", NULL, "kmer.frac", kmer = "G")
gvtrack.create("c_frac", NULL, "kmer.frac", kmer = "C")
gvtrack.create("cg_frac", NULL, "kmer.frac", kmer = "CG")
#> Warning: kmer sequence 'CG' is palindromic, please set strand to 1 or -1 to avoid double counting
gvtrack.create("masked_frac", NULL, "masked.frac")
# Define repeat mask
repeats <- gscreen("masked_frac > 0.5",
intervals = gintervals.all(),
iterator = 100
)
# Train unstratified model (no stratification)
model_0d <- gsynth.train(
mask = repeats,
intervals = gintervals.all(),
iterator = 200
)
#> Setting up iterator positions...
#> Training Markov model...
#> Trained unstratified Markov-5 model: 835,310 6-mers (no stratification)
# Train model with 2D stratification (GC content and CG dinucleotide)
model <- gsynth.train(
list(
expr = "g_frac + c_frac",
breaks = seq(0, 1, 0.025),
bin_merge = list(list(from = 0.7, to = c(0.675, 0.7)))
),
list(
expr = "cg_frac",
breaks = c(0, 0.01, 0.02, 0.03, 0.04, 0.2),
bin_merge = list(list(from = 0.04, to = c(0.03, 0.04)))
),
mask = repeats,
intervals = gintervals.all(),
iterator = 200
)
#> Extracting track values...
#> Training Markov model...
#> Trained Markov-5 model: 835,310 6-mers across 200 bins (2 dimensions)