Computes a 5th-order Markov model optionally stratified by bins of one or more track expressions (e.g., GC content and CG dinucleotide frequency). This model can be used to generate synthetic genomes that preserve the k-mer statistics of the original genome within each stratification bin. When called with no dimension specifications, trains a single unstratified model.

gsynth.train(
  ...,
  mask = NULL,
  intervals = NULL,
  iterator = NULL,
  pseudocount = 1,
  min_obs = 0
)

Arguments

...

Zero or more dimension specifications. Each specification is a list containing:

expr

Track expression for this dimension (required)

breaks

Numeric vector of bin boundaries for this dimension (required)

bin_merge

Optional list of merge specifications for merging sparse bins. Each specification is a named list with 'from' and 'to' elements.

If no dimensions are provided, trains an unstratified model with a single bin.

mask

Optional intervals to exclude from training. Regions in the mask will not contribute to k-mer counts. Can be computed using gscreen().

intervals

Genomic intervals to process. If NULL, uses all chromosomes.

iterator

Iterator for track evaluation, determines the resolution at which track values are computed.

pseudocount

Pseudocount added to all k-mer counts to avoid zero probabilities. Default is 1.

min_obs

Minimum number of observations (6-mers) required per bin. Bins with fewer observations will be marked as NA (not learned) and a warning will be issued. Default is 0 (no minimum). During sampling, NA bins will fall back to uniform sampling unless merged via bin_merge.

Value

A gsynth.model object containing:

n_dims

Number of stratification dimensions

dim_specs

List of dimension specifications (expr, breaks, num_bins, bin_map)

dim_sizes

Vector of bin counts per dimension

total_bins

Total number of bins (product of dim_sizes)

total_kmers

Total number of valid 6-mers counted

per_bin_kmers

Number of 6-mers counted per bin

total_masked

Number of positions skipped due to mask

total_n

Number of positions skipped due to N bases

model_data

Internal model data (counts and CDFs)

Details

Strand symmetry: The training process counts both the forward strand 6-mer and its reverse complement for each position, ensuring strand-symmetric transition probabilities. This means the reported total_kmers is approximately double the number of genomic positions processed.

N bases: Positions where the 6-mer contains any N (unknown) bases are skipped during training and counted in total_n. The model only learns from valid A/C/G/T sequences.

Examples

gdb.init_examples()

# Create virtual tracks for stratification
gvtrack.create("g_frac", NULL, "kmer.frac", kmer = "G")
gvtrack.create("c_frac", NULL, "kmer.frac", kmer = "C")
gvtrack.create("cg_frac", NULL, "kmer.frac", kmer = "CG")
#> Warning: kmer sequence 'CG' is palindromic, please set strand to 1 or -1 to avoid double counting
gvtrack.create("masked_frac", NULL, "masked.frac")

# Define repeat mask
repeats <- gscreen("masked_frac > 0.5",
    intervals = gintervals.all(),
    iterator = 100
)

# Train unstratified model (no stratification)
model_0d <- gsynth.train(
    mask = repeats,
    intervals = gintervals.all(),
    iterator = 200
)
#> Setting up iterator positions...
#> Training Markov model...
#> Trained unstratified model: 835,310 6-mers (no stratification)

# Train model with 2D stratification (GC content and CG dinucleotide)
model <- gsynth.train(
    list(
        expr = "g_frac + c_frac",
        breaks = seq(0, 1, 0.025),
        bin_merge = list(list(from = 0.7, to = c(0.675, 0.7)))
    ),
    list(
        expr = "cg_frac",
        breaks = c(0, 0.01, 0.02, 0.03, 0.04, 0.2),
        bin_merge = list(list(from = 0.04, to = c(0.03, 0.04)))
    ),
    mask = repeats,
    intervals = gintervals.all(),
    iterator = 200
)
#> Extracting track values...
#> Training Markov model...
#> Trained model: 835,310 6-mers across 200 bins (2 dimensions)