Genome Synthesis¶
Classes and functions for training generative models on genomic sequences and sampling synthetic genomes.
pymisha.GsynthModel
dataclass
¶
Trained stratified Markov model for genome synthesis.
Stores the transition probabilities (as CDFs) for a k-th order Markov
chain, optionally stratified by one or more genomic track dimensions.
Each of the 4^k possible k-mer contexts maps to a probability
distribution over the four nucleotides (A, C, G, T), independently for
every flat bin in the stratification grid. The model is created by
:func:gsynth_train and consumed by :func:gsynth_sample.
| ATTRIBUTE | DESCRIPTION |
|---|---|
k |
Markov order (context length). Default 5 for backward compatibility.
TYPE:
|
n_dims |
Number of stratification dimensions.
TYPE:
|
dim_sizes |
Number of bins per dimension.
TYPE:
|
dim_specs |
Per-dimension specification (expr, breaks, num_bins, bin_map).
TYPE:
|
total_bins |
Product of all dim_sizes (total flat bins).
TYPE:
|
model_data |
Contains
TYPE:
|
total_kmers |
Total k-mers counted during training.
TYPE:
|
per_bin_kmers |
K-mers per flat bin.
TYPE:
|
total_masked |
Positions skipped due to mask.
TYPE:
|
total_n |
Positions skipped due to N bases.
TYPE:
|
pseudocount |
Pseudocount used for CDF normalization.
TYPE:
|
See Also
gsynth_train : Create a GsynthModel from genome sequences.
gsynth_sample : Generate synthetic sequences from a model.
gsynth_save : Persist a model to disk.
gsynth_load : Restore a model from disk.
Examples:
>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> model = pm.gsynth_train()
>>> model.n_dims
0
>>> model.total_bins
1
pymisha.gsynth_bin_map ¶
Compute bin mapping for merging sparse bins.
Converts value-based merge specifications into an integer index array that redirects source bins to target bins. This is useful when certain stratification bins have too few observations to learn reliable transition probabilities -- their counts can be folded into a neighbouring, better-populated bin.
| PARAMETER | DESCRIPTION |
|---|---|
breaks
|
Sorted bin boundaries (length =
TYPE:
|
merge_ranges
|
Each dict has:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Integer array of length |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If breaks has fewer than 2 elements, or if a |
See Also
gsynth_train : Accepts bin_merge specifications per dimension.
Examples:
>>> from pymisha import gsynth_bin_map
>>> breaks = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
>>> gsynth_bin_map(breaks, [{"from": (0.4, 0.5), "to": (0.3, 0.4)}])
array([0, 1, 2, 3, 3], dtype=int32)
Multiple merges -- fold both tails into the centre:
pymisha.gsynth_train ¶
gsynth_train(*dim_specs, mask=None, intervals=None, iterator=None, pseudocount=1.0, min_obs=0, k=5, allow_parallel=True, num_cores=None, max_chunk_size=None)
Train a stratified Markov model from genome sequences.
Computes a k-th order Markov model optionally stratified by bins of one or
more track expressions (e.g., GC content and CpG dinucleotide frequency).
The resulting :class:GsynthModel can be used with :func:gsynth_sample
to generate synthetic genomes that preserve the k-mer statistics of the
original genome within each stratification bin.
Both the forward-strand (k+1)-mer and its reverse complement are counted
for every valid position, ensuring strand-symmetric transition
probabilities. Positions containing N bases are skipped and counted
separately in the returned model's total_n attribute.
When called with no dimension specifications, trains a single unstratified (0-D) model.
For large genomes (total bases > threshold), intervals can be split into
chunks and processed in parallel using multiple cores. Each chunk trains
independently, and the resulting k-mer count arrays are merged before
computing the final CDF. This matches the R misha parallel gsynth
behavior.
| PARAMETER | DESCRIPTION |
|---|---|
*dim_specs
|
Each positional argument is a dict specifying a stratification dimension with the following keys:
TYPE:
|
mask
|
Intervals to exclude from training. Regions in the mask do not
contribute to k-mer counts but are tallied in
TYPE:
|
intervals
|
Genomic intervals to train on. If
TYPE:
|
iterator
|
Iterator bin size for track extraction. Determines the resolution at which track values are evaluated.
TYPE:
|
pseudocount
|
Pseudocount added to all k-mer counts to avoid zero probabilities in the CDF.
TYPE:
|
min_obs
|
Minimum number of (k+1)-mer observations required per bin. Reserved for future use.
TYPE:
|
k
|
Markov order (context length). Must be in
TYPE:
|
allow_parallel
|
Whether to enable parallel chunking for large genomes. When
TYPE:
|
num_cores
|
Number of worker processes. If
TYPE:
|
max_chunk_size
|
Total-base threshold above which parallel processing triggers.
Defaults to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
GsynthModel
|
Trained model containing transition CDFs, dimension metadata, and training statistics. |
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
If a dimension spec is not a dict. |
ValueError
|
If a dimension spec is missing |
See Also
gsynth_sample : Sample synthetic sequences from a trained model. gsynth_random : Generate random sequences without a model. gsynth_save : Persist a trained model to disk. gsynth_load : Restore a model from disk. gsynth_bin_map : Compute bin-merge mappings for sparse bins.
Examples:
Train an unstratified (0-D) model over the whole genome:
pymisha.gsynth_sample ¶
gsynth_sample(model, output=None, *, output_format='fasta', intervals=None, iterator=None, mask_copy=None, n_samples=1, seed=None, bin_merge=None, allow_parallel=True, num_cores=None, max_chunk_size=None)
Sample synthetic genome sequences from a trained model.
Generates a synthetic genome by sampling from a trained stratified Markov model. For each genomic position the sampler looks up the current k-mer context and the position's stratification bin, then draws the next nucleotide from the corresponding CDF. The result preserves the k-mer statistics of the original genome within each bin.
When the sampler needs to initialise the first k-mer context and encounters regions with only N bases, it falls back to uniform random base selection until a valid context is established.
For large genomes (total bases > threshold), intervals can be split into
chunks and processed in parallel using multiple cores. Each chunk samples
independently and the resulting sequences are concatenated. For file
output modes ("fasta" or "seq"), the parallel path first samples
to in-memory vectors and then writes the combined result.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
Trained Markov model from :func:
TYPE:
|
output
|
Output file path. If
TYPE:
|
output_format
|
Output format:
TYPE:
|
intervals
|
Genomic intervals to synthesise. If
TYPE:
|
iterator
|
Iterator bin size for track extraction during bin-index computation.
TYPE:
|
mask_copy
|
Intervals where the original reference sequence is preserved verbatim instead of being sampled. Useful for keeping repetitive or regulatory regions intact. Should be non-overlapping and sorted by start position within each chromosome.
TYPE:
|
n_samples
|
Number of independent samples to generate per interval. When
TYPE:
|
seed
|
Random seed for reproducibility. If
TYPE:
|
bin_merge
|
Sampling-time bin merge overrides, one element per model dimension.
Each element is either
TYPE:
|
allow_parallel
|
Whether to enable parallel chunking for large genomes. When
TYPE:
|
num_cores
|
Number of worker processes. If
TYPE:
|
max_chunk_size
|
Total-base threshold above which parallel processing triggers.
Defaults to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of str or None
|
List of nucleotide strings when output is |
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
If model is not a :class: |
See Also
gsynth_train : Train the model consumed by this function. gsynth_random : Generate random sequences without a model. gsynth_save : Persist a model for later sampling.
Examples:
pymisha.gsynth_random ¶
gsynth_random(*, intervals=None, nuc_probs=None, output=None, output_format='fasta', mask_copy=None, n_samples=1, seed=None)
Generate random genome sequences without a trained model.
Produces random DNA sequences where each nucleotide is sampled
independently according to the specified probabilities. Unlike
:func:gsynth_sample, no Markov context is used -- consecutive bases
are statistically independent. This is useful for generating baseline
random sequences or sequences with a specific GC content.
| PARAMETER | DESCRIPTION |
|---|---|
intervals
|
Genomic intervals to generate. If
TYPE:
|
nuc_probs
|
Nucleotide probabilities keyed by
TYPE:
|
output
|
Output file path. If
TYPE:
|
output_format
|
Output format:
TYPE:
|
mask_copy
|
Intervals where the original reference sequence is preserved instead of being randomly generated.
TYPE:
|
n_samples
|
Number of independent samples to generate per interval.
TYPE:
|
seed
|
Random seed for reproducibility. If
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of str or None
|
List of nucleotide strings when output is |
See Also
gsynth_sample : Sample from a trained Markov model. gsynth_train : Train a Markov model for context-dependent sampling.
Examples:
Uniform random sequence:
pymisha.gsynth_replace_kmer ¶
gsynth_replace_kmer(target, replacement, *, intervals=None, output=None, output_format='fasta', check_composition=True)
Iteratively replace a k-mer in genome sequences.
Scans each sequence and replaces every occurrence of target with
replacement. If a replacement creates a new instance of target
(e.g., replacing "CG" with "GC" in the sequence "CCG"
produces "CGC"), the new instance is also replaced. The scan
repeats until the sequence is completely free of target.
When target and replacement are permutations of each other (e.g.,
"CG" and "GC"), the operation acts as a local "bubble sort" of
nucleotides, preserving the total base counts and GC content of the
genome.
| PARAMETER | DESCRIPTION |
|---|---|
target
|
K-mer to remove. Case-insensitive (converted to uppercase internally).
TYPE:
|
replacement
|
Replacement sequence. Must be the same length as target.
TYPE:
|
intervals
|
Genomic intervals to process. If
TYPE:
|
output
|
Output file path. If
TYPE:
|
output_format
|
Output format:
TYPE:
|
check_composition
|
If
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of str or None
|
List of modified nucleotide strings when output is |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If target or replacement is empty, if they differ in length,
or if |
See Also
gsynth_sample : Markov-model-based genome synthesis. gsynth_random : Independent random nucleotide generation.
Examples:
Remove all CpG dinucleotides while preserving GC content:
pymisha.gsynth_save ¶
Save a trained model to disk in .gsm format.
Serialises a :class:GsynthModel to a cross-platform .gsm directory
(or ZIP archive when compress=True) containing YAML metadata and raw
binary arrays. The file can later be restored with :func:gsynth_load.
The .gsm format stores counts and CDFs as raw float64 arrays in
row-major (C) order, making them readable from both Python and R without
any language-specific serialisation quirks.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
Trained model to save.
TYPE:
|
path
|
Destination path. When compress is
TYPE:
|
compress
|
If
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
|
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
If model is not a :class: |
See Also
gsynth_load : Restore a model saved by this function.
gsynth_train : Create a model.
gsynth_convert : Convert legacy pickle models to .gsm format.
Examples:
pymisha.gsynth_load ¶
Load a trained model from disk.
Auto-detects the format: .gsm directory, .gsm ZIP archive, or
legacy pickle.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to the saved model. Can be a
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
GsynthModel
|
The deserialised model, ready for use with :func: |
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
If the deserialised object is not a :class: |
FileNotFoundError
|
If path does not exist. |
ValueError
|
If the format or version is unrecognised. |
See Also
gsynth_save : Save a model to disk. gsynth_train : Create a new model from scratch. gsynth_sample : Sample sequences from the loaded model.
Examples: