Genome Synthesis¶

Classes and functions for training generative models on genomic sequences and sampling synthetic genomes.

pymisha.GsynthModel `dataclass` ¶

Trained stratified Markov model for genome synthesis.

Stores the transition probabilities (as CDFs) for a k-th order Markov chain, optionally stratified by one or more genomic track dimensions. Each of the 4^k possible k-mer contexts maps to a probability distribution over the four nucleotides (A, C, G, T), independently for every flat bin in the stratification grid. The model is created by :func:gsynth_train and consumed by :func:gsynth_sample.

ATTRIBUTE	DESCRIPTION
`k`	Markov order (context length). Default 5 for backward compatibility. TYPE: `int`
`n_dims`	Number of stratification dimensions. TYPE: `int`
`dim_sizes`	Number of bins per dimension. TYPE: `list of int`
`dim_specs`	Per-dimension specification (expr, breaks, num_bins, bin_map). TYPE: `list of dict`
`total_bins`	Product of all dim_sizes (total flat bins). TYPE: `int`
`model_data`	Contains `'counts'` (list of 2-D arrays, one per flat bin, shape `(4^k, 4)`) and `'cdf'` (same layout, cumulative probabilities). TYPE: `dict`
`total_kmers`	Total k-mers counted during training. TYPE: `int`
`per_bin_kmers`	K-mers per flat bin. TYPE: `ndarray`
`total_masked`	Positions skipped due to mask. TYPE: `int`
`total_n`	Positions skipped due to N bases. TYPE: `int`
`pseudocount`	Pseudocount used for CDF normalization. TYPE: `float`

See Also

gsynth_train : Create a GsynthModel from genome sequences. gsynth_sample : Generate synthetic sequences from a model. gsynth_save : Persist a model to disk. gsynth_load : Restore a model from disk.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> model = pm.gsynth_train()
>>> model.n_dims
0
>>> model.total_bins
1

num_kmers `property` ¶

num_kmers

Number of k-mer contexts (4^k).

pymisha.gsynth_bin_map ¶

gsynth_bin_map(breaks, merge_ranges)

Compute bin mapping for merging sparse bins.

Converts value-based merge specifications into an integer index array that redirects source bins to target bins. This is useful when certain stratification bins have too few observations to learn reliable transition probabilities -- their counts can be folded into a neighbouring, better-populated bin.

PARAMETER	DESCRIPTION
`breaks`	Sorted bin boundaries (length = `num_bins + 1`). TYPE: `array - like`
`merge_ranges`	Each dict has: `"from"` : float or tuple of (lo, hi) -- source value or range to remap. A scalar `v` is shorthand for `(v, inf)`. `"to"` : tuple of (lo, hi) -- target value range whose bin receives the merged counts. Must overlap exactly one bin defined by breaks. TYPE: `list of dict`

RETURNS	DESCRIPTION
`ndarray`	Integer array of length `num_bins`, where `bin_map[i]` is the 0-based target bin index for source bin `i`. Unmapped bins map to themselves (identity).

RAISES	DESCRIPTION
`ValueError`	If breaks has fewer than 2 elements, or if a `"to"` range does not match any bin in breaks.

See Also

gsynth_train : Accepts bin_merge specifications per dimension.

Examples:

>>> from pymisha import gsynth_bin_map
>>> breaks = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
>>> gsynth_bin_map(breaks, [{"from": (0.4, 0.5), "to": (0.3, 0.4)}])
array([0, 1, 2, 3, 3], dtype=int32)

Multiple merges -- fold both tails into the centre:

>>> breaks = [0.0, 0.25, 0.5, 0.75, 1.0]
>>> gsynth_bin_map(breaks, [
...     {"from": (0.0, 0.25), "to": (0.25, 0.5)},
...     {"from": (0.75, 1.0), "to": (0.5, 0.75)},
... ])
array([1, 1, 2, 2], dtype=int32)

pymisha.gsynth_train ¶

gsynth_train(*dim_specs, mask=None, intervals=None, iterator=None, pseudocount=1.0, min_obs=0, k=5, allow_parallel=True, num_cores=None, max_chunk_size=None)

Train a stratified Markov model from genome sequences.

Computes a k-th order Markov model optionally stratified by bins of one or more track expressions (e.g., GC content and CpG dinucleotide frequency). The resulting :class:GsynthModel can be used with :func:gsynth_sample to generate synthetic genomes that preserve the k-mer statistics of the original genome within each stratification bin.

Both the forward-strand (k+1)-mer and its reverse complement are counted for every valid position, ensuring strand-symmetric transition probabilities. Positions containing N bases are skipped and counted separately in the returned model's total_n attribute.

When called with no dimension specifications, trains a single unstratified (0-D) model.

For large genomes (total bases > threshold), intervals can be split into chunks and processed in parallel using multiple cores. Each chunk trains independently, and the resulting k-mer count arrays are merged before computing the final CDF. This matches the R misha parallel gsynth behavior.

PARAMETER	DESCRIPTION
`*dim_specs`	Each positional argument is a dict specifying a stratification dimension with the following keys: `"expr"` (str): Track expression for this dimension (required). `"breaks"` (array-like): Sorted bin boundaries (required). Length must be at least 2. `"bin_merge"` (list of dict, optional): Merge specifications for sparse bins, in the format accepted by :func:`gsynth_bin_map`. TYPE: `dict` DEFAULT: `()`
`mask`	Intervals to exclude from training. Regions in the mask do not contribute to k-mer counts but are tallied in `total_masked`. TYPE: `DataFrame` DEFAULT: `None`
`intervals`	Genomic intervals to train on. If `None`, uses all chromosomes. TYPE: `DataFrame` DEFAULT: `None`
`iterator`	Iterator bin size for track extraction. Determines the resolution at which track values are evaluated. TYPE: `int` DEFAULT: `None`
`pseudocount`	Pseudocount added to all k-mer counts to avoid zero probabilities in the CDF. TYPE: `float` DEFAULT: `1.0`
`min_obs`	Minimum number of (k+1)-mer observations required per bin. Reserved for future use. TYPE: `int` DEFAULT: `0`
`k`	Markov order (context length). Must be in `[1, 10]`. The model stores `4^k` context states; higher values capture longer-range dependencies but require more training data and memory. TYPE: `int` DEFAULT: `5`
`allow_parallel`	Whether to enable parallel chunking for large genomes. When `True` and the total bases exceed max_chunk_size, intervals are split across multiple processes. When `False`, always runs single-threaded. TYPE: `bool` DEFAULT: `True`
`num_cores`	Number of worker processes. If `None`, defaults to `multiprocessing.cpu_count()`. Capped at the number of interval rows. TYPE: `int` DEFAULT: `None`
`max_chunk_size`	Total-base threshold above which parallel processing triggers. Defaults to `GSYNTH_MAX_CHUNK_SIZE` (1 billion). TYPE: `int` DEFAULT: `None`

RETURNS	DESCRIPTION
`GsynthModel`	Trained model containing transition CDFs, dimension metadata, and training statistics.

RAISES	DESCRIPTION
`TypeError`	If a dimension spec is not a dict.
`ValueError`	If a dimension spec is missing `"expr"` or `"breaks"`, or if breaks has fewer than 2 elements, or if no data is extracted for a given expression, or if k is not in `[1, 10]`.

See Also

gsynth_sample : Sample synthetic sequences from a trained model. gsynth_random : Generate random sequences without a model. gsynth_save : Persist a trained model to disk. gsynth_load : Restore a model from disk. gsynth_bin_map : Compute bin-merge mappings for sparse bins.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()

Train an unstratified (0-D) model over the whole genome:

>>> model_0d = pm.gsynth_train()
>>> model_0d.n_dims
0

pymisha.gsynth_sample ¶

gsynth_sample(model, output=None, *, output_format='fasta', intervals=None, iterator=None, mask_copy=None, n_samples=1, seed=None, bin_merge=None, allow_parallel=True, num_cores=None, max_chunk_size=None)

Sample synthetic genome sequences from a trained model.

Generates a synthetic genome by sampling from a trained stratified Markov model. For each genomic position the sampler looks up the current k-mer context and the position's stratification bin, then draws the next nucleotide from the corresponding CDF. The result preserves the k-mer statistics of the original genome within each bin.

When the sampler needs to initialise the first k-mer context and encounters regions with only N bases, it falls back to uniform random base selection until a valid context is established.

For large genomes (total bases > threshold), intervals can be split into chunks and processed in parallel using multiple cores. Each chunk samples independently and the resulting sequences are concatenated. For file output modes ("fasta" or "seq"), the parallel path first samples to in-memory vectors and then writes the combined result.

PARAMETER	DESCRIPTION
`model`	Trained Markov model from :func:`gsynth_train`. TYPE: `GsynthModel`
`output`	Output file path. If `None`, sequences are returned in memory (equivalent to `output_format="vector"`). TYPE: `str` DEFAULT: `None`
`output_format`	Output format: `"fasta"` -- FASTA text format. `"seq"` -- misha binary `.seq` format. `"vector"` -- return sequences as a Python list of strings (does not write to file). TYPE: `(fasta, seq, vector)` DEFAULT: `"fasta"`
`intervals`	Genomic intervals to synthesise. If `None`, uses all chromosomes. TYPE: `DataFrame` DEFAULT: `None`
`iterator`	Iterator bin size for track extraction during bin-index computation. TYPE: `int` DEFAULT: `None`
`mask_copy`	Intervals where the original reference sequence is preserved verbatim instead of being sampled. Useful for keeping repetitive or regulatory regions intact. Should be non-overlapping and sorted by start position within each chromosome. TYPE: `DataFrame` DEFAULT: `None`
`n_samples`	Number of independent samples to generate per interval. When `n_samples > 1` and `output_format="fasta"`, headers include a `_sampleN` suffix. When `output_format="vector"`, the returned list has length `n_intervals * n_samples`. TYPE: `int` DEFAULT: `1`
`seed`	Random seed for reproducibility. If `None`, uses the current random state. TYPE: `int` DEFAULT: `None`
`bin_merge`	Sampling-time bin merge overrides, one element per model dimension. Each element is either `None` (use the training-time bin_merge) or a list of merge-range dicts as accepted by :func:`gsynth_bin_map`. This allows redirecting sparse bins to better-populated ones at sampling time without retraining the model. TYPE: `list` DEFAULT: `None`
`allow_parallel`	Whether to enable parallel chunking for large genomes. When `True` and the total bases exceed max_chunk_size, intervals are split across multiple processes. When `False`, always runs single-threaded. TYPE: `bool` DEFAULT: `True`
`num_cores`	Number of worker processes. If `None`, defaults to `multiprocessing.cpu_count()`. Capped at the number of interval rows. TYPE: `int` DEFAULT: `None`
`max_chunk_size`	Total-base threshold above which parallel processing triggers. Defaults to `GSYNTH_MAX_CHUNK_SIZE` (1 billion). TYPE: `int` DEFAULT: `None`

RETURNS	DESCRIPTION
`list of str or None`	List of nucleotide strings when output is `None` or output_format is `"vector"`. `None` otherwise (output is written to file).

RAISES	DESCRIPTION
`TypeError`	If model is not a :class:`GsynthModel`.

See Also

gsynth_train : Train the model consumed by this function. gsynth_random : Generate random sequences without a model. gsynth_save : Persist a model for later sampling.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> model = pm.gsynth_train()
>>> seqs = pm.gsynth_sample(
...     model,
...     intervals=pm.gintervals(["1"], [0], [1000]),
...     seed=42,
... )
>>> len(seqs[0])
1000

pymisha.gsynth_random ¶

gsynth_random(*, intervals=None, nuc_probs=None, output=None, output_format='fasta', mask_copy=None, n_samples=1, seed=None)

Generate random genome sequences without a trained model.

Produces random DNA sequences where each nucleotide is sampled independently according to the specified probabilities. Unlike :func:gsynth_sample, no Markov context is used -- consecutive bases are statistically independent. This is useful for generating baseline random sequences or sequences with a specific GC content.

PARAMETER	DESCRIPTION
`intervals`	Genomic intervals to generate. If `None`, uses all chromosomes. TYPE: `DataFrame` DEFAULT: `None`
`nuc_probs`	Nucleotide probabilities keyed by `'A'`, `'C'`, `'G'`, `'T'`. Values are automatically normalised to sum to 1. Default is uniform (0.25 each). TYPE: `dict` DEFAULT: `None`
`output`	Output file path. If `None`, sequences are returned in memory. TYPE: `str` DEFAULT: `None`
`output_format`	Output format: `"fasta"` -- FASTA text format. `"seq"` -- misha binary `.seq` format. `"vector"` -- return sequences as a Python list of strings. TYPE: `(fasta, seq, vector)` DEFAULT: `"fasta"`
`mask_copy`	Intervals where the original reference sequence is preserved instead of being randomly generated. TYPE: `DataFrame` DEFAULT: `None`
`n_samples`	Number of independent samples to generate per interval. TYPE: `int` DEFAULT: `1`
`seed`	Random seed for reproducibility. If `None`, uses the current random state. TYPE: `int` DEFAULT: `None`

RETURNS	DESCRIPTION
`list of str or None`	List of nucleotide strings when output is `None` or output_format is `"vector"`. `None` otherwise (output is written to file).

See Also

gsynth_sample : Sample from a trained Markov model. gsynth_train : Train a Markov model for context-dependent sampling.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()

Uniform random sequence:

>>> seqs = pm.gsynth_random(
...     intervals=pm.gintervals(["1"], [0], [1000]),
...     seed=42,
... )
>>> len(seqs[0])
1000

pymisha.gsynth_replace_kmer ¶

gsynth_replace_kmer(target, replacement, *, intervals=None, output=None, output_format='fasta', check_composition=True)

Iteratively replace a k-mer in genome sequences.

Scans each sequence and replaces every occurrence of target with replacement. If a replacement creates a new instance of target (e.g., replacing "CG" with "GC" in the sequence "CCG" produces "CGC"), the new instance is also replaced. The scan repeats until the sequence is completely free of target.

When target and replacement are permutations of each other (e.g., "CG" and "GC"), the operation acts as a local "bubble sort" of nucleotides, preserving the total base counts and GC content of the genome.

PARAMETER	DESCRIPTION
`target`	K-mer to remove. Case-insensitive (converted to uppercase internally). TYPE: `str`
`replacement`	Replacement sequence. Must be the same length as target. TYPE: `str`
`intervals`	Genomic intervals to process. If `None`, uses all chromosomes. TYPE: `DataFrame` DEFAULT: `None`
`output`	Output file path. If `None`, sequences are returned in memory. TYPE: `str` DEFAULT: `None`
`output_format`	Output format: `"fasta"` -- FASTA text format. `"seq"` -- misha binary `.seq` format. `"vector"` -- return sequences as a Python list of strings. TYPE: `(fasta, seq, vector)` DEFAULT: `"fasta"`
`check_composition`	If `True`, verify that target and replacement contain the same nucleotides (i.e., are anagrams). Set to `False` to allow replacements that change base composition. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`list of str or None`	List of modified nucleotide strings when output is `None` or output_format is `"vector"`. `None` otherwise (output is written to file).

RAISES	DESCRIPTION
`ValueError`	If target or replacement is empty, if they differ in length, or if `check_composition=True` and their nucleotide compositions differ.

See Also

gsynth_sample : Markov-model-based genome synthesis. gsynth_random : Independent random nucleotide generation.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()

Remove all CpG dinucleotides while preserving GC content:

>>> seqs = pm.gsynth_replace_kmer(
...     "CG", "GC",
...     intervals=pm.gintervals(["1"], [0], [1000]),
... )
>>> "CG" not in seqs[0]
True

pymisha.gsynth_save ¶

gsynth_save(model, path, *, compress=False)

Save a trained model to disk in .gsm format.

Serialises a :class:GsynthModel to a cross-platform .gsm directory (or ZIP archive when compress=True) containing YAML metadata and raw binary arrays. The file can later be restored with :func:gsynth_load.

The .gsm format stores counts and CDFs as raw float64 arrays in row-major (C) order, making them readable from both Python and R without any language-specific serialisation quirks.

PARAMETER	DESCRIPTION
`model`	Trained model to save. TYPE: `GsynthModel`
`path`	Destination path. When compress is `False` (default), a directory is created at this path. When `True`, a ZIP archive is written. Parent directories are not created automatically. TYPE: `str`
`compress`	If `True`, write a ZIP archive instead of a directory. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`None`

RAISES	DESCRIPTION
`TypeError`	If model is not a :class:`GsynthModel`.

See Also

gsynth_load : Restore a model saved by this function. gsynth_train : Create a model. gsynth_convert : Convert legacy pickle models to .gsm format.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> model = pm.gsynth_train()
>>> import tempfile, os
>>> path = os.path.join(tempfile.mkdtemp(), "model.gsm")
>>> pm.gsynth_save(model, path)

pymisha.gsynth_load ¶

gsynth_load(path)

Load a trained model from disk.

Auto-detects the format: .gsm directory, .gsm ZIP archive, or legacy pickle.

PARAMETER	DESCRIPTION
`path`	Path to the saved model. Can be a `.gsm` directory, a ZIP file, or a legacy pickle file. TYPE: `str`

RETURNS	DESCRIPTION
`GsynthModel`	The deserialised model, ready for use with :func:`gsynth_sample`.

RAISES	DESCRIPTION
`TypeError`	If the deserialised object is not a :class:`GsynthModel`.
`FileNotFoundError`	If path does not exist.
`ValueError`	If the format or version is unrecognised.

See Also

gsynth_save : Save a model to disk. gsynth_train : Create a new model from scratch. gsynth_sample : Sample sequences from the loaded model.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> model = pm.gsynth_train()
>>> import tempfile, os
>>> path = os.path.join(tempfile.mkdtemp(), "model.gsm")
>>> pm.gsynth_save(model, path)
>>> restored = pm.gsynth_load(path)
>>> restored.total_bins == model.total_bins
True

Genome Synthesis¶

pymisha.GsynthModel dataclass ¶

num_kmers property ¶

pymisha.gsynth_bin_map ¶

pymisha.gsynth_train ¶

pymisha.gsynth_sample ¶

pymisha.gsynth_random ¶

pymisha.gsynth_replace_kmer ¶

pymisha.gsynth_save ¶

pymisha.gsynth_load ¶

pymisha.GsynthModel `dataclass` ¶

num_kmers `property` ¶