Skip to content

Genome Synthesis

Classes and functions for training generative models on genomic sequences and sampling synthetic genomes.

pymisha.GsynthModel dataclass

Trained stratified Markov model for genome synthesis.

Stores the transition probabilities (as CDFs) for a k-th order Markov chain, optionally stratified by one or more genomic track dimensions. Each of the 4^k possible k-mer contexts maps to a probability distribution over the four nucleotides (A, C, G, T), independently for every flat bin in the stratification grid. The model is created by :func:gsynth_train and consumed by :func:gsynth_sample.

ATTRIBUTE DESCRIPTION
k

Markov order (context length). Default 5 for backward compatibility.

TYPE: int

n_dims

Number of stratification dimensions.

TYPE: int

dim_sizes

Number of bins per dimension.

TYPE: list of int

dim_specs

Per-dimension specification (expr, breaks, num_bins, bin_map).

TYPE: list of dict

total_bins

Product of all dim_sizes (total flat bins).

TYPE: int

model_data

Contains 'counts' (list of 2-D arrays, one per flat bin, shape (4^k, 4)) and 'cdf' (same layout, cumulative probabilities).

TYPE: dict

total_kmers

Total k-mers counted during training.

TYPE: int

per_bin_kmers

K-mers per flat bin.

TYPE: ndarray

total_masked

Positions skipped due to mask.

TYPE: int

total_n

Positions skipped due to N bases.

TYPE: int

pseudocount

Pseudocount used for CDF normalization.

TYPE: float

See Also

gsynth_train : Create a GsynthModel from genome sequences. gsynth_sample : Generate synthetic sequences from a model. gsynth_save : Persist a model to disk. gsynth_load : Restore a model from disk.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> model = pm.gsynth_train()
>>> model.n_dims
0
>>> model.total_bins
1

num_kmers property

num_kmers

Number of k-mer contexts (4^k).

pymisha.gsynth_bin_map

gsynth_bin_map(breaks, merge_ranges)

Compute bin mapping for merging sparse bins.

Converts value-based merge specifications into an integer index array that redirects source bins to target bins. This is useful when certain stratification bins have too few observations to learn reliable transition probabilities -- their counts can be folded into a neighbouring, better-populated bin.

PARAMETER DESCRIPTION
breaks

Sorted bin boundaries (length = num_bins + 1).

TYPE: array - like

merge_ranges

Each dict has:

  • "from" : float or tuple of (lo, hi) -- source value or range to remap. A scalar v is shorthand for (v, inf).
  • "to" : tuple of (lo, hi) -- target value range whose bin receives the merged counts. Must overlap exactly one bin defined by breaks.

TYPE: list of dict

RETURNS DESCRIPTION
ndarray

Integer array of length num_bins, where bin_map[i] is the 0-based target bin index for source bin i. Unmapped bins map to themselves (identity).

RAISES DESCRIPTION
ValueError

If breaks has fewer than 2 elements, or if a "to" range does not match any bin in breaks.

See Also

gsynth_train : Accepts bin_merge specifications per dimension.

Examples:

>>> from pymisha import gsynth_bin_map
>>> breaks = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
>>> gsynth_bin_map(breaks, [{"from": (0.4, 0.5), "to": (0.3, 0.4)}])
array([0, 1, 2, 3, 3], dtype=int32)

Multiple merges -- fold both tails into the centre:

>>> breaks = [0.0, 0.25, 0.5, 0.75, 1.0]
>>> gsynth_bin_map(breaks, [
...     {"from": (0.0, 0.25), "to": (0.25, 0.5)},
...     {"from": (0.75, 1.0), "to": (0.5, 0.75)},
... ])
array([1, 1, 2, 2], dtype=int32)

pymisha.gsynth_train

gsynth_train(*dim_specs, mask=None, intervals=None, iterator=None, pseudocount=1.0, min_obs=0, k=5, allow_parallel=True, num_cores=None, max_chunk_size=None)

Train a stratified Markov model from genome sequences.

Computes a k-th order Markov model optionally stratified by bins of one or more track expressions (e.g., GC content and CpG dinucleotide frequency). The resulting :class:GsynthModel can be used with :func:gsynth_sample to generate synthetic genomes that preserve the k-mer statistics of the original genome within each stratification bin.

Both the forward-strand (k+1)-mer and its reverse complement are counted for every valid position, ensuring strand-symmetric transition probabilities. Positions containing N bases are skipped and counted separately in the returned model's total_n attribute.

When called with no dimension specifications, trains a single unstratified (0-D) model.

For large genomes (total bases > threshold), intervals can be split into chunks and processed in parallel using multiple cores. Each chunk trains independently, and the resulting k-mer count arrays are merged before computing the final CDF. This matches the R misha parallel gsynth behavior.

PARAMETER DESCRIPTION
*dim_specs

Each positional argument is a dict specifying a stratification dimension with the following keys:

  • "expr" (str): Track expression for this dimension (required).
  • "breaks" (array-like): Sorted bin boundaries (required). Length must be at least 2.
  • "bin_merge" (list of dict, optional): Merge specifications for sparse bins, in the format accepted by :func:gsynth_bin_map.

TYPE: dict DEFAULT: ()

mask

Intervals to exclude from training. Regions in the mask do not contribute to k-mer counts but are tallied in total_masked.

TYPE: DataFrame DEFAULT: None

intervals

Genomic intervals to train on. If None, uses all chromosomes.

TYPE: DataFrame DEFAULT: None

iterator

Iterator bin size for track extraction. Determines the resolution at which track values are evaluated.

TYPE: int DEFAULT: None

pseudocount

Pseudocount added to all k-mer counts to avoid zero probabilities in the CDF.

TYPE: float DEFAULT: 1.0

min_obs

Minimum number of (k+1)-mer observations required per bin. Reserved for future use.

TYPE: int DEFAULT: 0

k

Markov order (context length). Must be in [1, 10]. The model stores 4^k context states; higher values capture longer-range dependencies but require more training data and memory.

TYPE: int DEFAULT: 5

allow_parallel

Whether to enable parallel chunking for large genomes. When True and the total bases exceed max_chunk_size, intervals are split across multiple processes. When False, always runs single-threaded.

TYPE: bool DEFAULT: True

num_cores

Number of worker processes. If None, defaults to multiprocessing.cpu_count(). Capped at the number of interval rows.

TYPE: int DEFAULT: None

max_chunk_size

Total-base threshold above which parallel processing triggers. Defaults to GSYNTH_MAX_CHUNK_SIZE (1 billion).

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
GsynthModel

Trained model containing transition CDFs, dimension metadata, and training statistics.

RAISES DESCRIPTION
TypeError

If a dimension spec is not a dict.

ValueError

If a dimension spec is missing "expr" or "breaks", or if breaks has fewer than 2 elements, or if no data is extracted for a given expression, or if k is not in [1, 10].

See Also

gsynth_sample : Sample synthetic sequences from a trained model. gsynth_random : Generate random sequences without a model. gsynth_save : Persist a trained model to disk. gsynth_load : Restore a model from disk. gsynth_bin_map : Compute bin-merge mappings for sparse bins.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()

Train an unstratified (0-D) model over the whole genome:

>>> model_0d = pm.gsynth_train()
>>> model_0d.n_dims
0

pymisha.gsynth_sample

gsynth_sample(model, output=None, *, output_format='fasta', intervals=None, iterator=None, mask_copy=None, n_samples=1, seed=None, bin_merge=None, allow_parallel=True, num_cores=None, max_chunk_size=None)

Sample synthetic genome sequences from a trained model.

Generates a synthetic genome by sampling from a trained stratified Markov model. For each genomic position the sampler looks up the current k-mer context and the position's stratification bin, then draws the next nucleotide from the corresponding CDF. The result preserves the k-mer statistics of the original genome within each bin.

When the sampler needs to initialise the first k-mer context and encounters regions with only N bases, it falls back to uniform random base selection until a valid context is established.

For large genomes (total bases > threshold), intervals can be split into chunks and processed in parallel using multiple cores. Each chunk samples independently and the resulting sequences are concatenated. For file output modes ("fasta" or "seq"), the parallel path first samples to in-memory vectors and then writes the combined result.

PARAMETER DESCRIPTION
model

Trained Markov model from :func:gsynth_train.

TYPE: GsynthModel

output

Output file path. If None, sequences are returned in memory (equivalent to output_format="vector").

TYPE: str DEFAULT: None

output_format

Output format:

  • "fasta" -- FASTA text format.
  • "seq" -- misha binary .seq format.
  • "vector" -- return sequences as a Python list of strings (does not write to file).

TYPE: (fasta, seq, vector) DEFAULT: "fasta"

intervals

Genomic intervals to synthesise. If None, uses all chromosomes.

TYPE: DataFrame DEFAULT: None

iterator

Iterator bin size for track extraction during bin-index computation.

TYPE: int DEFAULT: None

mask_copy

Intervals where the original reference sequence is preserved verbatim instead of being sampled. Useful for keeping repetitive or regulatory regions intact. Should be non-overlapping and sorted by start position within each chromosome.

TYPE: DataFrame DEFAULT: None

n_samples

Number of independent samples to generate per interval. When n_samples > 1 and output_format="fasta", headers include a _sampleN suffix. When output_format="vector", the returned list has length n_intervals * n_samples.

TYPE: int DEFAULT: 1

seed

Random seed for reproducibility. If None, uses the current random state.

TYPE: int DEFAULT: None

bin_merge

Sampling-time bin merge overrides, one element per model dimension. Each element is either None (use the training-time bin_merge) or a list of merge-range dicts as accepted by :func:gsynth_bin_map. This allows redirecting sparse bins to better-populated ones at sampling time without retraining the model.

TYPE: list DEFAULT: None

allow_parallel

Whether to enable parallel chunking for large genomes. When True and the total bases exceed max_chunk_size, intervals are split across multiple processes. When False, always runs single-threaded.

TYPE: bool DEFAULT: True

num_cores

Number of worker processes. If None, defaults to multiprocessing.cpu_count(). Capped at the number of interval rows.

TYPE: int DEFAULT: None

max_chunk_size

Total-base threshold above which parallel processing triggers. Defaults to GSYNTH_MAX_CHUNK_SIZE (1 billion).

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
list of str or None

List of nucleotide strings when output is None or output_format is "vector". None otherwise (output is written to file).

RAISES DESCRIPTION
TypeError

If model is not a :class:GsynthModel.

See Also

gsynth_train : Train the model consumed by this function. gsynth_random : Generate random sequences without a model. gsynth_save : Persist a model for later sampling.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> model = pm.gsynth_train()
>>> seqs = pm.gsynth_sample(
...     model,
...     intervals=pm.gintervals(["1"], [0], [1000]),
...     seed=42,
... )
>>> len(seqs[0])
1000

pymisha.gsynth_random

gsynth_random(*, intervals=None, nuc_probs=None, output=None, output_format='fasta', mask_copy=None, n_samples=1, seed=None)

Generate random genome sequences without a trained model.

Produces random DNA sequences where each nucleotide is sampled independently according to the specified probabilities. Unlike :func:gsynth_sample, no Markov context is used -- consecutive bases are statistically independent. This is useful for generating baseline random sequences or sequences with a specific GC content.

PARAMETER DESCRIPTION
intervals

Genomic intervals to generate. If None, uses all chromosomes.

TYPE: DataFrame DEFAULT: None

nuc_probs

Nucleotide probabilities keyed by 'A', 'C', 'G', 'T'. Values are automatically normalised to sum to 1. Default is uniform (0.25 each).

TYPE: dict DEFAULT: None

output

Output file path. If None, sequences are returned in memory.

TYPE: str DEFAULT: None

output_format

Output format:

  • "fasta" -- FASTA text format.
  • "seq" -- misha binary .seq format.
  • "vector" -- return sequences as a Python list of strings.

TYPE: (fasta, seq, vector) DEFAULT: "fasta"

mask_copy

Intervals where the original reference sequence is preserved instead of being randomly generated.

TYPE: DataFrame DEFAULT: None

n_samples

Number of independent samples to generate per interval.

TYPE: int DEFAULT: 1

seed

Random seed for reproducibility. If None, uses the current random state.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
list of str or None

List of nucleotide strings when output is None or output_format is "vector". None otherwise (output is written to file).

See Also

gsynth_sample : Sample from a trained Markov model. gsynth_train : Train a Markov model for context-dependent sampling.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()

Uniform random sequence:

>>> seqs = pm.gsynth_random(
...     intervals=pm.gintervals(["1"], [0], [1000]),
...     seed=42,
... )
>>> len(seqs[0])
1000

pymisha.gsynth_replace_kmer

gsynth_replace_kmer(target, replacement, *, intervals=None, output=None, output_format='fasta', check_composition=True)

Iteratively replace a k-mer in genome sequences.

Scans each sequence and replaces every occurrence of target with replacement. If a replacement creates a new instance of target (e.g., replacing "CG" with "GC" in the sequence "CCG" produces "CGC"), the new instance is also replaced. The scan repeats until the sequence is completely free of target.

When target and replacement are permutations of each other (e.g., "CG" and "GC"), the operation acts as a local "bubble sort" of nucleotides, preserving the total base counts and GC content of the genome.

PARAMETER DESCRIPTION
target

K-mer to remove. Case-insensitive (converted to uppercase internally).

TYPE: str

replacement

Replacement sequence. Must be the same length as target.

TYPE: str

intervals

Genomic intervals to process. If None, uses all chromosomes.

TYPE: DataFrame DEFAULT: None

output

Output file path. If None, sequences are returned in memory.

TYPE: str DEFAULT: None

output_format

Output format:

  • "fasta" -- FASTA text format.
  • "seq" -- misha binary .seq format.
  • "vector" -- return sequences as a Python list of strings.

TYPE: (fasta, seq, vector) DEFAULT: "fasta"

check_composition

If True, verify that target and replacement contain the same nucleotides (i.e., are anagrams). Set to False to allow replacements that change base composition.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list of str or None

List of modified nucleotide strings when output is None or output_format is "vector". None otherwise (output is written to file).

RAISES DESCRIPTION
ValueError

If target or replacement is empty, if they differ in length, or if check_composition=True and their nucleotide compositions differ.

See Also

gsynth_sample : Markov-model-based genome synthesis. gsynth_random : Independent random nucleotide generation.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()

Remove all CpG dinucleotides while preserving GC content:

>>> seqs = pm.gsynth_replace_kmer(
...     "CG", "GC",
...     intervals=pm.gintervals(["1"], [0], [1000]),
... )
>>> "CG" not in seqs[0]
True

pymisha.gsynth_save

gsynth_save(model, path, *, compress=False)

Save a trained model to disk in .gsm format.

Serialises a :class:GsynthModel to a cross-platform .gsm directory (or ZIP archive when compress=True) containing YAML metadata and raw binary arrays. The file can later be restored with :func:gsynth_load.

The .gsm format stores counts and CDFs as raw float64 arrays in row-major (C) order, making them readable from both Python and R without any language-specific serialisation quirks.

PARAMETER DESCRIPTION
model

Trained model to save.

TYPE: GsynthModel

path

Destination path. When compress is False (default), a directory is created at this path. When True, a ZIP archive is written. Parent directories are not created automatically.

TYPE: str

compress

If True, write a ZIP archive instead of a directory.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
None
RAISES DESCRIPTION
TypeError

If model is not a :class:GsynthModel.

See Also

gsynth_load : Restore a model saved by this function. gsynth_train : Create a model. gsynth_convert : Convert legacy pickle models to .gsm format.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> model = pm.gsynth_train()
>>> import tempfile, os
>>> path = os.path.join(tempfile.mkdtemp(), "model.gsm")
>>> pm.gsynth_save(model, path)

pymisha.gsynth_load

gsynth_load(path)

Load a trained model from disk.

Auto-detects the format: .gsm directory, .gsm ZIP archive, or legacy pickle.

PARAMETER DESCRIPTION
path

Path to the saved model. Can be a .gsm directory, a ZIP file, or a legacy pickle file.

TYPE: str

RETURNS DESCRIPTION
GsynthModel

The deserialised model, ready for use with :func:gsynth_sample.

RAISES DESCRIPTION
TypeError

If the deserialised object is not a :class:GsynthModel.

FileNotFoundError

If path does not exist.

ValueError

If the format or version is unrecognised.

See Also

gsynth_save : Save a model to disk. gsynth_train : Create a new model from scratch. gsynth_sample : Sample sequences from the loaded model.

Examples:

>>> import pymisha as pm
>>> _ = pm.gdb_init_examples()
>>> model = pm.gsynth_train()
>>> import tempfile, os
>>> path = os.path.join(tempfile.mkdtemp(), "model.gsm")
>>> pm.gsynth_save(model, path)
>>> restored = pm.gsynth_load(path)
>>> restored.total_bins == model.total_bins
True