NEWS.md
gsynth.sample() with output_format = "fasta" now writes a samtools-compatible .fai alongside the FASTA (tracking byte offsets during the write loop, no extra pass over the file). Removes the need to call samtools faidx by hand on every sampled genome. Matches the convention already used by ggenome.implant().cell_merge argument to gsynth.sample() and a companion exported utility gsynth.cell_merge(). Unlike bin_merge, which redirects bins independently along each dimension, cell_merge redirects specific per-joint-cell CDFs to other per-joint-cell CDFs (e.g. (GC=0.725, CG=0.05) -> (GC=0.70, CG=0.08)). Redirects are applied after bin_merge and after sparse-bin uniform fallback via pointer-level list-element copies in cdf_list, so there is no matrix duplication and no change to the C++ sampling hot path. Intended for reassigning under-trained joint cells to a nearest-sufficient neighbor to avoid sequence leakage from tiny training sets.gsynth.forbid_kmer(model, pattern) returning a new gsynth.model whose samples are guaranteed pattern-free downstream of the seeding window (e.g. gsynth.forbid_kmer(model, "CG") for a CpG-null background). Analytically equivalent to rejection sampling, implemented by zeroing the transitions that would produce the pattern and renormalizing per state-row; pattern length is capped at model$k + 1.gsynth.sample() silently falling back to uniform-random sampling for most positions inside any interval whose start was not aligned to model$iterator (#94). The iterator bin size is now passed explicitly from R instead of inferred from the first same-chrom diff in iter_starts (which was the partial first-bin width, not the full bin width). Output k-mer statistics on unaligned intervals were wrong; composition was partially uniform.gsynth.train(), gsynth.sample(), and gsynth.random_seqs() silently reading sequences from the wrong chromosome when the intervals argument covered a subset of the genome that omitted one or more earlier chromosomes in the chromkey. For every chromosome in the input that came after a missing one, the C++ side opened the wrong chromosome’s sequence (shifted by the number of earlier missing chromosomes), producing invalid models and corrupted sampled genomes without any error. Calls that passed intervals = gintervals.all() or left intervals at its default (which is gintervals.all()) were not affected. Users who ran these functions on custom interval subsets should re-run them with this version.track.dat for every chromosome during iterator init and chromosome transitions. Made indexed tracks unusable on genomes with many contigs (e.g. Pan_troglodytes with 4344 contigs).pwm.max.pos returning wrong strand sign. The direction (positive/negative) of the returned position was determined by the last scanned position in the interval rather than the position with the best score. Bug existed since pwm.max.pos was introduced (v4.3.0).rexit() due to R’s SIGTERM handler. All multitasked operations (gmultitasking = TRUE) were affected: after forking, child processes continued executing R-level post-processing code instead of terminating. Track-creating functions (gtrack.create, gtrack.smooth, gtrack.create_pwm_energy) could corrupt data on indexed databases — children ran gtrack.convert_to_indexed concurrently, reading incomplete files and deleting files still being written by other children. On non-indexed databases, track data was correct. Query functions (gextract, gscreen, gdist, gquantiles, gcor, gsummary, gcis.decay, gapply, gbins.quantiles, gbins.summary) returned correct results since data was written to shared memory before the failed exit.ggenome.implant() for replacing intervals in a reference genome with donor sequences and writing a new FASTA. Supports literal donor sequences or extraction from a misha database, with optional trackdb creation.ggenome.transplant() as sugar for cross-genome sequence swaps — extracts from a source genome and implants into a target genome in a single call.direction="below" with bidirect=TRUE taking min across strands instead of max. A genomic substitution changes both strands, so disrupting a motif requires both strands to fall below the threshold.score.min = score.thresh when direction = "below". score.min now defaults to NULL (no filter) for both directions. Set score.min explicitly to filter windows.k parameter to gsynth.train() to configure the Markov order (1-8, default 5).gextract with PWM vtracks, gseq.pwm, PWM edit distance). Replaced switch-statement DNA base encoding with O(1) lookup tables, eliminating branch mispredictions in the inner scoring loop.DnaPSSM::integrate_energy where case 'h' was written instead of case 'g' in the reverse-complement scoring path. This caused lowercase ‘g’ bases to be silently skipped (contributing zero to the score) instead of being complemented to ‘C’. The affected function was dead code (never called by any user-facing API — all callers use integrate_energy_logspat instead), so no user-visible results were affected.direction parameter (“above”/“below”) to PWM edit distance functions for computing minimum edits to bring score below a threshold (motif disruption).pwm.edit_distance, pwm.edit_distance.pos, pwm.max.edit_distance, pwm.edit_distance.lse, and pwm.edit_distance.lse.pos. Compute minimum edit distance (substitutions and indels) to reach a PWM score threshold, with per-window (max) and aggregate (LSE) scoring modes.gseq.pwm_edits() for retrieving detailed per-edit information (positions and replacement bases) from PWM edit distance computation.gscreen and gextract, improving parallel efficiency on genomes with large chromosomes.gscreen returning split intervals at sub-chromosome parallel boundaries instead of merged contiguous intervals.gsummary, gdist, and gcor (Pearson), improving parallel efficiency with explicit iterators.gscreen and gextract with dense iterators.pipe.Rd documenting %>% with a \usage section.gintervals.load() failing with “Intervals set does not exist” for intervals from databases loaded via gdataset.load().Fixed gdataset.load() performance: loading a large database as a dataset is now fast regardless of call order. Previously, gsetroot(small_db); gdataset.load(large_db) was extremely slow because the dataset scan used R’s list.files(recursive=TRUE), which lists every file in the tree. Replaced with a fast path that reads .db.cache or falls back to C++ fts-based scanning. Also changed gdb.reload() calls in gdataset.load()/gdataset.unload() to use rescan=FALSE, so existing databases use their cache instead of being rescanned.
Fixed collision detection for hierarchical tracks in gdataset.load(): the R-level scan returned /-separated names while the dataset map uses .-separated names, causing collision detection to silently fail for tracks in subdirectories.
Added motif format import functions: gseq.read_meme(), gseq.read_jaspar(), and gseq.read_homer() for reading MEME, JASPAR PFM, and HOMER motif formats. Returns named lists of position probability matrices directly usable with gseq.pwm(). All parsers are native with no new dependencies.
Added track export functions: gtrack.export_bedgraph() and gtrack.export_bigwig() for exporting tracks and track expressions to standard bedGraph and BigWig formats. Supports gzip compression, virtual tracks, track expressions, and custom iterators.
gsynth.save() and gsynth.load() now use the cross-platform .gsm format (YAML metadata + binary arrays) instead of R-specific RDS. Models saved with pymisha can now be loaded in R and vice versa. Legacy RDS files are still supported for backward compatibility.
Added compress parameter to gsynth.save() to optionally save as a ZIP archive.
Added gsynth.convert() to convert legacy RDS model files to the new .gsm format.
Added gintervals.attr.get(), gintervals.attr.set(), gintervals.attr.export(), and gintervals.attr.import() for managing interval set attributes. Attributes are stored as .iattr binary files (null-separated key/value pairs) next to .interv files for small interval sets, or inside the directory for big interval sets.
gintervals.rm() now cleans up companion .iattr attribute files when deleting interval sets.
Replaced the naive variance formula (E[X²]-E[X]²) with Welford’s numerically stable online algorithm for standard deviation computation in all track types (GenomeTrackFixedBin, GenomeTrackSparse, GenomeTrackInMemory, GenomeTrackArrays). The naive formula is prone to catastrophic cancellation when values are large or have small variance relative to their mean.
Improved sum accumulation precision by using double-precision intermediate accumulators in all non-sliding sum paths. The m_last_sum member remains float for API compatibility, but per-interval accumulation now happens in double, eliminating float→float rounding errors for large sums.
Fixed potential out-of-bounds access in SAMPLE/SAMPLE_POS functions when the random number generator returns exactly 1.0. Added bounds clamping (ported from pymisha).
Fixed umask leak in GenomeTrack::write_type() and GenomeTrackFixedBin::init_write() — the old umask is now saved and restored immediately after open(), preventing permanent process-wide umask changes.
Improved header validation in GenomeTrackFixedBin::init_read() to use integer modulo instead of floating-point division, with an additional underflow guard.
Added reusable scratch buffers (m_scratch_all_values, m_scratch_all_positions) to GenomeTrackFixedBin, avoiding per-call heap allocation for SAMPLE/SAMPLE_POS operations (ported from pymisha).
Added indexed format support for 2D tracks (rectangles and points). Per-chromosome-pair files can now be consolidated into a single track.dat + track.idx, matching the indexed format already available for 1D tracks.
New function gtrack.2d.convert_to_indexed() converts existing 2D tracks to indexed format with optional removal of old per-pair files.
gtrack.convert_to_indexed() now dispatches to 2D conversion automatically when given a rectangles or points track.
gdb.convert_to_indexed() with convert_tracks = TRUE now includes 2D tracks in batch conversion.
2D tracks created via gtrack.2d.create(), gtrack.2d.import(), and gtrack.2d.import_contacts() are automatically converted to indexed format when the database is in indexed mode.
Fixed NaN comparison bug in distance.center virtual track with iterator modifiers — used != instead of std::isnan(), causing the guard condition to always be true under IEEE 754 semantics.
Fixed distance virtual track returning NaN instead of a valid distance when one side of the sorted interval array has no intervals on the current chromosome (NaN-unsafe min in sequential path).
Fixed gtrack.smooth MEAN algorithm producing spurious NaN values during periodic recalculation when NaN values are present in the smoothing window. The recalculation checked isnan on the accumulator instead of the incoming value, permanently poisoning the sum.
Fixed incorrect standard deviation/variance computation in virtual tracks with filter/mask — the parallel variance formula used the already-incremented weight instead of the pre-increment value.
Fixed min.pos/max.pos virtual track aggregation comparing genomic positions instead of signal values when combining sub-intervals, returning the leftmost/rightmost sub-interval position instead of the position of the actual min/max value.
Fixed missing out_of_range check in distance.center virtual track with iterator modifiers, causing undefined behavior when all source intervals lie upstream of the query coordinate.
Fixed copy-paste error in quantile track computation where highest_vals buffer was resized using kid_lowest_vals_buf_size instead of kid_highest_vals_buf_size.
Fixed int32 truncation of int64_t genomic coordinates in gintervals.normalize — could silently produce wrong results for chromosomes longer than ~2.1 Gbp.
Fixed p-value transformation (pv function) being skipped when virtual track has both a filter and a p-value function active.
Fixed lse_accumulate double overload producing NaN when both inputs are -infinity (missing isinf guards that the float overload had).
Fixed GInterval::dist2coord treating coord == end as inside the interval, inconsistent with the half-open [start, end) convention used throughout the codebase.
Fixed distance.center virtual track function returning incorrect values when gextract is called with overlapping input intervals. The sequential scanner could miss containing source intervals when processing bins that go backward in position between overlapping regions.
Fixed neighbor.count virtual track function undercounting neighbors when gextract is called with overlapping input intervals. Same root cause as the distance/distance.edge fix in 5.4.6.
distance and distance.edge virtual track functions returning incorrect values when gextract is called with overlapping input intervals. The sequential scanner could miss closer source intervals when processing bins that go backward in position between overlapping regions.GenomeTrackBinnedTransform.cpp by avoiding arithmetic between distinct anonymous enum types.gdb.export_fasta() to export all contigs from a database to a multi-FASTA file, defaulting to the current gdb with optional groot override.lse virtual track function that computes the log-sum-exp of the values in the iterator interval.gsetroot() for your primary writable database, then gdataset.load() to add read-only datasetsforce=TRUE to override (working db always wins)gdataset.load(): Load a dataset into the namespace (tracks and intervals become available)gdataset.unload(): Remove a dataset from the namespacegdataset.save(): Create a new dataset from selected tracks/intervalsgdataset.ls(): List working database and all loaded datasetsgdataset.info(): Show metadata and contents of a datasetgtrack.dataset(): Get the source path for a track (working db or dataset)gtrack.dbs(): Get all paths where a track exists (for debugging shadowed tracks)gintervals.dataset(): Get the source path for an interval setgintervals.dbs(): Get all paths where an interval set existsdb parameter for filtering by source:
gtrack.ls(db = "/path/to/dataset"): List tracks from a specific sourcegintervals.ls(db = "/path/to/dataset"): List intervals from a specific sourcechrom_sizes.txt files (same genome assembly)gsetroot() works unchangedgtrack.mv(): Rename or move a track within the same databasegtrack.copy(): Copy a track (can copy between databases when multiple are loaded)gcor function that computes correlation between two tracks, or between multiple pairs of tracks.gextract and gscreen almost always did not enable multitasking mode due to incorrect gating.dataframe and names parameters to gdist function that return a data frame instead of an N-dimensional vector.gsynth.train, gsynth.sample and gsynth.save functions that train a Markov model from a genome sequence and sample a synthetic genome from the modelgseq.kmer.dist function that counts the number of occurrences of k-mers in genomic intervals.gtrack.liftover that created overlapping intervals when lifting sparse tracks.gintervals.normalize.gintervals.normalize now returns +1bp for intervals with odd sizes:
interval_relative parameter to giterator.intervals() for interval-aligned bin iteration.genome.seq + genome.idx files instead of per-chromosome filesgdb.info(), gdb.convert_to_indexed(), gtrack.convert_to_indexed(), gintervals.convert_to_indexed(), gintervals.2d.convert_to_indexed()
options(gmulticontig.indexed_format = FALSE) to create databases in legacy format for compatibility with older misha versionsvignette("Database-Formats") for more details.gmax.processes automatically set to 70% of available CPU coresgmax.data.size coordinated with process limits to ensure total memory usage <= 70% of RAM (capped at 10GB per process)gmax.data.size = min((RAM * 0.7) / gmax.processes, 10GB) ensures safe memory usage across all parallel processesgmax.processes * 1000 records (e.g., 2K on laptops, 89K on 128-core servers)options()
vignette("Manual") for detailsgvtrack.create with src parameter). These tracks behave exactly like regular sparse tracks, but are stored in memory and can be used in track expressions.sshift, eshift and filter parameters to gvtrack.create.gintervals.path() and gtrack.path() functions that return the actual file system paths for interval sets and tracks.masked.count and masked.frac virtual track functions that count and fraction masked base pairs (lowercase letters) in the current iterator interval.distance.edge virtual track function that computes edge-to-edge distance from the iterator interval to the closest source interval, using the same calculation as gintervals.neighbors.gtrack.liftover did not fill chromosomes missing the chain with NA values. This caused errors when trying to access the tracks afterwards.gintervals.as_chain function that converts a data frame to a chain object.gintervals.liftover via value_col and multi_target_agg parameters.src_overlap_policy and tgt_overlap_policy parameters to gintervals.liftover, gintervals.load_chain, and gtrack.liftover functions.gtrack.liftover via multi_target_agg parameter.gintervals.load_chain now returns valid misha intervals instead of a chain object.gintervals.load_chain now includes score and chain_id columns for all loaded chainsmin_score parameter in gintervals.load_chain, gintervals.liftover, and gtrack.liftover filters out low-quality chainstgt_overlap_policy = "auto_score" (or "auto") selects the best chain mapping based on alignment score (highest score → longest span → lowest chain_id)include_metadata parameter in gintervals.liftover optionally returns score and chain_id for each mapping BREAKING: “auto” is now an alias for “auto_score”. For the old behavior, use tgt_overlap_policy = "auto_first".canonic parameter to gintervals.liftover (default FALSE) to merge adjacent target intervals resulting from the same source interval and chain.tgt_overlap_policy = "best_cluster_union" (default, aliased as "best_source_cluster"): Uses source union coveragetgt_overlap_policy = "best_cluster_sum": Uses sum of target lengthstgt_overlap_policy = "best_cluster_max": Uses longest single membermax.pos.abs, max.pos.relative, min.pos.abs, min.pos.relative: Returns the position of the maximum/minimum value in the iterator intervalexists: Returns 1 if any value exists (or specific vals if provided), 0 otherwisesize: Returns the number of non-NaN values in the iterator intervalsample: Returns a uniformly sampled source value from the iterator intervalsample.pos.abs and sample.pos.relative: Returns the position of a uniformly sampled valuefirst and last: Returns the first/last value in the iterator intervalfirst.pos.abs, first.pos.relative, last.pos.abs, last.pos.relative: Returns the position of the first/last valuegintervals.neighbors when using mindist=0, maxdist=0: the function would miss zero-distance (touching) intervals when using mindist=0, maxdist=0.pwm.count with spatial sliding windows double-counting bidirectional hits (forward + reverse) at the same genomic position; the sliding path now matches the baseline per-position union semantics.gintervals.load_chain now returns a data frame with 8 columns instead of 7. Columns are: chrom, start, end, strand, chromsrc, startsrc, endsrc, strandsrc.src_overlap_policy and tgt_overlap_policy parameters to gintervals.load_chain, gtrack.liftover and gintervals.liftover functions.neighbor.count virtual track.gintervals.mark_overlaps function that marks overlapping intervals with a group ID.pssm parameter of gvtrack.create and gseq.pwm functions.gseq.pwm and added neutral_chars_policy parameter.pwm, pwm.max and pwm.count) for dense iterators when spatial weighting is disabled, providing significant performance improvements for consecutive genomic intervals.pwm.count(bidirect=TRUE) now counts per-position union of strands (via log-sum-exp), aligning with pwm/pwm.max. Each position contributes at most 1 to the count. To reproduce the old per-strand-sum behavior, add the two strand-specific counts: pwm.count(bidirect=FALSE, strand=1) + pwm.count(bidirect=FALSE, strand=-1).gseq.pwm and gseq.kmer functions that compute pwm and kmer scores on sequences without the need for a genome database.gseq.rev and gseq.comp functions that reverse and complement DNA sequences without the need for a genome database.gseq.revcomp alias for grevcomp function.gintervals.random function that generates random genome intervals.gintervals.covered_bp and gintervals.coverage_fraction functions that calculate the number of base pairs and the fraction of base pairs covered by a set of intervals.pwm.count virtual track function that counts the number of occurrences of a PWM in the current iterator interval.gintervals.neighbors.upstream() - Find upstream neighbors relative to query strandgintervals.neighbors.downstream() - Find downstream neighbors relative to query strandgintervals.neighbors.directional() - Find both upstream and downstream neighborsuse_intervals1_strand parameter to gintervals.neighbors() to use query intervals’ strand for distance directionality.warn.ignored.strand parameter to gintervals.neighbors() to control warnings when query strand is ignored.gintervals.neighbors: a stack imbalance in the C++ code in very rare cases of 2D intervals.gintervals.neighbors due to unbalanced rprotect calls.gintervals.normalize and gintervals.annotate functions.m1-asan build.pwm and kmer virtual track functions: iterator shifts were not applied.colnames parameter to gintervals.mapply function.attrs parameter to gtrack.import function.created.user default attribute in track creation functions.gtrack.import function.gtrack.create_dense function - creates a dense track from an intervals and values.clock_gettime is missing).gtrack.import_bigwig: intern argument was not passed to system calls.grevcomp function (reverse complement of a DNA sequence).gdb.create_genome function.R_curErrorBuf, SET_TYPEOF
Rf_ prefix in the c++ code.ALLGENOME is now only soft deprecated in order to support old misha scripts.gtrack.create_dirs function.gcluster.run.gintervals.neighbors..misha. Variables such as ALLGENOME can now be accessed as .misha$ALLGENOME. This change is not backwards compatible, please update your code accordingly.gintervals.neighbors (same as gintervals.neighbors1 from misha.ext). This means that instead of having two columns of ‘chrom’, ‘start’ and ‘end’, the resulting data frame would have ‘chrom1’, ‘start1’ and ‘end1’.gwget now uses curl in order to work on systems that do not have ftp installed.markdown format.Genomes vignette that demonstrates how to create a new genome database.