Changelog¶
v0.8.3 (2026-05-31)¶
- A value-based 2D vtrack with no explicit iterator now iterates the source 2D track's rects.
pm.gvtrack_create("v", "hic.track", "weighted.sum"); pm.gextract("v", scope_2d)(noiterator=) used to fall through to one-row-per-scope-interval aggregation; it now defaults the iterator to the source 2D track's rectangles (one row per source rect in the scope), matching R's per-vtrack default iterator. Passiterator=scope_2dto keep the prior one-row-per-scope-interval behavior. 1D iterator shifts (gvtrack_iterator(sshift=, eshift=)) set on a 2D-source vtrack are now rejected by the scanner path too (the legacy path already rejected them). - Mixing a 1D dim-projected vtrack with a bare 2D track under a 2D scope now projects each rect.
pm.gextract(["dist1_dim1", "dist2_dim2", "hic.track"], scope_2d)used to compute the dim-projected vtracks once for the whole scope and broadcast that single value to every rect row; it now defaults the iterator to the lone 2D track in the expression and projects each rect's[start1,end1]/[start2,end2]independently per row, matching R's implicit-iterator behaviour for a 2D-track-bearing expression. gintervals_neighborson 2D intervals now normalizes chrom names on both sides. Passing two 2D-interval DataFrames whosechrom1/chrom2columns disagreed on thechrprefix (one carried it, the other didn't) silently producedNonebecause the per-chrom-pair grouping is a literal string compare; both inputs are now normalized through the same loader path as the 1Dgintervals_neighbors, so a query of e.g.("1", ...)matches targets of e.g.("chr1", ...).
v0.8.2 (2026-05-31)¶
gintervals_neighborson 2D interval sets now handles large unbounded inputs. A nearest-neighbor query without amaxdist*window on intervals beyond a few million rects previously raisedNotImplementedError; it now runs an in-memory quadtree NN iterator in C++ (port of R misha'sStatQuadTree::NNIterator), matching R's per-axis-unsigned-gap geometry. R'spriority_queuetie-break order on equidistantmaxn=1neighbors is not portable across STL implementations and is not replicated; query rows that hit such ties may pick a different (but equidistant) neighbor. 2D track names passed asintervals1/intervals2are now materialized viagextractover the 2D ALLGENOME(full) scope (R parity).
v0.8.1 (2026-05-31)¶
- 2D tracks built with NaN-valued rectangles now retain the NaN rects, matching R.
gtrack_lookupwithforce_binning=Falseon a 2D source, and any other path constructing a 2D RECTS track from NaN-bearing values, previously silently dropped NaN rects so the resulting track was incomplete (an off-diagonalgtrack_lookupquery returnedNone); they now write the NaN rects to disk, exclude them from per-bin stat aggregations (avg / min / max / sum / weighted.sum), andgextractreturns all rects including NaN values.gtrack_lookupalso now evaluates the 2D scope undermode="full"so off-diagonal chrompair queries return data.
v0.8.0 (2026-05-29)¶
distance/distance.edge/distance.centervirtual tracks now return the true nearest source interval on overlapping or nested sources. The previous scan could return a non-nearest interval, even a nonzero distance for a query that overlaps a source (e.g. againstrmsk-style nested intervals). All three now use the same nearest-neighbor index asgintervals_neighbors, sodistance.edgematches it exactly.distance.centeralso no longer errors when the source intervals overlap: a bin center inside several intervals resolves to the nearest center.gdb_install_intervals/gdb_build_genomenow attachnameandgeneNamecolumns to the installedtss/exons/utr3/utr5sets.nameis the transcript accession;geneNameis the gene symbol (from thegene_nameattribute, falling back togene_id, blank when the source has neither). Overlapping features that unify to one interval concatenate their distinct symbols with;.- Indexed dense tracks whose first chromosome has no data no longer report
bin_size = 0or crash on read. A packed/converted indexed track whose leading chrom has an empty index entry left the bin size uninitialized, sogtrack_inforeportedbin_size = 0and subsequent reads divided by zero. The bin size is now back-filled from the first non-empty chromosome.
v0.7.1 (2026-05-28)¶
- 2D
gextractwith a diagonalbandnow clips emitted rectangle coords to the band-intersected area. Each contact rectangle returned by a 2D extract (raw track or scanner-driven aggregation over an intervals iterator) is now shrunk to the smallest bounding box containing its intersection with the band, matching R'sDiagonalBand::shrink2intersected(so e.g. a 10kb-bin contact on the diagonal withband=(-1024, 1024)keeps its on-diagonal slice instead of returning the full bin). Previously pymisha returned the raw stored coords. Inter-chrom rectangles under an active band are now skipped, matching R. weighted.sum(and other 2D stats) over a near-diagonal band now agree with R. The band-intersected area formula used a wrong corner (y1instead ofy2) in the d1-triangle subtraction, over-reporting area by ~3x on contact bins that straddle the diagonal. Fixed to match R'sDiagonalBand::intersected_area.- 2D
gextractover a 2D-intervals scope/iterator now normalizes chromosome names (chr1 ↔ 1) before looking them up. Previously the C++ scanner path used a raw chrom-name lookup, so passing a DataFrame whosechrom1/chrom2carried achrprefix against a DB that stores chroms without it (or vice versa) producedchromid=-1and combined with a registered virtual track could crash withstd::bad_alloc. The 1Dgextractpath already normalized; the 2D scanner path now does too. gintervals_2drecycles a length-1 axis against a length-N axis.gintervals_2d("chr1", starts, ends)(axis2 omitted) now produces N rows of(chr1, starts[i], ends[i]) × (chr1, 0, chrom_size), matching R'sdata.frame-style recycling. Previously this raisedValueError("chroms1 and chroms2 must produce the same number of intervals")when axis1 was a vector and axis2 fell back to chromosome-wide defaults. Equal-length 1D-vector arguments continue to pair positionally.
v0.7.0 (2026-05-28)¶
gextractnow honors a 2D-intervals iterator (DataFrame or interval-set name).gextract("rects_track", scope_2d, iterator=other_2d_intervals)(oriterator="my_2d_set") iterates the rectangles ofiterator ∩ scopeand evaluates the expression on each (withintervalIDattributing each row to its scope interval), matching R's 2D intervals iterator. Previously the 2D iterator was ignored and the whole scope was object-enumerated.giterator_intervalsover a bare 2D track now visits all chrom pairs.giterator_intervals("rects_track")(no scope) now enumerates every rectangle of the track, not just the intra-chromosomal (diagonal) ones, matching R's whole-genome 2D scope. It also accepts a 2D-intervals DataFrame or interval-set name as theiterator(returningiterator ∩ scopecells), and a 2D-track name as the scope.gintervals_summarynow supports a 2D-intervals iterator (DataFrame, interval-set name, or 2D-track name).gintervals_summary("rects_track", scope_2d, iterator=other_2d_intervals)summarizes the expression over the iterator cells within each scope interval (one row per scope interval), matching R.gintervals_neighborsnow supports 2D intervals. For two 2D-interval sets it finds, per query rectangle, the nearest target rectangles on the same chrom-pair within a per-axis distance window (mindist1/maxdist1/mindist2/maxdist2), ordered by Manhattan distance, returningdist1/dist2columns - matching R. Previously any 2D input raisedNotImplementedError. (A nearest-neighbour search over very large unbounded sets still needs a scalable quadtree NN iterator and raises a clear error.)gintervals_2d_intersectis now scalable. The pairwise rectangle intersection is computed per chrom-pair with an in-memory quadtree instead of anO(n1·n2)broadcast, so intersecting large 2D screens (e.g. 10^5 × 10^5 rectangles on one chrom-pair) no longer exhausts memory. Results are unchanged.
v0.6.0 (2026-05-27)¶
- Array tracks now work in the track-expression scanner. A
arraytrack can be used directly in a track expression (gextract("my.array", ...),gextract("2 * my.array + 17", ...)), mixed with dense/sparse tracks under an explicit iterator, as the iterator itself (iterator="my.array", one bin per row), and as the source of a value-based virtual track (gvtrack_create("v", "my.array", "min")). Each bin's columns are reduced to a scalar (default: average over all columns) which is then aggregated over the iterator interval, matching R. Previously any array track in an expression or as an iterator raised "scanner does not support array tracks". gvtrack_array_slicevirtual tracks now evaluate through the fast C++ scanner and support a column quantile. Selecting a column subset and reduction (gvtrack_array_slice("v", ["col1", "col3"], func="max")) is now computed in C++ and, with no explicit iterator, iterates the array's native bins (one value per bin), matching R exactly - including R's float32 standard-deviation accumulation.func="quantile"withparams=<percentile>is now accepted (previously raisedNotImplementedError).
v0.5.2 (2026-05-27)¶
giterator_intervalsnow accepts a numeric 2D iterator.giterator_intervals(expr, scope, iterator=(width, height), band=...)over a 2D scope enumerates the fixed-size grid cells (clipped to the scope and the optional diagonalband), matching R'sgiterator.intervals(expr, scope, iterator=c(width, height)). Previously a two-element numeric iterator over a 2D scope raised "intervals must have 'chrom', 'start', and 'end' columns".- A dim-projected 1D virtual track now honors a 2D iterator.
gextract("v", scope_2d, iterator="rects_track")for a 1D vtrack withgvtrack_iterator(dim=1/2)now iterates the iterator's 2D cells - projecting each onto the chosen axis and evaluating the vtrack there (one row per iterator cell), matching R. Previously the iterator was ignored and the vtrack was evaluated once per 2D scope interval.
v0.5.1 (2026-05-27)¶
gtrack_createcan now build a 2D track from a 2D track expression.gtrack_create("t", desc, "rects_track + 10")(and with a 2D-intervalsiterator) evaluates the expression over the source track's rectangles and writes a 2D RECTS/POINTS track, matching R'sgtrack.createon a 2D expression. Previously the 1D scanner could not iterate a rectangles track and no usable track was produced.giterator_intervalsnow accepts a cartesian-grid iterator.giterator_intervals(expr, scope, iterator=giterator_cartesian_grid(..., stream=True), band=...)enumerates the 2D grid cells over the scope, matching R'sgiterator.intervals(expr, scope, iterator=cartesian_grid). The cells are built with grid-point (center) de-duplication and adjacent-center midpoint clipping, intersected with the 2D scope, and filtered by the optional diagonalband. A 2D-track name orNone(the whole 2D genome) is accepted as the scope.- Single-strand spatial PWM scores no longer double-count the reverse strand. For a non-bidirectional PWM virtual track (
bidirect=False) with spatial weighting (spat_factor/spat_bin), the spatial sliding-window optimization scored both strands when no strand was explicitly selected, inflatingpwm/pwm.maxenergies (only at the first interval of each scan segment). It now scores a single strand, matching R. Bidirectional and non-spatial scoring were unaffected. global.percentile.min/global.percentile.maxvirtual tracks now match R exactly. They map each per-bin statistic through the source track's precomputedvars/pv.percentilesbinned quantile table (the same file R uses) instead of an exact empirical CDF, so the returned percentile is bit-for-bit identical to R for R-created tracks. Tracks without that file (e.g. pymisha-created) fall back to the empirical CDF as before.gintervals_neighborsnow signs the distance by the target's strand. When the second (target) interval set has astrandcolumn, the reported distance is negative for upstream neighbors and positive for downstream ones (as in R); previously the target's strand was ignored and every distance was unsigned (positive), so a signed distance window such asmindist=-10000, maxdist=-2000matched nothing and returnedNone. Target sets without astrandcolumn are unchanged (unsigned distance).- Values landing exactly on a bin break are now assigned to the lower bin. Binning is right-closed
(breaks[i], breaks[i+1]]as in R, so a value equal to a break belongs to the bin ending at it, not the one starting at it. Previously such values went one bin too high. This was invisible for continuous track data but mis-binned integer values (e.g.gcis_decaydistances against multiple-of-1000 breaks), and also affectedgbins_summary/gbins_quantiles/gdistover virtual tracks.
v0.5.0 (2026-05-26)¶
- A value-based virtual track with no explicit iterator now uses its source track's native iterator.
gextract("vt", intervals)for a vtrack over a dense track previously evaluated one value over each whole input interval; it now iterates the source track's native bins (applying anygvtrack_iteratorshifts), matching R. gintervals_summaryandgintervals_quantilesnow return one row per scope interval when the iterator is a track or interval-set. Previously, passing a track/interval-set name (or a DataFrame) as theiteratorproduced one row per (iterator-bin ∩ scope) piece and silently dropped scope intervals that contained no bin. They now emit exactly one row per input scope interval - empty intervals getTotal intervals = 0andNaNstatistics - matching R. The per-interval standard deviation also uses R's one-pass formula for exact parity.gintervals_diffnow canonicalizes its inputs first. When either operand contained overlapping intervals (e.g. anrbind/concatof two screen results), the difference double-counted the overlaps and returned too much genomic space. Both operands are now sorted and overlap-unified before the difference, matching R.giterator_intervalsgainsbandandintervals_set_outarguments. A diagonalband=(d1, d2)now restricts a 2D iterator to rectangles whose offset from the diagonal falls within the band (as ingextract), andintervals_set_out="name"saves the resulting iterator intervals as a named set (returningNone), matching R'sgiterator.intervals(..., band=, intervals.set.out=).distance.centervirtual tracks no longer returnNaNfor some bins whose center lies inside a source interval. When several source intervals began within a single iterator bin, the search inspected only the first one and could miss the interval actually containing the bin center, returningNaNinstead of the true distance. It now locates the containing interval by the center coordinate, matching R misha.- Sparse-track
stddevis now numerically stable. Virtual tracks and extractions usingfunc="stddev"over a sparse track previously used a one-pass formula that produced a tiny nonzero value (e.g.1e-4) for a bin of identical large-magnitude values, where the true standard deviation is0. They now use the Welford online algorithm (matching R misha and the dense-track path), returning exactly0for constant bins. gvtrack_createwith an intervals-set source now defaults tofunc="distance". Whenfuncis omitted and the source is an intervals set (a DataFrame with no value column, or an intervals-set name), the virtual track now measures distance to the nearest interval, matching R. Previously it defaulted to"avg"and raised "DataFrame source must include one value column". Track and value-bearing sources still default to"avg".
v0.4.0 (2026-05-26)¶
- Track expressions now follow R operator precedence for
&and|. Expressions such asgscreen("track > 0.1 & track < 0.3")previously raised a bitwise-operator error because&/|bound tighter than the comparisons. They now bind looser, as in R, soa > x & b < ymeans(a > x) & (b < y)(and&binds tighter than|). Affectsgscreen,gextract,gsummary,gdist,gquantiles,gpartitionand any other track-expression evaluation. - Indexed sparse tracks no longer read back empty.
gextract,gsummaryand value-based virtual tracks returned no data (andgsummaryreported 0 intervals) for sparse tracks stored in the indexed format; dense tracks were unaffected. They now read correctly. - The iterator and scope may be a track or interval-set name.
gextract(expr, intervals, iterator="some.track")iterates over that track's bins (dense) or intervals (sparse); a track or interval-set name passed asintervalsuses its intervals/rectangles as the scope;giterator_intervalsinfers the grid from a sparse track. Previously only a numeric bin size or an explicit intervals DataFrame was accepted. - Bare 2D expressions no longer require explicit intervals.
gscreen("rects_track > 10"),gsummary("rects_track")and similar now default to the whole 2D genome (as in R) instead of raising "rectangles not yet supported". gintervals/gintervals_2drecycle shorter argument vectors.gintervals([1, 2], starts, ends)with longerstarts/endsnow repeats the chromosome list to match (as in R), instead of raising "chroms, starts, and ends must have the same length". Lengths must still be multiples.gintervals_chrom_sizesnow returns interval counts per chromosome. It returnschrom/sizefor 1D intervals andchrom1/chrom2/sizefor 2D (the number of intervals on each chromosome or chromosome pair), matching R. Previously it returned only the unique chromosome names with no counts. It also accepts an interval-set or track name directly.global.percentilevirtual tracks no longer read back empty on indexed databases.gvtrack.create(..., func="global.percentile" / "global.percentile.min" / "global.percentile.max")returned all-NaNfor source tracks stored in the indexed format; these functions use a read path whose per-chromosome-file gate skipped indexed tracks (which keep their data intrack.dat). They now read correctly. Per-chromosome databases were unaffected.gtrack_array_extractno longer returns an empty result for array tracks stored in the indexed format. The array reader only knew the legacy per-chromosome files, so on an indexed database it found no data and returned zero rows (column names still read correctly). It now also reads each chromosome's block fromtrack.dat/track.idx. Per-chromosome databases were unaffected.
v0.3.0 (2026-05-25)¶
- Fixed a correctness bug in coarsening-iterator extraction. When a numeric
iteratorlarger than a dense track's native bin size was used,gextract,gsummary,gquantiles,gbins_summaryandgbins_quantilessampled the value at each output bin's midpoint instead of averaging the native bins it covers, returning wrong values. They now average, matching R misha. Extraction with the default (native) iterator was unaffected. - Mixed dense and sparse tracks in one expression are now supported. Expressions such as
gextract("dense_track + sparse_track", intervals, iterator=50)work with an explicit iterator, and (as in R) raise when no iterator can be inferred. Previously any dense+sparse mix raised "Mixed track types in expression are not supported". gintervals_canonic,gintervals_intersect,gintervals_unionandgintervals_diffnow reject intervals withstart >= endorstart < 0instead of silently dropping zero-width intervals or passing inverted intervals through unchanged.gintervals_neighbors(..., na_if_notfound=True)returnsNaNfor the start/end coordinates of rows with no neighbor, matching R (previously a-1sentinel).- Fixed two small reference leaks (result-dict float objects in
gtrack_import_mappedseqand the strands-autocorrelation routine). gsummaryandgintervals_summaryno longer returnNaNfor the standard deviation of near-constant, large-magnitude input; the tiny negative variance produced by catastrophic cancellation is clamped to 0 (matching R misha 5.7.4).gsynth_randomnow validatesnuc_probs: a missing, extra, or duplicated nucleotide name is rejected with a clear error instead of being silently mishandled.
v0.2.4 (2026-05-25)¶
gtrack_import()gains afuncargument selecting the per-bin reduction when importing with abinsize(Dense track): one of"weighted.mean"(default),"weighted.sum","max","min","median","count","coverage". For examplegtrack_import("pileup", desc, "reads.bed", binsize=20, func="coverage")builds a ChIP-style pileup track in one call. Works for every input format (BED/WIG/bedGraph/BigWig/tab). Passing a non-defaultfuncwithout abinsizeraises an error.
v0.2.3 (2026-05-21)¶
- CI green again on main. 15 mypy errors in
pymisha/liftover.pywere silently introduced during the G1 liftover C++ port (v0.1.87-v0.1.94) and only surfaced once dev shipped to main as v0.2.0. All fixed:# type: ignore[no-any-return]on the two C++-entry returns (pm_parse_chain_file,pm_chain_intervals_resolve), explicitfloat()/int()conversions on the aggregate reducers, and an inner-loop variable rename (m->mc) that was shadowing the outer median-indexm. - Conda package builds again.
conda-recipe/meta.yamlnow listszlibinhostandrunrequirements (broken since v0.2.0 added zlib for SAM gzip support).
v0.2.2 (2026-05-21)¶
- Internal: BAM auto-detect for
gtrack_import_mappedseqmoved from the Python wrapper into the C++ entry.samtools viewis now spawned viapopen()insidepm_import_mappedseq; the Python side no longer manages a subprocess or dups file descriptors. User-visible behavior is unchanged. This makes the architecture symmetric with the upcoming R misha port (R can't easily share fds with C, but both languages can let C drivepopen). - The C++ side surfaces a clear
pymisha.errorwhensamtoolsis missing (exit code 127) or whensamtools viewexits non-zero. The exception type changed fromRuntimeErrortopymisha.error- both are catchable asException.
v0.2.1 (2026-05-21)¶
gtrack_import_mappedseqnow accepts BAM files directly: bgzip magic bytes (1f 8b 08 04) are auto-detected and the file is streamed throughsamtools viewinto the existing C++ FSM. Requiressamtoolson PATH; a clearRuntimeErroris raised otherwise. The legacy defaultcols_order=(9, 11, 13, 14)is silently switched to SAM mode (None) for BAM inputs sincesamtools viewalways emits SAM-format payload.- New stdin / fd source in the C++ FSM:
file="-"reads from stdin andfile="fd:N"reads from an arbitrary file descriptor. Lets users compose pipelines such assamtools view -q 30 reads.bam | python -c "pm.gtrack_import_mappedseq(...)"or pre-filter viasamtools markdup/ region restriction.
v0.2.0 (2026-05-21)¶
Bundle release covering v0.1.75 through v0.1.95. Major themes:
- Liftover ported to C++ end-to-end (G1, v0.1.87-v0.1.94).
_aggregate_overlapping, chain-file parser, source-track reader (1D dense / sparse / indexed + 2D RECTS / POINTS), overlap-policy resolution,_map_intervals, thegtrack_liftoverorchestrator, and the dispatcher all run in C++ with R-parity semantics. ARRAYS source tracks still wait on G3. Env-var fallbacks:PYMISHA_FORCE_PY_AGGREGATE_OVERLAPPING,_PARSE_CHAIN_FILE,_READ_SOURCE_TRACK,_CHAIN_INTERVALS_RESOLVE,_MAP_INTERVALS,_LIFTOVER_TRACK. - SAM import ported to C++ (G2, v0.1.95).
pm_import_mappedseqper-byte FSM + direct dense / sparse writers. 3-5x on 100k synthetic SAM. R-parity: exact chrom-name match, tab-only split. Gzip auto-detect retained.PYMISHA_FORCE_PY_IMPORT_MAPPEDSEQ=1keeps the Python path. - 2D scanner + virtual-track R-parity (v0.1.75-v0.1.85). FixedRect, TrackRects, CartesianGrid iterators. Opt-in scanner routing for the intervals iterator. Reducing 2D vtracks, scanner-side
exists/size/first/last/sampleobject functions. ARRAYS branch in the C++ scanner +gvtrack_array_slice. 2D vtrack shifts through the scanner. Multi-track 2D compound expressions (Spec C). - LLM agent guides (
agent-guides/) ported from the R misha guides. Excluded from the sdist; designed for raw-github-URL fetch by downstream agents. - Misc:
gtrack_create / _modify / _smoothnow resolve vtracks in expressions (v0.1.84); various R-parity corrections for 2D vtracks (exists/sizeNaN handling, v0.1.81); CI green on main (mypy + test isolation, v0.1.86).
R parity is the spec wherever the prior Python implementation diverged. See individual v0.1.x entries below for per-release details.
v0.1.95 (2026-05-21)¶
gtrack_import_mappedseqported to C++ (pm_import_mappedseq): per-byte FSM SAM/tab parser + direct dense / sparse track writers. Dense speedup 3.34x and sparse speedup 5.41x on a 100k-read synthetic SAM (measured vs. the prior pure-Python implementation). R-parity: chromosome names must match the DB exactly (no normalization), tab is the only field separator. Gzip auto-detect via the magic bytes is retained.PYMISHA_FORCE_PY_IMPORT_MAPPEDSEQ=1selects the Python fallback.
v0.1.94 (2026-05-21)¶
gtrack_liftovernow supports 2D source tracks (RECTS and POINTS) end-to-end in C++ via the newpm_liftover_track_2dentry point. The dispatcher auto-detects 2D sources by quadtree signature and routes them through the dedicated 2D path. No aggregation is performed on the 2D side, matching R behavior. ARRAYS source tracks are still out of scope.
v0.1.93 (2026-05-21)¶
gtrack_liftovernow preserves source track type: dense (FIXED_BIN) source produces a dense target track; sparse source produces sparse. Previously always produced sparse. This is a breaking change for code relying on the always-sparse output; aggregation semantics now match R for both paths.gtrack_liftoverruns end-to-end in C++ via the newpm_liftover_trackentry point (~3.56x faster on a 1M-bin + 10k-chain workload). SetPYMISHA_FORCE_PY_LIFTOVER_TRACK=1to fall back to the pure-Python path.
v0.1.92 (2026-05-21)¶
gintervals_liftover: ~1.7x end-to-end speedup on 100k src x 100k chain.best_cluster_*policies now union bychain_idAND source overlap, matching R behavior. Fixes a divergence where multi-block chains were incorrectly split into multiple clusters.- Set
PYMISHA_FORCE_PY_MAP_INTERVALS=1to fall back to the pure-Python liftover path.
v0.1.91 (2026-05-20)¶
Fixes (R-parity)¶
gintervals_load_chainnow matches Rmishabehavior on two edge cases that v0.1.90 (and earlier) handled differently:src_overlap_policy="discard"uses a pair-only scan after sort by source coordinates (matching Rrdbinterval.cpphandle_src_overlaps). Previously this was a whole-cluster discard that also dropped non-overlapping intervals nested inside a wider overlap. A nested interval separated by a gap from the prior overlap pair is now kept.tgt_overlap_policy="auto_score"/"auto_first"/"auto_longer"/"agg"no longer collapse the split pieces of a negative-strand chain back into a single row viamin(startsrc)/max(endsrc). The merge step now requiresprev.endsrc == slice.startsrc(matching Rappend_slice), so negative-strand chains that were split by an overlapping target chain stay as N separate rows with their reversed source coordinates intact.
v0.1.90 (2026-05-20)¶
Performance¶
- Chain overlap-policy resolution ported to C++ (~22x faster on a 100k-row synthetic chain under
auto_score).gintervals_load_chainpicks up the speedup automatically for allsrc_overlap_policy/tgt_overlap_policycombinations. SetPYMISHA_FORCE_PY_CHAIN_INTERVALS_RESOLVE=1(or pass_force_pure_python=True) to fall back to the pure-Python implementation.
Fixes¶
- C++ tgt-overlap sweep now skips zero-length intervals (matching the Python
np.unique/coverage path) instead of emitting a phantom slice spanning to the next event.
v0.1.89 (2026-05-20)¶
Performance¶
gtrack.liftoversource-track reader (_read_source_track) ported to C++. Per-chrom dense, per-chrom sparse (32-bit + 64-bit record layouts), and indexedtrack.idx/track.datformats all decode directly into numpy arrays. ~6x speedup on a 1M-bin synthetic dense track. SetPYMISHA_FORCE_PY_READ_SOURCE_TRACK=1(or pass_force_pure_python=True) to fall back to the pure-Python reader.
Fixes¶
- Indexed source tracks with sibling subdirectories (e.g. the
vars/directory created bygtrack.convert_to_indexed) were silently being read as empty sparse tracks. Subdirectories are now filtered out of the per-chrom file list, so the indexed branch runs as intended.
v0.1.88 (2026-05-20)¶
Performance¶
gintervals_load_chainis ~5x faster on large chain files. SetPYMISHA_FORCE_PY_CHAIN_PARSER=1to fall back to the pure-Python parser.
v0.1.87 (2026-05-20)¶
Performance¶
gtrack.liftoveroverlap aggregation ported to C++. The per-chrom sweep-line in_aggregate_overlappingnow runs through_pymisha.pm_liftover_aggregate. Pure-Python fallback retained for custom aggregator callables.
Fixes¶
_aggregate_overlappingnow iterates the active set in row order (was hash order, non-deterministic by value). Affectsfirst/last/nthaggregators.
v0.1.86 (2026-05-20)¶
Fixes¶
- CI green. Mypy now passes (
pymisha/extract.pypolicy-dict and CartesianGrid resolver were inferred asdict[str, str]andobject; both annotated explicitly.pymisha/vtracks.py::gvtrack_array_sliceslice-list narrowing was tightened). Thetest_computed_tracksmodule fixture re-inits the canonical test db before writing the COMPUTED stub so the on-disk file matches the active db pointer regardless of which prior test last calledgdb_init_examples(). No runtime behavior change.
v0.1.85 (2026-05-20)¶
Features¶
- Multi-track 2D compound expressions (R parity).
gextract("v_a + v_b", ...),gextract("track_a - track_b", ...), and any compound 2D expression mixing bare 2D tracks and reducing vtracks now route through the C++ scanner. Each symbol is computed once per rectangle, then the compound expression is evaluated over the per-symbol arrays. Works withiterator=(N, M),iterator="<2D track>",iterator=CartesianGridSpec(...), and the opt-inPYMISHA_USE_SCANNER_FOR_INTERVALS=1intervals-iterator path. Previously raisedNotImplementedErroror silently fell through.
v0.1.84 (2026-05-20)¶
Fixes¶
gtrack_create,gtrack_modify,gtrack_smoothnow resolve virtual tracks in theirexprargument. Previously these handed the raw expression string to the C++ engine without forwarding the Python-side vtrack registry, so any expression referencing agvtrack_create-registered name failed withname '<vtrack>' is not defined. The canonical R misha patterngvtrack.create(...) ; gtrack.create("smoothed", ..., expr="log2(vt_sum + 1)", iterator=20)now works in pymisha.
v0.1.83 (2026-05-20)¶
Features¶
- 2D vtrack shifts now work through all scanner paths (R parity).
gvtrack_iterator_2d(name, sshift1=, eshift1=, sshift2=, eshift2=)shifts are now applied per-var when querying the underlying 2D track, forgextractcalls usingiterator=(N, M),iterator=track_name, oriterator=CartesianGridSpec(...). Previously any vtrack with non-zero shifts fell back to the slow legacy path or raisedNotImplementedErrorwith FixedRect/CartesianGrid iterators.
Limitations¶
global.percentileis not routed through the scanner (deferred - requires a two-pass population accumulation). It continues to work via the legacy path for plaingextractcalls, but raisesNotImplementedErrorwith FixedRect/CartesianGrid iterators.
v0.1.82 (2026-05-20)¶
Features¶
gvtrack_array_slice(vtrack, slice, func)- R-aligned API forgvtrack.array.slice. Mutates an existing vtrack (created viagvtrack_create(name, src=array_track)) rather than creating one. Matches R's two-step pattern:gvtrack_createthengvtrack_array_slice. Supports all R functions:avg,min,max,sum,stddev.gextract("vtrack_name", intervals=...)on such a vtrack returns one scalar per iterator interval.- Array-slice vtracks work in
gextractwithout an explicit iterator. When all expressions are array-slice vtracks and no physical track is present, the query intervals serve as the iterator (one row per input interval).
Limitations¶
- Bare
gextract("array_track", ...)(without a vtrack) still raises, pointing togtrack_array_extract. Usegvtrack_array_sliceto aggregate first. quantilefunc not supported (requires a different data path); deferred.
v0.1.81 (2026-05-20)¶
Fixes¶
existsandsize2D vtracks now return 0 (not NaN) for chrom pairs with no track data. R parity. Affectsgextract(vtrack, intervals=scope, iterator=...)where the iterator emits cells on pairs the track does not cover.
v0.1.80 (2026-05-20)¶
Features¶
- Scanner supports object-level reducer funcs (R parity).
PMTrackExpression2DVarsnow handlesexists,size,first,last,samplefor use with 2D iterators.gextract(vtrack_name, intervals=scope, iterator=(N,M))(or CartesianGrid iterator) now works when the vtrack has any of these five object-level funcs. Closes the remaining R-parity gap from v0.1.79 for reducing 2D vtracks + 2D iterators.
Limitations¶
global.percentilenot yet supported through the scanner path (deferred - requires a precomputed population).- 2D vtracks with shifts still raise (deferred).
v0.1.79 (2026-05-20)¶
Features¶
- Reducing 2D vtracks now supported with 2D iterators (R parity).
gextract(vtrack_name, intervals=scope, iterator=(N,M))(oriterator="<2D track>"oriterator=CartesianGridSpec(...)) now computes one aggregated value per emitted cell when the vtrack has a 2D reducer func (area,avg,min,max,weighted.sum). Previously raisedNotImplementedErroror fell through to the wrong path. Closes 8 xfail-strict R-parity gaps from v0.1.75-v0.1.77.
Limitations¶
- Vtracks with 2D shifts (
sshift1/sshift2) still raise - deferred. - Multi-vtrack compound expressions still raise - deferred.
exists,size,first,last,sample,global.percentilefuncs still raise - C++ scanner does not support object-level funcs; deferred.
v0.1.78 (2026-05-20)¶
Features¶
- Intervals iterator via scanner (opt-in).
PYMISHA_USE_SCANNER_FOR_INTERVALS=1routes the implicit intervals-iterator case (gextract(track, intervals=scope_df)) through the new C++ scanner path. The scanner returns one row per scope interval (per-rect aggregation via the var's func, e.g.avg). This complements the existing default path which returns one row per (scope_rect, track_object) intersection (per-object enumeration). The two paths solve different problems; both remain available. - Bench (testdb, 10 scope rects on chr1 x chr1): scanner path ~4.9 ms (10 rows, aggregated); existing per-object enumeration path ~7.3 ms (7 rows, per-object). Different output shapes; the numbers are informational only.
Internal¶
- New
IntervalsPolicydataclass + parser support inpymisha/_iterator_policy.py. pm_extract_2d_scanneracceptskind="intervals".- The bypass paths
pm_extract_2dandpm_extract_2d_objectsare intentionally retained as the default - they provide per-object enumeration semantics that the scanner aggregation cannot directly replace. - Multitask regression guard added:
test_intervals_via_scanner_multitask_equivalenceconfirms scanner result is identical under single-process and multi-process CONFIG.
Limitations¶
- Same as v0.1.77.
PYMISHA_USE_SCANNER_FOR_INTERVALS=1only affects the default no-iterator case for bare physical tracks. vtracks and explicit-DataFrame iterators continue to use the legacy path.- Scanner is single-process; multitasking deferred.
v0.1.77 (2026-05-20)¶
Features¶
- CartesianGrid 2D iterator (
giterator_cartesian_grid(..., stream=True)+gextract(iterator=spec)). New C++ streaming iterator generates the cartesian product of 1D windows around two sets of interval centers. Optionalband_idxfiltering keeps only (center-i, center-j) pairs within a specified index-delta range. Bench (testdb rects_track, 5 centers, 3 windows/axis, 225 cells, chr1 x chr1): ~5.5 ms pymisha vs ~8.0 ms R misha 5.7.2 (1.45x faster).
Internal¶
giterator_cartesian_gridaccepts a newstream=Truekwarg that returns aCartesianGridSpecinstead of a materialized DataFrame. The materialize path (stream=False) remains the default for backward compatibility. Stream path produces one row per cell (avg aggregation); materialize path produces one row per (cell, track-object) intersection.bench_2d_iterators.pyextended with Workload D (CartesianGrid) including an R misha comparison.
Limitations¶
- Same as v0.1.76 (vtrack + iterator policies, multitasking, memory materialization deferred).
- Reducing-vtrack regression-guard xfail remains.
v0.1.76 (2026-05-20)¶
Features¶
- TrackRects 2D iterator (
gextract(..., iterator="<2D track name>")). New C++ streaming iterator yields the rects from a 2D rects/points track that intersect the user's 2D scope. Bench (testdb rects_track, chr1 x chr1 scope, 76 rows): 6.3 ms pymisha vs 8.0 ms R misha 5.7.2 (1.27x faster).
Fixes¶
gextractwithiterator=<1D track>oriterator=<unknown track>now raises a clearValueError. Previously the argument was silently ignored.
Internal¶
- Iterator constructor contract: 2D iterators no longer prime in the constructor; callers must call
begin()explicitly. Affects FixedRect (v0.1.75-shipped) and the new TrackRects. Externally visible only via the test bindings.
Limitations¶
- Same as v0.1.75 (vtrack + iterator policies, multitasking, memory materialization deferred).
- Reducing 2D vtracks (e.g.
func="avg") withiterator=<2D track>still routes through the legacy one-row-per-scope path - gap documented with xfail-strict regression test.
v0.1.75 (2026-05-20)¶
Features¶
- FixedRect 2D iterator (
gextract(..., iterator=(N1, N2))). New C++ streaming iterator subdivides each 2D scope rect into anN1 x N2grid, yielding cells in row-major order with diagonal-band shrinking. Replaces the Python materialize-then-iterate path for this case. Bench (testdb, 100-bin scope): ~5.5 ms in pymisha vs ~106 ms in R misha (~19x speedup). First step of Group K parity.
Limitations¶
iterator=(N, M)with virtual tracks raisesNotImplementedErroruntil vtrack support lands in a later release. Use a bare physical 2D track name.- The FixedRect path runs single-process. Multitask integration is deferred to a later release.
- Large grids (N1 × N2 cells across many chrom pairs) materialize the full rect set in memory before aggregation; for genome-wide Hi-C resolution grids this can exceed available RAM. Streaming aggregation is planned for a later release.
v0.1.74 (2026-05-19)¶
Fixes¶
- R factor decode in legacy
.intervfiles. A factorchromcolumn was decoded as bare 1-based integer codes ("1".."22") instead of the level labels ("chr1".."chrY"), producing misleadingChromosome "20" does not existerrors on databases with fewer chromosomes than factor levels. Factors now decode topandas.Categorical, including ordered factors and NA codes. - R
NA_LOGICALandNA_integer_preserved. Atomic logical NAs no longer silently becomeTrue; integer NAs no longer surface as the-INT_MAXsentinel. Vectors with NAs decode topandas.arrays.BooleanArray/IntegerArray; NA-free vectors keep the existingbool/int32ndarray return type.
Internal¶
- Remove unused
_write_factor_column; the writer dispatch flattens Categorical to character on disk (unchanged).
v0.1.73 (2026-05-18)¶
Fixes¶
gdb_install_intervalsresolves assembly_name from the FTP listing when the NCBI Datasets/dataset_reportis empty or suppressed. Previously, accessions whose Datasets API record was suppressed (e.g.GCF_000001635.26GRCm38.p6, replaced by GRCm39 in the active index) leftassembly_nameempty and the GFF was silently skipped even though the FTP directory still hosted<acc>_<asm>_genomic.gff.gz. Now falls back to parsing the parent FTP listing for the assembly subdirectory. Ports R commitd6cd6047.
Internal¶
- Regression tests added for: cache invalidation on
gtrack_rm+gtrack_create_*andgintervals_rm+gintervals_saveon indexed DBs (R commit4c3803b0);min_coveragegate position in chrom-alias resolution (R commit537bfe29). Both pass on current pymisha; tests guard against drift.
v0.1.72 (2026-05-18)¶
Features¶
gintervals_to_mat/gintervals_from_mat. Pivot an intervals + values DataFrame into a matrix-shaped DataFrame indexed by intervals (3-level MultiIndex ofchrom/start/end, or 4-level whenid_colis given). Inverse viagintervals_from_mat. Pandas-nativeilocslicing andpd.concatpreserve the intervals correspondence. Ports R PR #120.
Fixes¶
- mypy CI: narrow the type of
dfafter_apply_intervals_joinin_apply_extract_outputso the post-processing block type-checks cleanly. No behavior change.
v0.1.71 (2026-05-18)¶
Features¶
gextractgainsintervals_joinargument. Replaces themisha.ext::gextract.left_joinworkflow with a single built-in call. Modes:"id"(default; appendsintervalID),"intervals"(dropsintervalID, attaches every column of the input intervals DataFrame to each output row, suffixing conflicting names with"1"),"none"(dropsintervalID, attaches nothing). Supported attach dtypes: numeric, bool, string, category.file=andintervals_set_out=are rejected withintervals_join="intervals". Ports R PR #124.
v0.1.70 (2026-05-15)¶
Performance¶
- 2D object-level vtrack functions are ~5x faster. The
exists,size,first,last, andsamplereductions over a 2D RECTS/POINTS track route through the newpm_extract_2d_objectsC++ binding instead of the per-interval Python loop. Synthetic 10M-rect track, 100k queries, warm cache: 6.8 s -> 1.26 s. - Vectorized chromid name lookups in
_gextract_2d_single(matters at multi-million-row outputs).
v0.1.69 (2026-05-15)¶
Performance¶
gquantiles/gsummary/gscreen/gpartition20-30x faster on 100k-1M interval inputs. Multitasking no longer fires for sparse-track workloads where fork + IPC overhead exceeds the serial scan. Phylo447 / 100k 500-bp intervals: gquantiles 190 -> 7 ms, gsummary 188 -> 7 ms, gscreen 188 -> 8 ms, gpartition 186 -> 6 ms.gextract20-90x faster on small-to-medium workloads. Phylo447 / 100k 500-bp intervals: 491 -> 7 ms (single track), 634 -> 7 ms (3 tracks).- Child-process wakeup polling reduced from 100 ms to 1 ms in all multitask call sites.
gintervals_neighborsdoes a single pass over the sorted target set per chrom run instead of rescanning it on every query-chrom transition.gseq_extractdrops a redundant per-intervalstd::stringcopy.gpartitionreuses the chromPyUnicodeacross same-chrom output runs (matches the v0.1.64intervals_to_pypattern).
Configuration¶
pm.CONFIG["min_scope4process"](default1_000_000_000bp): minimum scope per worker required to enable multitasking. Set to0to fork unconditionally.pm.CONFIG["min_intervs4process"](default250_000): minimum total intervals required before any fork.
v0.1.68 (2026-05-15)¶
Performance¶
gextracton raw 2D RECTS/POINTS tracks is ~2x faster. Replaces the per-interval Python loop in_gextract_2d_singlewith a native C++ object-enumeration path (pm_extract_2d). Synthetic 3M-rect track, 100k queries, 2.4M output rows: 6.18 s -> 2.61 s (2.37x). Vtrack-aggregated paths (avg/area/weighted.sum/min/max) and object-level vtracks are unchanged in this release.
v0.1.67 (2026-05-15)¶
Fixes¶
- mypy CI: annotate the
pm.CONFIG["track_create_parallel_writers"]cast in_apply_create_parallel_writers_from_config(CONFIG values are typed asobject). No behavior change.
v0.1.66 (2026-05-15)¶
Configuration¶
pm.CONFIG["track_create_parallel_writers"]controls how many threadsgtrack_create_sparseuses for empty per-chrom signature file dispatch on non-indexed DBs. Default 4. Set to 1 to force sequential, higher to push parallelism.multitasking=Falsealso forces 1 worker. Was hardcoded to 4 in v0.1.65.
v0.1.65 (2026-05-15)¶
Performance¶
gintervals_intersect/union/diff/canonic1.4-1.9x faster on million-row inputs. Hg38, 1M-row 1D ops: intersect 522 -> 331 ms, union 391 -> 212 ms, diff 447 -> 260 ms, canonic 241 -> 145 ms.gtrack_create_sparse~1.5x faster on million-row inputs. Hg38 (non-indexed, 455 chroms): 1.88 s -> 1.24 s. Indexed DBs (Phylo447, 194 chroms): 599 -> 511 ms.gtrack_create_dense-215 ms per call on databases with thousands of tracks.gextractwithiter=intvar~1.4x faster on dense tracks. Hg38, 5 Mb / 200-bp bins: 30.3 -> 21.5 ms.
v0.1.64 (2026-05-15)¶
Fixes¶
- mypy CI: expression helpers (
_parse_expr_vars,_validate_expr_security, and friends) now acceptcollections.abc.Set[str]fortrack_names/vtrack_names, so the cachedfrozensetreturned by the v0.1.63pm_track_namescache type-checks at every call site. No behavior change.
v0.1.63 (2026-05-15)¶
Performance¶
gintervals_ls()is now O(1) on warm databases (was O(N_files)). PMDb caches interval-set names alongside tracks during the existinggdb_initscan; the old PythonPath.rglob("*.interv*")walked every per-chrom file undertracks/. On hg38 (15k tracks, 20 interval sets): ~38 s -> ~0.04 ms.gintervals_save/gintervals_rmincrementally register / unregister names without paying a fullpm_dbreloadrescan.gextract/gsummary/gscreen/gdist/glookupwithiterator=intervalsis 100-1000x faster on million-row inputs. Vectorized the per-chromosome two-pointer sweep that intersects a scope DataFrame with an iterator DataFrame; the old O(K*J) Python loop ate ~95% of total time on real workloads (hg38 phyloP447, 10k 500-bp intervals: 9.4 s -> 77 ms).gintervals_saveis ~50x faster on million-row data frames. Replacedpyreadr.write_rds()with a native R-serialize XDR writer (pymisha._r_serialize.write_dataframe); the writer streams numeric/string columns to disk in bulknumpy.tobytes()ops instead of round-tripping through the librdata C++ binding row by row (1M rows, 24 unique chroms: 16.8 s -> 0.34 s).gintervals_loadis ~2x faster on million-row interval sets. The R-serialize reader's STRSXP path now inlines CHARSXP parsing instead of recursing through_read_itemfor every string (1M rows: 1.0 s -> 0.55 s).gextract/gsummary/gscreentight loops are 5-8x faster. Cached_check_computed_tracks(per-(exprs, vtracks) and per-track results) plus a Python-side cache forpm_track_names()shave ~12 ms of fixed Python overhead off every call. Caches are cleared ongdb_init/gdb_reload/gdb_unload; a_pymisha.pm_dbreloadmonkey-patch keeps the cache in sync even for callers that hit the C extension directly. Hg38, 1000 intervals, tightiter=intervalsloop: ~77 ms -> ~10 ms per call.
Internal¶
_intervals_to_cppnow constructs the pymisha-internal list-of-arrays format directly, skipping aDataFrame.copy()+ per-columnilocround-trip.- New C++ API:
pm_interv_names,pm_interv_register,pm_interv_unregister. New PMDb members:m_interv_cache,interv_names(),register_interv(),unregister_interv().
v0.1.62 (2026-05-15)¶
Performance¶
- WIG/BedGraph import is ~7x faster.
gtrack_importfrom plain (non-gzipped) WIG and BedGraph files now streams through a C++ parser (pm_parse_wig_or_bedgraph) instead of building Python lists line-by-line. Bench: 50 MB / 3M-row WIG variableStep parses in 0.7s vs 5.6s; 91 MB / 3M-row BedGraph in 0.9s vs 5.9s. Gzipped inputs still use the pure-Python streamer. Closes part of Group M.2 of the 2026-05-15 parity audit.
v0.1.61 (2026-05-15)¶
Fixes¶
- Docs build: removed a
../../dev/notes/link fromdocs/guides/parity.mdthat brokemkdocs --stricton main (dev/ is excluded from the shipped tree).
v0.1.60 (2026-05-15)¶
Fixes¶
_read_r_readonly_formatnow uses the native R-serialize reader (pymisha._r_serialize) instead ofpyreadr. The runtime no longer needspyreadrto load a database with R-written read-only-attribute files.pyreadrremains a soft dependency only forgintervals_save(RDS writer).
Docs¶
- Refreshed
docs/guides/parity.mdto reflect Groups G/H/I/J shipped in v0.1.55-v0.1.59.
v0.1.59 (2026-05-15)¶
Features¶
gtrack_array_create(track, description, intervals, values, colnames)writes a new array track from a DataFrame + 2-D matrix in memory. NaN cells are stored sparsely, matching R's array-track invariant. The on-disk format is byte-compatible with R misha -.colnameswritten by PyMisha is readable viaunserialize()in R. Complementsgtrack_array_extract/gtrack_array_get_colnamesshipped in v0.1.57.
v0.1.58 (2026-05-15)¶
Fixes¶
- CI mypy errors from v0.1.55-v0.1.57 are resolved. mypy is clean across
pymisha/_r_serialize.py,pymisha/genome/registry.py,pymisha/intervals.py, andpymisha/tracks.py. No runtime behaviour change.
Improvements¶
- Obsolete 2D track formats now produce an actionable error message instead of "requires conversion". Lists the obsolete format name, points to R misha's
gtrack.convertas the upgrade path, and notes that a PyMisha in-process converter is tracked under Group J of the 2026-05-15 parity audit. The legacy formats (OLD_RECTS1,OLD_RECTS2,OLD_COMPUTED1,OLD_COMPUTED2,OLD_COMPUTED3) have not been written by misha for years; the converter port is deferred.
v0.1.57 (2026-05-15)¶
Features¶
- Array track read support. Closes B1 of the 2026-05-15 parity audit. Three new public functions mirror R's
gtrack.array.*: gtrack_array_get_colnames(track)reads the column names from<track>/.colnamesvia the new R-serialize reader.gtrack_array_set_colnames(track, names)writes the file in a format both R misha and pymisha can read.gtrack_array_extract(track, slice=, intervals=)returns a DataFrame with one row per overlapping track interval and one column per requested array column (NaN where the track has no value at that index).sliceaccepts either column names or 0-based indices.- Improved
gextracterror message on array tracks. Previously raisedTrack type 'array' not yet supported; now points the user atgtrack_array_extract. Full C++ scanner integration for array tracks is deferred to Group K.
v0.1.56 (2026-05-15)¶
Features¶
- Native R-serialize reader (drops the Rscript hard dependency).
pymisha._r_serialize.read()decodes R's XDR (binary) format plus gzip-compressed RDS, with support for character / integer / numeric / logical vectors, NULL, raw bytes, named lists, data frames, and the common ALTREP encodings (compact_intseq,compact_realseq,wrap_*,deferred_string). Used by: gintervals_load(legacy_bigset): legacy.metafiles no longer needRscriptat runtime - the previous"Rscript is required to load legacy intervals metadata"error is gone.gtrack_var_get: variables written by R misha (XDR or gzipped XDR) are now readable from Python. The ASCII variants (A\n,B\n) are still rejected with a clear message pointing toserialize(..., ascii=FALSE)as the workaround.
v0.1.55 (2026-05-15)¶
Features¶
gdb_list_genomes()andgdb_genome_info(name)are now public APIs (R parity forgdb.list_genomes/gdb.genome_info). They walk the registry chain (explicit arg ->PYMISHA_GENOME_REGISTRY->./misha.yaml-> bundledrecipes.yaml) and return a DataFrame / recipe dict.gintervals_neighbors(intervals_set_out=, warn_ignored_strand=, mindist1/maxdist1/mindist2/maxdist2=)now accept the same parameter surface as Rgintervals.neighbors. The 2D-distance params are accepted as no-ops for 1D inputs (matching R); 2D inputs raiseNotImplementedError(deferred to the 2D C++ scanner work).gintervals_mapply(enable_gapply_intervals=, band=)R parity.enable_gapply_intervals=Truepasses the current iterator interval as agapply_intervalskwarg tofunc(PyMisha analogue of R'sGAPPLY.INTERVALS).gcis_decaynow accepts compound 2D expressions that reference exactly one 2D track (e.g.,"track + 0"). Distance is computed from coordinates so the expression value is unused.
Fixes¶
gsynth_sample/gsynth_random/gsynth_replace_kmerdefaultoutput_formatis now"misha"to match R."seq"remains accepted as a legacy alias. Unknown values now raiseValueErrorinstead of silently falling back to"fasta".gsynth_randomaccepts aniterator=1parameter for R API parity (no-op in PyMisha because sampling is per-position).gintervals_neighborsno longer ignoresstrandcolumns silently - emits a warning by default whenintervals1has astrandcolumn anduse_intervals1_strand=False. The directional helpers (_upstream,_downstream) suppress the warning.
v0.1.54 (2026-05-14)¶
Fixes¶
gsynth_sampleon 0D models with unaligned intervals no longer falls back to uniform-random output. A signed-int64 overflow when addingiter_size = INT64_MAX(the no-constraint sentinel for 0D models) to a positive bin start position made the bin-bounds check fail, returningbin_idx = -1and falling through to uniform-random base selection.gsynth_forbid_kmeron a 0D model now correctly produces samples that respect the forbidden pattern regardless of interval alignment. Aligned-interval samples are byte-identical to v0.1.53. (Roadmap follow-up #1.)
v0.1.53 (2026-05-14)¶
Features¶
gtrack_create_dense(func=)knob for per-bin aggregation. Seven reductions:weighted.mean(default, byte-identical to prior output),weighted.sum,max,min,median,count,coverage. Thecoveragemode withvalues=[1]*Nanddefval=0produces a one-call ChIP-seq-style pileup track from BED inputs.defvalacts as a synthetic uncovered contribution for every func exceptcount. (R misha 5.6.32068a02a2,5e69c2c8.)
Notes¶
- R 5.6.32
1b4f5065(unsigned-wrap guard) is N/A in pymisha: theov_end > ov_startguard was already present atsrc/PMTrackCreate.cpp:637; the bug never existed here. Wrap-regression test added as a sentinel.
v0.1.52 (2026-05-14)¶
Fixes¶
gdb_install_intervalsnow produces the full TSS/UTR sets on NCBI and UCSC backends._install_genespreviously filtered only Ensembl/GENCODE feature names (transcript,five_prime_utr,three_prime_utr); production NCBI GFF3 usesmRNA+five_prime_UTR+three_prime_UTR, and UCSC'sncbiRefSeq.gtf.gzuses5UTR+3UTR. The NCBI path was producing empty TSS+UTR sets; UCSC was producing empty UTR sets. The synthetic test fixture used GENCODE naming and hid the bug.pwm.edit_distancefamily:direction="below"+bidirect=Truenow takes MAX across strands (was MIN). A genomic substitution affects both strands, so disrupting a motif site needs both strands below threshold; the harder strand bounds the answer. (R misha 5.6.1019c51158.)pwm.edit_distancefamily: removed hiddenscore_min = score_threshdefault fordirection="below". The hidden default was a footgun: users pre-filtering for strong matches and then calling edit distance got unexpected NAs.score_minnow defaults to no filter for both directions; pass it explicitly when needed. (R misha 5.6.1088e49b62.)
Performance¶
gextractgained amultitasking_strategyconfig knob ("auto"|"tracks"|"tiles").auto(default) picks track-parallel for >= 8 expressions with a non-streaming iterator and tile-parallel (legacy chrom-parallel) otherwise. Track-parallel runs each worker over a subset of expressions across the full interval set, which substantially outperforms the legacy chrom-parallel path on many-track / few-interval workloads (e.g., scoring thousands of motif vtracks across a peak set). (R misha 5.6.18gmultitasking.strategy.)
Breaking¶
direction="below"PWM edit distance results change forbidirect=Truecallers (now MAX across strands) and for any caller relying on the implicitscore_min = score_threshdefault. Setscore_minexplicitly to recover the prior filter, orscore_min=-infto keep no filter.
v0.1.51 (2026-05-14)¶
Performance¶
gintervals_randomauto-routes to a C++ implementation for large genomes. Genomes with > 1000 contigs OR > 10M total bp get the newpm_intervals_randomC++ path; smaller genomes keep the pure-Python implementation. 5000-contig synthetic benchmark: 6.3x speedup (~9 ms -> ~1.4 ms). C++ path usesstd::mt19937_64; output is statistically equivalent to the Python path but NOT bit-identical (different RNG family). Edge case from R1b41bceb(contig length exactlysize + 2*dist_from_edge) handled.
v0.1.50 (2026-05-14)¶
Performance¶
gtrack_create_sparseandgtrack_create_densestream directly totrack.dat+track.idxon indexed databases. Removed the per-chromosome intermediate files + post-creategtrack_convert_to_indexedstep. Saves ~2.5M filesystem operations per track on million-contig genomes. Byte-identical regression tests guard the on-disk layout.gtrack_liftoverinherits the optimization (it callsgtrack_create_sparse). Non-indexed databases keep the previous per-chromosome write path. (R misha 5.6.3094a6446d,b2ca08cc,18985be4.)
Notes¶
- R 5.6.30
747b0076(TrackIndexWriter refactor) is N/A in pymisha: the abstraction never existed here; the new direct-write path adds anIndexedTrackWriterhelper insidePMTrackCreate.cpprather than refactoring shared code.
v0.1.49 (2026-05-14)¶
Fixes¶
gdb_reloadnow clears the Python-side dataset scan cache. Track and interval-set names created externally between twogdb_reloadcalls become immediately visible. The C++ side was already rescanning correctly; this closes the Python-cache gap. (R misha 5.6.30c82b01f0.)
Notes¶
- R 5.6.30
6dc476a8(meta short-circuit per-chrom probe) is N/A in pymisha:pm_track_infoalready takes the indexed fast path (twostat()calls ontrack.idx+track.dat) when an index is present and only falls back to per-chromosome stats for per-chrom-format tracks - matching R's post-6dc476a8behavior.
v0.1.48 (2026-05-14)¶
Performance¶
pm_intervals_allcaches its result onPMDb. Repeat calls return in microseconds instead of rebuilding the chrom-intervals DataFrame each time. Cache invalidated ongdb_init/gdb_reload. (R misha 5.6.30ce788e75.)find_existing_1d_filenameshort-circuits for indexed tracks. Skips the O(N_aliases) chromAlias scan + per-candidateaccess()syscalls when the track has an index. Hot read paths on indexed million-contig databases no longer pay the alias-search cost per chromosome transition. (R misha 5.6.30b340ccfa.)- Eliminated redundant
stat(track.idx)calls insideinit_read. BothGenomeTrackFixedBin::init_readandGenomeTrackSparse::init_readnow askget_track_index()directly (which caches its own stat result) instead of stat-then-load. Removes ~2 stat syscalls per chromosome transition. (R misha 5.6.301db467d1.)
Notes¶
- R 5.6.30
5a9828e6(GenomeChromKey caching) and327bcbb2(scanner index-aware iterator setup) are N/A in pymisha: the singletonPMDbalready caches the chrom-key for the database lifetime, and the scanner's bin-size discovery loop already breaks after the first non-empty chromosome, making it constant work. Verified by measurement.
v0.1.47 (2026-05-14)¶
Fixes¶
- mypy compliance for the
pymisha/genome/modules added in v0.1.43-v0.1.46. Restores green CI onmain. No behavior change.
v0.1.46 (2026-05-14)¶
Features¶
gdb_build_genome(..., source={"source": "ncbi", "accession": "GCF_..."}). NCBI Datasets API v2 backend with FTP fallback. Install path fetchesSEQUENCE_REPORT+ (optional)GENOME_GFFonly (fullGENOME_FASTAskipped, ~900 MB saved per call on a human accession). FTP fallback athttps://ftp.ncbi.nlm.nih.gov/genomes/all/<GCx>/<NNN>/<NNN>/<NNN>/<acc>_<asm>/covers empty-zip GFF and rmsk (<acc>_<asm>_genomic.gff.gz,<acc>_<asm>_rm.out.gz). cgi/cytoband are not provided by NCBI; requesting them warns and skips. (R misha 5.6.30409c235ed6cd6047.)gdb_install_intervals(..., force=True)parity confirmed across all three install-capable backends (ucsc/ucsc-hub/ncbi). Withforce=False(default), requesting a set the backend does not provide raisesValueError. Withforce=True, the orchestrator emits aUserWarningand installs only the available subset. (R misha 5.6.29968bf782.)
v0.1.45 (2026-05-14)¶
Features¶
gdb_build_genome(..., source={"source": "ucsc-hub", "accession": "GCA_..."}). UCSC mammal-hub backend. Probes known hub filename conventions (no HTML scraping); 404 on a per-asset URL is treated as "not available". Hub directories provide chromAlias + chrom.sizes + FASTA + RepeatMasker + cpgIslandExt + GTFs undergenes/. Cytoband is never available from hubs. (R misha 5.6.16.)- Multi-pass chromAlias rescue (
match_by_length=True, default). Four-pass algorithm fills missing canonical entries via unique-length matches, overrides misnamed rows when a length pair resolves them, breaks cross-row name collisions by length, and synthesizestarget_chromsrows when the upstream sources don't carry them. Post-rescuemin_coveragegate (R 5.6.30537bfe29). Single-pass mode (match_by_length=False) retained for callers that want the strict pre-rescue behavior.
v0.1.44 (2026-05-14)¶
Features¶
gdb_install_intervals(groot, source, sets=...). Install UCSC intervals sets (genes,rmsk,cgi,cytoband) into an existing groot. Parses gzipped GTF + RepeatMasker .out + UCSC database TSVs; bp-weighted chromAlias detection picks the canonical chrom column. Writes per-rmsk-class subsets for the major classes (SINE/LINE/LTR/DNA/Simple_repeat/Low_complexity). Provenance written to<groot>/tracks/.misha_install.json. (R misha 5.6.16 48c8a700 partial port; ucsc-hub and ncbi backends land in v0.1.45-v0.1.46.)gdb_build_genome(..., sets=...)re-enabled. Whensetsis non-empty,gdb_build_genomenow invokesgdb_install_intervalsafter the sequence build. Same set of backends (ucsconly in v0.1.44).
v0.1.43 (2026-05-14)¶
Features¶
gdb_build_genome(name, ...)skeleton for themanual,local, ands3backends. New entry point with a bundled registry of 11 genomes (hg19, hg38, mm9, mm10, mm39, rn6, rn7, dm6, ce11, sacCer3, danRer11). Resolution chain: explicitregistry=arg, then$PYMISHA_GENOME_REGISTRY, then./misha.yaml, then the bundledrecipes.yaml. Theucsc,ucsc-hub,ncbibackends andgdb_install_intervalsland in later releases. (R misha 5.6.16 partial port.)
v0.1.42 (2026-05-14)¶
Robustness¶
gtrack_rmandgintervals_rmreturn in microseconds even on directories with millions of files. The doomed directory is renamed to a hidden.trash.<base>.<pid>.<rand>sibling and the actual filesystem cleanup runs in a detached background process. Falls back to synchronous unlink when atomic rename is not available (cross-filesystem). The cross-database overwrite path ingtrack_copyuses the same mechanism. Stale.trash.*and.<name>.tmp.*entries are swept ongdb_init(24h cutoff). (R misha 5.6.30.)gtrack_create_*is now atomic. An interrupted create no longer leaves a partial track directory that blocks re-creation. Writers mkdir into a hidden.<base>.tmp.<pid>.<rand>directory andos.renameto the final name on success; on failure the tmp dir is trashed. Concurrent rescans (gtrack_ls) skip in-flight tmp dirs. Applies togtrack_create_sparse,gtrack_create_dense,gtrack_create_dense_direct,gtrack_smooth,gtrack_create,gtrack_2d_create, andgtrack_2d_import_contacts. Functions that delegate (gtrack_import,gtrack_import_set,gtrack_import_mappedseq,gtrack_create_pwm_energy,gtrack_2d_import) inherit atomicity from the wrapped writers they call. (R misha 5.6.30.)gdb_convert_to_indexed(threads=N)runs per-track conversions in parallel. Defaultthreadsismin(os.cpu_count(), 8). Per-track failures surface as warnings without aborting the batch. Falls back to serial on Windows and whenthreads == 1. (R misha 5.6.30.)
v0.1.41 (2026-05-13)¶
Testing¶
gintervals_randomregression tests for the R 5.6.301b41bcebedge case: contigs of length exactlysize + 2*dist_from_edge(a single valid start position is available) and contigs of length exactly equal tosizewithdist_from_edge=0. PyMisha's pure-Python implementation already handles these correctly - the C++ bug never made it to the Python port. (R misha 5.6.30.)
v0.1.40 (2026-05-13)¶
Features¶
gsynth_forbid_kmer(model, pattern)returns a new model whose samples avoidpatternas a substring (subject to a seeding caveat for the firstkbp of each interval). Useful for CpG-null or motif-null synthetic backgrounds. Pattern length is capped atmodel.k + 1. (R misha 5.6.16.)gsynth_cell_merge()+gsynth_sample(cell_merge=). Per-joint-cell CDF redirects let you reassign under-trained joint cells to a nearest-sufficient neighbor at sample time without retraining. Accepts a list of{"from": [...], "to": [...]}dicts or a DataFrame withfrom_<d>/to_<d>columns. Warns on self-redirects (dropped) and duplicate sources (later entry wins). (R misha 5.6.16.)gsynth_sample(output_format="fasta")andgsynth_random(output_format="fasta")now write a samtools-compatible.faialongside the FASTA (<output>.fai). Removes the need to callsamtools faidxby hand. Thegsynth_random.faiis a pymisha-only extension (R only added it togsynth.sample()). (R misha 5.6.16.)
Testing¶
- Project-wide RNG seed convention: every random seed in production code, tests, benchmarks, and doctest examples now uses
60427(was a mix of42and60427).
v0.1.39 (2026-05-13)¶
Fixes¶
gpartition,gquantiles,gscreen,gintervals_quantileswith±infbreaks routed every value to the last bin.BinFinderused a uniform-binsize fast path that hitInf/Inf = NaNon breaks likec(-inf, x, inf)and silently misrouted output. The binary-search path is now used whenever the binsize is non-finite. (R misha #110.)
Features¶
gtrack_copynow supports cross-database copy. New optional arguments:db=— destination database root. Accepts the activeGROOT, any member ofGDATASETS, or a valid unloaded misha root (chrom_sizes.txt+tracks/). Cross-genome destinations work.overwrite=— replace an existing destination track.srcaccepts a single track name or an iterable. With an iterable,dest=becomes a namespace prefix ("ns"->ns.track1,ns.track2, ...);dest=Nonekeeps each track's name.
Format mismatches between source and destination (per-chromosome vs indexed) are converted on the fly. Chromosome-order differences are remapped per file, with chr-prefix canonicalization. Chromosomes present in source but not destination are dropped with a warning; the copy refuses to create an empty track. 2D tracks (rectangles, points) require identical chromosome order. gtrack_copy now returns the list of created destination track names. (R misha 5.6.28.)
Performance¶
gdb_initon fragmented assemblies (>1000 contigs, nochrprefix, no mito chrom) no longer creates achr<name>alias for every contig. Ensembl-style mammalian or insect genomes are unaffected; only genomes like Phylo447 (2.4Mscaffold_*contigs) avoid the alias blowup. (R misha #112.)
v0.1.38 (2026-05-07)¶
CI fixes¶
- conda publish: install
anaconda-clientin the activated build env, not base. v0.1.36 movedconda-buildto-n baseso theconda buildsubcommand resolves; that surfaced the symmetric issue on the upload step, where the workflow'sdefaults.run.shell: bash -el {0}activates thebuildenv, andanaconda(provided byanaconda-client) wasn't on its PATH. With v0.1.36/v0.1.37 conda builds, packages built fine but the upload step bailed withanaconda: command not found. Split the install:conda-buildin base (next to the conda CLI),anaconda-clientin the activated env.
v0.1.37 (2026-05-07)¶
CI fixes¶
- mypy under numpy 2.2.x in CI.
_numpy.where(...)returns a generic-shapendarray[tuple[int, ...], ...]that mypy (with the typing tightening that landed in numpy 2.2) refuses to assign back to a variable previously bound to a 1-Dndarray[tuple[int], ...]. Two_numpy.where(...)results ingsynth_score(the per-flat-bin index in_extract_bin_dataand the NaN-maskedrawlog-p) are now wrapped incast(ndarray, ...). Newer numpy (≥2.4) didn't reproduce this locally, hence the slip.
v0.1.36 (2026-05-07)¶
CI fixes¶
- mypy: clean type errors introduced in v0.1.34/v0.1.35. Type-correct fixes only; no functional changes.
gintervals_from_tuplescast itslist[int]strands to the widerlist[int | str]accepted bygintervalsafter the character-strand widening in v0.1.34.- Wrap a couple of fancy-indexed
_DNA_BASE_CODE[seq_bytes]andarr[arr[:, 0].argsort()]returns in_numpy.asarrayso mypy recognises the ndarray result. _extract_bin_data's string lookup branch ingsynth_scorenow narrows the result of_maybe_load_intervals_set(which can return a name string) toDataFramebefore callingreset_index.- Replace
unique(..., return_index=True) -> list(...)reassignment with a freshedge_list: list[int]to avoid the array→list type clash. - Use
cast(dict[str, Any], metadata["data"])to type-check the prior-bin nested dict assignment ingsynth_save. - conda publish: install
conda-buildin thebaseenv, not the activatedbuildenv. The recentsetup-miniconda@v3change leaves the activated env without theconda buildsubcommand ifconda-buildis installed there, which was causing everyconda buildstep to fail with "argument COMMAND: invalid choice: 'build'" since v0.1.34. Install with-n baseso the subcommand is registered against the conda CLI.
v0.1.35 (2026-05-06)¶
Features¶
- Per-bin Dirichlet prior in
gsynth_train. Newprior=argument selects how the per-bin priorpi(b)is resolved for the Bayesian posteriorP(a|c,b) = (N + alpha * pi_a(b)) / (sum_a N + alpha)(withalpha = pseudocount): "marginal"(default) — per-bin empirical base composition computed on post-merge counts. Bins with zero observations fall back to uniform."global"— pooled empirical base composition broadcast to every bin."uniform"—1/4per base for every bin (legacy symmetric Laplace smoothing).- array-like
(total_bins, 4)— explicit per-bin pi.
The resolved prior is exposed as model.prior_mode, model.prior_matrix, and model.marginal_fallbacks, and is round-tripped through gsynth_save/gsynth_load. Legacy pickles backfill to prior_mode='uniform'. Matches R misha 5.6.21 (commits 1a49d803..e89b1738).
- gsynth_score() -- score reference sequence under a trained model and write per-bp summed log-probability into a misha dense track. Supports mask= (NA-poisons output bins overlapping mask intervals), resolution= (default model.iterator), n_policy={"NA","uniform"}, sparse_policy={"NA","uniform"}, overwrite=, and stratified or 0D models. Bin lookup is aligned to pos - k (training convention). Matches R misha 5.6.21 commits ba88e197 and 3fba28c2.
Fixes¶
gsynth_sampleon 0D models pastmodel.iteratorbp. The 0D_extract_bin_datapath emits a single iter entry per input interval (coveringiter_sizebp), so positions past the firstiter_sizebp on a sample interval fell back to uniform random sampling instead of the trained CDF. The sample path now passes theINT64_MAXiter_sizesentinel for 0D models (matchinggsynth_random), so the single bin always resolves correctly. The previously-skipped regression testtest_0d_sample_on_non_first_chrom_uses_trained_cdfnow exercises this path under the marginal prior and passes.
Notes¶
- gquantiles perf fixes (R
625438a7,9970dbf5) not ported. PyMisha'spm_quantilesusesStreamPercentiler::get_percentile(sort-once on the reservoir, then index by position), so it never had the O(k * N)nth_elementsuffix walk that R 5.6.20 introduced and 5.6.21 reverted. Same reasoning for the per-kid sort + parent k-way merge optimisation -- the existing implementation already sorts per-kid at access time. - gintervals chrom-factor normalisation (R
ba88e197part 1) not ported.pandasinterval frames use stringchromcolumns, so the R-only factor-level mismatch on bigset save/load doesn't have a pymisha analogue.
v0.1.34 (2026-05-06)¶
Fixes¶
gsynth_samplestratum bin lookup off bykbp at every iter-window boundary.pm_gsynth_samplequeriedbin_at(pos)at the predicted-base position, but training attributes each(k+1)-mer event tobin_at(pos - k)(the leftmost base of the context window). At iter-window boundaries the sampler therefore picked up the next bin's CDF for the lastkpredicted positions, so cross-bin context dependencies were silently miscounted. Realigned the sample-time bin lookup topos - kto match the convention used bygsynth_trainand the cached.gsmmodel. Cached models stay valid; downstream tracks built fromgsynth_sampleshould be regenerated. Matches R misha commit3fba28c2.
Features¶
gsynth_sampleandgsynth_randomnow preserve referenceNpositions by default (preserve_n=True). Centromeres, telomeric Ns and other gap regions are written verbatim into the output instead of being filled with a fabricated ACGT base.mask_copyregions still take precedence. Passpreserve_n=Falseto recover the previous behaviour. Matches R misha #109 (commite90314be).gintervals()accepts BED-style character strand input. Strand values can now be"+","-",".","*", or""(mapped to1/-1/0/0/0) in addition to numeric-1/0/1. Output remains numeric. Matches R misha #104 (commit1de5131e).gintervals_import_bed(),gintervals_import_gff(),gintervals_import_vcf(). Three new file-format importers that preserve common metadata columns and normalise chromosome names through the active database's chromosome-alias mechanism:gintervals_import_bed(file, name=True, score=True, strand=True)-- BED is already 0-based half-open, coordinates are passed through. Optionally keeps the BED4/BED5/BED6 metadata columns.gintervals_import_gff(file, feature=None, strand=True, attrs=True)-- GFF/GTF is 1-based closed; converts to 0-based half-open by subtracting 1 fromstart. Optional feature-type filter; keepssource,type,score, optional rawattrs.gintervals_import_vcf(file, info=True)-- setsstart = POS - 1andend = POS - 1 + len(REF). Keepsid,ref,alt,qual,filter, optional rawinfo.
Matches R misha #105 (commit da592845).
v0.1.33 (2026-04-26)¶
Fixes¶
gsynth_samplesilently fell back to uniform-random sampling when intervals were not aligned to the iterator bin boundary.pm_gsynth_train/pm_gsynth_sampleinferrediter_sizefrom the first same-chrom diff initer_starts. For an interval whose start was not a multiple of the iterator (e.g.iterator=200, interval start at 64),gextractemits a partial first bin, so the inferrediter_sizeequalled the partial width (e.g. 136) instead of the true iterator. Every position past the partial bin then fell throughbin_idx = -1and was sampled uniformly at random — including for k-mer constraints the caller explicitly tried to enforce. Matches R misha #94 (commit02f7ad2f). The fix:pm_gsynth_train/pm_gsynth_samplenow require the iterator as an explicit positional argument; non-positive values raise.GsynthModelstores the training-time iterator (model.iterator) andgsynth_sampledefaults to it when the caller does not passiterator=..gsmsave/load and legacy pickle load preserve / backfillmodel.iterator.
Performance¶
- Drop
MAP_POPULATEfromMmapFile.MmapFile::open()previously requestedMAP_POPULATEon Linux, which forces synchronous page-in of the entire mapped file on everymmapcall. Per-chromosome track files trigger a freshmmapon every chrom transition during expression evaluation, so multi-track multi-chrom queries paid full page-walk cost per track per chrom.MADV_SEQUENTIAL(still set) is sufficient to drive read-ahead for the bin-scan access pattern. Matches R misha #96 (commiteb30be95); R measured ~10× speedups on realistic motif-extract workloads.
v0.1.32 (2026-04-19)¶
Fixes¶
gsynth_train/gsynth_samplesilently dropped non-first-chromosome intervals in 0D (unstratified) models._extract_bin_datahardcodediter_chromsto zero for the 0D branch, so the C++ backend routed every iterator entry tochrom_bins[0]and left bins empty for every other chromosome. In training, k-mers on any chromosome other than chromkey ID 0 were silently uncounted (chrX-only training producedtotal_kmers == 0); in sampling, positions on such chromosomes fell back todrand48() * 4uniform random instead of the trained Markov CDF. Multi-dimensional models,gsynth_random, andgsynth_replace_kmerwere not affected. Users who trained 0D (unstratified) models on intervals spanning multiple chromosomes should re-train and re-sample with this version.
v0.1.31 (2026-04-16)¶
Fixes¶
pwm.max.posstrand sign:DnaPSSM::max_like_match()used to updatebest_dirat every iteration, so the returned strand reflected the last scanned position rather than the best-scoring one. Fixed to match R misha commit b7d469a6 (bug present since R misha v4.3.0). Added golden-master regression test against R misha for the signed position output.
v0.1.30 (2026-04-15)¶
Performance¶
- ggenome_implant C++ fast path: FASTA read/write/perturb loop now runs in C++ with 4 MB I/O buffers, matching the misha R package implementation.
v0.1.29 (2026-04-15)¶
Features¶
- Genome editing: Added
ggenome_implant()andggenome_transplant()for replacing intervals in a reference genome with donor sequences. Supports literal donor strings or extraction from a misha database, with optional trackdb creation and.faiindex generation.
v0.1.28 (2026-04-14)¶
Fixes¶
- mypy CI green: Resolved 3
no-any-returnerrors from C++ bridge calls in_crc64.pyandsequence.py.
v0.1.27 (2026-04-14)¶
Fixes¶
- Example DB dotfiles in wheel: Added
examples/**/.*glob to package-data so.attributes,.colnames, and.ro_attributesfiles are included in wheels. Without these the bundled example DB was non-functional.
v0.1.26 (2026-04-14)¶
Features¶
- Bundled example database:
gdb_init_examples()now works on any machine afterpip install pymisha. The example trackdb is shipped inside the wheel underpymisha/examples/.
v0.1.25 (2026-04-12)¶
Maintenance¶
- Public API cleanup: Removed 10 underscore-prefixed internal symbols from
__all__; addednoqa: F401for the C++ bridge imports. - C++ memory safety: Replaced
new char[]/delete[]withstd::vector<char>in indexed format writers. Replaced manualnew/deletewithstd::unique_ptrin PMWilcox, GenomeSeqFetch, PMTrackExpressionVars. Addedsnprintfbounds checking for shared-memory error buffer. - Compiler warnings: Removed
-Wno-switchand-Wno-strict-aliasingsuppressions from setup.py. Fixed misleading indentation in GenomeTrackFixedBin.cpp and uninitialized variable in PMTrackCreate.cpp. Zero warnings from project sources. - Coverage reporting: Added
[tool.coverage.run]/[tool.coverage.report]to pyproject.toml; added--cov=pymisha --cov-report=term-missingto Linux CI.
Documentation¶
- Thread safety: Added concurrency constraints section to README.md and quickstart guide (single-threaded, one DB per process, global CONFIG).
_PMLOCALScomment: Expanded explanation of the C++ namespace bridge at the end of__init__.py.
v0.1.24 (2026-04-12)¶
Features¶
- Comprehensive type annotations: Added inline type annotations to all function signatures across all 26 modules (~500 functions). Created
py.typedmarker (PEP 561) andpymisha/_types.pywith shared type aliases (Intervals,PMDataFrame,Iterator,TrackExpr,Chroms). Full mypy pass with zero errors.
Maintenance¶
- Dynamic
__version__: Replaced hardcoded version string withimportlib.metadata.version()to stay in sync withpyproject.tomlautomatically. - mypy in CI: Added type checking step to GitHub Actions workflow.
- mypy config: Enabled
check_untyped_defsandwarn_unused_ignoresinpyproject.toml.
v0.1.23 (2026-04-04)¶
Performance¶
- LUT-based DNA encoding: Replaced all switch-statement character encoding in DnaPSSM with static 256-entry lookup tables (BASE_ENCODE, COMPLEMENT_ENCODE, NEUTRAL_CHAR), eliminating branch misprediction overhead in PWM scoring hot loops. Fixed latent
case 'h'bug inintegrate_energyreverse-complement path. Ported from R misha PR #83. - DP buffer reuse: PWMEditDistanceScorer
compute_with_indels()now reuses a persistent buffer instead of per-window heap allocation, reducing malloc/free overhead for indel-mode edit distance scoring.
v0.1.22 (2026-04-03)¶
Features¶
- Direction parameter for edit distance: New
direction="above"/"below"parameter ingseq_pwm_edits(), virtual tracks (pwm.edit_distance,pwm.edit_distance.lse), and all edit distance scorers."above"(default) finds min edits to reach threshold;"below"finds min edits to disrupt score below threshold. Ported from R misha. - Variable Markov order k:
gsynth_train()now acceptsk=1..10(default 5) for configurable Markov order. Includes format v2 for.gsmfiles with v1 backward compatibility, non-integer k rejection, and full propagation through parallel train/sample.
Performance¶
- Edit distance 1.5x speedup (subs-only genome scans on hg38): IC-ordered column processing, score-aware pigeonhole viable tables, sliding-window N-count skip, persistent allocation in heuristic, inline pigeonhole prefilter. Benchmarked on hg38 chr1 (249Mb): 10.6s → 7.0s.
Testing¶
- 98 new tests (2551 total): 54 direction=below tests (subs, indels, LSE, vtracks, edge cases), 44 variable-k tests (k=1/3/5/7/10 training, validation, save/load, parallel).
v0.1.21 (2026-03-29)¶
Performance¶
- C++ data structure optimizations (ported from R misha): Bitmask replacement for
m_functions(vector<bool>→uint32_t),vector<uint8_t>in PWMEditDistanceScorer, golden-ratio multiplicative hashing, StreamPercentiler template comparators for compiler inlining. - C++ I/O optimizations: MmapFile RAII utility, coalesced packed-struct I/O in TrackIndex2D/TrackIndex/GenomeIndex/PMTrackIndexedFormat, mmap-backed fixed-bin reads in GenomeTrackFixedBin.
- C++ safety: SegmentFinder overflow-safe midpoint and iterative destructor.
- PyMisha-specific C++ optimizations: CHROM string interning at scanner init, NumPy array reuse across evaluation batches, sparse interval vector adaptive pre-allocation.
- Direct-to-NumPy accumulation: New PMDirectAccumulator writes gextract scan results directly into pre-allocated NumPy arrays, bypassing intermediate C++ vector storage.
- Python vectorization: Replaced
iterrows()/itertuples()with vectorized groupby in analysis.py/extract.py/vtracks.py, NumPy advanced indexing in_compute_value_df_vtrack(), removed 7 unnecessary DataFrame.copy()calls, batch filter resolution withpd.concat. - Real hg38 benchmarks: gsummary +87%, sparse extract +10%, dense extract +2-4%.
v0.1.20 (2026-03-25)¶
Features¶
- PWM edit distance optimizations: Ported all R misha performance enhancements — flat PSSM lookup tables, precomputed base index arrays, pigeonhole pre-filter, suffix-bound early-abandon, quick deficit check, and specialized exact solvers for
max_indels=1andmax_indels=2. Mandatory edit handling fixes for log-zero PSSM entries. gseq_pwm_edits()indel support: Newmax_indelsparameter enables insertion/deletion detection via banded 3D Needleman-Wunsch DP with alignment traceback. Output includesedit_typecolumn ("sub","ins","del") and gap characters inwindow_seq/mutated_seq.- Intra-chromosome parallelization: Large chromosomes are now split across multiple workers by genomic range with bin-aligned boundaries (ported from R misha's
split_intervals_1d_by_range). Affectsgscreen,gextract,gsummary,gquantiles,gdist,gpartition,gcor. Minimum 50,000 bins per worker. - Vtrack C++ scanner integration: Virtual track expressions (PWM, edit_distance, kmer, masked, and value-based aggregations like avg/sum/min/max) now evaluate inline in the C++ scanner loop instead of falling back to a serial Python eval path. This enables full fork/FIFO parallelism and intra-chromosome splitting for vtrack-heavy workloads.
Performance¶
- Edit distance genome scans: 19x speedup with parallelism (76.5s → 4.0s serial → parallel on chr1:0-10Mb, CTCF D=2 K=2).
- Single-threaded edit distance throughput matches R misha (8.78 vs 7.32 sec/Mb for D=2 K=2).
- Hit counts match R misha exactly across all benchmark configurations.
Testing¶
- 72 new tests (2453 total): 55 edit distance tests (adversarial, indel, optimization consistency), 7 intra-chromosome parallelization tests, 10 vtrack C++ path tests.
- Adversarial test suite validates specialized vs generic DP solvers, exhaustive L=2 enumeration, mandatory edit handling, and known bug regressions.
v0.1.19 (2026-03-17)¶
Features¶
- DataFrame intervals as iterator:
gextract(),gscreen(),gsummary(),gquantiles(),gdist(),gpartition(),giterator_intervals(),gtrack_create(),gtrack_smooth(), andglookup()now accept a pandas DataFrame of intervals as theiteratorparameter, matching R misha behavior. Iterator intervals are intersected with the scope in Python and passed to C++ withiterator=-1. - Interval set names as iterator: All iterator-accepting functions now also accept a string naming a saved interval set as the
iteratorparameter. - String interval set names in set operations:
gintervals_intersect(),gintervals_union(),gintervals_diff(),gintervals_canonic(),gintervals_neighbors(),gintervals_covered_bp(),gintervals_coverage_fraction(),gintervals_force_range(),gintervals_mark_overlaps(), andgintervals_annotate()now accept string interval set names in addition to DataFrames. partial_binsparameter:giterator_intervals()gained apartial_binsparameter ("clip","drop", or"exact") controlling how bins that don't fit entirely within an interval are handled.gintervals_covered_bp()src parameter: Optionalsrcparameter restricts counting to the intersection ofintervalswithsrc.gtrack_create()band support: Thebandparameter is now wired through for 2D track creation instead of raisingValueError.
Testing¶
- 21 new tests for API type gap fixes (string names, partial_bins, band, covered_bp src).
- Tests for DataFrame and interval-set-name iterators across all affected functions.
v0.1.18 (2026-03-16)¶
Features¶
- User variables in expressions: Python variables from the caller's scope can now be used in expression strings passed to
gextract(),gscreen(),gdist(),gsummary(), andgquantiles(). Variables are resolved via frame introspection at the API boundary. An optionalvars=parameter allows explicit control. Track names and coordinates (CHROM,START,END) take priority over user variables. The AST-validated security sandbox is preserved.
Testing¶
- 16 new tests covering module-level variables, function locals, closures, explicit
vars=, virtual track integration, priority/shadowing, error handling, and numpy operations.
v0.1.17 (2026-03-16)¶
Features¶
- Motif format import:
gseq_read_meme(),gseq_read_jaspar(), andgseq_read_homer()for reading MEME, JASPAR PFM, and HOMER motif formats. Returnsdict[str, pd.DataFrame]with A/C/G/T probability columns directly usable withgseq_pwm(). All parsers are native (no new dependencies). - Track export:
gtrack_export_bedgraph()andgtrack_export_bigwig()for exporting tracks and track expressions to standard bedGraph and BigWig formats. Supports gzip compression, virtual tracks, track expressions, and custom iterators.
Testing¶
- 69 motif import tests covering MEME, JASPAR (header + simple PFM), HOMER parsing, error handling, and integration with
gseq_pwm(). - 14 track export tests covering bedGraph format, gzip compression, NaN exclusion, sparse tracks, track expressions, and BigWig conversion.
- Cross-validated with R misha: all 7 parsed matrices identical (max diff: 3.3e-16).
v0.1.16 (2026-03-13)¶
Features¶
- Cross-platform
.gsmformat for gsynth models:gsynth_save()andgsynth_load()now use a YAML metadata + binary arrays format readable by both Python and R misha. Legacy pickle files are still supported via automatic format detection. - Added
compressparameter togsynth_save()for optional ZIP archive output. - Added
gsynth_convert()to migrate legacy pickle model files to.gsmformat. - Added
min_obsfield toGsynthModeldataclass. - Added
pyyamlas a dependency.
Testing¶
- 9 new tests covering directory/ZIP round-trip, legacy pickle backward compatibility, conversion, and min_obs preservation.
- Cross-platform compatibility verified with R misha (Python saves, R loads and vice versa).
v0.1.15 (2026-03-09)¶
Features¶
- Interval set attributes:
gintervals_attr_get(),gintervals_attr_set(),gintervals_attr_export(),gintervals_attr_import()for managing interval set attributes stored as.iattrbinary files next to.intervfiles. Matches R mishagintervals.attr.*API. gintervals_rm()now cleans up companion.iattrfiles when deleting interval sets.
Testing¶
- 24 new tests covering basic get/set, export/import, cleanup on deletion, bulk operations, and edge cases.
v0.1.14 (2026-03-04)¶
Features¶
- Indexed 2D track support:
gtrack_2d_convert_to_indexed()converts per-chromosome-pair 2D tracks to single-file indexed format (track.dat+track.idx), matching R misha 5.5.0. - Auto-conversion of 2D tracks to indexed format when database is indexed via
gdb_convert_to_indexed(). gdb_convert_to_indexed()now includes 2D tracks (rectangles and points) in batch conversion.gtrack_convert_to_indexed()auto-dispatches to 2D conversion for 2D tracks.- 2D tracks created via
gtrack_2d_create()andgtrack_2d_import_contacts()are automatically converted to indexed format when the database is in indexed mode.
Performance¶
- Indexed 2D tracks reduce file descriptor usage from O(N^2) to O(1) per track.
- Single mmap for indexed track.dat eliminates per-pair file open/close overhead.
v0.1.13 (2026-03-03)¶
Features¶
gcompute_strands_autocorr: Strand autocorrelation for nascent transcription analysis, matching R misha's C++GenomeComputeStrandAutocorralgorithm. Parses mapped reads files, builds binned strand coverage, computes Pearson cross-correlation at distance offsets.gintervals_annotatetie_method: Addedtie_methodparameter ("first","min.start","min.end") for controlling tie-breaking when multiple annotations are equidistant.gtrack_2d_importmulti-file: Accepts a list of file paths, reads and concatenates all before building the quad-tree.grevcomp: Standalone reverse complement function for DNA strings.gdb_mark_cache_dirty: Cache invalidation function (delegates togdb_reload).gdataset_example_path: Returns filesystem path to a bundled example dataset.- COMPUTED track detection: COMPUTED 2D tracks (Hi-C normalization) now raise an informative
NotImplementedErrorin 7 API functions instead of failing silently.
Testing¶
- Multi-process hardening: 18 new edge-case tests for
gmax_processes/_parallel_extractcovering parity, intervalID remapping, single-chrom, empty intervals, virtual track bypass, and stress scenarios. - Test suite: 2103 passed, 25 skipped (up from 1993).
Documentation¶
- Track arrays explicitly excluded: GAP-025 marked NOT PLANNED — 5
gtrack.array.*functions permanently out of scope. - Gap coverage: 136/138 in-scope R misha functions (98.6%), up from 130/144 (90%).
v0.1.12 (2026-03-02)¶
Features¶
- 2D vtrack non-aggregation functions: Added
exists,size,first,last,sample, andglobal.percentilefor 2D virtual tracks. All return one row per query interval. - 2D set operations:
gintervals_2d_intersect(vectorized numpy pairwise rectangle intersection) andgintervals_2d_union(concatenate + sort). - 2D iterator:
giterator_intervals_2dgenerator yields one DataFrame per input 2D interval. Supports band filtering, virtual tracks, multiple expressions. - Trans contact mirroring:
gtrack_2d_import_contactsnow writes both chrA-chrB and chrB-chrA files for trans contacts, matching R misha symmetric behavior. - Path functions: Added
gtrack_path(track)andgintervals_path(name)convenience functions returning filesystem paths to track/interval set directories. - R-serialization detection:
gtrack_var_getnow detects R-serialized track variables (RDS/serialize format) and raises an informative error instead of returning garbage or crashing. - gextract file output: Added
fileparameter (streaming TSV write) andintervals_set_outparameter (save result intervals as named set) togextract. - PWM spatial weighting: Implemented
spat_factor/spat_binparameters forgseq_pwm, matching R misha's log-space spatial weight modulation. Removed NotImplementedError. - Bigset transparent iteration: Named bigset interval sets are now transparently loaded in 21 functions across all modules (extract, summary, intervals, liftover, lookup, sequence, analysis, gsynth).
gtrack_dbs/gintervals_dbs: Return the dataset that provides each track or interval set, matching R mishagtrack.dbs/gintervals.dbs.intervals_set_outparameter: Added to 8 functions (gscreen,gpartition,glookup,gintervals_force_range,gintervals_union,gintervals_intersect,gintervals_diff,gintervals_normalize) for saving results as named interval sets.gsynth_samplebin_merge override: Addedbin_mergeparameter for sampling-time bin merge overrides without modifying the model.- Parallel extraction (
gmax_processes): Multi-processgextractsplits work by chromosome across forked workers. Configurable viagmax_processes(n).
Bug fixes¶
dimparameter ingvtrack_iterator: Fixed correctness bug wheredim=1/dim=2was silently ignored. 2D tracks can now be projected to 1D for extraction over 1D intervals.gintervals_force_rangecolumn preservation: Extra columns beyond chrom/start/end are now preserved when clipping intervals to chromosome boundaries, matching R misha behavior.
Performance¶
- C++ quad-tree reader: Replaced pure-Python
struct.unpackquad-tree traversal with C++ implementation (QuadTreeReader.h/cpp). Stats queries 182x faster, object queries 14x faster. Batch stats API (pm_quadtree_query_stats_batch) eliminates per-interval Python→C++ overhead for 2D vtrack aggregation. - gcis_decay vectorization: C++ bulk quad-tree object extraction + numpy vectorized distance computation, binning (
np.searchsorted+np.bincount), and domain containment checks. Eliminates per-object Python loop. - Liftover mapping vectorization: Replaced per-interval Python mapping loop with numpy prefix-max overlap search, batch
searchsorted, and vectorized strand-aware coordinate transformation. - DataFrame construction: Replaced list-of-dicts
pd.DataFrame(rows)patterns with column-wise numpy array construction in liftover.py and intervals.py (5 sites, 2-5x faster for large results). - gbins optimization: Vectorized
gbins_summarywithnumpy.bincountand optimizedgbins_quantileswith sort-based grouping. 1.4-1.5x speedup. - K-mer vectorization: Numpy stride_tricks-based k-mer hashing in
gseq_kmerandgseq_kmer_dist. 3.5x average speedup over per-sequence Python loops. - Liftover overlap resolution: Vectorized 7 overlap resolution functions using pandas groupby, numpy cumsum merging, and vectorized interval operations.
- PWM scoring vectorization: Numpy stride-tricks vectorized PWM scoring in
gseq_pwm— sliding window viaas_strided, fancy indexing into log_pssm, vectorized base encoding. 17.6x speedup for batch scoring. - VTrack per-row vectorization: Replaced 4 iterrows/per-row loops in vtracks.py with numpy operations:
_build_unmasked_segmentsno-mask path, overlap matching, nearest fallback,base_startsextraction. - Pre-computed vtrack values: Eliminated per-chunk vtrack recomputation in mixed C++/vtrack extraction. Vtracks are now computed once for the full interval set and sliced per chunk.
- Multi-chunk quad-tree writer:
_quadtree.pynow supports multi-chunk serialization matching R misha'sStatQuadTreeCachedformat. Prevents OOM on very large 2D tracks. - Batch gintervals_mapply: Replaced per-interval
gextractcalls with single batch extraction + intervalID grouping. Eliminates N separate C++ calls. - C++ band-filtered query: Added
pm_quadtree_query_objects_bandfor C++ band-filtered quad-tree object enumeration, replacing pure-Python band filtering.
v0.1.11 (2026-03-01)¶
Features¶
- 2D virtual track aggregation: All five 2D vtrack functions (
area,weighted.sum,min,max,avg) are now supported, matching R misha feature parity. Previously only alias-style vtracks (avg/mean) were allowed in 2D extraction. - Hybrid quad-tree stat traversal: 2D aggregation uses R misha's
get_statalgorithm — O(1) for fully-contained subtrees via pre-computed node stats, O(K) enumeration only at partially-overlapping leaves. Arena-clamped 3-way intersection prevents double-counting across sibling nodes. - Band filter support: 2D aggregation vtracks work with band filters (falls back to per-object enumeration since node-level stats don't account for diagonal band constraints).
Bug fixes¶
- pandas 3.0 compatibility: Fixed C++ extension and Python codebase for pandas 3.0 (DataFrame construction, deprecated APIs).
v0.1.10 (2026-02-27)¶
Documentation¶
- Fixed API reference: added
docstring_style: numpyto mkdocstrings config so Parameters, Returns, Examples, and See Also sections render correctly instead of as plain text. - Split monolithic API page (856KB, 136 functions) into 10 per-section pages: Database, Datasets, Tracks, Virtual Tracks, Intervals, Data Operations, Liftover, Sequence Analysis, Genome Synthesis.
- Disabled inline source code display (
show_source: false) to reduce page bloat. - Added signature annotations and separate signature rendering for better readability.
- Limited TOC depth to prevent sub-sections (Parameters, Returns) from cluttering the sidebar.
v0.1.9 (2026-02-27)¶
Bug fixes¶
- Fixed multi-chunk quad-tree reader: cross-chunk references (negative kid offsets) now correctly read the target chunk header instead of treating the file position as a node offset.
- Fixed
gintervals_summaryandgintervals_quantilesfor 2D intervals: replaced hardcoded 1D column names with dynamic coordinate column selection. - Added
_maybe_load_2d_intervals_setcalls togsummary,gquantiles,gdistso string-named 2D interval sets are auto-detected.
Features¶
- 2D vtrack iterator shifts:
gvtrack_iterator_2dshifts (sshift1/eshift1/sshift2/eshift2) are now applied during 2D extraction.
Performance¶
- Cache file mmap per chrom pair in 2D extraction — opens each file once instead of per-interval.
- Replace
iterrows()with vectorized numpy extraction ingtrack_2d_createandgtrack_2d_import_contacts.
v0.1.8 (2026-02-26)¶
Bug fixes¶
- Fixed
GInterval::dist2coordtreatingcoord == endas inside the interval, inconsistent with the half-open[start, end)convention used throughout the codebase. This could affect distance calculations for coordinates that fall exactly on an interval boundary.
v0.1.7 (2026-02-23)¶
Performance¶
- Batch chromosome normalization:
_canonicalize_known_chromsnow normalizes only unique chromosome names (one C++ call per unique name instead of per row), then applies the mapping vectorially viaSeries.map. ~50× faster on large interval sets. - Vectorized dense pileup in
gtrack_import_mappedseq: Replace per-coordinate Python loops with NumPy-based duplicate detection, vectorized bin assignment, andnp.add.ataccumulation. Replace per-bindict.appendrow building withnp.arange/np.concatenate. ~22× faster row building. - Cached chromosome normalization during SAM parsing: Per-read
pm_normalize_chromscalls are now cached so each unique chromosome string is normalized only once. - Removed redundant DataFrame copy in
gtrack_create_dense.
Features¶
gtrack_create_dense_direct: New function that writes Misha dense track binary files directly, bypassing the C++ bridge. Supportsreload=Falsefor batch creation (callgdb_reload()once after many tracks). Inspired by borzoi_finetune's ~100× faster direct-write approach for multi-track workloads.
v0.1.6 (2026-02-17)¶
Documentation¶
- Replace the docs favicon with an icon-only transparent asset (no
pymishawordmark text) and configure MkDocs to use it. - Re-export the docs logo with transparent background and cleaner edges while reducing file size from 5.2MB to ~3.6MB.
v0.1.5 (2026-02-17)¶
Features¶
- Add
gdb_export_fastafor efficient full-database genome export to FASTA with streaming I/O for indexed and per-chromosome database formats, line wrapping, chunked reads, overwrite guard, and optional temporarygrootswitching.
Bug fixes¶
- Fix
gtrack_liftoverindexed-source detection to ignore non-file entries (for examplevars/), ensuring indexed-only source tracks are parsed fromtrack.idx/track.datcorrectly.
Tests and benchmarks¶
- Add tests for
gdb_export_fastacovering chunking/wrapping parity, overwrite behavior, root restoration, and per-chromosomechrprefix fallback. - Make Python-vs-R benchmark comparison fair by forcing single-process R timing in benchmark helper (
options(gmax.processes = 1)), and add a new large-database multiprocess benchmark forgsummary.
v0.1.4 (2026-02-16)¶
Documentation¶
- Add a concise "Misha Basics (Short Guide)" tutorial focused on core concepts: tracks, intervals, iterator policies, virtual tracks (including
sshift/eshift), and PWM basics with examples from the bundled example DB. - Add the new basics tutorial to MkDocs navigation under Tutorials.
v0.1.3 (2026-02-15)¶
Features¶
- 2D extraction parity: Added 2D support for arithmetic expressions, virtual-track expressions, named 2D interval-set scopes in extraction/screening, and 2D iterator intervals from track-name iterators.
- Intervals utilities: Added
gintervals_is_bigsetAPI and exported it from the public package namespace.
Bug fixes¶
- Value-based virtual tracks: Fixed DataFrame-source handling for interval-only functions, multi-chrom behavior, overlap validation by function class, and Python fallback parity for
nearestand position reducers. - Filtered value semantics: Fixed filtered value-based
avgto use overlap-length weighting and aligned empty-bin behavior for reductions. - PWM spatial validation: Enforced positive finite
spat_factorand positive integerspat_binat vtrack creation. - 2D range clipping: Added 2D support in
gintervals_force_range.
v0.1.2 (2026-02-15)¶
Features¶
- global.percentile vtracks: Python-side support for
global.percentile,global.percentile.min, andglobal.percentile.maxvirtual track functions. - Sparse vtrack C++ fast path: Forward-scan cursor for
avg/sum/min/max/size/existsreducers on sparse tracks, replacing per-interval generic reducer flow.
Infrastructure¶
- Conda packaging: Automated conda package builds on release (Python 3.10–3.12 × NumPy 1.26/2.0/2.1 × Linux/macOS). Install via
conda install -c aviezerl pymisha.
Bug fixes¶
- Fix vtrack cache key to include DB root, avoiding cross-DB cache collisions.
v0.1.1 (2026-02-14)¶
Performance¶
- Phase 1 optimizations: Reduce BufferedFile default buffer (2MB → 128KB) for multitask workloads, cache per-chromosome CHROM strings to avoid per-row
PyUnicode_FromString, skipfseekfor sequential fixed-bin reads, add reducer fast-path in fixed-bin to skip unused function bookkeeping, stream sparse overlaps lazily instead of materializing all upfront. - Phase 2 optimizations: Add basic-only sparse fast path in
calc_vals(tight loop for avg/sum/min/max when no position/stddev/sample needed), replacedynamic_castwithstatic_castin per-row hot loop, skip CHROM/START/END array population when expressions don't reference them, reuse scratch buffers in fixed-bin multi-bin path, eliminate extra copy in sparse track loading. - Combined effect: 13–21% speedup across extraction workloads.
Documentation¶
- Migrate docs from Sphinx/Furo to MkDocs Material.
- Port R misha vignettes to pymisha docs.
- Add pymisha logo and favicon.
v0.1.0 (2026-02-13)¶
Initial public release.
Core functionality¶
- Track operations:
gextract,gscreen,gsummary,gquantiles,gdist,glookup,gpartition,gsample,gcorwith C++ streaming backends. - Track creation:
gtrack_create,gtrack_create_dense,gtrack_create_sparse,gtrack_modify,gtrack_smooth,gtrack_lookup,gtrack_create_pwm_energy. - 2D tracks:
gtrack_2d_create,gtrack_2d_import,gtrack_2d_import_contacts, 2D extraction,gintervals_2d_band_intersect. - Interval operations: Union, intersection, difference, canonicalization, neighbors (k-nearest, directional), annotation, normalization, random generation, mark overlaps, mapply, import genes.
- Virtual tracks: 30+ aggregation functions, filtering with mask support, iterator shifts, 2D iterators.
- Statistical analysis:
gsegment(Wilcoxon-based segmentation),gwilcox(sliding-window Wilcoxon),gbins_summary,gbins_quantiles,gcis_decay. - Liftover:
gintervals_load_chain,gintervals_as_chain,gintervals_liftover,gtrack_liftoverwith full overlap policy support. - Sequence analysis:
gseq_extract,gseq_kmer,gseq_kmer_dist,gseq_pwm. - Genome synthesis:
gsynth_train,gsynth_sample,gsynth_random,gsynth_replace_kmer,gsynth_bin_map,gsynth_save,gsynth_load. - Database management:
gdb_init,gdb_create,gdb_create_genome,gdb_create_linked,gdb_convert_to_indexed,gdb_info,gdb_reload, dataset and directory management. - Track management: List, info, attributes, variables, import (BED, WIG, BigWig, TSV), copy, move, remove.
R misha compatibility¶
- 123 of 145 R misha exports covered with compatible on-disk formats.
- Full database interoperability: tracks and interval sets created by either R misha or PyMisha are readable by both.
Not yet implemented¶
- Track arrays (
gtrack.array.*,gvtrack.array.slice). - Legacy 2D format conversion (
gtrack.convert).