For each input sequence (or genomic interval), finds the optimal motif window
and the specific base changes needed to reach score.thresh. Returns a
long-format data frame with one row per edit.
gseq.pwm_edits(
seqs,
pssm,
score.thresh,
max_edits = NULL,
max_indels = NULL,
bidirect = TRUE,
prior = 0.01,
score.min = NULL,
score.max = NULL,
extend = TRUE,
strand = 1L
)character vector of DNA sequences, OR a data frame of genomic
intervals (with columns chrom, start, end). When intervals are provided,
sequences are extracted automatically via gseq.extract.
numeric matrix or data frame with columns A, C, G, T. Each row is a motif position.
numeric; target PWM log-likelihood score to reach.
integer or NULL; maximum number of edits to search. NULL means no cap. Default NULL.
integer or NULL; maximum number of insertions and deletions
allowed (default NULL, substitutions only). When > 0, a banded
Needleman-Wunsch DP is used. Edits are reported with edit_type
"sub", "ins", or "del".
logical; scan both strands? Default TRUE.
numeric; pseudocount for PSSM frequencies. Default 0.01.
numeric or NULL; skip windows with PWM score below this. Default NULL (no filter).
numeric or NULL; skip windows with PWM score above this. Default NULL (no filter).
logical or integer; extend sequence for boundary motifs. Default TRUE.
integer; which strand to scan when bidirect=FALSE.
1=forward, -1=reverse. Default 1.
A data frame (long format) with one row per edit, containing columns:
Index into input sequences/intervals (1-based)
+1 (forward) or -1 (reverse strand)
1-based position of optimal window within sequence
PWM score before edits
PWM score after all edits
Total number of edits needed
Which edit this row represents (1, 2, ...)
1-based position within the motif where the edit occurs
Current base at this position
Suggested replacement base
Score improvement from this individual edit
Motif-length sequence at the optimal window (as seen by PSSM, reverse-complemented if on reverse strand)
Same sequence with all edits applied
When intervals are provided, additional columns chrom, start,
end are included.
Sequences already above the threshold produce a single row with
n_edits = 0. Unreachable sequences are omitted from the output.
This is an investigation tool: use it on a small set of positions (e.g., from
gscreen) to see what mutations would activate latent binding sites.
gdb.init_examples()
# Simple PSSM
pssm <- matrix(c(1, 0, 0, 0, 0, 1, 0, 0),
nrow = 2,
dimnames = list(NULL, c("A", "C", "G", "T"))
)
# What edits are needed?
gseq.pwm_edits("CCGTACGT", pssm, score.thresh = -0.5, prior = 0)
#> seq_idx strand window_start score_before score_after n_edits edit_num
#> 1 1 -1 1 -Inf 0 1 1
#> motif_col ref alt gain edit_type window_seq mutated_seq
#> 1 1 G A Inf sub GG AG