For each input sequence (or genomic interval), finds the optimal motif window and the specific base changes needed to reach score.thresh. Returns a long-format data frame with one row per edit.

gseq.pwm_edits(
  seqs,
  pssm,
  score.thresh,
  max_edits = NULL,
  max_indels = NULL,
  bidirect = TRUE,
  prior = 0.01,
  score.min = NULL,
  score.max = NULL,
  extend = TRUE,
  strand = 1L
)

Arguments

seqs

character vector of DNA sequences, OR a data frame of genomic intervals (with columns chrom, start, end). When intervals are provided, sequences are extracted automatically via gseq.extract.

pssm

numeric matrix or data frame with columns A, C, G, T. Each row is a motif position.

score.thresh

numeric; target PWM log-likelihood score to reach.

max_edits

integer or NULL; maximum number of edits to search. NULL means no cap. Default NULL.

max_indels

integer or NULL; maximum number of insertions and deletions allowed (default NULL, substitutions only). When > 0, a banded Needleman-Wunsch DP is used. Edits are reported with edit_type "sub", "ins", or "del".

bidirect

logical; scan both strands? Default TRUE.

prior

numeric; pseudocount for PSSM frequencies. Default 0.01.

score.min

numeric or NULL; skip windows with PWM score below this. Default NULL (no filter).

score.max

numeric or NULL; skip windows with PWM score above this. Default NULL (no filter).

extend

logical or integer; extend sequence for boundary motifs. Default TRUE.

strand

integer; which strand to scan when bidirect=FALSE. 1=forward, -1=reverse. Default 1.

Value

A data frame (long format) with one row per edit, containing columns:

seq_idx

Index into input sequences/intervals (1-based)

strand

+1 (forward) or -1 (reverse strand)

window_start

1-based position of optimal window within sequence

score_before

PWM score before edits

score_after

PWM score after all edits

n_edits

Total number of edits needed

edit_num

Which edit this row represents (1, 2, ...)

motif_col

1-based position within the motif where the edit occurs

ref

Current base at this position

alt

Suggested replacement base

gain

Score improvement from this individual edit

window_seq

Motif-length sequence at the optimal window (as seen by PSSM, reverse-complemented if on reverse strand)

mutated_seq

Same sequence with all edits applied

When intervals are provided, additional columns chrom, start, end are included.

Sequences already above the threshold produce a single row with n_edits = 0. Unreachable sequences are omitted from the output.

Details

This is an investigation tool: use it on a small set of positions (e.g., from gscreen) to see what mutations would activate latent binding sites.

Examples


gdb.init_examples()

# Simple PSSM
pssm <- matrix(c(1, 0, 0, 0, 0, 1, 0, 0),
    nrow = 2,
    dimnames = list(NULL, c("A", "C", "G", "T"))
)

# What edits are needed?
gseq.pwm_edits("CCGTACGT", pssm, score.thresh = -0.5, prior = 0)
#>   seq_idx strand window_start score_before score_after n_edits edit_num
#> 1       1     -1            1         -Inf           0       1        1
#>   motif_col ref alt gain edit_type window_seq mutated_seq
#> 1         1   G   A  Inf       sub         GG          AG