Mask sequences by thresholding the PWM
mask_sequences_by_pwm.Rd
Mask sequences by thresholding the PWM. Sequences with a PWM above the threshold will be masked by 'N'. Sequences at the edges of the sequences will also be masked by 'N'.
Usage
mask_sequences_by_pwm(
sequences,
pssm,
mask_thresh,
pos_bits_thresh = 0.2,
spat = NULL,
spat_min = 0,
spat_max = NULL,
bidirect = TRUE,
prior = 0.01
)
Arguments
- sequences
a vector of sequences
- pssm
a PSSM matrix or data frame. The columns of the matrix or data frame should be named with the nucleotides ('A', 'C', 'G' and 'T').
- mask_thresh
Threshold for masking. Sequences with a PWM above this threshold will be masked by 'N'.
- pos_bits_thresh
Mask only positions with amount of information contributed (Shannon entropy, measured in bits) above this threshold. The scale is the same as the y axis in the pssm logo plots.
- spat
a data frame with the spatial model (as returned from the
$spat
slot from the regression). Should contain a column called 'bin' and a column called 'spat_factor'.- spat_min
the minimum position to use from the sequences. The default is 1.
- spat_max
the maximum position to use from the sequences. The default is the length of the sequences.
- bidirect
is the motif bi-directional. If TRUE, the reverse-complement of the motif will be used as well.
- prior
a prior probability for each nucleotide.
Examples
res <- regress_pwm(cluster_sequences_example, cluster_mat_example[, 1])
#> ℹ Using 7 bins of size 40 bp
#> ℹ Using "ks" as the final metric
#> ℹ Number of response variables: 1
#>
#> ── Generate candidate kmers
#>
#> ── Regress each candidate kmer
#> ℹ Running regression on 10 candidate kmers
#> • Bidirectional: TRUE
#> • Spat bin size: 40
#> • Number of spatial bins 7
#> • Length of sequence: 280
#> • Min gap: 0
#> • Max gap: 1
#> • Kmer length: 8
#> • Improve epsilon: 0.0001
#> • Min nuc prob: 0.001
#> • Uniform prior: 0.05
#> • Score metric: "r2"
#> • Seed: 60427
#> • kmer: "TTAAT*ATT", score (ks): 0.847160350285881
#> • kmer: "AAT*ATTAA", score (ks): 0.843616775749975
#> • kmer: "TAAT*ATTA", score (ks): 0.850689772296759
#> • kmer: "GTTAAT*AT", score (ks): 0.838714558811362
#> • kmer: "AT*ATTAAC", score (ks): 0.838920314752157
#> • kmer: "AA*CATTAA", score (ks): 0.846266563897562
#> • kmer: "TTAA*CATT", score (ks): 0.84705039605297
#> • kmer: "TAATCATT", score (ks): 0.844248196097387
#> • kmer: "TTA*TCATT", score (ks): 0.852612338389051
#> • kmer: "TAA*CATTA", score (ks): 0.845319433376444
#> ℹ Best match in the database: "JOLMA.HNF1B_di_full_1", cor: 0.784
#> ✔ "JOLMA.HNF1B_di_full_1" KS test D: 0.8484, p-value: 0
#> ℹ Best motif: "***TTA*TCATT***", score (ks): 0.852612338389051
new_sequences <- mask_sequences_by_pwm(
cluster_sequences_example,
res$pssm,
quantile(res$pred, 0.95),
spat = res$spat
)
#> ℹ The following positions will be masked: 4, 5, 6, 7, 8, 10, 11, and 12. Overall 8 positions will be masked
head(new_sequences)
#> [1] "CAGTAAAAGCTTTAATGCGTCTTGAGAGGGAGAGCATCAGCTTACAGAGCGAAGACCCCGAATGGCAAAACCCCGTCCCTTTTATGGAGAATTGCCCTCCGCCTCAGACACGTCGCTCCCTGATTGGCTGCAGCCCATCGGCCGAGTTGTCCTCACGGGGAAGGCAGAGCACATGGAGTGGAAAACTACCCCGGGCACATGCACAGATTACTTGTTTACTACTTAGAACACAGGATGTCAGCACCATCTTGTAATGGCGAATGTGAGGGCGGCTCCTCATACTTANNNNNNNNNNNNNNN"
#> [2] "AATTGCTTCATTAAAACCAAGTTTTTCTTTGTTCATTAGGCGTTAGCCAGATGGGAATTCAGTGTTTTTAAGCAGACACTCACATGGGGTTTTGTTTCTGACATTGATGAATGACTGCCTGCATCCCAAGATGGAAGTTTCCACCCTGGGCTCTGACTGCAACTTTTGTTATTCATAGCAGAAGTCACACCAGTCCACAGCTGAATAGCCACAGTGTTAAGAACAGCTGTCTTACAGCACTGTATGTGGAGAACAGAAAGAGCGGGGTCAAGACTGGTCGCATTGNNNNNNNNNNNNNNN"
#> [3] "GAGTGAGGTGTAAACTGAGCACCCCTTTTTATGGTCTTCACTGTTGCTAGGTAACTGGGGAGGAGTTTAGCCTGAAGGTCAGAAGCTTGGGCCATTGATTATGTGACTACTGACCCTGCTTCTCTTGTGGGGGCTGTGGGAGGTGGTACCTTAGGCAGGGGCCGAGTTCCAGGAGCATGAGGGAACGCCCACTGTGTCATGTAGGTGATTTATGGCCATCGGGTTTCAGACCTCAGCTCGACTGGAGACCAGCCTGCAATTCCCCACAGGAAACTTTATAAGAACNNNNNNNNNNNNNNN"
#> [4] "TTCACTGAAGGTTTAAAACCATAGCTAAGTTATTAGTGAAGTTTTGTAGAGATAAGCCCAGTTGGTATTTTATCTTCTGTCCTAGCACCTACAATAAATCATTAGCTGCTTTTTAATGACCTTTGGTTAATTGTTTTACAACCTCTTGGAATGTGCTCTTAGTAGGAGAAAGTCTGGTTACCATCTAAGAGCAATTAACTGGTGACACTTGGGAGGCTGGCAGAGTTCTCATTGCAGCTTTGACTATCAGAAAAGGACCTAATAGCAGTCCTGTTACAAAAGAGCNNNNNNNNNNNNNNN"
#> [5] "CATGTTACGCTCCATCACTGAGAGACAAGGGTAGGAGCTCAGGGAGGAACCTAATCCAGAAACTGACAGCAGAACCAAAAACGACCACTGATATCTGACTTGCTCTCTGTGTCTGGCTCAGCTGCCTTCCATGTACATCCCAGGACCCACAGAGTGACACAGGCCACAATGGCTGGACCCTCCCACATCAATCATTAATCAAGAGAATGTTTGCCAGACATGCCCACAGACCAATCTGACGGAGGCAGCTCCTCAGCTGAGGCTCCCTTTCCCAGGCGGCTACTGNNNNNNNNNNNNNNN"
#> [6] "AATTTGTATTTAACTATTTCTGGTTACTTAGCTGTGTGTCCTACCTCTGTGTTTGCTGTAGGATTGCAGTTTGATTCCAGCCAACCTTGCAGAGGCATTTGGATTACTGATTAACTGCAGTGAGGCATCTCCTTGCCACCCACTTCTTCCTACTGTGCAGCACACTTTAAAGAAAGCAGTGGAGCCTGAGGGGCTGCTGTACTTAGCGTCTAAGCAGTTCACAACTGAAGACTTTGAAAAGGTAAAGTTAGATAAAATGATTACTCAGAATGCCTACTCGTTTCTNNNNNNNNNNNNNNN"