Skip to contents

Mask sequences by thresholding the PWM. Sequences with a PWM above the threshold will be masked by 'N'. Sequences at the edges of the sequences will also be masked by 'N'.

Usage

mask_sequences_by_pwm(
  sequences,
  pssm,
  mask_thresh,
  pos_bits_thresh = 0.2,
  spat = NULL,
  spat_min = 0,
  spat_max = NULL,
  bidirect = TRUE,
  prior = 0.01
)

Arguments

sequences

a vector of sequences

pssm

a PSSM matrix or data frame. The columns of the matrix or data frame should be named with the nucleotides ('A', 'C', 'G' and 'T').

mask_thresh

Threshold for masking. Sequences with a PWM above this threshold will be masked by 'N'.

pos_bits_thresh

Mask only positions with amount of information contributed (Shannon entropy, measured in bits) above this threshold. The scale is the same as the y axis in the pssm logo plots.

spat

a data frame with the spatial model (as returned from the $spat slot from the regression). Should contain a column called 'bin' and a column called 'spat_factor'.

spat_min

the minimum position to use from the sequences. The default is 1.

spat_max

the maximum position to use from the sequences. The default is the length of the sequences.

bidirect

is the motif bi-directional. If TRUE, the reverse-complement of the motif will be used as well.

prior

a prior probability for each nucleotide.

Value

A vector with the masked sequences.

Examples

res <- regress_pwm(cluster_sequences_example, cluster_mat_example[, 1])
#>  Using 7 bins of size 40 bp
#>  Using "ks" as the final metric
#>  Number of response variables: 1
#> 
#> ── Generate candidate kmers 
#> 
#> ── Regress each candidate kmer 
#>  Running regression on 10 candidate kmers
#> • Bidirectional: TRUE
#> • Spat bin size: 40
#> • Number of spatial bins 7
#> • Length of sequence: 280
#> • Min gap: 0
#> • Max gap: 1
#> • Kmer length: 8
#> • Improve epsilon: 0.0001
#> • Min nuc prob: 0.001
#> • Uniform prior: 0.05
#> • Score metric: "r2"
#> • Seed: 60427
#> • kmer: "TTAAT*ATT", score (ks): 0.847160350285881
#> • kmer: "AAT*ATTAA", score (ks): 0.843616775749975
#> • kmer: "TAAT*ATTA", score (ks): 0.850689772296759
#> • kmer: "GTTAAT*AT", score (ks): 0.838714558811362
#> • kmer: "AT*ATTAAC", score (ks): 0.838920314752157
#> • kmer: "AA*CATTAA", score (ks): 0.846266563897562
#> • kmer: "TTAA*CATT", score (ks): 0.84705039605297
#> • kmer: "TAATCATT", score (ks): 0.844248196097387
#> • kmer: "TTA*TCATT", score (ks): 0.852612338389051
#> • kmer: "TAA*CATTA", score (ks): 0.845319433376444
#>  Best match in the database: "JOLMA.HNF1B_di_full_1", cor: 0.784
#>  "JOLMA.HNF1B_di_full_1" KS test D: 0.8484, p-value: 0
#>  Best motif: "***TTA*TCATT***", score (ks): 0.852612338389051
new_sequences <- mask_sequences_by_pwm(
    cluster_sequences_example,
    res$pssm,
    quantile(res$pred, 0.95),
    spat = res$spat
)
#>  The following positions will be masked: 4, 5, 6, 7, 8, 10, 11, and 12. Overall 8 positions will be masked

head(new_sequences)
#> [1] "CAGTAAAAGCTTTAATGCGTCTTGAGAGGGAGAGCATCAGCTTACAGAGCGAAGACCCCGAATGGCAAAACCCCGTCCCTTTTATGGAGAATTGCCCTCCGCCTCAGACACGTCGCTCCCTGATTGGCTGCAGCCCATCGGCCGAGTTGTCCTCACGGGGAAGGCAGAGCACATGGAGTGGAAAACTACCCCGGGCACATGCACAGATTACTTGTTTACTACTTAGAACACAGGATGTCAGCACCATCTTGTAATGGCGAATGTGAGGGCGGCTCCTCATACTTANNNNNNNNNNNNNNN"
#> [2] "AATTGCTTCATTAAAACCAAGTTTTTCTTTGTTCATTAGGCGTTAGCCAGATGGGAATTCAGTGTTTTTAAGCAGACACTCACATGGGGTTTTGTTTCTGACATTGATGAATGACTGCCTGCATCCCAAGATGGAAGTTTCCACCCTGGGCTCTGACTGCAACTTTTGTTATTCATAGCAGAAGTCACACCAGTCCACAGCTGAATAGCCACAGTGTTAAGAACAGCTGTCTTACAGCACTGTATGTGGAGAACAGAAAGAGCGGGGTCAAGACTGGTCGCATTGNNNNNNNNNNNNNNN"
#> [3] "GAGTGAGGTGTAAACTGAGCACCCCTTTTTATGGTCTTCACTGTTGCTAGGTAACTGGGGAGGAGTTTAGCCTGAAGGTCAGAAGCTTGGGCCATTGATTATGTGACTACTGACCCTGCTTCTCTTGTGGGGGCTGTGGGAGGTGGTACCTTAGGCAGGGGCCGAGTTCCAGGAGCATGAGGGAACGCCCACTGTGTCATGTAGGTGATTTATGGCCATCGGGTTTCAGACCTCAGCTCGACTGGAGACCAGCCTGCAATTCCCCACAGGAAACTTTATAAGAACNNNNNNNNNNNNNNN"
#> [4] "TTCACTGAAGGTTTAAAACCATAGCTAAGTTATTAGTGAAGTTTTGTAGAGATAAGCCCAGTTGGTATTTTATCTTCTGTCCTAGCACCTACAATAAATCATTAGCTGCTTTTTAATGACCTTTGGTTAATTGTTTTACAACCTCTTGGAATGTGCTCTTAGTAGGAGAAAGTCTGGTTACCATCTAAGAGCAATTAACTGGTGACACTTGGGAGGCTGGCAGAGTTCTCATTGCAGCTTTGACTATCAGAAAAGGACCTAATAGCAGTCCTGTTACAAAAGAGCNNNNNNNNNNNNNNN"
#> [5] "CATGTTACGCTCCATCACTGAGAGACAAGGGTAGGAGCTCAGGGAGGAACCTAATCCAGAAACTGACAGCAGAACCAAAAACGACCACTGATATCTGACTTGCTCTCTGTGTCTGGCTCAGCTGCCTTCCATGTACATCCCAGGACCCACAGAGTGACACAGGCCACAATGGCTGGACCCTCCCACATCAATCATTAATCAAGAGAATGTTTGCCAGACATGCCCACAGACCAATCTGACGGAGGCAGCTCCTCAGCTGAGGCTCCCTTTCCCAGGCGGCTACTGNNNNNNNNNNNNNNN"
#> [6] "AATTTGTATTTAACTATTTCTGGTTACTTAGCTGTGTGTCCTACCTCTGTGTTTGCTGTAGGATTGCAGTTTGATTCCAGCCAACCTTGCAGAGGCATTTGGATTACTGATTAACTGCAGTGAGGCATCTCCTTGCCACCCACTTCTTCCTACTGTGCAGCACACTTTAAAGAAAGCAGTGGAGCCTGAGGGGCTGCTGTACTTAGCGTCTAAGCAGTTCACAACTGAAGACTTTGAAAAGGTAAAGTTAGATAAAATGATTACTCAGAATGCCTACTCGTTTCTNNNNNNNNNNNNNNN"