Skip to contents

Screen for kmers

Usage

screen_kmers(
  sequences,
  response,
  kmer_length = 6,
  min_cor = 0.08,
  is_train = NULL,
  min_gap = 0,
  max_gap = 0,
  from_range = 0,
  to_range = NULL,
  return_mat = FALSE,
  seed = 60427,
  verbose = FALSE
)

Arguments

sequences

A vector of DNA sequences ('A', 'T', 'C' or 'G'. Will go through toupper)

response

A matrix of response variables - number of rows should equal the number of sequences

kmer_length

The number of non-gap characters in motifs that will be screened

min_cor

Only patterns for which the maximum correlation to one of the response variable is larger than min_cor will be reported

is_train

a boolean vector that determine which subset of sequences to use when screening

min_gap, max_gap

the length of a gap to be considered in the pattern. Only one gap, of length min_gap:max_gap, is being used, and is located anywhere in the motif. Note that this greatly expand the search space (and increase multiple testing severely).

from_range

Sequences will be considered only from position from_range (default 0)

to_range

Sequences will be considered only up to position to_range (default NULL - using the length of the sequences)

return_mat

Return a matrix of patterns and their correlation to the response variables instead of a data frame. (default: FALSE)

seed

random seed

verbose

show verbose messages

Value

A data frame with the following columns, together with a column for each response variable with the correlation of the kmers to the response variable:

kmer:

the kmer pattern, where "*" indicates a wildcard

max_r2:

the maximum R^2 to one of the response variables

avg_n:

the average number of times the kmer appears in the sequences

avg_var:

the variance of the number of times the kmer appears in the sequences

if return_mat is TRUE, a matrix with correlations to the response variables (where rows are the kmers) is returned instead of a data frame. If no kmer is found, an empty data frame is returned.

Examples

kmers <- screen_kmers(sequences_example, response_mat_example)
#>  Number of response variables: 5
#>  Screening kmers of length 6, from position 0 to position 300
#>  minimal correlation: 0.08
#>  Found 575 kmers in 1000 sequences.
head(kmers)
#>     kmer     max_r2 avg_n    avg_var          c1          c2         c3
#> 1 AGATAA 0.04998882 0.140 0.15240000 -0.01025383 -0.03046725 -0.2235818
#> 2 TTATCT 0.02975253 0.128 0.13761599  0.03136108  0.04927359 -0.1724892
#> 3 CTTATC 0.02680831 0.080 0.07960001  0.02861430  0.01967959 -0.1637324
#> 4 GGGGAG 0.02300140 0.243 0.28195098  0.04611732  0.02960520  0.1516621
#> 5 GGGCGG 0.02280306 0.062 0.06015600  0.06576437  0.02662074  0.1510068
#> 6 TAACTG 0.02215597 0.071 0.06995900 -0.01145993 -0.02249238 -0.1488488
#>             c4          c5
#> 1  0.004382457 -0.03498838
#> 2  0.035843253  0.02917432
#> 3  0.059437271  0.03772539
#> 4  0.044013925  0.03089626
#> 5  0.104473867  0.12732512
#> 6 -0.037136175 -0.01858661

kmers <- screen_kmers(sequences_example, response_mat_example, return_mat = TRUE)
#>  Number of response variables: 5
#>  Screening kmers of length 6, from position 0 to position 300
#>  minimal correlation: 0.08
#>  Found 575 kmers in 1000 sequences.
head(kmers)
#>                 c1          c2         c3           c4          c5
#> AGATAA -0.01025383 -0.03046725 -0.2235818  0.004382457 -0.03498838
#> TTATCT  0.03136108  0.04927359 -0.1724892  0.035843253  0.02917432
#> CTTATC  0.02861430  0.01967959 -0.1637324  0.059437271  0.03772539
#> GGGGAG  0.04611732  0.02960520  0.1516621  0.044013925  0.03089626
#> GGGCGG  0.06576437  0.02662074  0.1510068  0.104473867  0.12732512
#> TAACTG -0.01145993 -0.02249238 -0.1488488 -0.037136175 -0.01858661

kmers <- screen_kmers(sequences_example, response_mat_example, max_gap = 3)
#>  Number of response variables: 5
#>  Screening kmers of length 6, from position 0 to position 300
#>  Gaps of length 0:3 are allowed
#>  minimal correlation: 0.08
#>  Found 4884 kmers in 1000 sequences.
head(kmers)
#>       kmer     max_r2 avg_n    avg_var           c1           c2         c3
#> 1   AGATAA 0.05012437 0.139 0.15167901 -0.011159671 -0.033565160 -0.2238847
#> 2  AGA*AAG 0.03404951 0.220 0.23760001  0.062884174  0.063842162 -0.1845251
#> 3   TTATCT 0.02915060 0.127 0.13487099  0.036021627  0.051995803 -0.1707355
#> 4 ACAT**CT 0.02848161 0.090 0.09190000 -0.002383306 -0.023468828 -0.1687650
#> 5  AG*TAAG 0.02829135 0.118 0.12007600  0.034630746 -0.005794165 -0.1682003
#> 6   CTTATC 0.02680831 0.080 0.07960001  0.028614303  0.019679585 -0.1637324
#>            c4          c5
#> 1 0.001539792 -0.03616555
#> 2 0.025056362  0.02373981
#> 3 0.032191906  0.02568413
#> 4 0.019714575 -0.02059141
#> 5 0.046150818  0.02593128
#> 6 0.059437271  0.03772539