Advanced analysis

IMPORTANT: THIS IS ONLY A SANDBOX. IT IS (NOT EVEN) IN BETA STAGE, READ WITH CAUTION AND CONSULT WITH US IF YOU HAVE ANY QUESTIONS. NO GUARANTEES!

This part is intended for more advanced users that want to do additional analyses using gpatterns and ‘misha’. For a standard analytical pipline please refer to analysis section.

Data

We we will use the same data from the analysis section: a dataset of UMI-RRBS of breast and lung tumors. The dataset contains 16 breast tumors samples from 4 patients, with multiple sections from each sample.

library(gpatterns)
gsetroot('/home/aviezerl/proj/sandbox_db')
options(gmax.data.size=1e09)
tracks <- gpatterns.ls('gpatterns_nugget.Breast_.+_.+')
print(tracks)

##  [1] "gpatterns_nugget.Breast_2460_1_3"   
##  [2] "gpatterns_nugget.Breast_2460_1_6"   
##  [3] "gpatterns_nugget.Breast_2460_1_8"   
##  [4] "gpatterns_nugget.Breast_2460_Fresh" 
##  [5] "gpatterns_nugget.Breast_3174_1_2"   
##  [6] "gpatterns_nugget.Breast_3174_1_2p"  
##  [7] "gpatterns_nugget.Breast_3174_1_8"   
##  [8] "gpatterns_nugget.Breast_3174_1_9"   
##  [9] "gpatterns_nugget.Breast_3174_Fresh" 
## [10] "gpatterns_nugget.Breast_4310_1_2"   
## [11] "gpatterns_nugget.Breast_4310_Fresh" 
## [12] "gpatterns_nugget.Breast_4310_Normal"
## [13] "gpatterns_nugget.Breast_4550_3_1"   
## [14] "gpatterns_nugget.Breast_4550_3_2"   
## [15] "gpatterns_nugget.Breast_4550_3_3"   
## [16] "gpatterns_nugget.Breast_4550_Fresh"

names <- gsub('gpatterns_nugget.', '', tracks)
print(names)

##  [1] "Breast_2460_1_3"    "Breast_2460_1_6"    "Breast_2460_1_8"   
##  [4] "Breast_2460_Fresh"  "Breast_3174_1_2"    "Breast_3174_1_2p"  
##  [7] "Breast_3174_1_8"    "Breast_3174_1_9"    "Breast_3174_Fresh" 
## [10] "Breast_4310_1_2"    "Breast_4310_Fresh"  "Breast_4310_Normal"
## [13] "Breast_4550_3_1"    "Breast_4550_3_2"    "Breast_4550_3_3"   
## [16] "Breast_4550_Fresh"

Data internals

Each track contains (at least) 5 components:

4 ‘misha’ (sparse) tracks that contain data on CpGs in the genome: (Note that CpGs without data will be missing (NA) in the tracks)
- ‘track.meth’ with the pileup of methylated calls
- ‘track.unmeth’ with the pileup of unmethylated calls
- ‘track.cov’ with the total number of calls
- ‘track.avg’ with the average methylation (‘track.meth’ / ‘track.cov’)

gtrack.ls(tracks[1])

##  [1] "gpatterns_nugget.Breast_2460_1_3.avg"     
##  [2] "gpatterns_nugget.Breast_2460_1_3.cov"     
##  [3] "gpatterns_nugget.Breast_2460_1_3.epipoly" 
##  [4] "gpatterns_nugget.Breast_2460_1_3.fid"     
##  [5] "gpatterns_nugget.Breast_2460_1_3.meth"    
##  [6] "gpatterns_nugget.Breast_2460_1_3.n"       
##  [7] "gpatterns_nugget.Breast_2460_1_3.n0"      
##  [8] "gpatterns_nugget.Breast_2460_1_3.n1"      
##  [9] "gpatterns_nugget.Breast_2460_1_3.nc"      
## [10] "gpatterns_nugget.Breast_2460_1_3.ncpg"    
## [11] "gpatterns_nugget.Breast_2460_1_3.nx"      
## [12] "gpatterns_nugget.Breast_2460_1_3.pat_cov3"
## [13] "gpatterns_nugget.Breast_2460_1_3.pat_cov5"
## [14] "gpatterns_nugget.Breast_2460_1_3.pat_cov7"
## [15] "gpatterns_nugget.Breast_2460_1_3.pat_meth"
## [16] "gpatterns_nugget.Breast_2460_1_3.pat2"    
## [17] "gpatterns_nugget.Breast_2460_1_3.unmeth"

tidy CpGs - comma separated files that contain the raw calls for the sequencing data:

gpatterns.get_tidy_cpgs(tracks[1], intervals=gintervals(1,1e6,1e6 + 1e4))

## # A tibble: 1,949 × 12
##                                                read_id chrom   start
##                                                  <chr> <chr>   <dbl>
## 1    Breast_2406_1.3_2_A:2368391:AGAGGT:Xba_Xba_B_Dark  chr1 1001409
## 2  Breast_2406_1.3_2_A:3080811:ATATAA:Xba_EcoRI_T_Dark  chr1 1001409
## 3  Breast_2406_1.3_2_A:4179231:GAGGAG:Xba_EcoRI_T_Dark  chr1 1001409
## 4    Breast_2406_1.3_2_A:2462062:GGTGGT:Xba_Xba_B_Dark  chr1 1001409
## 5    Breast_2406_1.3_2_A:4163117:GTAGTA:Xba_Xba_B_Dark  chr1 1001409
## 6    Breast_2406_1.3_2_A:2677641:TGAGGG:Xba_Xba_T_Dark  chr1 1001409
## 7    Breast_2406_1.3_2_A:1844720:AGAGGT:Xba_Xba_T_Dark  chr1 1001510
## 8    Breast_2406_1.3_2_A:2527743:GGGTTT:Xba_Xba_T_Dark  chr1 1001510
## 9  Breast_2406_1.3_2_A:4552887:GTAGTA:Xba_EcoRI_T_Dark  chr1 1001510
## 10   Breast_2406_1.3_2_A:2384712:TAAGAG:Xba_Xba_B_Dark  chr1 1001510
##        end strand   umi1  umi2 insert_len   num  cg_pos
##      <chr>  <chr>  <chr> <chr>      <dbl> <int>   <dbl>
## 1  1001509      - AGAGGT     -         98     4 1001471
## 2  1001509      - ATATAA     -        -92     1 1001471
## 3  1001509      - GAGGAG     -         98     1 1001471
## 4  1001509      - GGTGGT     -         98     2 1001471
## 5  1001509      - GTAGTA     -         98     1 1001471
## 6  1001509      - TGAGGG     -        -98     5 1001471
## 7  1001664      + AGAGGT     -        -98     2 1001471
## 8  1001664      + GGGTTT     -        -98     2 1001471
## 9  1001664      + GTAGTA     -         98     1 1001471
## 10 1001664      + TAAGAG     -         98     4 1001471
## # ... with 1,939 more rows, and 2 more variables: meth <dbl>, qual <dbl>

The tidy_cpgs table contains the following fields:

read_id: ID of the read from which the methylation call came from.
chrom, start, end, strand: position of the read. single end reads would have ‘-’ in the ‘end’ field, and restriction based protocals such as RRBS would have the position of the restriction fragment.
umi1, umi2: unique moclecular identifiers (UMIs) if exists.
insert_len: insert length (only for paired end reads)
num (optional): number of times the molecule was observed. Protocols that do not have a duplicate filtering mechanism (such as vanilla RRBS) would not have this field.
cg_pos: position of the CpG (on the ‘C’ bp) in chromosome .
meth: 1 if the CpG was observed methylated and 0 if unmethyltated.
qual: quality of the bp in which the methylation call was observed (C in C->T conversion protocols, G in G->A conversion protocols).

Storing the raw reads allows us to analyze coupled methylation calls, and calculate features like epipolymorphism or other attributes of the full methylation pattern distribution, see ‘applying functions on tidy_CpGs’ section.

Extracting methylation data

In order to extract the methylation signal from a track we can use the basic misha functions, for example, to extract the methylation calls from chromosome1, 1,000,000 to 1,0010,000 we can do:

gextract(paste0(tracks[1], c('.meth', '.unmeth', '.avg', '.cov')), intervals=gintervals(1,1e6,1e6+1e4), iterator=paste0(tracks[1], '.cov'), colnames=c('meth', 'unmeth', 'avg', 'cov')) %>% tbl_df

## # A tibble: 170 × 8
##     chrom   start     end  meth unmeth       avg   cov intervalID
## *  <fctr>   <dbl>   <dbl> <dbl>  <dbl>     <dbl> <dbl>      <int>
## 1    chr1 1001471 1001472    12      9 0.5714286    21          1
## 2    chr1 1002663 1002664    15      1 0.9375000    16          1
## 3    chr1 1002666 1002667    16      0 1.0000000    16          1
## 4    chr1 1002687 1002688    13      2 0.8666667    15          1
## 5    chr1 1002689 1002690    11      3 0.7857143    14          1
## 6    chr1 1002697 1002698     8      0 1.0000000     8          1
## 7    chr1 1002702 1002703    12      1 0.9230769    13          1
## 8    chr1 1002705 1002706    14      1 0.9333333    15          1
## 9    chr1 1002737 1002738     4      1 0.8000000     5          1
## 10   chr1 1002739 1002740     4      2 0.6666667     6          1
## # ... with 160 more rows

gpatterns provides wrapper to this function, , that has the following additional features: (Note that all this features can be achived only with ‘misha’, e.g. using virtual tracks)

Extraction of multiple tracks.
Extraction of methyation of larger regions than a single CpG. Repeating the naive misha approach on larger intervals would give us average of averages. This is usually not what we want, so computes meth / (meth + unmeth) for the entire interval.
Filtering by coverage, number of CpGs, minimal variance, range, etc.
Extraction of summation of tracks.

For example, if we want to extract the average methylation in MSP1 fragments from all the tracks, limiting ourselves to fragments with covereage of 10 in at least 5 samples that have at least 2 CpGs we can write:

gpatterns.get_avg_meth(tracks, names=names, intervals=gintervals(1, 1e6, 1e6+1e4), iterator='intervs.msp1.fid', min_cov = 10, min_samples=5, min_cpgs = 2)

## Taking only intervals with at least 2 CpGs

## extracting...

## Taking only intervals with coverage >= 10 in at least 5 samples

## number of intervals: 30

## # A tibble: 480 × 9
##     chrom   start     end intervalID            samp  meth unmeth
##    <fctr>   <dbl>   <dbl>      <int>           <chr> <dbl>  <dbl>
## 1    chr1 1001409 1001509          1 Breast_2460_1_3    12      9
## 2    chr1 1001510 1001664          1 Breast_2460_1_3   NaN    NaN
## 3    chr1 1002618 1002731          1 Breast_2460_1_3    89      8
## 4    chr1 1002732 1002764          1 Breast_2460_1_3    23      9
## 5    chr1 1002824 1003025          1 Breast_2460_1_3   NaN    NaN
## 6    chr1 1003026 1003262          1 Breast_2460_1_3   NaN    NaN
## 7    chr1 1003283 1003374          1 Breast_2460_1_3    46    153
## 8    chr1 1003887 1004001          1 Breast_2460_1_3     6    159
## 9    chr1 1004538 1004706          1 Breast_2460_1_3     0     26
## 10   chr1 1004707 1004788          1 Breast_2460_1_3     1     85
##           avg   cov
##         <dbl> <dbl>
## 1  0.57142857    21
## 2         NaN   NaN
## 3  0.91752577    97
## 4  0.71875000    32
## 5         NaN   NaN
## 6         NaN   NaN
## 7  0.23115578   199
## 8  0.03636364   165
## 9  0.00000000    26
## 10 0.01162791    86
## # ... with 470 more rows

We can limit ourselves to ‘interesting’ fragments, e.g. fragments with minimal range (‘abs(min(m) - max(m) >= min_range’):

gpatterns.get_avg_meth(tracks, names=names, intervals=gintervals(1, 1e6, 1e6+1e4), iterator='intervs.msp1.fid', min_cov = 10, min_samples=5, min_cpgs = 2, min_range = 0.3)

## Taking only intervals with at least 2 CpGs

## extracting...

## Taking only intervals with coverage >= 10 in at least 5 samples

## number of intervals: 19

## # A tibble: 304 × 9
##     chrom   start     end intervalID            samp  meth unmeth
##    <fctr>   <dbl>   <dbl>      <int>           <chr> <dbl>  <dbl>
## 1    chr1 1001409 1001509          1 Breast_2460_1_3    12      9
## 2    chr1 1001510 1001664          1 Breast_2460_1_3   NaN    NaN
## 3    chr1 1002618 1002731          1 Breast_2460_1_3    89      8
## 4    chr1 1002824 1003025          1 Breast_2460_1_3   NaN    NaN
## 5    chr1 1003026 1003262          1 Breast_2460_1_3   NaN    NaN
## 6    chr1 1003283 1003374          1 Breast_2460_1_3    46    153
## 7    chr1 1005099 1005188          1 Breast_2460_1_3    58     29
## 8    chr1 1005189 1005260          1 Breast_2460_1_3    12     56
## 9    chr1 1005316 1005500          1 Breast_2460_1_3     1      3
## 10   chr1 1006940 1007053          1 Breast_2460_1_3     0     15
##          avg   cov
##        <dbl> <dbl>
## 1  0.5714286    21
## 2        NaN   NaN
## 3  0.9175258    97
## 4        NaN   NaN
## 5        NaN   NaN
## 6  0.2311558   199
## 7  0.6666667    87
## 8  0.1764706    68
## 9  0.2500000     4
## 10 0.0000000    15
## # ... with 294 more rows

If we want to extract data in single CpG level, we can explicitly set the iterator to CpG intervals or leave it NULL:

gpatterns.get_avg_meth(tracks, names=names, intervals=gintervals(1, 1e6, 1e6+1e4), min_cov = 10, min_samples=5, min_cpgs = 2, min_range = 0.3)

## Using cpgs as iterator

## extracting...

## 0%...100%

## Taking only intervals with coverage >= 10 in at least 5 samples

## number of intervals: 48

## # A tibble: 768 × 9
##     chrom   start     end intervalID            samp  meth unmeth
##    <fctr>   <dbl>   <dbl>      <int>           <chr> <dbl>  <dbl>
## 1    chr1 1001471 1001472          1 Breast_2460_1_3    12      9
## 2    chr1 1002663 1002664          1 Breast_2460_1_3    15      1
## 3    chr1 1002666 1002667          1 Breast_2460_1_3    16      0
## 4    chr1 1002687 1002688          1 Breast_2460_1_3    13      2
## 5    chr1 1002689 1002690          1 Breast_2460_1_3    11      3
## 6    chr1 1002697 1002698          1 Breast_2460_1_3     8      0
## 7    chr1 1002702 1002703          1 Breast_2460_1_3    12      1
## 8    chr1 1002705 1002706          1 Breast_2460_1_3    14      1
## 9    chr1 1003298 1003299          1 Breast_2460_1_3     6     20
## 10   chr1 1003302 1003303          1 Breast_2460_1_3     6     20
##          avg   cov
##        <dbl> <dbl>
## 1  0.5714286    21
## 2  0.9375000    16
## 3  1.0000000    16
## 4  0.8666667    15
## 5  0.7857143    14
## 6  1.0000000     8
## 7  0.9230769    13
## 8  0.9333333    15
## 9  0.2307692    26
## 10 0.2307692    26
## # ... with 758 more rows

Sometimes it is useful to extract only the average methylation in a non ‘tidy’ format. This may also boost performance:

gpatterns.get_avg_meth(tracks, names=names, intervals=gintervals(1, 1e6, 1e6+1e4), iterator='intervs.msp1.fid', min_cov = 10, min_cpgs = 2, min_samples=5, tidy=FALSE)

## Taking only intervals with at least 2 CpGs

## Taking only intervals with coverage >= 10 in at least 5 samples

## number of intervals: 30

## # A tibble: 30 × 20
##     chrom   start     end Breast_2460_1_3 Breast_2460_1_6 Breast_2460_1_8
## *  <fctr>   <dbl>   <dbl>           <dbl>           <dbl>           <dbl>
## 1    chr1 1001409 1001509      0.57142857      0.31578947      0.72727273
## 2    chr1 1001510 1001664             NaN      0.57894737      1.00000000
## 3    chr1 1002618 1002731      0.91752577      0.97849462      0.93548387
## 4    chr1 1002732 1002764      0.71875000      0.96825397      0.96551724
## 5    chr1 1002824 1003025             NaN             NaN             NaN
## 6    chr1 1003026 1003262             NaN             NaN             NaN
## 7    chr1 1003283 1003374      0.23115578      0.13815789      0.56122449
## 8    chr1 1003887 1004001      0.03636364      0.02000000      0.02857143
## 9    chr1 1004538 1004706      0.00000000      0.01724138      0.03636364
## 10   chr1 1004707 1004788      0.01162791      0.02586207      0.00000000
##    Breast_2460_Fresh Breast_3174_1_2 Breast_3174_1_2p
## *              <dbl>           <dbl>            <dbl>
## 1                NaN      0.75000000        0.8333333
## 2                NaN      1.00000000        1.0000000
## 3          0.9285714      0.88888889        0.8275862
## 4          1.0000000             NaN              NaN
## 5                NaN      1.00000000        1.0000000
## 6                NaN             NaN              NaN
## 7          0.4146341      0.81818182        0.8360656
## 8          0.0000000      0.03174603        0.0781250
## 9                NaN      0.07843137        0.0000000
## 10         0.0000000      0.01694915        0.0000000
## # ... with 20 more rows, and 11 more variables: Breast_3174_1_8 <dbl>,
## #   Breast_3174_1_9 <dbl>, Breast_3174_Fresh <dbl>, Breast_4310_1_2 <dbl>,
## #   Breast_4310_Fresh <dbl>, Breast_4310_Normal <dbl>,
## #   Breast_4550_3_1 <dbl>, Breast_4550_3_2 <dbl>, Breast_4550_3_3 <dbl>,
## #   Breast_4550_Fresh <dbl>, intervalID <int>

gpatterns provides a wrapper to extract the ‘sum’ of tracks, for example, if we want to combine the methylation of all the sections from tumor 2460 we can add ‘sum_tracks = TRUE’:

gpatterns.get_avg_meth(tracks[1:4], intervals=gintervals(1, 1e6, 1e6+1e4), iterator='intervs.msp1.fid', min_cov = 10, min_cpgs = 2, sum_tracks = TRUE)

## Taking only intervals with at least 2 CpGs

## Taking only intervals with coverage >= 10 in at least 1 samples

## # A tibble: 27 × 5
##     chrom   start     end        sum intervalID
## *  <fctr>   <dbl>   <dbl>      <dbl>      <int>
## 1    chr1 1001409 1001509 0.50980392          1
## 2    chr1 1001510 1001664 0.63636364          2
## 3    chr1 1002618 1002731 0.95011876          3
## 4    chr1 1002732 1002764 0.90697674          4
## 5    chr1 1003283 1003374 0.28367347          5
## 6    chr1 1003887 1004001 0.02721088          6
## 7    chr1 1004002 1004015 0.00000000          7
## 8    chr1 1004441 1004495 0.00000000          8
## 9    chr1 1004538 1004706 0.02158273          9
## 10   chr1 1004707 1004788 0.01351351         10
## # ... with 17 more rows

Again, note that all these functions are just wrappers of ‘misha’ functions, and we encourage the reader to try and implement some of them using the basic misha verbs, as in many cases these wrappers would not be enough (e.g. complex track expressions, combination with other tracks).

Aviezer Lifshitz

2017-06-18

Data

Data internals

Extracting methylation data

Applying functions on tidy_CpGs

Analyzing raw pattern data

Defining chipseq peaks

Parallelization

Contents