IMPORTANT: THIS IS ONLY A SANDBOX. IT IS (NOT EVEN) IN BETA STAGE, READ WITH CAUTION AND CONSULT WITH US IF YOU HAVE ANY QUESTIONS. NO GUARANTEES!
This part is intended for more advanced users that want to do additional analyses using gpatterns and ‘misha’. For a standard analytical pipline please refer to analysis
section.
We we will use the same data from the analysis section: a dataset of UMI-RRBS of breast and lung tumors. The dataset contains 16 breast tumors samples from 4 patients, with multiple sections from each sample.
library(gpatterns)
gsetroot('/home/aviezerl/proj/sandbox_db')
options(gmax.data.size=1e09)
tracks <- gpatterns.ls('gpatterns_nugget.Breast_.+_.+')
print(tracks)
## [1] "gpatterns_nugget.Breast_2460_1_3"
## [2] "gpatterns_nugget.Breast_2460_1_6"
## [3] "gpatterns_nugget.Breast_2460_1_8"
## [4] "gpatterns_nugget.Breast_2460_Fresh"
## [5] "gpatterns_nugget.Breast_3174_1_2"
## [6] "gpatterns_nugget.Breast_3174_1_2p"
## [7] "gpatterns_nugget.Breast_3174_1_8"
## [8] "gpatterns_nugget.Breast_3174_1_9"
## [9] "gpatterns_nugget.Breast_3174_Fresh"
## [10] "gpatterns_nugget.Breast_4310_1_2"
## [11] "gpatterns_nugget.Breast_4310_Fresh"
## [12] "gpatterns_nugget.Breast_4310_Normal"
## [13] "gpatterns_nugget.Breast_4550_3_1"
## [14] "gpatterns_nugget.Breast_4550_3_2"
## [15] "gpatterns_nugget.Breast_4550_3_3"
## [16] "gpatterns_nugget.Breast_4550_Fresh"
names <- gsub('gpatterns_nugget.', '', tracks)
print(names)
## [1] "Breast_2460_1_3" "Breast_2460_1_6" "Breast_2460_1_8"
## [4] "Breast_2460_Fresh" "Breast_3174_1_2" "Breast_3174_1_2p"
## [7] "Breast_3174_1_8" "Breast_3174_1_9" "Breast_3174_Fresh"
## [10] "Breast_4310_1_2" "Breast_4310_Fresh" "Breast_4310_Normal"
## [13] "Breast_4550_3_1" "Breast_4550_3_2" "Breast_4550_3_3"
## [16] "Breast_4550_Fresh"
Each track contains (at least) 5 components:
gtrack.ls(tracks[1])
## [1] "gpatterns_nugget.Breast_2460_1_3.avg"
## [2] "gpatterns_nugget.Breast_2460_1_3.cov"
## [3] "gpatterns_nugget.Breast_2460_1_3.epipoly"
## [4] "gpatterns_nugget.Breast_2460_1_3.fid"
## [5] "gpatterns_nugget.Breast_2460_1_3.meth"
## [6] "gpatterns_nugget.Breast_2460_1_3.n"
## [7] "gpatterns_nugget.Breast_2460_1_3.n0"
## [8] "gpatterns_nugget.Breast_2460_1_3.n1"
## [9] "gpatterns_nugget.Breast_2460_1_3.nc"
## [10] "gpatterns_nugget.Breast_2460_1_3.ncpg"
## [11] "gpatterns_nugget.Breast_2460_1_3.nx"
## [12] "gpatterns_nugget.Breast_2460_1_3.pat_cov3"
## [13] "gpatterns_nugget.Breast_2460_1_3.pat_cov5"
## [14] "gpatterns_nugget.Breast_2460_1_3.pat_cov7"
## [15] "gpatterns_nugget.Breast_2460_1_3.pat_meth"
## [16] "gpatterns_nugget.Breast_2460_1_3.pat2"
## [17] "gpatterns_nugget.Breast_2460_1_3.unmeth"
gpatterns.get_tidy_cpgs(tracks[1], intervals=gintervals(1,1e6,1e6 + 1e4))
## # A tibble: 1,949 × 12
## read_id chrom start
## <chr> <chr> <dbl>
## 1 Breast_2406_1.3_2_A:2368391:AGAGGT:Xba_Xba_B_Dark chr1 1001409
## 2 Breast_2406_1.3_2_A:3080811:ATATAA:Xba_EcoRI_T_Dark chr1 1001409
## 3 Breast_2406_1.3_2_A:4179231:GAGGAG:Xba_EcoRI_T_Dark chr1 1001409
## 4 Breast_2406_1.3_2_A:2462062:GGTGGT:Xba_Xba_B_Dark chr1 1001409
## 5 Breast_2406_1.3_2_A:4163117:GTAGTA:Xba_Xba_B_Dark chr1 1001409
## 6 Breast_2406_1.3_2_A:2677641:TGAGGG:Xba_Xba_T_Dark chr1 1001409
## 7 Breast_2406_1.3_2_A:1844720:AGAGGT:Xba_Xba_T_Dark chr1 1001510
## 8 Breast_2406_1.3_2_A:2527743:GGGTTT:Xba_Xba_T_Dark chr1 1001510
## 9 Breast_2406_1.3_2_A:4552887:GTAGTA:Xba_EcoRI_T_Dark chr1 1001510
## 10 Breast_2406_1.3_2_A:2384712:TAAGAG:Xba_Xba_B_Dark chr1 1001510
## end strand umi1 umi2 insert_len num cg_pos
## <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 1001509 - AGAGGT - 98 4 1001471
## 2 1001509 - ATATAA - -92 1 1001471
## 3 1001509 - GAGGAG - 98 1 1001471
## 4 1001509 - GGTGGT - 98 2 1001471
## 5 1001509 - GTAGTA - 98 1 1001471
## 6 1001509 - TGAGGG - -98 5 1001471
## 7 1001664 + AGAGGT - -98 2 1001471
## 8 1001664 + GGGTTT - -98 2 1001471
## 9 1001664 + GTAGTA - 98 1 1001471
## 10 1001664 + TAAGAG - 98 4 1001471
## # ... with 1,939 more rows, and 2 more variables: meth <dbl>, qual <dbl>
The tidy_cpgs table contains the following fields:
Storing the raw reads allows us to analyze coupled methylation calls, and calculate features like epipolymorphism or other attributes of the full methylation pattern distribution, see ‘applying functions on tidy_CpGs’ section.
In order to extract the methylation signal from a track we can use the basic misha functions, for example, to extract the methylation calls from chromosome1, 1,000,000 to 1,0010,000 we can do:
gextract(paste0(tracks[1], c('.meth', '.unmeth', '.avg', '.cov')), intervals=gintervals(1,1e6,1e6+1e4), iterator=paste0(tracks[1], '.cov'), colnames=c('meth', 'unmeth', 'avg', 'cov')) %>% tbl_df
## # A tibble: 170 × 8
## chrom start end meth unmeth avg cov intervalID
## * <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 chr1 1001471 1001472 12 9 0.5714286 21 1
## 2 chr1 1002663 1002664 15 1 0.9375000 16 1
## 3 chr1 1002666 1002667 16 0 1.0000000 16 1
## 4 chr1 1002687 1002688 13 2 0.8666667 15 1
## 5 chr1 1002689 1002690 11 3 0.7857143 14 1
## 6 chr1 1002697 1002698 8 0 1.0000000 8 1
## 7 chr1 1002702 1002703 12 1 0.9230769 13 1
## 8 chr1 1002705 1002706 14 1 0.9333333 15 1
## 9 chr1 1002737 1002738 4 1 0.8000000 5 1
## 10 chr1 1002739 1002740 4 2 0.6666667 6 1
## # ... with 160 more rows
gpatterns provides wrapper to this function, , that has the following additional features: (Note that all this features can be achived only with ‘misha’, e.g. using virtual tracks)
For example, if we want to extract the average methylation in MSP1 fragments from all the tracks, limiting ourselves to fragments with covereage of 10 in at least 5 samples that have at least 2 CpGs we can write:
gpatterns.get_avg_meth(tracks, names=names, intervals=gintervals(1, 1e6, 1e6+1e4), iterator='intervs.msp1.fid', min_cov = 10, min_samples=5, min_cpgs = 2)
## Taking only intervals with at least 2 CpGs
## extracting...
## Taking only intervals with coverage >= 10 in at least 5 samples
## number of intervals: 30
## # A tibble: 480 × 9
## chrom start end intervalID samp meth unmeth
## <fctr> <dbl> <dbl> <int> <chr> <dbl> <dbl>
## 1 chr1 1001409 1001509 1 Breast_2460_1_3 12 9
## 2 chr1 1001510 1001664 1 Breast_2460_1_3 NaN NaN
## 3 chr1 1002618 1002731 1 Breast_2460_1_3 89 8
## 4 chr1 1002732 1002764 1 Breast_2460_1_3 23 9
## 5 chr1 1002824 1003025 1 Breast_2460_1_3 NaN NaN
## 6 chr1 1003026 1003262 1 Breast_2460_1_3 NaN NaN
## 7 chr1 1003283 1003374 1 Breast_2460_1_3 46 153
## 8 chr1 1003887 1004001 1 Breast_2460_1_3 6 159
## 9 chr1 1004538 1004706 1 Breast_2460_1_3 0 26
## 10 chr1 1004707 1004788 1 Breast_2460_1_3 1 85
## avg cov
## <dbl> <dbl>
## 1 0.57142857 21
## 2 NaN NaN
## 3 0.91752577 97
## 4 0.71875000 32
## 5 NaN NaN
## 6 NaN NaN
## 7 0.23115578 199
## 8 0.03636364 165
## 9 0.00000000 26
## 10 0.01162791 86
## # ... with 470 more rows
We can limit ourselves to ‘interesting’ fragments, e.g. fragments with minimal range (‘abs(min(m) - max(m) >= min_range’):
gpatterns.get_avg_meth(tracks, names=names, intervals=gintervals(1, 1e6, 1e6+1e4), iterator='intervs.msp1.fid', min_cov = 10, min_samples=5, min_cpgs = 2, min_range = 0.3)
## Taking only intervals with at least 2 CpGs
## extracting...
## Taking only intervals with coverage >= 10 in at least 5 samples
## number of intervals: 19
## # A tibble: 304 × 9
## chrom start end intervalID samp meth unmeth
## <fctr> <dbl> <dbl> <int> <chr> <dbl> <dbl>
## 1 chr1 1001409 1001509 1 Breast_2460_1_3 12 9
## 2 chr1 1001510 1001664 1 Breast_2460_1_3 NaN NaN
## 3 chr1 1002618 1002731 1 Breast_2460_1_3 89 8
## 4 chr1 1002824 1003025 1 Breast_2460_1_3 NaN NaN
## 5 chr1 1003026 1003262 1 Breast_2460_1_3 NaN NaN
## 6 chr1 1003283 1003374 1 Breast_2460_1_3 46 153
## 7 chr1 1005099 1005188 1 Breast_2460_1_3 58 29
## 8 chr1 1005189 1005260 1 Breast_2460_1_3 12 56
## 9 chr1 1005316 1005500 1 Breast_2460_1_3 1 3
## 10 chr1 1006940 1007053 1 Breast_2460_1_3 0 15
## avg cov
## <dbl> <dbl>
## 1 0.5714286 21
## 2 NaN NaN
## 3 0.9175258 97
## 4 NaN NaN
## 5 NaN NaN
## 6 0.2311558 199
## 7 0.6666667 87
## 8 0.1764706 68
## 9 0.2500000 4
## 10 0.0000000 15
## # ... with 294 more rows
If we want to extract data in single CpG level, we can explicitly set the iterator to CpG intervals or leave it NULL:
gpatterns.get_avg_meth(tracks, names=names, intervals=gintervals(1, 1e6, 1e6+1e4), min_cov = 10, min_samples=5, min_cpgs = 2, min_range = 0.3)
## Using cpgs as iterator
## extracting...
## 0%...100%
## Taking only intervals with coverage >= 10 in at least 5 samples
## number of intervals: 48
## # A tibble: 768 × 9
## chrom start end intervalID samp meth unmeth
## <fctr> <dbl> <dbl> <int> <chr> <dbl> <dbl>
## 1 chr1 1001471 1001472 1 Breast_2460_1_3 12 9
## 2 chr1 1002663 1002664 1 Breast_2460_1_3 15 1
## 3 chr1 1002666 1002667 1 Breast_2460_1_3 16 0
## 4 chr1 1002687 1002688 1 Breast_2460_1_3 13 2
## 5 chr1 1002689 1002690 1 Breast_2460_1_3 11 3
## 6 chr1 1002697 1002698 1 Breast_2460_1_3 8 0
## 7 chr1 1002702 1002703 1 Breast_2460_1_3 12 1
## 8 chr1 1002705 1002706 1 Breast_2460_1_3 14 1
## 9 chr1 1003298 1003299 1 Breast_2460_1_3 6 20
## 10 chr1 1003302 1003303 1 Breast_2460_1_3 6 20
## avg cov
## <dbl> <dbl>
## 1 0.5714286 21
## 2 0.9375000 16
## 3 1.0000000 16
## 4 0.8666667 15
## 5 0.7857143 14
## 6 1.0000000 8
## 7 0.9230769 13
## 8 0.9333333 15
## 9 0.2307692 26
## 10 0.2307692 26
## # ... with 758 more rows
Sometimes it is useful to extract only the average methylation in a non ‘tidy’ format. This may also boost performance:
gpatterns.get_avg_meth(tracks, names=names, intervals=gintervals(1, 1e6, 1e6+1e4), iterator='intervs.msp1.fid', min_cov = 10, min_cpgs = 2, min_samples=5, tidy=FALSE)
## Taking only intervals with at least 2 CpGs
## Taking only intervals with coverage >= 10 in at least 5 samples
## number of intervals: 30
## # A tibble: 30 × 20
## chrom start end Breast_2460_1_3 Breast_2460_1_6 Breast_2460_1_8
## * <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 chr1 1001409 1001509 0.57142857 0.31578947 0.72727273
## 2 chr1 1001510 1001664 NaN 0.57894737 1.00000000
## 3 chr1 1002618 1002731 0.91752577 0.97849462 0.93548387
## 4 chr1 1002732 1002764 0.71875000 0.96825397 0.96551724
## 5 chr1 1002824 1003025 NaN NaN NaN
## 6 chr1 1003026 1003262 NaN NaN NaN
## 7 chr1 1003283 1003374 0.23115578 0.13815789 0.56122449
## 8 chr1 1003887 1004001 0.03636364 0.02000000 0.02857143
## 9 chr1 1004538 1004706 0.00000000 0.01724138 0.03636364
## 10 chr1 1004707 1004788 0.01162791 0.02586207 0.00000000
## Breast_2460_Fresh Breast_3174_1_2 Breast_3174_1_2p
## * <dbl> <dbl> <dbl>
## 1 NaN 0.75000000 0.8333333
## 2 NaN 1.00000000 1.0000000
## 3 0.9285714 0.88888889 0.8275862
## 4 1.0000000 NaN NaN
## 5 NaN 1.00000000 1.0000000
## 6 NaN NaN NaN
## 7 0.4146341 0.81818182 0.8360656
## 8 0.0000000 0.03174603 0.0781250
## 9 NaN 0.07843137 0.0000000
## 10 0.0000000 0.01694915 0.0000000
## # ... with 20 more rows, and 11 more variables: Breast_3174_1_8 <dbl>,
## # Breast_3174_1_9 <dbl>, Breast_3174_Fresh <dbl>, Breast_4310_1_2 <dbl>,
## # Breast_4310_Fresh <dbl>, Breast_4310_Normal <dbl>,
## # Breast_4550_3_1 <dbl>, Breast_4550_3_2 <dbl>, Breast_4550_3_3 <dbl>,
## # Breast_4550_Fresh <dbl>, intervalID <int>
gpatterns provides a wrapper to extract the ‘sum’ of tracks, for example, if we want to combine the methylation of all the sections from tumor 2460 we can add ‘sum_tracks = TRUE’:
gpatterns.get_avg_meth(tracks[1:4], intervals=gintervals(1, 1e6, 1e6+1e4), iterator='intervs.msp1.fid', min_cov = 10, min_cpgs = 2, sum_tracks = TRUE)
## Taking only intervals with at least 2 CpGs
## Taking only intervals with coverage >= 10 in at least 1 samples
## # A tibble: 27 × 5
## chrom start end sum intervalID
## * <fctr> <dbl> <dbl> <dbl> <int>
## 1 chr1 1001409 1001509 0.50980392 1
## 2 chr1 1001510 1001664 0.63636364 2
## 3 chr1 1002618 1002731 0.95011876 3
## 4 chr1 1002732 1002764 0.90697674 4
## 5 chr1 1003283 1003374 0.28367347 5
## 6 chr1 1003887 1004001 0.02721088 6
## 7 chr1 1004002 1004015 0.00000000 7
## 8 chr1 1004441 1004495 0.00000000 8
## 9 chr1 1004538 1004706 0.02158273 9
## 10 chr1 1004707 1004788 0.01351351 10
## # ... with 17 more rows
Again, note that all these functions are just wrappers of ‘misha’ functions, and we encourage the reader to try and implement some of them using the basic misha verbs, as in many cases these wrappers would not be enough (e.g. complex track expressions, combination with other tracks).