Data can be imported to gpatterns from raw reads (.fastq files), mapped reads (.bam files) or from tidy_cpgs files were generated by the gpatterns package.

Data: UMI-RRBS of Breast and Lung tumors

To explore gpatterns’ import functions, we’ll start with a dataset of UMI-RRBS of breast and lung tumors. The dataset contains 16 breast tumors samples from 4 patients, with multiple sections from each sample.

  • table

Importing from raw reads (.fastq)

bissli2

bissli2 is a wrapper around bowtie2 that maps bisulfite converted reads and extracts methylation calls. - more about bissli params and usage

gpatterns.bissli2(r1_fastq = 'Breast_2460_1.3_R1/fastq/Breast_2460_1.3_R1.fastq', r2_fastq = 'Breast_2460_1.3_R1/fastq/Breast_2460_1.3_R2.fastq', out_bam = 'Breast_2460_1.3_R1/bam/Breast_2460_1.3.bam', genome_seq = "/net/mraid14/export/data/db/tgdb/hg19/seq", genome_type="ct_ga", bissli2_idx = "/net/mraid14/export/data/tools/bissli2/hg19/hg19")

Importing from mapped reads (.bam)

gpatterns.import_from_bam('bam/Breast_2460_1.3.bam', workdir='Breast_2460_1.3_R1', trim=1, cgs_mask_file='/net/mraid14/export/tgdata/users/aviezerl/proj/gpatterns_nugget/msp1_stickey_ends', frag_intervs='intervs.msp1.fid', rm_off_target=T, maxdist=4, use_sge=F, parallel=TRUE, paired_end=TRUE)

Importing from tidy_cpgs files

gpatterns.import_from_tidy_cpgs(tidy_cpgs="Breast_2460_1.3_R1/tidy_cpgs", track="gpatterns_nugget.Breast_2460_1.3_R1", description="", groot="/home/aviezerl/hg19", max_span=600, parallel=TRUE)

Generating methylation patterns

In order to generate methylation patterns we first need to define the CpG scope, or ‘pattern space’, i.e. the set of adjacent CpGs we will be looking at. In theory, we could choose a different set of CpGs for each sample separately, in fact, gpatterns does this routinely by generating .patX tracks for each sample with patterns of length X for each CpG.

However, this approach becomes problematic if we want to compare the pattern distribution of samples directly. Therefore, the best practice is to define the ‘pattern space’ based on all the samples together. gpatterns provides two functions to calculate such space:

  • gpatterns.tracks_to_pat_space: generates patterns space by dividing the genome to non-overlapping chunks and taking the most covered stretches of pat_l adjacent CpGs.
  • gpatterns.intervs_to_pat_space: generates pattern space for specific intervals (e.g. MSP1 fragments, capture regions, enhancers, CpG islands etc.) by taking the most covered stretches of pat_l CpGs than cover each interval, and only then calls gpatterms.tracks_to_pat_space to calculate for the rest of the genome. The function can be limited to only adjacent CpGs (adjacent=TRUE) or take the most covered CpGs even if there are other CpGs in the middle.

Since our example is from RRBS data we would use gpatterns.intervs_to_pat_space with the msp1 fragments:

pat_space <- gpatterns.intervs_to_pat_space(tracks, intervals='intervs.msp1.fid', adjacent=TRUE, pat_len=5, parallel=TRUE)

And now we can create the patterns attributes for each of our tracks:

purrr::walk(tracks, ~ gpatterns.create_patterns_track(track = .x, description = paste0('patterns of ', .x), pat_space=pat_space, max_missing=0))