Data can be imported to gpatterns from raw reads (.fastq files), mapped reads (.bam files) or from tidy_cpgs
files were generated by the gpatterns package.
To explore gpatterns’ import functions, we’ll start with a dataset of UMI-RRBS of breast and lung tumors. The dataset contains 16 breast tumors samples from 4 patients, with multiple sections from each sample.
bissli2
is a wrapper around bowtie2 that maps bisulfite converted reads and extracts methylation calls. - more about bissli params and usage
gpatterns.bissli2(r1_fastq = 'Breast_2460_1.3_R1/fastq/Breast_2460_1.3_R1.fastq', r2_fastq = 'Breast_2460_1.3_R1/fastq/Breast_2460_1.3_R2.fastq', out_bam = 'Breast_2460_1.3_R1/bam/Breast_2460_1.3.bam', genome_seq = "/net/mraid14/export/data/db/tgdb/hg19/seq", genome_type="ct_ga", bissli2_idx = "/net/mraid14/export/data/tools/bissli2/hg19/hg19")
gpatterns.import_from_bam('bam/Breast_2460_1.3.bam', workdir='Breast_2460_1.3_R1', trim=1, cgs_mask_file='/net/mraid14/export/tgdata/users/aviezerl/proj/gpatterns_nugget/msp1_stickey_ends', frag_intervs='intervs.msp1.fid', rm_off_target=T, maxdist=4, use_sge=F, parallel=TRUE, paired_end=TRUE)
tidy_cpgs
filesgpatterns.import_from_tidy_cpgs(tidy_cpgs="Breast_2460_1.3_R1/tidy_cpgs", track="gpatterns_nugget.Breast_2460_1.3_R1", description="", groot="/home/aviezerl/hg19", max_span=600, parallel=TRUE)
In order to generate methylation patterns we first need to define the CpG scope, or ‘pattern space’, i.e. the set of adjacent CpGs we will be looking at. In theory, we could choose a different set of CpGs for each sample separately, in fact, gpatterns does this routinely by generating .patX tracks for each sample with patterns of length X for each CpG.
However, this approach becomes problematic if we want to compare the pattern distribution of samples directly. Therefore, the best practice is to define the ‘pattern space’ based on all the samples together. gpatterns provides two functions to calculate such space:
gpatterns.tracks_to_pat_space
: generates patterns space by dividing the genome to non-overlapping chunks and taking the most covered stretches of pat_l
adjacent CpGs.gpatterns.intervs_to_pat_space
: generates pattern space for specific intervals (e.g. MSP1 fragments, capture regions, enhancers, CpG islands etc.) by taking the most covered stretches of pat_l
CpGs than cover each interval, and only then calls gpatterms.tracks_to_pat_space
to calculate for the rest of the genome. The function can be limited to only adjacent CpGs (adjacent=TRUE
) or take the most covered CpGs even if there are other CpGs in the middle.Since our example is from RRBS data we would use gpatterns.intervs_to_pat_space
with the msp1 fragments:
pat_space <- gpatterns.intervs_to_pat_space(tracks, intervals='intervs.msp1.fid', adjacent=TRUE, pat_len=5, parallel=TRUE)
And now we can create the patterns attributes for each of our tracks:
purrr::walk(tracks, ~ gpatterns.create_patterns_track(track = .x, description = paste0('patterns of ', .x), pat_space=pat_space, max_missing=0))