Extracts average methylation data from tracks.

gpatterns.get_avg_meth(tracks, intervals, iterator = NULL, min_cov = NULL,
  mask_by_cov = FALSE, use_cpgs = FALSE, min_samples = NULL,
  min_cpgs = NULL, min_var = NULL, var_quantile = NULL,
  min_range = NULL, names = NULL, tidy = TRUE, pre_screen = FALSE,
  use_disk = FALSE, file = NULL, intervals.set.out = NULL,
  sum_tracks = FALSE)

Arguments

tracks
methylation tracks
intervals
genomic scope for which the function is applied
iterator
see iterator in gextract. if NULL iterator would be set to CpGs
min_cov
minimal coverage for iterator interval
mask_by_cov
change loci with coverage < min_cov to NA. Not relevant when intervals.set.out or file is not NULL.
use_cpgs
use CpGs as iterator
min_samples
minimal number of samples with cov >= min_cov. if min_cov is NULL it would be set to 1.
min_cpgs
minimal number of CpGs per iterator interval. note that the intervalID column may be incorrect.
min_var
minimal variance (across samples) per iterator interval
var_quantile
minimal quantile of variance per iterator interval
min_range
take only iterator intervals with max(avg_meth) - min(avg_meth) >= min_range
names
alternative names to tracks. similar to colnames in gextract if tidy == FALSE. Note that names should be shorter than the maximal length of R data frame column name
tidy
if TRUE returns a tidy data frame with the following fields: chrom, start, end, intervalID, samp, meth, unmeth, avg, cov. if FALSE returns a data frame with average methylation, similar to gextract'. Note that for a large number of intervals tidy == FALSE may be the only memory feasable option.
file
save output to file (only in non tidy mode, would not filer by variance)
intervals.set.out
save output big intervals set (only in tidy mode, would not filter by variance)
sum_tracks
get average methylation from all the tracks summed
pre
screen for min_samples and min_cov (for large number of tracks / large number of intervals). Note that the intervalID column may be incorrect and if use_cpgs is TRUE, the intervals set would become the cpgs.

Details

There are two main modes:

  • not tidy: returns a data frame with intervals (chrom,start,end) and a column with average methylation for each sample.
  • tidy: returns a tidy data frame with 'meth','unmeth','avg','cov' for each iterator interval for each sample.

the 'tidy' option is very conveniet in terms of further analysis, but note that for large amount of data it may be too slow. The 'not tidy' version, on the other hand, returns only average methylation and not the raw 'meth' and 'unmeth' calls. In general, choose the mode according to the following guidelines:

  • For extremly large datasets use the 'not tidy' version with use_disk == TURE. Note that in general working with huge number of genomic regions is not useful, both in terms of performance (memory consumption, slow algorithms) and analysis (more 'noise'). A good practice is to select the genomic regions carefully, for example by requering minimal coverage (min_cov) in minimal number of samples (min_samples), minimal number of CpGs (min_cpgs), taking only the most variable regions (min_var, var_quantile) or by taking sets of annotated regioins (e.g. promoters, enhancers).
  • For large datasets use the 'not tidy' version.
  • For intermediate size datasets use the 'tidy' version with pre_screen = TRUE. This would first filter the CpGs and only then exracts the methylation to memory.
  • For small datasets use the 'vanilla' 'tidy' version.

To understand the concept of iterators and intervals, see gextract, and the misha package in general. The function works in the following way: for every interval in intervals the function extracts the methylation calls in each iterator interval and calculates the average. Beware the difference between intervals and iterator: intervals parameter sets the global genomic scope of the function (what part of the genome to look at to begin with). iterator parameter sets the iterator intervals, which are the chunks of the genome form which we will extract the methylation calls. For example setting the iterator to gintervals.all() would calculate the average methylation of every chromosome, whereas setting the intervals to gintervals.all() would just mean that the calculations of the iterator intervals would not be limited to a specific part of the genome, and, for example, if iterator=NULL, methylation would be extracted from all the genomic CpGs.