umi4c-usage.Rmd
UMI-4C Package is a processing and analysis pipeline for UMI-4C experiments.
This package is intended for processing and analysis of UMI-4C experiments. The UMI-4C protocol exploits sonication of 3C template material to facilitate molecule counting and quantitative analysis of chromatin contacts of one or several genomic loci at high resolution.
The two parts of the package are:
The pipeline is designed to run on a standard linux machine. The basic requirements are:
Install the r dependencies:
# Install devtools if not installed
if(!require(devtools)){
install.packages("devtools")
}
# Install misha if not installed
if(!require(misha)){
install.packages('misha', repos=c(getOption('repos'), 'https://tanaylab.github.io/repo'))
}
Download and install umi4cPackage:
devtools::install_github("tanaylab/umi4cpackage", build_vignette = TRUE)
library(umi4cPackage)
The pipeline requires a special database called trackdb which contains binary genomic tracks and other data, as well as restriction site database in text format called redb.
Downloading the data (~2.5Gb) is done from inside the package.
We will first create a new project directory:
From R we will download the data to the provided folder:
library("umi4cPackage")
# Use build 'hg19' for DpnII enzyme
p4cBuildRequirements(output_dir = "my_u4c_proj", genome = "hg19", reseq = "GATC")
Two directory will be generated after downloading the extracting the data:
The next step is to set up the conf files. To do so we will first need to dump templates of conf files to a conf directory:
p4cDumpConfFiles(conf_dir = "my_u4c_proj/conf")
Now we will need to go to the supplied directory (my_u4c_proj/conf
in our case) and complete the conf files.
Make sure that all newlines are linux compatible (\n) and not windows (\r\n)!
We need to load the configuration files to our envirnoment by:
p4cLoadConfFiles(conf_dir = "my_u4c_proj/conf")
Download example fastqs from: http://compgenomics.weizmann.ac.il/tanay/?page_id=617. These files include three UMI-4C experiments done on three cell lines on the ANK1 promoter.
Extract the files to a directory.
In samples.txt - change fastqs_dir to the working directory with the downloaded example files. Notice that the pipeline detect the files associated files by the regular expression pattern defined in fastqs_regex. See that Bait_IDs of the samples is 1.
In baits.txt - Inspect that indeed the Bait_ID 1 is correctly filled.
Now we will run the pipeline on a single sample:
p4cCreate4CseqTrack(sample_ids = 1)
All intermediate files will be saved in the workdir that was defined in TG3C.workdir parameter.
The pipeline uses the samples and baits tables to retrieve the relevant information on experiment ID#1, and transform the raw fastq files to UMI counts (see more details on intermediate steps below). This might take some time, depending on the size of the fastq file and the machine being used. Then UMI counts are imported and saved in a special data structure we refer to as “genomic tracks”. For each sample-bait combination a genomic track will generated. In our example, a track named umi4C_example_CMK_ANK1_TSS
will be generated. This track contains data for sample CMK and bait ANK1_TSS.
After successfully importing the data, the pipeline will report some essential statistics.
These statistics can also be accessed from 4CQC.txt in the sample workdir directory.
Once imported the genomic tracks are saved in our trackdb, and can be listed by:
Now we can analyze the data:
Suppose we want to create an UMI-4C profile 200kb upstream and downstream the bait in our exmaple CMK experiment. The first step is to create p4cProfile object
:
CMK_fc <- p4cNewProfile("umi4C_example_CMK_ANK1_TSS", scope_5=200000, scope_3=200000)
Plot nearcis profile:
plot(CMK_fc)
This command generates a plot of the 4C profile with a smoothed trend and a domainogram. It accepts some more parameters which are described in the function documentation: ?plot.p4cProfile
.
For example, if we want to save the domainogram to a png file we can do:
plot(CMK_fc, png_fn="figs/CMK_ANK1_TSS.png")
To compare two 4C experiments, we first import two other profiles (notice, this time we will do it in batch):
p4cCreate4CseqTrack(sample_ids = c(2,3))
Two new tracks were created. To list the tracks again by gtrack.ls()
.
After the tracks were imported we can produce comparative plots. For example, we will compare CMK which express ANK1 gene to DND41 which does not.
# Generate a second p4cProfile on the same scope
DND41_fc <- p4cNewProfile("umi4C_example_DND41_ANK1_TSS", scope_5=200000, scope_3=200000)
# plot a comparative plot
plot(CMK_fc, DND41_fc)
Additional feature is the ability to derive mean contact intensity for defined genomic intervals. The following command will return normalized contact intensities of the two profiles, fold change, and p-value (Chi-square test)
fold_change <- p4cIntervalsMean(CMK_fc, DND41_fc, start=41665000, end=41675000)
knitr::kable(fold_change, align = 'l', digits=2)