UMI-4C Package is a processing and analysis pipeline for UMI-4C experiments.

Background

This package is intended for processing and analysis of UMI-4C experiments. The UMI-4C protocol exploits sonication of 3C template material to facilitate molecule counting and quantitative analysis of chromatin contacts of one or several genomic loci at high resolution.

The two parts of the package are:

  • Processing of of raw reads (FASTQ format) to genomic tracks: Paired-end FASTQ files are parsed, splitted according to the restriction enzyme sequence (Usually DpnII) and summarized in adjacency table (adj file). This part uses as a bunch of Perl scripts wrapped in R.
  • Analyzing UMI-4C genomic tracks and producing nearcis figures: Genomic tracks are dynamically processed to produce nearcis profiles at the desired genomic region specified by the user.

Installation

Requirements:

The pipeline is designed to run on a standard linux machine. The basic requirements are:

Installation

Install the r dependencies:

# Install devtools if not installed
if(!require(devtools)){
  install.packages("devtools")
}
# Install misha if not installed
if(!require(misha)){
  install.packages('misha', repos=c(getOption('repos'),       'https://tanaylab.github.io/repo'))
}

Download and install umi4cPackage:

devtools::install_github("tanaylab/umi4cpackage", build_vignette = TRUE)
library(umi4cPackage)
Preparing genomic database

The pipeline requires a special database called trackdb which contains binary genomic tracks and other data, as well as restriction site database in text format called redb.

Downloading the data (~2.5Gb) is done from inside the package.
We will first create a new project directory:

mkdir my_u4c_proj

From R we will download the data to the provided folder:

library("umi4cPackage")
# Use build 'hg19' for DpnII enzyme
p4cBuildRequirements(output_dir = "my_u4c_proj", genome = "hg19", reseq = "GATC")

Two directory will be generated after downloading the extracting the data:

  • my_u4c_proj/hg19/trackdb
  • my_u4c_proj/hg19/redb

Configuration files

The next step is to set up the conf files. To do so we will first need to dump templates of conf files to a conf directory:

p4cDumpConfFiles(conf_dir = "my_u4c_proj/conf")

Now we will need to go to the supplied directory (my_u4c_proj/conf in our case) and complete the conf files.
Make sure that all newlines are linux compatible (\n) and not windows (\r\n)!

  • paths.conf: Set the paths to the directories and dependencies.
  • samples.txt: Samples table, tab separated. One row for each experiment. Mandatory columns are:
    • Sample_ID: Unique ID.
    • Sample_name: Unique sample name. e.g., Cell type, condition etc…
    • Experiment_name: A shared name for the experiment batch.
    • fastqs_regex: Regular expression that captures all FASTQ files names associated with the sample. Important: the files should have the suffix *.fastq . The FASTQ filenames must also contain ‘R1’ and ‘R2’ which associate the file with the paired-end sequencing direction.
    • fastqs_dir: Path to FASTQ directory.
    • Bait_IDs: Unique IDs of baits associated with the experiment. A comma separated list of bait IDs from baits.txt.
  • baits.txt: Baits table, tab separated. One row for each bait. Mandatory columns are:
    • Bait_ID: Unique bait ID - associates sample to bait.
    • Bait_name: Unique bait name.
    • Bait_seq: Sequence of the (DS) bait primer without the padding sequence. It accepts also regular expressions (e.g., AGGCA.+GGG).
    • Bait_pad: Padding sequence - the sequence between the bait primer end to the bait fragment restriction site (including the restriction sequence, e.g., for DpnII: XXXXGATC).
    • Bait_chr: Bait chromosome.
    • Bait_coord: Bait coord. Ideally the coordinate of the nearest restriction site to the bait.
  • pipeline.conf: Pipeline settings. The defaults should work fine for DpnII.
  • p4c.conf: Analysis settings. The defaults should work fine.

Importing and analyzing UMI-4C files.

Initialization:

We need to load the configuration files to our envirnoment by:

p4cLoadConfFiles(conf_dir = "my_u4c_proj/conf")

Working example:

Download example fastqs from: http://compgenomics.weizmann.ac.il/tanay/?page_id=617. These files include three UMI-4C experiments done on three cell lines on the ANK1 promoter.
Extract the files to a directory.
In samples.txt - change fastqs_dir to the working directory with the downloaded example files. Notice that the pipeline detect the files associated files by the regular expression pattern defined in fastqs_regex. See that Bait_IDs of the samples is 1.
In baits.txt - Inspect that indeed the Bait_ID 1 is correctly filled.

Importing genomic tracks

Now we will run the pipeline on a single sample:

p4cCreate4CseqTrack(sample_ids = 1)

All intermediate files will be saved in the workdir that was defined in TG3C.workdir parameter.

The pipeline uses the samples and baits tables to retrieve the relevant information on experiment ID#1, and transform the raw fastq files to UMI counts (see more details on intermediate steps below). This might take some time, depending on the size of the fastq file and the machine being used. Then UMI counts are imported and saved in a special data structure we refer to as “genomic tracks”. For each sample-bait combination a genomic track will generated. In our example, a track named umi4C_example_CMK_ANK1_TSS will be generated. This track contains data for sample CMK and bait ANK1_TSS.
After successfully importing the data, the pipeline will report some essential statistics.
These statistics can also be accessed from 4CQC.txt in the sample workdir directory.

Once imported the genomic tracks are saved in our trackdb, and can be listed by:

Analyzing and plotting

Now we can analyze the data:
Suppose we want to create an UMI-4C profile 200kb upstream and downstream the bait in our exmaple CMK experiment. The first step is to create p4cProfile object:

CMK_fc <- p4cNewProfile("umi4C_example_CMK_ANK1_TSS", scope_5=200000, scope_3=200000)

Plot nearcis profile:

plot(CMK_fc)

This command generates a plot of the 4C profile with a smoothed trend and a domainogram. It accepts some more parameters which are described in the function documentation: ?plot.p4cProfile.
For example, if we want to save the domainogram to a png file we can do:

plot(CMK_fc, png_fn="figs/CMK_ANK1_TSS.png")

Compare the profile to another profile

To compare two 4C experiments, we first import two other profiles (notice, this time we will do it in batch):

p4cCreate4CseqTrack(sample_ids = c(2,3))

Two new tracks were created. To list the tracks again by gtrack.ls().

After the tracks were imported we can produce comparative plots. For example, we will compare CMK which express ANK1 gene to DND41 which does not.

# Generate a second p4cProfile on the same scope
DND41_fc <- p4cNewProfile("umi4C_example_DND41_ANK1_TSS", scope_5=200000, scope_3=200000)

# plot a comparative plot
plot(CMK_fc, DND41_fc)

Additional feature is the ability to derive mean contact intensity for defined genomic intervals. The following command will return normalized contact intensities of the two profiles, fold change, and p-value (Chi-square test)

fold_change <- p4cIntervalsMean(CMK_fc, DND41_fc, start=41665000, end=41675000)
knitr::kable(fold_change, align = 'l', digits=2)