umi4c-newDB.Rmd
UMI-4C Package is a processing and analysis pipeline for UMI-4C experiments. Some databases have been computed and are accessible via download. However, sometimes the database you need is not available and you need to create a new one. This vignette will guide you to create a new database.
To create a new genomic database, you will need:
The pipeline is designed to run on a standard linux machine. The basic requirements are:
# Install devtools if not installed
if(!require(devtools)){
install.packages("devtools")
}
# Install misha if not installed
if(!require(misha)){
install.packages('misha', repos=c(getOption('repos'), 'https://tanaylab.github.io/repo'))
}
# Install umi4cPackage if not installed
if(!require(umi4cPackage)){
devtools::install_github("tanaylab/umi4cpackage", build_vignette = TRUE)
}
Most of genomes are available on UCSC where you can download them through the download page in the section Genome sequence files and select annotations. You can also download them by the link:
If you have a fasta file from another source it is also find.
If the name of the chromosome does not begin by ‘chr’ you will need to add it by:
To create a new database, you need one file per chromosome. If you have only one file, you can split your file using awk:
genome=mm10
mkdir -p $genome
zcat ${genome}.fa.gz | awk -v dir=$genome '$0~"^>"{id = $1; gsub("^>", "", id)}{print >> dir"/"id".fa"}'
In some assemblies there are a lot of contigs or there are alternate contigs and you may be interested in keeping only some of them (for example only the numbered chromosome, the sexual and the mitochondrion).
You can remove the one you don’t want from the new folder created, for example:
As written above, you can provide a mappability bedgraph. This is not mandatory, alternatively you can arbitrary decide that all genome is fully mappable.
One way to get the mappability track is to use gem2. You can install it through conda:
or download the source:
wget https://paoloribeca.science/gem/gem2.INEEDTOCHECK.Linux-amd64.tar.xz
tar xvf gem2.INEEDTOCHECK.Linux-amd64.tar.xz
export PATH=$PATH:$PWD/gem2.INEEDTOCHECK.Linux-amd64
You need to have a single fasta file. If you selected only some chromosomes, you need to concatenate them:
Then you need to create an index:
nbThreads=28
fastaFile=${genome}_simplified.fa # or ${genome}.fa.gz if you did not select chromosomes
gem-indexer -T ${nbThreads} -c dna -i ${fastaFile} -o ${genome}_index
You need to decide a size for the kmer to compute mappability (we recommand 50).
Finally you need to convert the mappability file to bedgraph
The pipeline requires a special database called trackdb which contains binary genomic tracks and other data, as well as restriction site database in text format called redb.
From R we will create a genome database (the trackdb folder) using individual fasta files. All fasta files must begin by “chr”.
library("misha")
genome <- "mm10"
gdb.create("my_genome/trackdb", file.path(genome, "chr*.fa"))
You set this new database as your current database:
gdb.init("my_genome/trackdb")
We will now create a track with the mappability. It can be very long if you have a big genome…
gtrack.create(track="my_mappab", description="mappab set to 5",
expr="5", iterator = min(ALLGENOME[[1]]$end),
band = NULL)
mapab_fn <- paste0(genome, "_50.bg")
# Create a track from the bedgraph
gtrack.import(track="mappab_from_file", description="mappability from gem",
file=mapab_fn, 10)
# Get the maxiumum value (1 with gem):
max.value <- gsummary("mappab_from_file")["Max"]
# Create a new track with the values scaled to get max at 5:
gtrack.create(track="my_mappab", description="mappab from gem scaled to 5",
expr=paste("mappab_from_file * 5 /", max.value), iterator = 10,
band = NULL)
To create a redb you need to provide a tabulation-delimited file which provide the list of chromosomes and the path to the sequence in fasta format.
# You need to adapt here where your individual fasta are:
fasta.files <- list.files(file.path(genome))
seq.names <- sapply(fasta.files, function(fn){gsub("^chr", "",
gsub("\\.fa$", "", fn))})
seq.key.df <- data.frame(names = seq.names,
files = file.path(genome, fasta.files))
write.table(seq.key.df, "chrom_key.txt", sep ="\t",
row.names = F, col.names = F, quote = F)
The next step is to set up the conf file. To do so we will first need to dump templates of conf files to a conf directory:
library(umi4cPackage)
p4cDumpConfFiles(conf_dir = "my_genome/conf")
Now we will need to go to the supplied directory (my_genome/conf
in our case) and complete the redb.conf file.
Make sure that all newlines are linux compatible (\n) and not windows (\r\n)!
Now you need to create a database specific to the restriction enzyme you used.
# For example, DpnII:
gtrack.create_redb_tracks(re_seq="GATC", redb_params_fn="my_genome/conf/redb.conf")
After this, you should have two directories that contains all files necessary to do the umi4c analysis:
You can keep a copy of this new database, for example creating an archive:
When you want to use this new db, don’t forget to use the good path in paths.conf file for variables TG3C.trackdb and TG3C.redb.