Create a new database for umi4cPackage

UMI-4C Package is a processing and analysis pipeline for UMI-4C experiments. Some databases have been computed and are accessible via download. However, sometimes the database you need is not available and you need to create a new one. This vignette will guide you to create a new database.

Background

To create a new genomic database, you will need:

Fasta files (one per chromosome) where file name begins with ‘chr’ and each chromosome name also begins with ‘chr’. If you have only one file with all chromosomes the steps below will help you to convert it.
Optionnally a mappability bedgraph where values goes from 0 to 5. A section is dedicated to ways to get this mappability bedgraph

Get the inputs

Requirements:

The pipeline is designed to run on a standard linux machine. The basic requirements are:

Perl
R packages:
- devtools
- misha
- umi4cPackage

Install the R dependencies

# Install devtools if not installed
if(!require(devtools)){
  install.packages("devtools")
}
# Install misha if not installed
if(!require(misha)){
  install.packages('misha', repos=c(getOption('repos'), 'https://tanaylab.github.io/repo'))
}
# Install umi4cPackage if not installed
if(!require(umi4cPackage)){
  devtools::install_github("tanaylab/umi4cpackage", build_vignette = TRUE)
}

Fasta files

Most of genomes are available on UCSC where you can download them through the download page in the section Genome sequence files and select annotations. You can also download them by the link:

genome=mm10
wget http://hgdownload.cse.ucsc.edu/goldenPath/${genome}/bigZips/${genome}.fa.gz

If you have a fasta file from another source it is also find.

If the name of the chromosome does not begin by ‘chr’ you will need to add it by:

zcat original.fa.gz | awk '$0~/^>/{gsub(">",">chr", $0)}{print}' | gzip > ${genome}.fa.gz

To create a new database, you need one file per chromosome. If you have only one file, you can split your file using awk:

genome=mm10
mkdir -p $genome
zcat ${genome}.fa.gz | awk -v dir=$genome '$0~"^>"{id = $1; gsub("^>", "", id)}{print >> dir"/"id".fa"}'

In some assemblies there are a lot of contigs or there are alternate contigs and you may be interested in keeping only some of them (for example only the numbered chromosome, the sexual and the mitochondrion).

You can remove the one you don’t want from the new folder created, for example:

rm ${genome}/*random
rm ${genome}/chrUn*
rm ${genome}/*_alt

Build a mappability bedgraph

As written above, you can provide a mappability bedgraph. This is not mandatory, alternatively you can arbitrary decide that all genome is fully mappable.

One way to get the mappability track is to use gem2. You can install it through conda:

conda install -c bioconda gem2

or download the source:

wget https://paoloribeca.science/gem/gem2.INEEDTOCHECK.Linux-amd64.tar.xz
tar xvf gem2.INEEDTOCHECK.Linux-amd64.tar.xz
export PATH=$PATH:$PWD/gem2.INEEDTOCHECK.Linux-amd64

You need to have a single fasta file. If you selected only some chromosomes, you need to concatenate them:

cat ${genome}/*.fa > ${genome}_simplified.fa

Then you need to create an index:

nbThreads=28
fastaFile=${genome}_simplified.fa # or ${genome}.fa.gz if you did not select chromosomes
gem-indexer -T ${nbThreads} -c dna -i ${fastaFile} -o ${genome}_index

You need to decide a size for the kmer to compute mappability (we recommand 50).

kmer=50
gem-mappability -T ${nbThreads} -I ${genome}_index.gem -l ${kmer} -o ${genome}_${kmer}

Finally you need to convert the mappability file to bedgraph

gem-2-bed mappability -I ${genome}_index.gem -i ${genome}_${kmer}.mappability -o ${genome}_${kmer}

Build the genomic database

The pipeline requires a special database called trackdb which contains binary genomic tracks and other data, as well as restriction site database in text format called redb.

Initiate the database

From R we will create a genome database (the trackdb folder) using individual fasta files. All fasta files must begin by “chr”.

library("misha")
genome <- "mm10"
gdb.create("my_genome/trackdb", file.path(genome, "chr*.fa"))

You set this new database as your current database:

gdb.init("my_genome/trackdb")

Put mappability track

We will now create a track with the mappability. It can be very long if you have a big genome…

If you did not compute any mappability bedgraph:

gtrack.create(track="my_mappab", description="mappab set to 5",
              expr="5", iterator = min(ALLGENOME[[1]]$end),
              band = NULL)

If you computed the mappability with gem or any other way and you have a bedgraph:

mapab_fn <- paste0(genome, "_50.bg")
# Create a track from the bedgraph
gtrack.import(track="mappab_from_file", description="mappability from gem",
              file=mapab_fn, 10)
# Get the maxiumum value (1 with gem):
max.value <- gsummary("mappab_from_file")["Max"]
# Create a new track with the values scaled to get max at 5:
gtrack.create(track="my_mappab", description="mappab from gem scaled to 5",
              expr=paste("mappab_from_file * 5 /", max.value), iterator = 10,
              band = NULL)

Create the redb tracks

Build the chromosome_key table

To create a redb you need to provide a tabulation-delimited file which provide the list of chromosomes and the path to the sequence in fasta format.

# You need to adapt here where your individual fasta are:
fasta.files <- list.files(file.path(genome))
seq.names <- sapply(fasta.files, function(fn){gsub("^chr", "", 
                                                 gsub("\\.fa$", "", fn))})
seq.key.df <- data.frame(names = seq.names,
                         files = file.path(genome, fasta.files))
write.table(seq.key.df, "chrom_key.txt", sep ="\t",
            row.names = F, col.names = F, quote = F)

Configuration file

The next step is to set up the conf file. To do so we will first need to dump templates of conf files to a conf directory:

library(umi4cPackage)
p4cDumpConfFiles(conf_dir = "my_genome/conf")

Now we will need to go to the supplied directory (my_genome/conf in our case) and complete the redb.conf file.
Make sure that all newlines are linux compatible (\n) and not windows (\r\n)!

TG3C.mapab_track: Set the name of the mappability track
TG3C.re_workdir: Set the path to the redb directory
TG3C.chrom_seq_key: Set the path to the chomosome_key table created above.

Create the db

Now you need to create a database specific to the restriction enzyme you used.

# For example, DpnII:
gtrack.create_redb_tracks(re_seq="GATC", redb_params_fn="my_genome/conf/redb.conf")

After this, you should have two directories that contains all files necessary to do the umi4c analysis:

my_genome/trackdb
my_genome/redb

You can keep a copy of this new database, for example creating an archive:

genome=mm10
tar -zcvf ${genome}_umi4c_db.tar.gz my_genome/redb my_genome/trackdb

When you want to use this new db, don’t forget to use the good path in paths.conf file for variables TG3C.trackdb and TG3C.redb.

Omer Schwartzman

Aviezer Lifshitz

Lucille Lopez-Delisle

2021-10-24