Prerequisites

Genome assembly (optional)

Repsc can compute family-wise multiple sequence alignments of all repeat sequences in the genome to improve read mapping onto a consensus model. To do so, we require the genome assembly of our organism of choice stored as a BSgenome object. You can retrieve the full list of supported genomes by typing BSgenome::available.genomes() or create a custom BSgenome object following the instructions.

Since we are working with expression data from human cancer cell lines and mouse embryos, we will first install the UCSC hg38 and mm10 assemblies.

BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
BiocManager::install("BSgenome.Mmusculus.UCSC.mm10")

Important note: Do not use repeat-masked BSgenome objects (contain ‘masked’ suffix, e.g. BSgenome.Hsapiens.UCSC.hg38.masked)!

MAFFT multiple alignment program (optional)

To compute family-wise multiple sequence alignments (used to improve mapping of read/UMI signal along consensus TE models), Repsc utilizes the MAFFT multiple alignment program. To use this feature, make sure mafft is in your command PATH.

Repeat annotation (required)

A convient solution is to download transposon coordinates from the Repeatmasker homepage or the DFAM database for your genome and assembly of choice. You can import the Repeatmasker fa.out.gz or DFAM dfam.hits.gz files using the Repsc importRMSK and importDFAM functions, respectively. Another option is to provide custom annotation as long as it provides the basic information about chromosome, start, end, strand, repname (family identifier), and id_unique (unique locus identifier). In this tutorial, we will show you how to import such information using the provided example datasets.

Gene model annotation (required)

In addition to TE expression counts, Repsc also quantifies genic expression levels using common gene interval and annotation formats (e.g. gtf). In this tutorial, we will use the Gencode comprehensive gene annotation on the human reference chromosomes only. Other annotation ressources are currently untested and should be used with caution.

Read/UMI genomic coordinates (required)

Repsc requires read alignment coordinates stored in BAM format as input, which are routinely generated during most common scRNA-seq workflows, including 10x’ Cellranger pipeline. BAM inputs should be duplicate removed, position sorted, and indexed. Chunking BAM inputs (e.g. by chromosome) can accelerate the import into your R environment using the importBAM function. Important: Repsc assumes the cell barcode is either stored as CB tag or BAM input files are seperated per cell. Other formats are currently not supported (e.g. cell barcode in read name).

Workflow human 10x scRNA-seq dataset (5’)

In this tutorial, we are going to utilize 5’ scRNA-seq data on epigenetically de-repressed cancer cell lines to quantify transposable element (TE) expression levels at single-cell and locus resolution. Following the workflow, you’ll learn the specifics of Repsc to adapt it to your single-cell dataset.

Getting started

We start the workflow by loading Repsc and the human hg38 BSgenome object into our R environment:

devtools::load_all('/net/mraid14/export/tgdata/users/davidbr/src/Repsc/')

# adjust to your genome of interest (e.g. BSgenome.Mmusculus.UCSC.mm10)
library(BSgenome.Hsapiens.UCSC.hg38)

Create scSet

We then import our gene and TE annotation files as GRanges objects followed by Repsc-specific curation and formatting using the curateGenes and curateTEs functions.

# path to Gencode gtf file (provided)
gene_path <- system.file(package = 'Repsc', 
                        'extdata', 
                        'gencode.v29.annotation.gtf.gz')

# path to RepeatMasker hg38 repeat annotation (provided)
rmsk_path <- system.file(package = 'Repsc', 
                         'extdata', 
                         'hg38.fa.out.gz')
                         
# creating the scSet
sc <- createScSet(genome   = Hsapiens,
                  protocol = 'fiveprime',
                  tes      = rmsk_path,
                  genes    = gene_path)

Compute multiple sequence alignments

Repsc computes the read/UMI coverage along genes and the consensus model of TE families. This can be useful to sanity check 5’/3’ enrichment (depending on the protocol) and to identify putative TE consensus TSSs (5’ protocols), polyA-sites (3’ protocols), and to distinguish true de-repression from spurious background signal (e.g. intronic TE read mis-assignment, broad-scale genomic background transcription, etc.). As a rough estimate, we can utilize the consensus mapping information from Repeatmasker or DFAM output files for that purpose. This will usually provide reasonable results for highly conserved families. To increase accuracy, Repsc can also compute family-wise multiple sequence alignments to improve mapping of individual loci onto a de novo alignment. When time and computational ressources are no limitation, we recommend this step by running:


compAlignments(sc, 
               max_gap   = 0.95,       # max gap percentage in the alignment to consider
               n_jobs    = 500,        # number of parallel jobs (needs HPC access)
               outdir    = tempdir(),      
               overwrite = FALSE)

## Compute count matrix

After we have imported and curated the data, we can procede to generate the actual read/UMI count matrix using the `compCounts` function.

# path to bam files containing mapped reads                         
bam_paths <- dir('/net/mraid14/export/data/users/davidbr/proj/epitherapy/data/h1299/10x/dacsb/aligned/chunked_bams/',
                 pattern = '_deduplicated_chr[0-9, X, Y, MT]+.bam', 
                 full.names = TRUE)
                 
# create a data.frame specifying import parameters                 
bam_df    <- data.frame(paths   = bam_paths,
                        paired  = TRUE,       # use FALSE for single-end libraries
                        mate    = 'first',    # only imports the first mate of properly aligned read pairs, set to NA when using single-end libraries
                        barcode = 'CB',       # 10x barcode included in BAM flag
                        stringsAsFactors = FALSE)
sc <- addCounts(sc,
                reads    = bam_df,
                bin_size = 25,
                msa_dir  = NULL)

Call cells

To distinguish real cells from empty droplets, we utilize the emptyDrops function from the DropletUtils package[1].

sc <- callCells(sc, 
                fdr = 0.01)

Call TSSs

Workflow mouse Smart-seq2 scRNA-seq dataset

In this part of the tutorial, we will use Smart-seq2 data generated on early mouse embryos[1] to illustrate the specifics of full-coverage scRNA-seq data.

Getting started

We start the workflow by loading Repsc and the mouse mm10 BSgenome object into our R environment:

devtools::load_all('/net/mraid14/export/tgdata/users/davidbr/src/Repsc/')

# adjust to your genome of interest (e.g. BSgenome.Hsapiens.UCSC.hg38)
library(BSgenome.Mmusculus.UCSC.mm10)

Import genome annotation

As before, we first import our gene and TE annotation files as GRanges objects followed by Repsc-specific curation and formatting using the curateGenes and curateTEs functions.

# path to Gencode gtf file (provided)
gene_path <- system.file(package = 'Repsc', 
                        'extdata', 
                        'gencode.vM21.annotation.gtf.gz')
# path to RepeatMasker hg38 repeat annotation (provided)
rmsk_path <- system.file(package = 'Repsc', 
                         'extdata', 
                         'mm10.fa.out.gz')
                            
# import
genes <- rtracklayer::import(gene_path)
tes   <- importRMSK(rmsk_path)         # by default only imports DNA, SINE, LINE, and LTR class repeats on the main reference chromosomes                           

# curation
genes_cur <- curateGenes(genes)        # only retains exons, resolves naming ambiguity
tes_cur   <- curateTEs(tes,            # tries to resolve nested TE intervals and adds annotation
                       genes_cur)      # to filter exonic TEs

Import reads and compute count matrix in parallel

Given the size of the example Smart-seq2 dataset and memory limitations of most systems, we will compute single-cell gene/TE counts for subsets of the BAM files and combine the processed results afterwards. For enhanced speed performance, we will employ the parallel backend of the future package.

# path to bam files containing mapped reads                         
bam_paths <- dir('/net/mraid14/export/tgdata/users/evghenic/data/cheng2019/aligned/', 
                 pattern = 'duprm.bam', 
                 full.names = TRUE)
bam_paths

# vector of cell barcodes 
barcodes <- substring(basename(bam_paths), 1, 10)

# commands for parallel execution
cmds   <- list()
chunks <- chunk(bam_paths, chunk_size = 10)

for (i in 1:length(chunks))
{
  bam_chunks     <- glue("c({glue_collapse(single_quote(bam_paths[chunks[[i]]]), sep = ',')})")
  barcode_chunks <- glue("c({glue_collapse(single_quote(barcodes[chunks[[i]]]), sep = ',')})")
 
  cmds[[i]] <- glue("Repsc:::compCounts(reads = Repsc:::importBAM({bam_chunks}, paired = FALSE, mate = NA, anchor = 'fiveprime', barcode = {barcode_chunks}),
                     tes_cur,
                     genes_cur,
                     bin_size = 20,
                     protocol = 'full')")
}

# distribute commands

# create new empty env and fill with relevant
empty           <- new.env(parent = emptyenv())
empty$tes_cur   <- tes_cur
empty$genes_cur <- genes_cur
  
# distribute, compute and combine
counts <- 
  gcluster.run3(command_list = cmds,  
                packages = c("data.table", "plyranges", "base", "stats"), 
                envir = empty, 
                collapse_results = TRUE,
                io_saturation = TRUE)

Normalize counts

The full transcript coverage of Smart-seq2 comes with a cost: Unlike most scRNA-seq protocols it lacks UMIs to quantify true molecule counts per cell. As an approximation, Repsc samples N molecules per cell with empirical weights learned from the data to simulate integer molecule counts.

counts_norm <- normCounts(counts, 
                          N = 20000) # number of molecules to sample per cell

# Workflow mouse 10x scRNA-seq dataset (3')

In this part of the tutorial, we will use 3' scRNA-seq data generated on mouse embryos[1] to illustrate the specifics of 3' scRNA-seq data.

## Getting started

We start the workflow by loading Repsc and the mouse mm10 _BSgenome_ object into our R environment:

devtools::load_all('/net/mraid14/export/tgdata/users/davidbr/src/Repsc/')
library(BSgenome.Mmusculus.UCSC.mm10)

Import genome annotation

As before, we first import our gene and TE annotation files as GRanges objects followed by Repsc-specific curation and formatting using the curateGenes and curateTEs functions.

# path to Gencode gtf file (provided)
gene_path <- system.file(package = 'Repsc', 
                        'extdata', 
                        'gencode.vM21.annotation.gtf.gz')
# path to RepeatMasker hg38 repeat annotation (provided)
rmsk_path <- system.file(package = 'Repsc', 
                         'extdata', 
                         'mm10.fa.out.gz')
                            
# import
genes <- rtracklayer::import(gene_path)
tes   <- importRMSK(rmsk_path)         # by default only imports DNA, SINE, LINE, and LTR class repeats on the main reference chromosomes                           

# curation
genes_cur <- curateGenes(genes)        # only retains exons, resolves naming ambiguity
tes_cur   <- curateTEs(tes,            # tries to resolve nested TE intervals and adds annotation
                       genes_cur)      # to filter exonic TEs

Import reads and compute count matrix in parallel

# path to bam files containing mapped reads                         
bam_paths <- dir('/net/mraid14/export/tgdata/users/davidbr/data/goettgens_2019/aligned/', 
                 pattern = 'bam.bam$',
                 full.names = TRUE)
bam_paths

# commands for parallel execution
cmds   <- list()
chunks <- chunk(bam_paths, chunk_size = 1)

for (i in 1:length(chunks))
{
  bam_chunks     <- glue("c({glue_collapse(single_quote(bam_paths[chunks[[i]]]), sep = ',')})")
 
  cmds[[i]] <- glue("Repsc:::compCounts(reads = Repsc:::importBAM({bam_chunks}, paired = FALSE, mate = NA, anchor = 'fiveprime'),
                     tes_cur,
                     genes_cur,
                     bin_size = 20,
                     protocol = 'full')")
}

# distribute commands

# create new empty env and fill with relevant
empty           <- new.env(parent = emptyenv())
empty$tes_cur   <- tes_cur
empty$genes_cur <- genes_cur
  
# distribute, compute and combine
res_gclust <- 
  gcluster.run3(command_list = cmds,  
                packages = c("data.table", "plyranges", "base", "stats"), 
                envir = empty, 
                collapse_results = TRUE,
                io_saturation = TRUE)

References

[1]

Session information

sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /net/mraid14/export/data/users/eladch/tools/CO7/mkl/2018.3/compilers_and_libraries_2018.3.222/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1       roxygen2_6.1.1   rprojroot_1.3-2  crayon_1.3.4    
#>  [5] assertthat_0.2.1 digest_0.6.19    commonmark_1.7   MASS_7.3-51.1   
#>  [9] R6_2.4.0         backports_1.1.4  magrittr_1.5     evaluate_0.13   
#> [13] stringi_1.4.3    rlang_0.3.2      fs_1.2.7         xml2_1.2.0      
#> [17] rmarkdown_1.12   pkgdown_1.3.0    desc_1.2.0       tools_3.5.3     
#> [21] stringr_1.4.0    yaml_2.2.0       xfun_0.5         compiler_3.5.3  
#> [25] memoise_1.1.0    htmltools_0.3.6  knitr_1.22

Repsc vignette

David Brocks

2019-06-27

Prerequisites

Genome assembly (optional)

MAFFT multiple alignment program (optional)

Repeat annotation (required)

Gene model annotation (required)

Read/UMI genomic coordinates (required)

Workflow human 10x scRNA-seq dataset (5’)

Getting started

Create scSet

Compute multiple sequence alignments

Call cells

Call TSSs

Workflow mouse Smart-seq2 scRNA-seq dataset

Getting started

Import genome annotation

Import reads and compute count matrix in parallel

Normalize counts

Import genome annotation

Import reads and compute count matrix in parallel

References

Session information

Contents