Write a sparse-matrix file from an indexed bam file — write_sparse_matrix_from

This function reads a bam file and writes a sparse-matrix file, where rows are the genomic coordinates and columns are the cells. It does so by:

filtering the bam file by the given region (e.g. chromosom, using the "regions" paramter)
extracting the cell name using the "CB" tag added by cellranger, while excluding reads which were marked as PCR duplicates (1024)
use awk to extract the start coordinates and the tag
use unix "sort" and "uniq" to count the number of times each coordinate appears for each cell
use an R script to convert the cell names to indices
write the sparse-matrix file (matrix-market format) and zip it

Note (1): In order to use this function, "samtools" (>= 1.15), "awk", "sed", "sort" and "uniq" must be available at the unix command line. Note (2): The function takes a while to run - around 13 minutes using 24 cores on the PBMC data.

Usage

write_sparse_matrix_from_bam(
  bam_file,
  out_file,
  cell_names,
  region,
  genome = NULL,
  min_mapq = NULL,
  samtools_bin = "samtools",
  samtools_opts = NULL,
  num_reads = NULL,
  verbose = TRUE,
  overwrite = FALSE
)

Arguments

bam_file: name of the bam file
out_file: name of the output file. A ".gz" extension will be added to the file name.
cell_names: a vector with the cell names or an ATAC object
region: intervals set to filter. See samtools docs http://www.htslib.org/doc/samtools-view.html for details
genome: genome name (e.g. hg19). Will be inferred from the ScPeaks object if provided
min_mapq: minimal mapping quality (optional)
samtools_bin: path to samtools executable
samtools_opts: additional options for samtools (e.g. "--subsample 0.1")
num_reads: number of reads (within the region) to process (optional).
verbose: verbose output (optional)
overwrite: overwrite existing files (optional)

Value

None

Examples

if (FALSE) {
write_sparse_matrix_from_bam("pbmc_data/possorted_bam.bam", "chrom1.mm", region = gintervals.all()[1, ], cell_names = atac_sc)
}