Skip to contents

This function reads a bam file and writes a sparse-matrix file, where rows are the genomic coordinates and columns are the cells. It does so by:

  1. filtering the bam file by the given region (e.g. chromosom, using the "regions" paramter)

  2. extracting the cell name using the "CB" tag added by cellranger, while excluding reads which were marked as PCR duplicates (1024)

  3. use awk to extract the start coordinates and the tag

  4. use unix "sort" and "uniq" to count the number of times each coordinate appears for each cell

  5. use an R script to convert the cell names to indices

  6. write the sparse-matrix file (matrix-market format) and zip it


Note (1): In order to use this function, "samtools" (>= 1.15), "awk", "sed", "sort" and "uniq" must be available at the unix command line. Note (2): The function takes a while to run - around 13 minutes using 24 cores on the PBMC data.

Usage

write_sparse_matrix_from_bam(
  bam_file,
  out_file,
  cell_names,
  region,
  genome = NULL,
  min_mapq = NULL,
  samtools_bin = "samtools",
  samtools_opts = NULL,
  num_reads = NULL,
  verbose = TRUE,
  overwrite = FALSE
)

Arguments

bam_file

name of the bam file

out_file

name of the output file. A ".gz" extension will be added to the file name.

cell_names

a vector with the cell names or an ATAC object

region

intervals set to filter. See samtools docs http://www.htslib.org/doc/samtools-view.html for details

genome

genome name (e.g. hg19). Will be inferred from the ScPeaks object if provided

min_mapq

minimal mapping quality (optional)

samtools_bin

path to samtools executable

samtools_opts

additional options for samtools (e.g. "--subsample 0.1")

num_reads

number of reads (within the region) to process (optional).

verbose

verbose output (optional)

overwrite

overwrite existing files (optional)

Value

None

Examples

if (FALSE) {
write_sparse_matrix_from_bam("pbmc_data/possorted_bam.bam", "chrom1.mm", region = gintervals.all()[1, ], cell_names = atac_sc)
}