Write a sparse-matrix file from an indexed bam file
Source:R/import-counts.R
write_sparse_matrix_from_bam.Rd
This function reads a bam file and writes a sparse-matrix file, where rows are the genomic coordinates and columns are the cells. It does so by:
filtering the bam file by the given region (e.g. chromosom, using the "regions" paramter)
extracting the cell name using the "CB" tag added by cellranger, while excluding reads which were marked as PCR duplicates (1024)
use awk to extract the start coordinates and the tag
use unix "sort" and "uniq" to count the number of times each coordinate appears for each cell
use an R script to convert the cell names to indices
write the sparse-matrix file (matrix-market format) and zip it
Note (1): In order to use this function, "samtools" (>= 1.15), "awk", "sed", "sort" and "uniq" must be available at the unix command line.
Note (2): The function takes a while to run - around 13 minutes using 24 cores on the PBMC data.
Usage
write_sparse_matrix_from_bam(
bam_file,
out_file,
cell_names,
region,
genome = NULL,
min_mapq = NULL,
samtools_bin = "samtools",
samtools_opts = NULL,
num_reads = NULL,
verbose = TRUE,
overwrite = FALSE
)
Arguments
- bam_file
name of the bam file
- out_file
name of the output file. A ".gz" extension will be added to the file name.
- cell_names
a vector with the cell names or an ATAC object
- region
intervals set to filter. See samtools docs http://www.htslib.org/doc/samtools-view.html for details
- genome
genome name (e.g. hg19). Will be inferred from the ScPeaks object if provided
- min_mapq
minimal mapping quality (optional)
- samtools_bin
path to samtools executable
- samtools_opts
additional options for samtools (e.g. "--subsample 0.1")
- num_reads
number of reads (within the
region
) to process (optional).- verbose
verbose output (optional)
- overwrite
overwrite existing files (optional)