Downsample

TanayLabUtilities.Downsample

— Module

Downsampling of data. The idea is that you have a vector containing a total of some (large) K counts of samples with values 1..N drawn from a multinomial distribution (with different probabilities for getting each of the 1..N values). Generate a vector with a total of some (smaller) k samples. This is typically done to a set of vectors, typically to all columns of a matrix (each with its own K(j)), to get a set of vectors with the same k.

This is useful for meaningfully comparing the vectors (for example, computing correlations between them). Without downsampling, distance measures between such vectors are biases by the sampling depth K. For example, correlations with deeper (higher total samples) vectors will tend to be higher.

Downsampling discards data so we'd like the target k to be as large as possible. Typically this isn't the minimal K(j) to avoid a few shallow sampled vectors from ruining the quality of the results; we accept that a small fraction of the vectors will keep their original K(j) samples when this is less than the chosen k.

TanayLabUtilities.Downsample.downsample

— Function

downsample(
    vector::AbstractVector{<:Integer},
    samples::Integer;
    rng::AbstractRNG = default_rng(),
    output::Maybe{AbstractVector} = nothing,
)::AbstractVector

downsample(
    matrix::AbstractMatrix{<:Integer},
    samples::Integer;
    dims::Integer,
    rng::AbstractRNG = default_rng(),
    output::Maybe{AbstractMatrix} = nothing,
)::AbstractMatrix

Given a vector of integer non-negative data values, return a new vector such that the sum of entries in it is samples. Think of the original vector as containing a number of marbles in each entry. We randomly pick samples marbles from this vector; each time we pick a marble we take it out of the original vector and move it to the same position in the result.

If the sum of the entries of a vector is less than samples, it is copied to the output. If output is not specified, it is allocated automatically using the same element type as the input.

When downsampling a matrix, then dims must be specified to be 1/ Rows to separately downsample each row, or 2/ Columns to separately downsample each column.

using Test

# Columns

data = rand(1:100, 10, 5)
samples_per_column = vec(sum(data; dims = 1))

for samples in (100, 250, 500, 750, 1000)
    downsampled = downsample(data, samples; dims = 2)
    downsamples_per_column = vec(sum(downsampled; dims = 1))
    @test all(downsamples_per_column .== min.(samples_per_column, samples))
    too_small_mask = samples_per_column .<= samples
    @test all(downsampled[:, too_small_mask] .== data[:, too_small_mask])
end

# Rows

data = flip(data)
samples_per_row = samples_per_column

for samples in (100, 250, 500, 750, 1000)
    downsampled = downsample(data, samples; dims = 1)
    downsamples_per_row = vec(sum(downsampled; dims = 2))
    @test all(downsamples_per_row .== min.(samples_per_row, samples))
    too_small_mask = samples_per_row .<= samples
    @test all(downsampled[too_small_mask, :] .== data[too_small_mask, :])
end

println("OK")

# output

OK

TanayLabUtilities.Downsample.downsamples

— Function

downsamples(
    samples_per_vector::AbstractVector{<:Integer};
    min_downsamples::Integer = ```750```,
    min_downsamples_quantile::AbstractFloat = ```0.05```,
    max_downsamples_quantile::AbstractFloat = ```0.5```,
)::Integer

When downsampling multiple vectors (the amount of data in each available in samples_per_vector), we need to pick a "reasonable" number of samples to downsample to. We have conflicting requirements, so this is a compromise. First, we want most vectors to have at least the target number of samples, so we start with the min_downsamples_quantile of the samples_per_vector. Second, we also want to have at least min_downsamples to ensure we don't throw away too much data even if many vectors are sparse, so we increase the target to this value. Finally, we don't want a target which is too big for too many vectors, so so we reduce the result to the max_downsamples_quantile of the samples_per_vector.

Note

The defaults (especially min_downsamples) were chosen to fit our needs (downsampling UMIs of sc-RNA-seq data). You will need to tweak them when using this for other purposes.

downsamples([100, 500, 1000])

# output

500

Downsample

Index