Downsample

TanayLabUtilities.Downsample Module

Downsampling of data. The idea is that you have a vector containing a total of some (large) K counts of samples with values 1..N drawn from a multinomial distribution (with different probabilities for getting each of the 1..N values). Generate a vector with a total of some (smaller) k samples. This is typically done to a set of vectors, typically to all columns of a matrix (each with its own K(j) ), to get a set of vectors with the same k .

This is useful for meaningfully comparing the vectors (for example, computing correlations between them). Without downsampling, distance measures between such vectors are biases by the sampling depth K . For example, correlations with deeper (higher total samples) vectors will tend to be higher.

Downsampling discards data so we'd like the target k to be as large as possible. Typically this isn't the minimal K(j) to avoid a few shallow sampled vectors from ruining the quality of the results; we accept that a small fraction of the vectors will keep their original K(j) samples when this is less than the chosen k .

TanayLabUtilities.Downsample.downsample Function
downsample(
    vector::AbstractVector{<:Integer},
    samples::Integer;
    rng::AbstractRNG = default_rng(),
    output::Maybe{AbstractVector} = nothing,
)::AbstractVector

downsample(
    matrix::AbstractMatrix{<:Integer},
    samples::Integer;
    dims::Integer,
    rng::AbstractRNG = default_rng(),
    output::Maybe{AbstractMatrix} = nothing,
)::AbstractMatrix

Given a vector of integer non-negative data values, return a new vector such that the sum of entries in it is samples . Think of the original vector as containing a number of marbles in each entry. We randomly pick samples marbles from this vector; each time we pick a marble we take it out of the original vector and move it to the same position in the result.

If the sum of the entries of a vector is less than samples , it is copied to the output. If output is not specified, it is allocated automatically using the same element type as the input.

When downsampling a matrix , then dims must be specified to be 1 / Rows to separately downsample each row, or 2 / Columns to separately downsample each column.

using Test

# Columns

data = rand(1:100, 10, 5)
samples_per_column = vec(sum(data; dims = 1))

for samples in (100, 250, 500, 750, 1000)
    downsampled = downsample(data, samples; dims = 2)
    downsamples_per_column = vec(sum(downsampled; dims = 1))
    @test all(downsamples_per_column .== min.(samples_per_column, samples))
    too_small_mask = samples_per_column .<= samples
    @test all(downsampled[:, too_small_mask] .== data[:, too_small_mask])
end

# Rows

data = flip(data)
samples_per_row = samples_per_column

for samples in (100, 250, 500, 750, 1000)
    downsampled = downsample(data, samples; dims = 1)
    downsamples_per_row = vec(sum(downsampled; dims = 2))
    @test all(downsamples_per_row .== min.(samples_per_row, samples))
    too_small_mask = samples_per_row .<= samples
    @test all(downsampled[too_small_mask, :] .== data[too_small_mask, :])
end

println("OK")

# output

OK

TanayLabUtilities.Downsample.downsamples Function
downsamples(
    samples_per_vector::AbstractVector{<:Integer};
    min_downsamples::Integer = ```750```,
    min_downsamples_quantile::AbstractFloat = ```0.05```,
    max_downsamples_quantile::AbstractFloat = ```0.5```,
)::Integer

When downsampling multiple vectors (the amount of data in each available in samples_per_vector ), we need to pick a "reasonable" number of samples to downsample to. We have conflicting requirements, so this is a compromise. First, we want most vectors to have at least the target number of samples, so we start with the min_downsamples_quantile of the samples_per_vector . Second, we also want to have at least min_downsamples to ensure we don't throw away too much data even if many vectors are sparse, so we increase the target to this value. Finally, we don't want a target which is too big for too many vectors, so so we reduce the result to the max_downsamples_quantile of the samples_per_vector .

Note

The defaults (especially min_downsamples ) were chosen to fit our needs (downsampling UMIs of sc-RNA-seq data). You will need to tweak them when using this for other purposes.

downsamples([100, 500, 1000])

# output

500

Index