Downsample
TanayLabUtilities.Downsample
—
Module
Downsampling of data. The idea is that you have a vector containing a total of some (large)
K
counts of samples with values
1..N
drawn from a multinomial distribution (with different probabilities for getting each of the
1..N
values). Generate a vector with a total of some (smaller)
k
samples. This is typically done to a set of vectors, typically to all columns of a matrix (each with its own
K(j)
), to get a set of vectors with the same
k
.
This is useful for meaningfully comparing the vectors (for example, computing correlations between them). Without downsampling, distance measures between such vectors are biases by the sampling depth
K
. For example, correlations with deeper (higher total samples) vectors will tend to be higher.
Downsampling discards data so we'd like the target
k
to be as large as possible. Typically this isn't the minimal
K(j)
to avoid a few shallow sampled vectors from ruining the quality of the results; we accept that a small fraction of the vectors will keep their original
K(j)
samples when this is less than the chosen
k
.
TanayLabUtilities.Downsample.downsample
—
Function
downsample(
vector::AbstractVector{<:Integer},
samples::Integer;
rng::AbstractRNG = default_rng(),
output::Maybe{AbstractVector} = nothing,
)::AbstractVector
downsample(
matrix::AbstractMatrix{<:Integer},
samples::Integer;
dims::Integer,
rng::AbstractRNG = default_rng(),
output::Maybe{AbstractMatrix} = nothing,
)::AbstractMatrix
Given a
vector
of integer non-negative data values, return a new vector such that the sum of entries in it is
samples
. Think of the original vector as containing a number of marbles in each entry. We randomly pick
samples
marbles from this vector; each time we pick a marble we take it out of the original vector and move it to the same position in the result.
If the sum of the entries of a vector is less than
samples
, it is copied to the output. If
output
is not specified, it is allocated automatically using the same element type as the input.
When downsampling a
matrix
, then
dims
must be specified to be
1
/
Rows
to separately downsample each row, or
2
/
Columns
to separately downsample each column.
using Test
# Columns
data = rand(1:100, 10, 5)
samples_per_column = vec(sum(data; dims = 1))
for samples in (100, 250, 500, 750, 1000)
downsampled = downsample(data, samples; dims = 2)
downsamples_per_column = vec(sum(downsampled; dims = 1))
@test all(downsamples_per_column .== min.(samples_per_column, samples))
too_small_mask = samples_per_column .<= samples
@test all(downsampled[:, too_small_mask] .== data[:, too_small_mask])
end
# Rows
data = flip(data)
samples_per_row = samples_per_column
for samples in (100, 250, 500, 750, 1000)
downsampled = downsample(data, samples; dims = 1)
downsamples_per_row = vec(sum(downsampled; dims = 2))
@test all(downsamples_per_row .== min.(samples_per_row, samples))
too_small_mask = samples_per_row .<= samples
@test all(downsampled[too_small_mask, :] .== data[too_small_mask, :])
end
println("OK")
# output
OK
TanayLabUtilities.Downsample.downsamples
—
Function
downsamples(
samples_per_vector::AbstractVector{<:Integer};
min_downsamples::Integer = ```750```,
min_downsamples_quantile::AbstractFloat = ```0.05```,
max_downsamples_quantile::AbstractFloat = ```0.5```,
)::Integer
When downsampling multiple vectors (the amount of data in each available in
samples_per_vector
), we need to pick a "reasonable" number of samples to downsample to. We have conflicting requirements, so this is a compromise. First, we want most vectors to have at least the target number of samples, so we start with the
min_downsamples_quantile
of the
samples_per_vector
. Second, we also want to have at least
min_downsamples
to ensure we don't throw away too much data even if many vectors are sparse, so we increase the target to this value. Finally, we don't want a target which is too big for too many vectors, so so we reduce the result to the
max_downsamples_quantile
of the
samples_per_vector
.
The defaults (especially
min_downsamples
) were chosen to fit our needs (downsampling UMIs of sc-RNA-seq data). You will need to tweak them when using this for other purposes.
downsamples([100, 500, 1000])
# output
500