Slanter

Slanter Module

Reorder matrices rows and columns to move high values close to the diagonal.

Note

Instead of providing the limited oclust (like the R version), this includes ehclust , an extended version of [hclust]*https://github.com/JuliaStats/Clustering.jl/blob/master/src/hclust.jl).

Slanter.slanted_orders Function
function slanted_orders(
    data::AbstractMatrix{<:Real};
    order_rows::Bool=true,
    order_cols::Bool=true,
    squared_order::Bool=true,
    same_order::Bool=false,
    discount_outliers::Bool=true,
    max_spin_count::Integer=10
)::Tuple{AbstractVector{<:Integer}, AbstractVector{<:Integer}}

Compute rows and columns orders which move high values close to the diagonal.

For a matrix expressing the cross-similarity between two (possibly different) sets of entities, this produces better results than clustering. This is because clustering does not care about the order of each two sub-partitions. That is, clustering is as happy with ((2, 1), (4, 3)) as it is with the more sensible ((1, 2), (3, 4)) . As a result, visualizations of similarities using naive clustering can be misleading.

This situation is worse in Python or R than it is in Julia, which mercifully provides the :barjoseph method to reorder the branches. Still, the method here may be "more suitable" in specific circumstances (when the data depicts a clear gradient).

Parameters

  • data : A rectangular matrix containing non-negative values (may be negative if squared_order ).
  • order_rows : Whether to reorder the rows.
  • order_cols : Whether to reorder the columns.
  • squared_order : Whether to reorder to minimize the l2 norm (otherwise minimizes the l1 norm).
  • same_order : Whether to apply the same order to both rows and columns.
  • discount_outliers : Whether to do a final order phase discounting outlier values far from the diagonal.
  • max_spin_count : How many times to retry improving the solution before giving up.

Returns

A tuple with two vectors, which contain the order of the rows and the columns.

Slanter.slanted_reorder Function
slanted_reorder(
    data::AbstractMatrix{T};
    order_data::Union{AbstractMatrix{<:Real}, Nothing}=nothing,
    order_rows::Bool=true,
    order_cols::Bool=true,
    squared_order::Bool=true,
    same_order::Bool=false,
    discount_outliers::Bool=true
)::AbstractMatrix{T} where {T<:Real}

Reorder data rows and columns to move high values close to the diagonal.

Given a matrix expressing the cross-similarity between two (possibly different) sets of entities, this uses slanted_orders to compute the "best" order for visualizing the matrix, then returns the reordered data.

Parameters

  • data : A rectangular matrix to reorder, of non-negative values (unless order_data is specified, or squared_order )).
  • order_data : An optional matrix of non-negative values of the same size to use for computing the orders (may be negative if squared_order ).
  • order_rows : Whether to reorder the rows.
  • order_cols : Whether to reorder the columns.
  • squared_order : Whether to reorder to minimize the l2 norm (otherwise minimizes the l1 norm).
  • same_order : Whether to apply the same order to both rows and columns.
  • discount_outliers : Whether to do a final order phase discounting outlier values far from the diagonal.

Returns

A matrix of the same shape whose rows and columns are a permutation of the input.

Slanter.reorder_hclust Function
reorder_hclust(clusters, order)

Given a clustering of some data, and some ideal order we'd like to use to visualize it, reorder (but do not modify) the clustering to be as consistent as possible with this ideal order.

Parameters

  • clusters : The existing clustering of the data.
  • order : The ideal order we'd like to see the data in.

Returns

A reordered clustering which is consistent, wherever possible, with the ideal order.

Slanter.EnhancedHclust Module

Enhancements of [hclust]*https://github.com/JuliaStats/Clustering.jl/blob/master/src/hclust.jl)

Slanter.EnhancedHclust.ehclust Function
ehclust(d::AbstractMatrix; [linkage], [uplo], [branchorder], [order]) -> Hclust

Enhanced [hclust]*https://github.com/JuliaStats/Clustering.jl/blob/master/src/hclust.jl). This is similar to hclust with the following extensions:

  • If branchorder is a vector of Real numbers, one per leaf, then we reorder the branches so that each leaf position would be as close as possible to its branchorder value. Technically we compute a center of gravity for each node and reorder the tree such that that at each branch, the left sub-tree center of gravity is to the left (lower than) the center of gravity of the right sub-tree.
  • If order is specified, it must be a permutation of the 1:N leaf indices. This will be the final order of the result; that is, we constrain the tree so that each node covers a continuous range of leaves (by this order). If you specify an explicit branchorder , this will rotate some nodes so the result will no longer be in the specified order, but the tree is still constrained as above.
  • If groups is a vector of strings, then we first cluster all the entries for each group together, then combine the results. This is mutually exclusive with specifying an order .
  • If groups is a vector of integers, they are expected to cover a range 1:N. We again cluster each group separately, and then cluster the groups enforcing them to be in ascending order. Applying branchorder in this case will only reorder branches inside each group, preserving the ascending order between the groups.

The groups and order parameters are mutually exclusive.

using Test
using Clustering
using Distances

data = rand(4, 10)
distances = pairwise(Euclidean(), data; dims = 2)
result = hclust(distances)
eresult = ehclust(distances)
@test result.order == eresult.order

result = hclust(distances; linkage = :ward)
eresult = ehclust(distances; linkage = :ward)
@test result.order == eresult.order

println("OK")

# output

OK

using Test
using Distances

data = rand(4, 10)
distances = pairwise(Euclidean(), data; dims = 2)
positions = rand(10)
result = ehclust(distances; branchorder = positions)
merges_data = Vector{Tuple{Int, Float64}}(undef, 9)
for merge_index in 1:9
    left = result.merges[merge_index, 1]
    if left < 0
        left_size = 1
        left_center = positions[-left]
    else
        left_size, left_center = merges_data[left]
    end

    right = result.merges[merge_index, 2]
    if right < 0
        right_size = 1
        right_center = positions[-right]
    else
        right_size, right_center = merges_data[right]
    end

    @test left_center <= right_center
    merged_size = left_size + right_size
    merged_center = (left_center * left_size + right_center * right_size) / merged_size
    merges_data[merge_index] = (merged_size, merged_center)
end

println("OK")

# output

OK

using Test
using Distances
using Random

data = rand(4, 10)
distances = pairwise(Euclidean(), data; dims = 2)
order = collect(1:10)
shuffle!(order)
result = ehclust(distances; order)
@test result.order == order
result = ehclust(distances; branchorder = :r, order)
@test result.order != order

println("OK")

# output

OK

using Test
using Distances
using Random

data = rand(4, 10)
distances = pairwise(Euclidean(), data; dims = 2)
groups = ["A", "B"][rand(1:2, 10)]
result = ehclust(distances; groups)
a_indices = findall(groups[result.order] .== "A")
b_indices = findall(groups[result.order] .== "B")
@assert maximum(a_indices) < minimum(b_indices) || maximum(b_indices) < minimum(a_indices)

println("OK")

# output

OK

using Test
using Distances
using Random

data = rand(4, 10)
distances = pairwise(Euclidean(), data; dims = 2)

groups = rand(1:2, 10)
result = ehclust(distances; groups)
one_indices = findall(groups[result.order] .== 1)
two_indices = findall(groups[result.order] .== 2)
@assert maximum(one_indices) < minimum(two_indices)

groups = 3 .- groups
result = ehclust(distances; groups)
one_indices = findall(groups[result.order] .== 1)
two_indices = findall(groups[result.order] .== 2)
@assert maximum(one_indices) < minimum(two_indices)

bresult = ehclust(distances; branchorder = :r, groups)
@assert bresult.order != result.order
one_indices = findall(groups[bresult.order] .== 1)
two_indices = findall(groups[bresult.order] .== 2)
@assert maximum(one_indices) < minimum(two_indices)

println("OK")

# output

OK

Index