kmeans++ with return value similar to R kmeans

TGL_kmeans(
  df,
  k,
  metric = "euclid",
  max_iter = 40,
  min_delta = 0.0001,
  verbose = FALSE,
  keep_log = FALSE,
  id_column = FALSE,
  reorder_func = "hclust",
  hclust_intra_clusters = FALSE,
  seed = NULL,
  use_cpp_random = FALSE
)

Arguments

df

a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used.

k

number of clusters. Note that in some cases the algorithm might return fewer clusters than k.

metric

distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman'

max_iter

maximal number of iterations

min_delta

minimal change in assignments (fraction out of all observations) to continue iterating

verbose

display algorithm messages

keep_log

keep algorithm messages in 'log' field

id_column

df's first column contains the observation id. If not set and the first column is character or factor, it will be automatically used as the ID column (with a warning).

reorder_func

function to reorder the clusters. operates on each center and orders by the result. e.g. reorder_func = mean would calculate the mean of each center and then would reorder the clusters accordingly. If reorder_func = hclust the centers would be ordered by hclust of the euclidean distance of the correlation matrix, i.e. hclust(dist(cor(t(centers)))) if NULL, no reordering would be done.

hclust_intra_clusters

run hierarchical clustering within each cluster and return an ordering of the observations.

seed

seed for the c++ random number generator

use_cpp_random

use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed to R.

Value

list with the following components:

cluster:

A vector of integers (from ‘1:k’) indicating the cluster to which each point is allocated.

centers:

A matrix of cluster centers.

size:

The number of points in each cluster.

log:

messages from the algorithm run (only if keep_log = TRUE).

order:

A vector of integers with the new ordering if the observations. (only if hclust_intra_clusters = TRUE)

See also

Examples


# create 5 clusters normally distributed around 1:5
d <- simulate_data(
    n = 100,
    sd = 0.3,
    nclust = 5,
    dims = 2,
    add_true_clust = FALSE,
    id_column = FALSE
)

head(d)
#>          V1        V2
#> 1 1.0765951 0.7643702
#> 2 0.2688209 0.6829789
#> 3 0.9983286 0.7613376
#> 4 1.1864658 0.4731174
#> 5 1.3445235 0.7928386
#> 6 0.4534547 0.8324374

# cluster
km <- TGL_kmeans(d, k = 5, "euclid", verbose = TRUE)
#> will generate seeds
#> generating seeds
#> at seed 0
#> add new core from 207 to 0
#> at seed 1
#> done update min distance
#> seed range 350 450
#> picked up 29
#> add new core from 29 to 1
#> at seed 2
#> done update min distance
#> seed range 300 400
#> picked up 430
#> add new core from 430 to 2
#> at seed 3
#> done update min distance
#> seed range 250 350
#> picked up 363
#> add new core from 363 to 3
#> at seed 4
#> done update min distance
#> seed range 200 300
#> picked up 131
#> add new core from 131 to 4
#> reassign after init
#> iter 0
#> iter 1 changed 3
#> iter 1
#> iter 2 changed 0
names(km)
#> [1] "cluster" "centers" "size"   
km$centers
#>            V1       V2
#> [1,] 2.008681 2.071147
#> [2,] 3.971403 4.040806
#> [3,] 4.974842 5.043703
#> [4,] 2.963238 3.015011
#> [5,] 1.027979 1.038583
head(km$cluster)
#> 1 2 3 4 5 6 
#> 5 5 5 5 5 5 
km$size
#>   1   2   3   4   5 
#> 101  98  99 101 101