kmeans++ with return value similar to R kmeans

TGL_kmeans(
  df,
  k,
  metric = "euclid",
  max_iter = 40,
  min_delta = 0.0001,
  verbose = FALSE,
  keep_log = FALSE,
  id_column = FALSE,
  reorder_func = "hclust",
  hclust_intra_clusters = FALSE,
  seed = NULL,
  use_cpp_random = FALSE
)

Arguments

df

a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used.

k

number of clusters. Note that in some cases the algorithm might return less clusters than k.

metric

distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman'

max_iter

maximal number of iterations

min_delta

minimal change in assignments (fraction out of all observations) to continue iterating

verbose

display algorithm messages

keep_log

keep algorithm messages in 'log' field

id_column

df's first column contains the observation id

reorder_func

function to reorder the clusters. operates on each center and orders by the result. e.g. reorder_func = mean would calculate the mean of each center and then would reorder the clusters accordingly. If reorder_func = hclust the centers would be ordered by hclust of the euclidean distance of the correlation matrix, i.e. hclust(dist(cor(t(centers)))) if NULL, no reordering would be done.

hclust_intra_clusters

run hierarchical clustering within each cluster and return an ordering of the observations.

seed

seed for the c++ random number generator

use_cpp_random

use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R.

Value

list with the following components:

cluster:

A vector of integers (from ‘1:k’) indicating the cluster to which each point is allocated.

centers:

A matrix of cluster centers.

size:

The number of points in each cluster.

log:

messages from the algorithm run (only if id_column == TRUE).

order:

A vector of integers with the new ordering if the observations. (only if hclust_intra_clusters = TRUE)

See also

Examples


# create 5 clusters normally distributed around 1:5
d <- simulate_data(
    n = 100,
    sd = 0.3,
    nclust = 5,
    dims = 2,
    add_true_clust = FALSE,
    id_column = FALSE
)

head(d)
#>          V1        V2
#> 1 1.0765951 0.7643702
#> 2 0.2688209 0.6829789
#> 3 0.9983286 0.7613376
#> 4 1.1864658 0.4731174
#> 5 1.3445235 0.7928386
#> 6 0.4534547 0.8324374

# cluster
km <- TGL_kmeans(d, k = 5, "euclid", verbose = TRUE)
#> will generate seeds
#> generating seeds
#> at seed 0
#> add new core from 207 to 0
#> at seed 1
#> done update min distance
#> seed range 350 450
#> picked up 29 dist was 1.46705
#> add new core from 29 to 1
#> at seed 2
#> done update min distance
#> seed range 300 400
#> picked up 430 dist was 1.53801
#> add new core from 430 to 2
#> at seed 3
#> done update min distance
#> seed range 250 350
#> picked up 363 dist was 0.700502
#> add new core from 363 to 3
#> at seed 4
#> done update min distance
#> seed range 200 300
#> picked up 131 dist was 0.49677
#> add new core from 131 to 4
#> reassign after init
#> iter 0
#> iter 1 changed 3
#> iter 1
#> iter 2 changed 0
names(km)
#> [1] "cluster" "centers" "size"   
km$centers
#>            V1       V2
#> [1,] 2.008681 2.071147
#> [2,] 3.971403 4.040806
#> [3,] 4.974842 5.043703
#> [4,] 2.963238 3.015011
#> [5,] 1.027979 1.038583
head(km$cluster)
#> 1 2 3 4 5 6 
#> 5 5 5 5 5 5 
km$size
#>   1   2   3   4   5 
#> 101  98  99 101 101