kmeans++ with return value similar to R kmeans

```
TGL_kmeans(
df,
k,
metric = "euclid",
max_iter = 40,
min_delta = 0.0001,
verbose = FALSE,
keep_log = FALSE,
id_column = FALSE,
reorder_func = "hclust",
hclust_intra_clusters = FALSE,
seed = NULL,
parallel = getOption("tglkmeans.parallel"),
use_cpp_random = FALSE
)
```

- df
a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used.

- k
number of clusters. Note that in some cases the algorithm might return less clusters than k.

- metric
distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman'

- max_iter
maximal number of iterations

- min_delta
minimal change in assignments (fraction out of all observations) to continue iterating

- verbose
display algorithm messages

- keep_log
keep algorithm messages in 'log' field

- id_column
`df`

's first column contains the observation id- reorder_func
function to reorder the clusters. operates on each center and orders by the result. e.g.

`reorder_func = mean`

would calculate the mean of each center and then would reorder the clusters accordingly. If`reorder_func = hclust`

the centers would be ordered by hclust of the euclidean distance of the correlation matrix, i.e.`hclust(dist(cor(t(centers))))`

if NULL, no reordering would be done.- hclust_intra_clusters
run hierarchical clustering within each cluster and return an ordering of the observations.

- seed
seed for the c++ random number generator

- parallel
cluster every cluster parallelly (if hclust_intra_clusters is true)

- use_cpp_random
use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R.

list with the following components:

- cluster:
A vector of integers (from ‘1:k’) indicating the cluster to which each point is allocated.

- centers:
A matrix of cluster centers.

- size:
The number of points in each cluster.

- log:
messages from the algorithm run (only if

`id_column == TRUE`

).- order:
A vector of integers with the new ordering if the observations. (only if hclust_intra_clusters = TRUE)

```
# \dontshow{
# this line is only for CRAN checks
tglkmeans.set_parallel(1)
# }
# create 5 clusters normally distributed around 1:5
d <- simulate_data(
n = 100,
sd = 0.3,
nclust = 5,
dims = 2,
add_true_clust = FALSE,
id_column = FALSE
)
head(d)
#> V1 V2
#> 1 0.5799869 0.8838359
#> 2 1.0765951 0.7643702
#> 3 0.2688209 0.6829789
#> 4 0.9983286 0.7613376
#> 5 1.1864658 0.4731174
#> 6 1.3445235 0.7928386
# cluster
km <- TGL_kmeans(d, k = 5, "euclid", verbose = TRUE)
#> will generate seeds
#> generating seeds
#> at seed 0
#> add new core from 184 to 0
#> at seed 1
#> done update min distance
#> seed range 350 450
#> picked up 446 dist was 2.12285
#> add new core from 446 to 1
#> at seed 2
#> done update min distance
#> seed range 300 400
#> picked up 10 dist was 0.853872
#> add new core from 10 to 2
#> at seed 3
#> done update min distance
#> seed range 250 350
#> picked up 336 dist was 0.74596
#> add new core from 336 to 3
#> at seed 4
#> done update min distance
#> seed range 200 300
#> picked up 231 dist was 0.63909
#> add new core from 231 to 4
#> reassign after init
#> iter 0
#> iter 1 changed 2
#> iter 1
#> iter 2 changed 0
names(km)
#> [1] "cluster" "centers" "size"
km$centers
#> V1 V2
#> [1,] 2.964869 3.008410
#> [2,] 3.969391 4.038842
#> [3,] 1.019032 1.034553
#> [4,] 2.000776 2.053331
#> [5,] 4.975310 5.048907
head(km$cluster)
#> 1 2 3 4 5 6
#> 3 3 3 3 3 3
km$size
#> 1 2 3 4 5
#> 102 98 100 101 99
```