TGL kmeans with 'tidy' output

TGL_kmeans_tidy(
  df,
  k,
  metric = "euclid",
  max_iter = 40,
  min_delta = 0.0001,
  verbose = FALSE,
  keep_log = FALSE,
  id_column = FALSE,
  reorder_func = "hclust",
  add_to_data = FALSE,
  hclust_intra_clusters = FALSE,
  seed = NULL,
  parallel = getOption("tglkmeans.parallel"),
  use_cpp_random = FALSE
)

Arguments

df

a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used.

k

number of clusters. Note that in some cases the algorithm might return less clusters than k.

metric

distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman'

max_iter

maximal number of iterations

min_delta

minimal change in assignments (fraction out of all observations) to continue iterating

verbose

display algorithm messages

keep_log

keep algorithm messages in 'log' field

id_column

df's first column contains the observation id

reorder_func

function to reorder the clusters. operates on each center and orders by the result. e.g. reorder_func = mean would calculate the mean of each center and then would reorder the clusters accordingly. If reorder_func = hclust the centers would be ordered by hclust of the euclidean distance of the correlation matrix, i.e. hclust(dist(cor(t(centers)))) if NULL, no reordering would be done.

add_to_data

return also the original data frame with an extra 'clust' column with the cluster ids ('id' is the first column)

hclust_intra_clusters

run hierarchical clustering within each cluster and return an ordering of the observations.

seed

seed for the c++ random number generator

parallel

cluster every cluster parallelly (if hclust_intra_clusters is true)

use_cpp_random

use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R.

Value

list with the following components:

cluster:

tibble with `id` column with the observation id (`1:n` if no id column was supplied), and `clust` column with the observation assigned cluster.

centers:

tibble with `clust` column and the cluster centers.

size:

tibble with `clust` column and `n` column with the number of points in each cluster.

data:

tibble with `clust` column the original data frame.

log:

messages from the algorithm run (only if id_column = FALSE).

order:

tibble with 'id' column, 'clust' column, 'order' column with a new ordering if the observations and 'intra_clust_order' column with the order within each cluster. (only if hclust_intra_clusters = TRUE)

See also

Examples

# \dontshow{
# this line is only for CRAN checks
tglkmeans.set_parallel(1)
# }

# create 5 clusters normally distributed around 1:5
d <- simulate_data(
    n = 100,
    sd = 0.3,
    nclust = 5,
    dims = 2,
    add_true_clust = FALSE,
    id_column = FALSE
)

head(d)
#>          V1        V2
#> 1 0.9695561 1.1848463
#> 2 1.2218107 0.5552327
#> 3 0.4162643 1.0408562
#> 4 0.8653655 1.3884966
#> 5 1.0570451 0.6128050
#> 6 0.5543186 0.7068720

# cluster
km <- TGL_kmeans_tidy(d, k = 5, "euclid", verbose = TRUE)
#> will generate seeds
#> generating seeds
#> at seed 0
#> add new core from 269 to 0
#> at seed 1
#> done update min distance
#> seed range 350 450
#> picked up 417 dist was 1.57597
#> add new core from 417 to 1
#> at seed 2
#> done update min distance
#> seed range 300 400
#> picked up 73 dist was 1.23917
#> add new core from 73 to 2
#> at seed 3
#> done update min distance
#> seed range 250 350
#> picked up 368 dist was 0.662498
#> add new core from 368 to 3
#> at seed 4
#> done update min distance
#> seed range 200 300
#> picked up 193 dist was 0.438478
#> add new core from 193 to 4
#> reassign after init
#> iter 0
#> iter 1 changed 20
#> iter 1
#> iter 2 changed 11
#> iter 2
#> iter 3 changed 3
#> iter 3
#> iter 4 changed 0
km
#> $centers
#> # A tibble: 5 × 3
#>   clust    V1    V2
#>   <int> <dbl> <dbl>
#> 1     1 1.98  2.00 
#> 2     2 3.98  3.98 
#> 3     3 0.987 0.978
#> 4     4 3.07  3.02 
#> 5     5 5.02  4.98 
#> 
#> $cluster
#> # A tibble: 500 × 2
#>    id    clust
#>    <chr> <int>
#>  1 1         3
#>  2 2         3
#>  3 3         3
#>  4 4         3
#>  5 5         3
#>  6 6         3
#>  7 7         3
#>  8 8         3
#>  9 9         3
#> 10 10        3
#> # ℹ 490 more rows
#> 
#> $size
#> # A tibble: 5 × 2
#>   clust     n
#>   <int> <int>
#> 1     1   104
#> 2     2    99
#> 3     3    99
#> 4     4    98
#> 5     5   100
#>