TGL kmeans with 'tidy' output

TGL_kmeans_tidy(
  df,
  k,
  metric = "euclid",
  max_iter = 40,
  min_delta = 0.0001,
  verbose = FALSE,
  keep_log = FALSE,
  id_column = FALSE,
  reorder_func = "hclust",
  add_to_data = FALSE,
  hclust_intra_clusters = FALSE,
  seed = NULL,
  use_cpp_random = FALSE
)

Arguments

df

a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used.

k

number of clusters. Note that in some cases the algorithm might return less clusters than k.

metric

distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman'

max_iter

maximal number of iterations

min_delta

minimal change in assignments (fraction out of all observations) to continue iterating

verbose

display algorithm messages

keep_log

keep algorithm messages in 'log' field

id_column

df's first column contains the observation id. If not set and the first column is character or factor, it will be automatically used as the ID column (with a warning).

reorder_func

function to reorder the clusters. operates on each center and orders by the result. e.g. reorder_func = mean would calculate the mean of each center and then would reorder the clusters accordingly. If reorder_func = hclust the centers would be ordered by hclust of the euclidean distance of the correlation matrix, i.e. hclust(dist(cor(t(centers)))) if NULL, no reordering would be done.

add_to_data

return also the original data frame with an extra 'clust' column with the cluster ids ('id' is the first column)

hclust_intra_clusters

run hierarchical clustering within each cluster and return an ordering of the observations.

seed

seed for the c++ random number generator

use_cpp_random

use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R.

Value

list with the following components:

cluster:

tibble with `id` column with the observation id (`1:n` if no id column was supplied), and `clust` column with the observation assigned cluster.

centers:

tibble with `clust` column and the cluster centers.

size:

tibble with `clust` column and `n` column with the number of points in each cluster.

data:

tibble with `clust` column the original data frame.

log:

messages from the algorithm run (only if id_column = FALSE).

order:

tibble with 'id' column, 'clust' column, 'order' column with a new ordering if the observations and 'intra_clust_order' column with the order within each cluster. (only if hclust_intra_clusters = TRUE)

See also

Examples


# create 5 clusters normally distributed around 1:5
d <- simulate_data(
    n = 100,
    sd = 0.3,
    nclust = 5,
    dims = 2,
    add_true_clust = FALSE,
    id_column = FALSE
)

head(d)
#>          V1        V2
#> 1 0.4162643 1.0408562
#> 2 0.8653655 1.3884966
#> 3 1.0570451 0.6128050
#> 4 0.5543186 0.7068720
#> 5 0.5403417 1.4741502
#> 6 0.8346604 0.9403844

# cluster
km <- TGL_kmeans_tidy(d, k = 5, "euclid", verbose = TRUE)
#> will generate seeds
#> generating seeds
#> at seed 0
#> add new core from 97 to 0
#> at seed 1
#> done update min distance
#> seed range 350 450
#> picked up 410
#> add new core from 410 to 1
#> at seed 2
#> done update min distance
#> seed range 300 400
#> picked up 221
#> add new core from 221 to 2
#> at seed 3
#> done update min distance
#> seed range 250 350
#> picked up 175
#> add new core from 175 to 3
#> at seed 4
#> done update min distance
#> seed range 200 300
#> picked up 339
#> add new core from 339 to 4
#> reassign after init
#> iter 0
#> iter 1 changed 14
#> iter 1
#> iter 2 changed 2
#> iter 2
#> iter 3 changed 0
km
#> $centers
#> # A tibble: 5 × 3
#>   clust    V1    V2
#>   <int> <dbl> <dbl>
#> 1     1 1.98  2.00 
#> 2     2 3.98  3.97 
#> 3     3 3.07  3.01 
#> 4     4 0.983 0.980
#> 5     5 5.03  4.98 
#> 
#> $cluster
#> # A tibble: 500 × 2
#>    id    clust
#>    <chr> <int>
#>  1 1         4
#>  2 2         4
#>  3 3         4
#>  4 4         4
#>  5 5         4
#>  6 6         4
#>  7 7         4
#>  8 8         4
#>  9 9         4
#> 10 10        4
#> # ℹ 490 more rows
#> 
#> $size
#> # A tibble: 5 × 2
#>   clust     n
#>   <int> <int>
#> 1     1   104
#> 2     2    99
#> 3     3    98
#> 4     4    99
#> 5     5   100
#>