TGL kmeans with 'tidy' output

TGL_kmeans_tidy(
  df,
  k,
  metric = "euclid",
  max_iter = 40,
  min_delta = 0.0001,
  verbose = FALSE,
  keep_log = FALSE,
  id_column = FALSE,
  reorder_func = "hclust",
  add_to_data = FALSE,
  hclust_intra_clusters = FALSE,
  seed = NULL,
  use_cpp_random = FALSE
)

Arguments

df: a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used.
k: number of clusters. Note that in some cases the algorithm might return less clusters than k.
metric: distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman'
max_iter: maximal number of iterations
min_delta: minimal change in assignments (fraction out of all observations) to continue iterating
verbose: display algorithm messages
keep_log: keep algorithm messages in 'log' field
id_column: df's first column contains the observation id
reorder_func: function to reorder the clusters. operates on each center and orders by the result. e.g. reorder_func = mean would calculate the mean of each center and then would reorder the clusters accordingly. If reorder_func = hclust the centers would be ordered by hclust of the euclidean distance of the correlation matrix, i.e. hclust(dist(cor(t(centers)))) if NULL, no reordering would be done.
add_to_data: return also the original data frame with an extra 'clust' column with the cluster ids ('id' is the first column)
hclust_intra_clusters: run hierarchical clustering within each cluster and return an ordering of the observations.
seed: seed for the c++ random number generator
use_cpp_random: use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R.

Value

list with the following components:

cluster:: tibble with `id` column with the observation id (`1:n` if no id column was supplied), and `clust` column with the observation assigned cluster.
centers:: tibble with `clust` column and the cluster centers.
size:: tibble with `clust` column and `n` column with the number of points in each cluster.
data:: tibble with `clust` column the original data frame.
log:: messages from the algorithm run (only if id_column = FALSE).
order:: tibble with 'id' column, 'clust' column, 'order' column with a new ordering if the observations and 'intra_clust_order' column with the order within each cluster. (only if hclust_intra_clusters = TRUE)

Examples


# create 5 clusters normally distributed around 1:5
d <- simulate_data(
    n = 100,
    sd = 0.3,
    nclust = 5,
    dims = 2,
    add_true_clust = FALSE,
    id_column = FALSE
)

head(d)
#>          V1        V2
#> 1 0.9695561 1.1848463
#> 2 1.2218107 0.5552327
#> 3 0.4162643 1.0408562
#> 4 0.8653655 1.3884966
#> 5 1.0570451 0.6128050
#> 6 0.5543186 0.7068720

# cluster
km <- TGL_kmeans_tidy(d, k = 5, "euclid", verbose = TRUE)
#> will generate seeds
#> generating seeds
#> at seed 0
#> add new core from 269 to 0
#> at seed 1
#> done update min distance
#> seed range 350 450
#> picked up 417 dist was 1.57597
#> add new core from 417 to 1
#> at seed 2
#> done update min distance
#> seed range 300 400
#> picked up 73 dist was 1.23917
#> add new core from 73 to 2
#> at seed 3
#> done update min distance
#> seed range 250 350
#> picked up 368 dist was 0.662498
#> add new core from 368 to 3
#> at seed 4
#> done update min distance
#> seed range 200 300
#> picked up 193 dist was 0.438478
#> add new core from 193 to 4
#> reassign after init
#> iter 0
#> iter 1 changed 20
#> iter 1
#> iter 2 changed 11
#> iter 2
#> iter 3 changed 3
#> iter 3
#> iter 4 changed 0
km
#> $centers
#> # A tibble: 5 × 3
#>   clust    V1    V2
#>   <int> <dbl> <dbl>
#> 1     1 1.98  2.00 
#> 2     2 3.98  3.98 
#> 3     3 0.987 0.978
#> 4     4 3.07  3.02 
#> 5     5 5.02  4.98 
#> 
#> $cluster
#> # A tibble: 500 × 2
#>    id    clust
#>    <chr> <int>
#>  1 1         3
#>  2 2         3
#>  3 3         3
#>  4 4         3
#>  5 5         3
#>  6 6         3
#>  7 7         3
#>  8 8         3
#>  9 9         3
#> 10 10        3
#> # ℹ 490 more rows
#> 
#> $size
#> # A tibble: 5 × 2
#>   clust     n
#>   <int> <int>
#> 1     1   104
#> 2     2    99
#> 3     3    99
#> 4     4    98
#> 5     5   100
#>

TGL kmeans with 'tidy' output

Arguments

Value

See also

Examples