TGL kmeans with 'tidy' output
TGL_kmeans_tidy(
df,
k,
metric = "euclid",
max_iter = 40,
min_delta = 0.0001,
verbose = FALSE,
keep_log = FALSE,
id_column = FALSE,
reorder_func = "hclust",
add_to_data = FALSE,
hclust_intra_clusters = FALSE,
seed = NULL,
use_cpp_random = FALSE
)
a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used.
number of clusters. Note that in some cases the algorithm might return less clusters than k.
distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman'
maximal number of iterations
minimal change in assignments (fraction out of all observations) to continue iterating
display algorithm messages
keep algorithm messages in 'log' field
df
's first column contains the observation id
function to reorder the clusters. operates on each center and orders by the result. e.g. reorder_func = mean
would calculate the mean of each center and then would reorder the clusters accordingly. If reorder_func = hclust
the centers would be ordered by hclust of the euclidean distance of the correlation matrix, i.e. hclust(dist(cor(t(centers))))
if NULL, no reordering would be done.
return also the original data frame with an extra 'clust' column with the cluster ids ('id' is the first column)
run hierarchical clustering within each cluster and return an ordering of the observations.
seed for the c++ random number generator
use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R.
list with the following components:
tibble with `id` column with the observation id (`1:n` if no id column was supplied), and `clust` column with the observation assigned cluster.
tibble with `clust` column and the cluster centers.
tibble with `clust` column and `n` column with the number of points in each cluster.
tibble with `clust` column the original data frame.
messages from the algorithm run (only if id_column = FALSE
).
tibble with 'id' column, 'clust' column, 'order' column with a new ordering if the observations and 'intra_clust_order' column with the order within each cluster. (only if hclust_intra_clusters = TRUE)
# \dontshow{
# this line is only for CRAN checks
tglkmeans.set_parallel(1)
# }
# create 5 clusters normally distributed around 1:5
d <- simulate_data(
n = 100,
sd = 0.3,
nclust = 5,
dims = 2,
add_true_clust = FALSE,
id_column = FALSE
)
head(d)
#> V1 V2
#> 1 0.9695561 1.1848463
#> 2 1.2218107 0.5552327
#> 3 0.4162643 1.0408562
#> 4 0.8653655 1.3884966
#> 5 1.0570451 0.6128050
#> 6 0.5543186 0.7068720
# cluster
km <- TGL_kmeans_tidy(d, k = 5, "euclid", verbose = TRUE)
#> will generate seeds
#> generating seeds
#> at seed 0
#> add new core from 269 to 0
#> at seed 1
#> done update min distance
#> seed range 350 450
#> picked up 417 dist was 1.57597
#> add new core from 417 to 1
#> at seed 2
#> done update min distance
#> seed range 300 400
#> picked up 73 dist was 1.23917
#> add new core from 73 to 2
#> at seed 3
#> done update min distance
#> seed range 250 350
#> picked up 368 dist was 0.662498
#> add new core from 368 to 3
#> at seed 4
#> done update min distance
#> seed range 200 300
#> picked up 193 dist was 0.438478
#> add new core from 193 to 4
#> reassign after init
#> iter 0
#> iter 1 changed 20
#> iter 1
#> iter 2 changed 11
#> iter 2
#> iter 3 changed 3
#> iter 3
#> iter 4 changed 0
km
#> $centers
#> # A tibble: 5 × 3
#> clust V1 V2
#> <int> <dbl> <dbl>
#> 1 1 1.98 2.00
#> 2 2 3.98 3.98
#> 3 3 0.987 0.978
#> 4 4 3.07 3.02
#> 5 5 5.02 4.98
#>
#> $cluster
#> # A tibble: 500 × 2
#> id clust
#> <chr> <int>
#> 1 1 3
#> 2 2 3
#> 3 3 3
#> 4 4 3
#> 5 5 3
#> 6 6 3
#> 7 7 3
#> 8 8 3
#> 9 9 3
#> 10 10 3
#> # ℹ 490 more rows
#>
#> $size
#> # A tibble: 5 × 2
#> clust n
#> <int> <int>
#> 1 1 104
#> 2 2 99
#> 3 3 99
#> 4 4 98
#> 5 5 100
#>