Project new observations onto existing k-means cluster centers.

predict_tgl_kmeans(object, newdata, id_column = FALSE, ...)

Arguments

object

A tgl_kmeans result from TGL_kmeans_tidy

newdata

A matrix or data frame of new observations. Must have the same features (columns) as the data used to create the k-means model. If the first column contains observation IDs (character/factor), it will be used as the id column.

id_column

Does newdata's first column contain observation IDs? If TRUE, the first column is used as IDs. If FALSE (default), row numbers are used as IDs.

...

Additional arguments (currently unused)

Value

A tibble with columns: id (observation identifier) and clust (assigned cluster).

Details

For each observation in newdata, the function computes the distance to every cluster center and assigns the observation to the nearest center. The distance metric used is the same one that was used when creating the k-means model ("euclid", "pearson", or "spearman").

Distance formulas:

  • euclid: sqrt(sum((x - center)^2, na.rm = TRUE))

  • pearson: -cor(x, center, use = "pairwise.complete.obs")

  • spearman: -cor(x, center, method = "spearman", use = "pairwise.complete.obs")

Examples


# create 5 clusters normally distributed around 1:5
data <- simulate_data(n = 100, sd = 0.3, nclust = 5, dims = 10)
km <- TGL_kmeans_tidy(data[, -1], k = 5, id_column = FALSE, seed = 60427)
new_data <- simulate_data(n = 10, sd = 0.3, nclust = 5, dims = 10)
predictions <- predict_tgl_kmeans(km, new_data[, -1])
predictions
#> # A tibble: 50 × 2
#>    id    clust
#>    <chr> <int>
#>  1 1         5
#>  2 2         5
#>  3 3         5
#>  4 4         5
#>  5 5         5
#>  6 6         5
#>  7 7         5
#>  8 8         5
#>  9 9         5
#> 10 10        5
#> # ℹ 40 more rows