Predict cluster assignments for new data — predict_tgl

Project new observations onto existing k-means cluster centers.

predict_tgl_kmeans(object, newdata, id_column = FALSE, ...)

Arguments

object: A tgl_kmeans result from TGL_kmeans_tidy
newdata: A matrix or data frame of new observations. Must have the same features (columns) as the data used to create the k-means model. If the first column contains observation IDs (character/factor), it will be used as the id column.
id_column: Does newdata's first column contain observation IDs? If TRUE, the first column is used as IDs. If FALSE (default), row numbers are used as IDs.
...: Additional arguments (currently unused)

Value

A tibble with columns: id (observation identifier) and clust (assigned cluster).

Details

For each observation in newdata, the function computes the distance to every cluster center and assigns the observation to the nearest center. The distance metric used is the same one that was used when creating the k-means model ("euclid", "pearson", or "spearman").

Distance formulas (matching the training implementation; n is the number of dimensions present in both x and center):

euclid: sqrt(sum((x - center)^2, na.rm = TRUE)) / n
pearson: -cor(x, center, use = "pairwise.complete.obs")
spearman: -cor(x, center, method = "spearman", use = "pairwise.complete.obs")

Examples


# create 5 clusters normally distributed around 1:5
data <- simulate_data(n = 100, sd = 0.3, nclust = 5, dims = 10)
km <- TGL_kmeans_tidy(data[, -1], k = 5, id_column = FALSE, seed = 60427)
new_data <- simulate_data(n = 10, sd = 0.3, nclust = 5, dims = 10)
predictions <- predict_tgl_kmeans(km, new_data[, -1])
predictions
#> # A tibble: 50 × 2
#>    id    clust
#>    <chr> <int>
#>  1 1         4
#>  2 2         4
#>  3 3         4
#>  4 4         4
#>  5 5         4
#>  6 6         4
#>  7 7         4
#>  8 8         4
#>  9 9         4
#> 10 10        4
#> # ℹ 40 more rows