R/knn.R
tgs_graph_cover_resample.Rd
Clusters directed graph multiple times with randomized sample subset.
tgs_graph_cover_resample(
graph,
knn,
min_cluster_size,
cooling = 1.05,
burn_in = 10,
p_resamp = 0.75,
n_resamp = 500,
method = "hash"
)
directed graph in the format returned by tgs_graph
maximal number of edges used per node for each sample subset
used to determine the candidates for seeding (= min weight)
factor that is used to gradually increase the chance of a node to stay in the cluster
number of node reassignments after which cooling is applied
fraction of total number of nodes used in each sample subset
number iterations the clustering is run on different sample subsets
method for calculating co_cluster and co_sample; valid values: "hash", "full", "edges"
If method == "hash", a list with two members. The first member is a data frame with 3 columns: "node1", "node2" and "cnt". "cnt" indicates the number of times "node1" and "node2" appeared in the same cluster. The second member of the list is a vector of number of nodes length reflecting how many times each node was used in the subset.
If method == "full", a list containing two matrices: co_cluster and co_sample.
If method == "edges", a list containing two data frames: co_cluster and co_sample.
The algorithm is explained in a "MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions" paper, published in "Genome Biology" #20: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1812-2
# \donttest{
# Note: all the available CPU cores might be used
set.seed(seed = 0)
rows <- 100
cols <- 200
vals <- sample(1:(rows * cols / 2), rows * cols, replace = TRUE)
m <- matrix(vals, nrow = rows, ncol = cols)
r1 <- tgs_cor(m, pairwise.complete.obs = FALSE, spearman = TRUE)
r2 <- tgs_knn(r1, knn = 20, diag = FALSE)
r3 <- tgs_graph(r2, knn = 3, k_expand = 10)
r4 <- tgs_graph_cover_resample(r3, 10, 1)
# }