Cluster analysis in R: determine the optimal number of clusters












401















Being a newbie in R, I'm not very sure how to choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?



n = 1000
kk = 10
x1 = runif(kk)
y1 = runif(kk)
z1 = runif(kk)
x4 = sample(x1,length(x1))
y4 = sample(y1,length(y1))
randObs <- function()
{
ix = sample( 1:length(x4), 1 )
iy = sample( 1:length(y4), 1 )
rx = rnorm( 1, x4[ix], runif(1)/8 )
ry = rnorm( 1, y4[ix], runif(1)/8 )
return( c(rx,ry) )
}
x = c()
y = c()
for ( k in 1:n )
{
rPair = randObs()
x = c( x, rPair[1] )
y = c( y, rPair[2] )
}
z <- rnorm(n)
d <- data.frame( x, y, z )









share|improve this question




















  • 4





    If you are not completely wedded to kmeans, you could try the DBSCAN clustering algorithm, available in the fpc package. It's true, you then have to set two parameters... but I've found that fpc::dbscan then does a pretty good job at automatically determining a good number of clusters. Plus it can actually output a single cluster if that's what the data tell you - some of the methods in @Ben's excellent answers won't help you determine whether k=1 is actually best.

    – Stephan Kolassa
    Jun 26 '14 at 14:08













  • See also stats.stackexchange.com/q/11691/478

    – Richie Cotton
    Oct 23 '14 at 12:38
















401















Being a newbie in R, I'm not very sure how to choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?



n = 1000
kk = 10
x1 = runif(kk)
y1 = runif(kk)
z1 = runif(kk)
x4 = sample(x1,length(x1))
y4 = sample(y1,length(y1))
randObs <- function()
{
ix = sample( 1:length(x4), 1 )
iy = sample( 1:length(y4), 1 )
rx = rnorm( 1, x4[ix], runif(1)/8 )
ry = rnorm( 1, y4[ix], runif(1)/8 )
return( c(rx,ry) )
}
x = c()
y = c()
for ( k in 1:n )
{
rPair = randObs()
x = c( x, rPair[1] )
y = c( y, rPair[2] )
}
z <- rnorm(n)
d <- data.frame( x, y, z )









share|improve this question




















  • 4





    If you are not completely wedded to kmeans, you could try the DBSCAN clustering algorithm, available in the fpc package. It's true, you then have to set two parameters... but I've found that fpc::dbscan then does a pretty good job at automatically determining a good number of clusters. Plus it can actually output a single cluster if that's what the data tell you - some of the methods in @Ben's excellent answers won't help you determine whether k=1 is actually best.

    – Stephan Kolassa
    Jun 26 '14 at 14:08













  • See also stats.stackexchange.com/q/11691/478

    – Richie Cotton
    Oct 23 '14 at 12:38














401












401








401


460






Being a newbie in R, I'm not very sure how to choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?



n = 1000
kk = 10
x1 = runif(kk)
y1 = runif(kk)
z1 = runif(kk)
x4 = sample(x1,length(x1))
y4 = sample(y1,length(y1))
randObs <- function()
{
ix = sample( 1:length(x4), 1 )
iy = sample( 1:length(y4), 1 )
rx = rnorm( 1, x4[ix], runif(1)/8 )
ry = rnorm( 1, y4[ix], runif(1)/8 )
return( c(rx,ry) )
}
x = c()
y = c()
for ( k in 1:n )
{
rPair = randObs()
x = c( x, rPair[1] )
y = c( y, rPair[2] )
}
z <- rnorm(n)
d <- data.frame( x, y, z )









share|improve this question
















Being a newbie in R, I'm not very sure how to choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?



n = 1000
kk = 10
x1 = runif(kk)
y1 = runif(kk)
z1 = runif(kk)
x4 = sample(x1,length(x1))
y4 = sample(y1,length(y1))
randObs <- function()
{
ix = sample( 1:length(x4), 1 )
iy = sample( 1:length(y4), 1 )
rx = rnorm( 1, x4[ix], runif(1)/8 )
ry = rnorm( 1, y4[ix], runif(1)/8 )
return( c(rx,ry) )
}
x = c()
y = c()
for ( k in 1:n )
{
rPair = randObs()
x = c( x, rPair[1] )
y = c( y, rPair[2] )
}
z <- rnorm(n)
d <- data.frame( x, y, z )






r cluster-analysis k-means






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited May 6 '14 at 16:07









Agnius Vasiliauskas

7,27753957




7,27753957










asked Mar 13 '13 at 2:39









user2153893user2153893

2,022385




2,022385








  • 4





    If you are not completely wedded to kmeans, you could try the DBSCAN clustering algorithm, available in the fpc package. It's true, you then have to set two parameters... but I've found that fpc::dbscan then does a pretty good job at automatically determining a good number of clusters. Plus it can actually output a single cluster if that's what the data tell you - some of the methods in @Ben's excellent answers won't help you determine whether k=1 is actually best.

    – Stephan Kolassa
    Jun 26 '14 at 14:08













  • See also stats.stackexchange.com/q/11691/478

    – Richie Cotton
    Oct 23 '14 at 12:38














  • 4





    If you are not completely wedded to kmeans, you could try the DBSCAN clustering algorithm, available in the fpc package. It's true, you then have to set two parameters... but I've found that fpc::dbscan then does a pretty good job at automatically determining a good number of clusters. Plus it can actually output a single cluster if that's what the data tell you - some of the methods in @Ben's excellent answers won't help you determine whether k=1 is actually best.

    – Stephan Kolassa
    Jun 26 '14 at 14:08













  • See also stats.stackexchange.com/q/11691/478

    – Richie Cotton
    Oct 23 '14 at 12:38








4




4





If you are not completely wedded to kmeans, you could try the DBSCAN clustering algorithm, available in the fpc package. It's true, you then have to set two parameters... but I've found that fpc::dbscan then does a pretty good job at automatically determining a good number of clusters. Plus it can actually output a single cluster if that's what the data tell you - some of the methods in @Ben's excellent answers won't help you determine whether k=1 is actually best.

– Stephan Kolassa
Jun 26 '14 at 14:08







If you are not completely wedded to kmeans, you could try the DBSCAN clustering algorithm, available in the fpc package. It's true, you then have to set two parameters... but I've found that fpc::dbscan then does a pretty good job at automatically determining a good number of clusters. Plus it can actually output a single cluster if that's what the data tell you - some of the methods in @Ben's excellent answers won't help you determine whether k=1 is actually best.

– Stephan Kolassa
Jun 26 '14 at 14:08















See also stats.stackexchange.com/q/11691/478

– Richie Cotton
Oct 23 '14 at 12:38





See also stats.stackexchange.com/q/11691/478

– Richie Cotton
Oct 23 '14 at 12:38












7 Answers
7






active

oldest

votes


















983














If your question is how can I determine how many clusters are appropriate for a kmeans analysis of my data?, then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods.



First, some reproducible data (the data in the Q are... unclear to me):



n = 100
g = 6
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
plot(d)


enter image description here



One. Look for a bend or elbow in the sum of squared error (SSE) scree plot. See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans:



mydata <- d
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")


We might conclude that 4 clusters would be indicated by this method:
enter image description here



Two. You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package.



library(fpc)
pamk.best <- pamk(d)
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "n")
plot(pam(d, pamk.best$nc))


enter image description hereenter image description here



# we could also do:
library(fpc)
asw <- numeric(20)
for (k in 2:20)
asw[[k]] <- pam(d, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "n")
# still 4


Three. Calinsky criterion: Another approach to diagnosing how many clusters suit the data. In this case we try 1 to 10 groups.



require(vegan)
fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "n")
# 5 clusters!


enter image description here



Four. Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models



# See http://www.jstatsoft.org/v18/i06/paper
# http://www.stat.washington.edu/research/reports/2006/tr504.pdf
#
library(mclust)
# Run the function to see how many clusters
# it finds to be optimal, set it to search for
# at least 1 model and up 20.
d_clust <- Mclust(as.matrix(d), G=1:20)
m.best <- dim(d_clust$z)[2]
cat("model-based optimal number of clusters:", m.best, "n")
# 4 clusters
plot(d_clust)


enter image description hereenter image description hereenter image description here



Five. Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800



library(apcluster)
d.apclus <- apcluster(negDistMat(r=2), d)
cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "n")
# 4
heatmap(d.apclus)
plot(d.apclus, d)


enter image description hereenter image description here



Six. Gap Statistic for Estimating the Number of Clusters. See also some code for a nice graphical output. Trying 2-10 clusters here:



library(cluster)
clusGap(d, kmeans, 10, B = 100, verbose = interactive())

Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
.................................................. 50
.................................................. 100
Clustering Gap statistic ["clusGap"].
B=100 simulated reference sets, k = 1..10
--> Number of clusters (method 'firstSEmax', SE.factor=1): 4
logW E.logW gap SE.sim
[1,] 5.991701 5.970454 -0.0212471 0.04388506
[2,] 5.152666 5.367256 0.2145907 0.04057451
[3,] 4.557779 5.069601 0.5118225 0.03215540
[4,] 3.928959 4.880453 0.9514943 0.04630399
[5,] 3.789319 4.766903 0.9775842 0.04826191
[6,] 3.747539 4.670100 0.9225607 0.03898850
[7,] 3.582373 4.590136 1.0077628 0.04892236
[8,] 3.528791 4.509247 0.9804556 0.04701930
[9,] 3.442481 4.433200 0.9907197 0.04935647
[10,] 3.445291 4.369232 0.9239414 0.05055486


Here's the output from Edwin Chen's implementation of the gap statistic:
enter image description here



Seven. You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details.



Eight. The NbClust package provides 30 indices to determine the number of clusters in a dataset.



library(NbClust)
nb <- NbClust(d, diss="NULL", distance = "euclidean",
min.nc=2, max.nc=15, method = "kmeans",
index = "alllong", alphaBeale = 0.1)
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
# Looks like 3 is the most frequently determined number of clusters
# and curiously, four clusters is not in the output at all!


enter image description here



If your question is how can I produce a dendrogram to visualize the results of my cluster analysis, then you should start with these:
http://www.statmethods.net/advstats/cluster.html
http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis
http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html



Here are a few examples:



d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist)) # apply hirarchical clustering and plot


enter image description here



# a Bayesian clustering method, good for high-dimension data, more details:
# http://vahid.probstat.ca/paper/2012-bclust.pdf
install.packages("bclust")
library(bclust)
x <- as.matrix(d)
d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
# I just include the dendrogram here


enter image description here



Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Here's the example from the documentation (wont work on such low dimensional data as in my example):



library(pvclust)
library(MASS)
data(Boston)
boston.pv <- pvclust(Boston)
plot(boston.pv)


enter image description here



Does any of that help?






share|improve this answer





















  • 136





    This might be the most graphy answer I've ever seen on SO. +1

    – K. Barresi
    Jun 17 '13 at 23:12








  • 34





    Thanks, it was fun to put together (I'm an adherent of the graphism thesis)

    – Ben
    Jun 17 '13 at 23:32






  • 12





    Yes, this was excellent... thanks so much for the awesome explanation. Not just the graphs, but the packages & the code to show how to utilize them! :)

    – Mike Williamson
    Oct 12 '13 at 1:39






  • 26





    This is simply the greatest stack answer I have ever seen. Awesome effort and great explanation!!!

    – Cybernetic
    Mar 29 '14 at 16:52






  • 17





    Such a high standard answer deserves a unique and special platinium tag.

    – Colonel Beauvel
    Feb 24 '15 at 13:26



















19














It's hard to add something too such an elaborate answer. Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples.



d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist))
clusters <- identify(hclust(d_dist))


identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. Hit Esc to leave interactive mode and return to R console. Note, that the list contains the indices, not the rownames (as opposed to cutree).






share|improve this answer

































    8














    In order to determine optimal k-cluster in clustering methods. I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. This code can sample like this:



    Elbow method



    elbow.k <- function(mydata){
    dist.obj <- dist(mydata)
    hclust.obj <- hclust(dist.obj)
    css.obj <- css.hclust(dist.obj,hclust.obj)
    elbow.obj <- elbow.batch(css.obj)
    k <- elbow.obj$k
    return(k)
    }


    Running Elbow parallel



    no_cores <- detectCores()
    cl<-makeCluster(no_cores)
    clusterEvalQ(cl, library(GMD))
    clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
    start.time <- Sys.time()
    elbow.k.handle(data.clustering))
    k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
    end.time <- Sys.time()
    cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)


    It works well.






    share|improve this answer



















    • 2





      The elbow and css functions are coming from the GMD package : cran.r-project.org/web/packages/GMD/GMD.pdf

      – Rohan
      Jun 28 '17 at 18:22



















    4














    Splendid answer from Ben. However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. Please see the scientific paper supporting this method in Science here:



    Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976.



    So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters:



    library(apcluster)
    apclus = apcluster(negDistMat(r=2), data)
    show(apclus)


    If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. For example, for similarities based on Spearman correlations, this is what you need:



    sim = corSimMat(data, method="spearman")
    apclus = apcluster(s=sim)


    Please note that those functions for similarities in the AP package are just provided for simplicity. In fact, apcluster() function in R will accept any matrix of correlations. The same before with corSimMat() can be done with this:



    sim = cor(data, method="spearman")


    or



    sim = cor(t(data), method="spearman")


    depending on what you want to cluster on your matrix (rows or cols).






    share|improve this answer

































      3














      These methods are great but when trying to find k for much larger data sets, these can be crazy slow in R.



      A good solution I have found is the "RWeka" package, which has an efficient implementation of the X-Means algorithm - an extended version of K-Means that scales better and will determine the optimum number of clusters for you.



      First you'll want to make sure that Weka is installed on your system and have XMeans installed through Weka's package manager tool.



      library(RWeka)

      # Print a list of available options for the X-Means algorithm
      WOW("XMeans")

      # Create a Weka_control object which will specify our parameters
      weka_ctrl <- Weka_control(
      I = 1000, # max no. of overall iterations
      M = 1000, # max no. of iterations in the kMeans loop
      L = 20, # min no. of clusters
      H = 150, # max no. of clusters
      D = "weka.core.EuclideanDistance", # distance metric Euclidean
      C = 0.4, # cutoff factor ???
      S = 12 # random number seed (for reproducibility)
      )

      # Run the algorithm on your data, d
      x_means <- XMeans(d, control = weka_ctrl)

      # Assign cluster IDs to original data set
      d$xmeans.cluster <- x_means$class_ids





      share|improve this answer































        1














        The answers are great. If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting.



        > set.seed(2)
        > x=matrix(rnorm(50*2), ncol=2)
        > hc.complete = hclust(dist(x), method="complete")
        > plot(hc.complete)


        enter image description here



        Depending on how many classes you need you can cut your dendrogram as;



        > cutree(hc.complete,k = 2)
        [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
        [26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2


        If you type ?cutree you will see the definitions. If your data set has three classes it will be simply cutree(hc.complete, k = 3). The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9).






        share|improve this answer
























        • I prefer Wards over complete.

          – Chris
          Nov 29 '17 at 1:47



















        1














        A simple solution is the library factoextra. You can change the clustering method and the method for calculate the best number of groups. For example if you want to know the best number of clusters for a k- means:



        Data: mtcars



        library(factoextra)   
        fviz_nbclust(mtcars, kmeans, method = "wss") +
        geom_vline(xintercept = 3, linetype = 2)+
        labs(subtitle = "Elbow method")


        Finally, we get a graph like:



        enter image description here






        share|improve this answer






















          protected by Community Mar 27 '14 at 16:55



          Thank you for your interest in this question.
          Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



          Would you like to answer one of these unanswered questions instead?














          7 Answers
          7






          active

          oldest

          votes








          7 Answers
          7






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          983














          If your question is how can I determine how many clusters are appropriate for a kmeans analysis of my data?, then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods.



          First, some reproducible data (the data in the Q are... unclear to me):



          n = 100
          g = 6
          set.seed(g)
          d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
          y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
          plot(d)


          enter image description here



          One. Look for a bend or elbow in the sum of squared error (SSE) scree plot. See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans:



          mydata <- d
          wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
          for (i in 2:15) wss[i] <- sum(kmeans(mydata,
          centers=i)$withinss)
          plot(1:15, wss, type="b", xlab="Number of Clusters",
          ylab="Within groups sum of squares")


          We might conclude that 4 clusters would be indicated by this method:
          enter image description here



          Two. You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package.



          library(fpc)
          pamk.best <- pamk(d)
          cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "n")
          plot(pam(d, pamk.best$nc))


          enter image description hereenter image description here



          # we could also do:
          library(fpc)
          asw <- numeric(20)
          for (k in 2:20)
          asw[[k]] <- pam(d, k) $ silinfo $ avg.width
          k.best <- which.max(asw)
          cat("silhouette-optimal number of clusters:", k.best, "n")
          # still 4


          Three. Calinsky criterion: Another approach to diagnosing how many clusters suit the data. In this case we try 1 to 10 groups.



          require(vegan)
          fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
          plot(fit, sortg = TRUE, grpmts.plot = TRUE)
          calinski.best <- as.numeric(which.max(fit$results[2,]))
          cat("Calinski criterion optimal number of clusters:", calinski.best, "n")
          # 5 clusters!


          enter image description here



          Four. Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models



          # See http://www.jstatsoft.org/v18/i06/paper
          # http://www.stat.washington.edu/research/reports/2006/tr504.pdf
          #
          library(mclust)
          # Run the function to see how many clusters
          # it finds to be optimal, set it to search for
          # at least 1 model and up 20.
          d_clust <- Mclust(as.matrix(d), G=1:20)
          m.best <- dim(d_clust$z)[2]
          cat("model-based optimal number of clusters:", m.best, "n")
          # 4 clusters
          plot(d_clust)


          enter image description hereenter image description hereenter image description here



          Five. Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800



          library(apcluster)
          d.apclus <- apcluster(negDistMat(r=2), d)
          cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "n")
          # 4
          heatmap(d.apclus)
          plot(d.apclus, d)


          enter image description hereenter image description here



          Six. Gap Statistic for Estimating the Number of Clusters. See also some code for a nice graphical output. Trying 2-10 clusters here:



          library(cluster)
          clusGap(d, kmeans, 10, B = 100, verbose = interactive())

          Clustering k = 1,2,..., K.max (= 10): .. done
          Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
          .................................................. 50
          .................................................. 100
          Clustering Gap statistic ["clusGap"].
          B=100 simulated reference sets, k = 1..10
          --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
          logW E.logW gap SE.sim
          [1,] 5.991701 5.970454 -0.0212471 0.04388506
          [2,] 5.152666 5.367256 0.2145907 0.04057451
          [3,] 4.557779 5.069601 0.5118225 0.03215540
          [4,] 3.928959 4.880453 0.9514943 0.04630399
          [5,] 3.789319 4.766903 0.9775842 0.04826191
          [6,] 3.747539 4.670100 0.9225607 0.03898850
          [7,] 3.582373 4.590136 1.0077628 0.04892236
          [8,] 3.528791 4.509247 0.9804556 0.04701930
          [9,] 3.442481 4.433200 0.9907197 0.04935647
          [10,] 3.445291 4.369232 0.9239414 0.05055486


          Here's the output from Edwin Chen's implementation of the gap statistic:
          enter image description here



          Seven. You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details.



          Eight. The NbClust package provides 30 indices to determine the number of clusters in a dataset.



          library(NbClust)
          nb <- NbClust(d, diss="NULL", distance = "euclidean",
          min.nc=2, max.nc=15, method = "kmeans",
          index = "alllong", alphaBeale = 0.1)
          hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
          # Looks like 3 is the most frequently determined number of clusters
          # and curiously, four clusters is not in the output at all!


          enter image description here



          If your question is how can I produce a dendrogram to visualize the results of my cluster analysis, then you should start with these:
          http://www.statmethods.net/advstats/cluster.html
          http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis
          http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html



          Here are a few examples:



          d_dist <- dist(as.matrix(d))   # find distance matrix 
          plot(hclust(d_dist)) # apply hirarchical clustering and plot


          enter image description here



          # a Bayesian clustering method, good for high-dimension data, more details:
          # http://vahid.probstat.ca/paper/2012-bclust.pdf
          install.packages("bclust")
          library(bclust)
          x <- as.matrix(d)
          d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
          viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
          dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
          # I just include the dendrogram here


          enter image description here



          Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Here's the example from the documentation (wont work on such low dimensional data as in my example):



          library(pvclust)
          library(MASS)
          data(Boston)
          boston.pv <- pvclust(Boston)
          plot(boston.pv)


          enter image description here



          Does any of that help?






          share|improve this answer





















          • 136





            This might be the most graphy answer I've ever seen on SO. +1

            – K. Barresi
            Jun 17 '13 at 23:12








          • 34





            Thanks, it was fun to put together (I'm an adherent of the graphism thesis)

            – Ben
            Jun 17 '13 at 23:32






          • 12





            Yes, this was excellent... thanks so much for the awesome explanation. Not just the graphs, but the packages & the code to show how to utilize them! :)

            – Mike Williamson
            Oct 12 '13 at 1:39






          • 26





            This is simply the greatest stack answer I have ever seen. Awesome effort and great explanation!!!

            – Cybernetic
            Mar 29 '14 at 16:52






          • 17





            Such a high standard answer deserves a unique and special platinium tag.

            – Colonel Beauvel
            Feb 24 '15 at 13:26
















          983














          If your question is how can I determine how many clusters are appropriate for a kmeans analysis of my data?, then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods.



          First, some reproducible data (the data in the Q are... unclear to me):



          n = 100
          g = 6
          set.seed(g)
          d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
          y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
          plot(d)


          enter image description here



          One. Look for a bend or elbow in the sum of squared error (SSE) scree plot. See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans:



          mydata <- d
          wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
          for (i in 2:15) wss[i] <- sum(kmeans(mydata,
          centers=i)$withinss)
          plot(1:15, wss, type="b", xlab="Number of Clusters",
          ylab="Within groups sum of squares")


          We might conclude that 4 clusters would be indicated by this method:
          enter image description here



          Two. You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package.



          library(fpc)
          pamk.best <- pamk(d)
          cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "n")
          plot(pam(d, pamk.best$nc))


          enter image description hereenter image description here



          # we could also do:
          library(fpc)
          asw <- numeric(20)
          for (k in 2:20)
          asw[[k]] <- pam(d, k) $ silinfo $ avg.width
          k.best <- which.max(asw)
          cat("silhouette-optimal number of clusters:", k.best, "n")
          # still 4


          Three. Calinsky criterion: Another approach to diagnosing how many clusters suit the data. In this case we try 1 to 10 groups.



          require(vegan)
          fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
          plot(fit, sortg = TRUE, grpmts.plot = TRUE)
          calinski.best <- as.numeric(which.max(fit$results[2,]))
          cat("Calinski criterion optimal number of clusters:", calinski.best, "n")
          # 5 clusters!


          enter image description here



          Four. Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models



          # See http://www.jstatsoft.org/v18/i06/paper
          # http://www.stat.washington.edu/research/reports/2006/tr504.pdf
          #
          library(mclust)
          # Run the function to see how many clusters
          # it finds to be optimal, set it to search for
          # at least 1 model and up 20.
          d_clust <- Mclust(as.matrix(d), G=1:20)
          m.best <- dim(d_clust$z)[2]
          cat("model-based optimal number of clusters:", m.best, "n")
          # 4 clusters
          plot(d_clust)


          enter image description hereenter image description hereenter image description here



          Five. Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800



          library(apcluster)
          d.apclus <- apcluster(negDistMat(r=2), d)
          cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "n")
          # 4
          heatmap(d.apclus)
          plot(d.apclus, d)


          enter image description hereenter image description here



          Six. Gap Statistic for Estimating the Number of Clusters. See also some code for a nice graphical output. Trying 2-10 clusters here:



          library(cluster)
          clusGap(d, kmeans, 10, B = 100, verbose = interactive())

          Clustering k = 1,2,..., K.max (= 10): .. done
          Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
          .................................................. 50
          .................................................. 100
          Clustering Gap statistic ["clusGap"].
          B=100 simulated reference sets, k = 1..10
          --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
          logW E.logW gap SE.sim
          [1,] 5.991701 5.970454 -0.0212471 0.04388506
          [2,] 5.152666 5.367256 0.2145907 0.04057451
          [3,] 4.557779 5.069601 0.5118225 0.03215540
          [4,] 3.928959 4.880453 0.9514943 0.04630399
          [5,] 3.789319 4.766903 0.9775842 0.04826191
          [6,] 3.747539 4.670100 0.9225607 0.03898850
          [7,] 3.582373 4.590136 1.0077628 0.04892236
          [8,] 3.528791 4.509247 0.9804556 0.04701930
          [9,] 3.442481 4.433200 0.9907197 0.04935647
          [10,] 3.445291 4.369232 0.9239414 0.05055486


          Here's the output from Edwin Chen's implementation of the gap statistic:
          enter image description here



          Seven. You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details.



          Eight. The NbClust package provides 30 indices to determine the number of clusters in a dataset.



          library(NbClust)
          nb <- NbClust(d, diss="NULL", distance = "euclidean",
          min.nc=2, max.nc=15, method = "kmeans",
          index = "alllong", alphaBeale = 0.1)
          hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
          # Looks like 3 is the most frequently determined number of clusters
          # and curiously, four clusters is not in the output at all!


          enter image description here



          If your question is how can I produce a dendrogram to visualize the results of my cluster analysis, then you should start with these:
          http://www.statmethods.net/advstats/cluster.html
          http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis
          http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html



          Here are a few examples:



          d_dist <- dist(as.matrix(d))   # find distance matrix 
          plot(hclust(d_dist)) # apply hirarchical clustering and plot


          enter image description here



          # a Bayesian clustering method, good for high-dimension data, more details:
          # http://vahid.probstat.ca/paper/2012-bclust.pdf
          install.packages("bclust")
          library(bclust)
          x <- as.matrix(d)
          d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
          viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
          dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
          # I just include the dendrogram here


          enter image description here



          Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Here's the example from the documentation (wont work on such low dimensional data as in my example):



          library(pvclust)
          library(MASS)
          data(Boston)
          boston.pv <- pvclust(Boston)
          plot(boston.pv)


          enter image description here



          Does any of that help?






          share|improve this answer





















          • 136





            This might be the most graphy answer I've ever seen on SO. +1

            – K. Barresi
            Jun 17 '13 at 23:12








          • 34





            Thanks, it was fun to put together (I'm an adherent of the graphism thesis)

            – Ben
            Jun 17 '13 at 23:32






          • 12





            Yes, this was excellent... thanks so much for the awesome explanation. Not just the graphs, but the packages & the code to show how to utilize them! :)

            – Mike Williamson
            Oct 12 '13 at 1:39






          • 26





            This is simply the greatest stack answer I have ever seen. Awesome effort and great explanation!!!

            – Cybernetic
            Mar 29 '14 at 16:52






          • 17





            Such a high standard answer deserves a unique and special platinium tag.

            – Colonel Beauvel
            Feb 24 '15 at 13:26














          983












          983








          983







          If your question is how can I determine how many clusters are appropriate for a kmeans analysis of my data?, then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods.



          First, some reproducible data (the data in the Q are... unclear to me):



          n = 100
          g = 6
          set.seed(g)
          d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
          y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
          plot(d)


          enter image description here



          One. Look for a bend or elbow in the sum of squared error (SSE) scree plot. See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans:



          mydata <- d
          wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
          for (i in 2:15) wss[i] <- sum(kmeans(mydata,
          centers=i)$withinss)
          plot(1:15, wss, type="b", xlab="Number of Clusters",
          ylab="Within groups sum of squares")


          We might conclude that 4 clusters would be indicated by this method:
          enter image description here



          Two. You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package.



          library(fpc)
          pamk.best <- pamk(d)
          cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "n")
          plot(pam(d, pamk.best$nc))


          enter image description hereenter image description here



          # we could also do:
          library(fpc)
          asw <- numeric(20)
          for (k in 2:20)
          asw[[k]] <- pam(d, k) $ silinfo $ avg.width
          k.best <- which.max(asw)
          cat("silhouette-optimal number of clusters:", k.best, "n")
          # still 4


          Three. Calinsky criterion: Another approach to diagnosing how many clusters suit the data. In this case we try 1 to 10 groups.



          require(vegan)
          fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
          plot(fit, sortg = TRUE, grpmts.plot = TRUE)
          calinski.best <- as.numeric(which.max(fit$results[2,]))
          cat("Calinski criterion optimal number of clusters:", calinski.best, "n")
          # 5 clusters!


          enter image description here



          Four. Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models



          # See http://www.jstatsoft.org/v18/i06/paper
          # http://www.stat.washington.edu/research/reports/2006/tr504.pdf
          #
          library(mclust)
          # Run the function to see how many clusters
          # it finds to be optimal, set it to search for
          # at least 1 model and up 20.
          d_clust <- Mclust(as.matrix(d), G=1:20)
          m.best <- dim(d_clust$z)[2]
          cat("model-based optimal number of clusters:", m.best, "n")
          # 4 clusters
          plot(d_clust)


          enter image description hereenter image description hereenter image description here



          Five. Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800



          library(apcluster)
          d.apclus <- apcluster(negDistMat(r=2), d)
          cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "n")
          # 4
          heatmap(d.apclus)
          plot(d.apclus, d)


          enter image description hereenter image description here



          Six. Gap Statistic for Estimating the Number of Clusters. See also some code for a nice graphical output. Trying 2-10 clusters here:



          library(cluster)
          clusGap(d, kmeans, 10, B = 100, verbose = interactive())

          Clustering k = 1,2,..., K.max (= 10): .. done
          Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
          .................................................. 50
          .................................................. 100
          Clustering Gap statistic ["clusGap"].
          B=100 simulated reference sets, k = 1..10
          --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
          logW E.logW gap SE.sim
          [1,] 5.991701 5.970454 -0.0212471 0.04388506
          [2,] 5.152666 5.367256 0.2145907 0.04057451
          [3,] 4.557779 5.069601 0.5118225 0.03215540
          [4,] 3.928959 4.880453 0.9514943 0.04630399
          [5,] 3.789319 4.766903 0.9775842 0.04826191
          [6,] 3.747539 4.670100 0.9225607 0.03898850
          [7,] 3.582373 4.590136 1.0077628 0.04892236
          [8,] 3.528791 4.509247 0.9804556 0.04701930
          [9,] 3.442481 4.433200 0.9907197 0.04935647
          [10,] 3.445291 4.369232 0.9239414 0.05055486


          Here's the output from Edwin Chen's implementation of the gap statistic:
          enter image description here



          Seven. You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details.



          Eight. The NbClust package provides 30 indices to determine the number of clusters in a dataset.



          library(NbClust)
          nb <- NbClust(d, diss="NULL", distance = "euclidean",
          min.nc=2, max.nc=15, method = "kmeans",
          index = "alllong", alphaBeale = 0.1)
          hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
          # Looks like 3 is the most frequently determined number of clusters
          # and curiously, four clusters is not in the output at all!


          enter image description here



          If your question is how can I produce a dendrogram to visualize the results of my cluster analysis, then you should start with these:
          http://www.statmethods.net/advstats/cluster.html
          http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis
          http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html



          Here are a few examples:



          d_dist <- dist(as.matrix(d))   # find distance matrix 
          plot(hclust(d_dist)) # apply hirarchical clustering and plot


          enter image description here



          # a Bayesian clustering method, good for high-dimension data, more details:
          # http://vahid.probstat.ca/paper/2012-bclust.pdf
          install.packages("bclust")
          library(bclust)
          x <- as.matrix(d)
          d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
          viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
          dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
          # I just include the dendrogram here


          enter image description here



          Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Here's the example from the documentation (wont work on such low dimensional data as in my example):



          library(pvclust)
          library(MASS)
          data(Boston)
          boston.pv <- pvclust(Boston)
          plot(boston.pv)


          enter image description here



          Does any of that help?






          share|improve this answer















          If your question is how can I determine how many clusters are appropriate for a kmeans analysis of my data?, then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods.



          First, some reproducible data (the data in the Q are... unclear to me):



          n = 100
          g = 6
          set.seed(g)
          d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
          y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
          plot(d)


          enter image description here



          One. Look for a bend or elbow in the sum of squared error (SSE) scree plot. See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans:



          mydata <- d
          wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
          for (i in 2:15) wss[i] <- sum(kmeans(mydata,
          centers=i)$withinss)
          plot(1:15, wss, type="b", xlab="Number of Clusters",
          ylab="Within groups sum of squares")


          We might conclude that 4 clusters would be indicated by this method:
          enter image description here



          Two. You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package.



          library(fpc)
          pamk.best <- pamk(d)
          cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "n")
          plot(pam(d, pamk.best$nc))


          enter image description hereenter image description here



          # we could also do:
          library(fpc)
          asw <- numeric(20)
          for (k in 2:20)
          asw[[k]] <- pam(d, k) $ silinfo $ avg.width
          k.best <- which.max(asw)
          cat("silhouette-optimal number of clusters:", k.best, "n")
          # still 4


          Three. Calinsky criterion: Another approach to diagnosing how many clusters suit the data. In this case we try 1 to 10 groups.



          require(vegan)
          fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
          plot(fit, sortg = TRUE, grpmts.plot = TRUE)
          calinski.best <- as.numeric(which.max(fit$results[2,]))
          cat("Calinski criterion optimal number of clusters:", calinski.best, "n")
          # 5 clusters!


          enter image description here



          Four. Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models



          # See http://www.jstatsoft.org/v18/i06/paper
          # http://www.stat.washington.edu/research/reports/2006/tr504.pdf
          #
          library(mclust)
          # Run the function to see how many clusters
          # it finds to be optimal, set it to search for
          # at least 1 model and up 20.
          d_clust <- Mclust(as.matrix(d), G=1:20)
          m.best <- dim(d_clust$z)[2]
          cat("model-based optimal number of clusters:", m.best, "n")
          # 4 clusters
          plot(d_clust)


          enter image description hereenter image description hereenter image description here



          Five. Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800



          library(apcluster)
          d.apclus <- apcluster(negDistMat(r=2), d)
          cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "n")
          # 4
          heatmap(d.apclus)
          plot(d.apclus, d)


          enter image description hereenter image description here



          Six. Gap Statistic for Estimating the Number of Clusters. See also some code for a nice graphical output. Trying 2-10 clusters here:



          library(cluster)
          clusGap(d, kmeans, 10, B = 100, verbose = interactive())

          Clustering k = 1,2,..., K.max (= 10): .. done
          Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
          .................................................. 50
          .................................................. 100
          Clustering Gap statistic ["clusGap"].
          B=100 simulated reference sets, k = 1..10
          --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
          logW E.logW gap SE.sim
          [1,] 5.991701 5.970454 -0.0212471 0.04388506
          [2,] 5.152666 5.367256 0.2145907 0.04057451
          [3,] 4.557779 5.069601 0.5118225 0.03215540
          [4,] 3.928959 4.880453 0.9514943 0.04630399
          [5,] 3.789319 4.766903 0.9775842 0.04826191
          [6,] 3.747539 4.670100 0.9225607 0.03898850
          [7,] 3.582373 4.590136 1.0077628 0.04892236
          [8,] 3.528791 4.509247 0.9804556 0.04701930
          [9,] 3.442481 4.433200 0.9907197 0.04935647
          [10,] 3.445291 4.369232 0.9239414 0.05055486


          Here's the output from Edwin Chen's implementation of the gap statistic:
          enter image description here



          Seven. You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details.



          Eight. The NbClust package provides 30 indices to determine the number of clusters in a dataset.



          library(NbClust)
          nb <- NbClust(d, diss="NULL", distance = "euclidean",
          min.nc=2, max.nc=15, method = "kmeans",
          index = "alllong", alphaBeale = 0.1)
          hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
          # Looks like 3 is the most frequently determined number of clusters
          # and curiously, four clusters is not in the output at all!


          enter image description here



          If your question is how can I produce a dendrogram to visualize the results of my cluster analysis, then you should start with these:
          http://www.statmethods.net/advstats/cluster.html
          http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis
          http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html



          Here are a few examples:



          d_dist <- dist(as.matrix(d))   # find distance matrix 
          plot(hclust(d_dist)) # apply hirarchical clustering and plot


          enter image description here



          # a Bayesian clustering method, good for high-dimension data, more details:
          # http://vahid.probstat.ca/paper/2012-bclust.pdf
          install.packages("bclust")
          library(bclust)
          x <- as.matrix(d)
          d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
          viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
          dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
          # I just include the dendrogram here


          enter image description here



          Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Here's the example from the documentation (wont work on such low dimensional data as in my example):



          library(pvclust)
          library(MASS)
          data(Boston)
          boston.pv <- pvclust(Boston)
          plot(boston.pv)


          enter image description here



          Does any of that help?







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 12 '15 at 21:22









          ybungalobill

          45.5k1395162




          45.5k1395162










          answered Mar 13 '13 at 3:20









          BenBen

          31.8k1398170




          31.8k1398170








          • 136





            This might be the most graphy answer I've ever seen on SO. +1

            – K. Barresi
            Jun 17 '13 at 23:12








          • 34





            Thanks, it was fun to put together (I'm an adherent of the graphism thesis)

            – Ben
            Jun 17 '13 at 23:32






          • 12





            Yes, this was excellent... thanks so much for the awesome explanation. Not just the graphs, but the packages & the code to show how to utilize them! :)

            – Mike Williamson
            Oct 12 '13 at 1:39






          • 26





            This is simply the greatest stack answer I have ever seen. Awesome effort and great explanation!!!

            – Cybernetic
            Mar 29 '14 at 16:52






          • 17





            Such a high standard answer deserves a unique and special platinium tag.

            – Colonel Beauvel
            Feb 24 '15 at 13:26














          • 136





            This might be the most graphy answer I've ever seen on SO. +1

            – K. Barresi
            Jun 17 '13 at 23:12








          • 34





            Thanks, it was fun to put together (I'm an adherent of the graphism thesis)

            – Ben
            Jun 17 '13 at 23:32






          • 12





            Yes, this was excellent... thanks so much for the awesome explanation. Not just the graphs, but the packages & the code to show how to utilize them! :)

            – Mike Williamson
            Oct 12 '13 at 1:39






          • 26





            This is simply the greatest stack answer I have ever seen. Awesome effort and great explanation!!!

            – Cybernetic
            Mar 29 '14 at 16:52






          • 17





            Such a high standard answer deserves a unique and special platinium tag.

            – Colonel Beauvel
            Feb 24 '15 at 13:26








          136




          136





          This might be the most graphy answer I've ever seen on SO. +1

          – K. Barresi
          Jun 17 '13 at 23:12







          This might be the most graphy answer I've ever seen on SO. +1

          – K. Barresi
          Jun 17 '13 at 23:12






          34




          34





          Thanks, it was fun to put together (I'm an adherent of the graphism thesis)

          – Ben
          Jun 17 '13 at 23:32





          Thanks, it was fun to put together (I'm an adherent of the graphism thesis)

          – Ben
          Jun 17 '13 at 23:32




          12




          12





          Yes, this was excellent... thanks so much for the awesome explanation. Not just the graphs, but the packages & the code to show how to utilize them! :)

          – Mike Williamson
          Oct 12 '13 at 1:39





          Yes, this was excellent... thanks so much for the awesome explanation. Not just the graphs, but the packages & the code to show how to utilize them! :)

          – Mike Williamson
          Oct 12 '13 at 1:39




          26




          26





          This is simply the greatest stack answer I have ever seen. Awesome effort and great explanation!!!

          – Cybernetic
          Mar 29 '14 at 16:52





          This is simply the greatest stack answer I have ever seen. Awesome effort and great explanation!!!

          – Cybernetic
          Mar 29 '14 at 16:52




          17




          17





          Such a high standard answer deserves a unique and special platinium tag.

          – Colonel Beauvel
          Feb 24 '15 at 13:26





          Such a high standard answer deserves a unique and special platinium tag.

          – Colonel Beauvel
          Feb 24 '15 at 13:26













          19














          It's hard to add something too such an elaborate answer. Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples.



          d_dist <- dist(as.matrix(d))   # find distance matrix 
          plot(hclust(d_dist))
          clusters <- identify(hclust(d_dist))


          identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. Hit Esc to leave interactive mode and return to R console. Note, that the list contains the indices, not the rownames (as opposed to cutree).






          share|improve this answer






























            19














            It's hard to add something too such an elaborate answer. Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples.



            d_dist <- dist(as.matrix(d))   # find distance matrix 
            plot(hclust(d_dist))
            clusters <- identify(hclust(d_dist))


            identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. Hit Esc to leave interactive mode and return to R console. Note, that the list contains the indices, not the rownames (as opposed to cutree).






            share|improve this answer




























              19












              19








              19







              It's hard to add something too such an elaborate answer. Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples.



              d_dist <- dist(as.matrix(d))   # find distance matrix 
              plot(hclust(d_dist))
              clusters <- identify(hclust(d_dist))


              identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. Hit Esc to leave interactive mode and return to R console. Note, that the list contains the indices, not the rownames (as opposed to cutree).






              share|improve this answer















              It's hard to add something too such an elaborate answer. Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples.



              d_dist <- dist(as.matrix(d))   # find distance matrix 
              plot(hclust(d_dist))
              clusters <- identify(hclust(d_dist))


              identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. Hit Esc to leave interactive mode and return to R console. Note, that the list contains the indices, not the rownames (as opposed to cutree).







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Aug 1 '16 at 19:31

























              answered Apr 19 '16 at 21:06









              Matt BannertMatt Bannert

              13.8k25113182




              13.8k25113182























                  8














                  In order to determine optimal k-cluster in clustering methods. I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. This code can sample like this:



                  Elbow method



                  elbow.k <- function(mydata){
                  dist.obj <- dist(mydata)
                  hclust.obj <- hclust(dist.obj)
                  css.obj <- css.hclust(dist.obj,hclust.obj)
                  elbow.obj <- elbow.batch(css.obj)
                  k <- elbow.obj$k
                  return(k)
                  }


                  Running Elbow parallel



                  no_cores <- detectCores()
                  cl<-makeCluster(no_cores)
                  clusterEvalQ(cl, library(GMD))
                  clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
                  start.time <- Sys.time()
                  elbow.k.handle(data.clustering))
                  k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
                  end.time <- Sys.time()
                  cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)


                  It works well.






                  share|improve this answer



















                  • 2





                    The elbow and css functions are coming from the GMD package : cran.r-project.org/web/packages/GMD/GMD.pdf

                    – Rohan
                    Jun 28 '17 at 18:22
















                  8














                  In order to determine optimal k-cluster in clustering methods. I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. This code can sample like this:



                  Elbow method



                  elbow.k <- function(mydata){
                  dist.obj <- dist(mydata)
                  hclust.obj <- hclust(dist.obj)
                  css.obj <- css.hclust(dist.obj,hclust.obj)
                  elbow.obj <- elbow.batch(css.obj)
                  k <- elbow.obj$k
                  return(k)
                  }


                  Running Elbow parallel



                  no_cores <- detectCores()
                  cl<-makeCluster(no_cores)
                  clusterEvalQ(cl, library(GMD))
                  clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
                  start.time <- Sys.time()
                  elbow.k.handle(data.clustering))
                  k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
                  end.time <- Sys.time()
                  cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)


                  It works well.






                  share|improve this answer



















                  • 2





                    The elbow and css functions are coming from the GMD package : cran.r-project.org/web/packages/GMD/GMD.pdf

                    – Rohan
                    Jun 28 '17 at 18:22














                  8












                  8








                  8







                  In order to determine optimal k-cluster in clustering methods. I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. This code can sample like this:



                  Elbow method



                  elbow.k <- function(mydata){
                  dist.obj <- dist(mydata)
                  hclust.obj <- hclust(dist.obj)
                  css.obj <- css.hclust(dist.obj,hclust.obj)
                  elbow.obj <- elbow.batch(css.obj)
                  k <- elbow.obj$k
                  return(k)
                  }


                  Running Elbow parallel



                  no_cores <- detectCores()
                  cl<-makeCluster(no_cores)
                  clusterEvalQ(cl, library(GMD))
                  clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
                  start.time <- Sys.time()
                  elbow.k.handle(data.clustering))
                  k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
                  end.time <- Sys.time()
                  cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)


                  It works well.






                  share|improve this answer













                  In order to determine optimal k-cluster in clustering methods. I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. This code can sample like this:



                  Elbow method



                  elbow.k <- function(mydata){
                  dist.obj <- dist(mydata)
                  hclust.obj <- hclust(dist.obj)
                  css.obj <- css.hclust(dist.obj,hclust.obj)
                  elbow.obj <- elbow.batch(css.obj)
                  k <- elbow.obj$k
                  return(k)
                  }


                  Running Elbow parallel



                  no_cores <- detectCores()
                  cl<-makeCluster(no_cores)
                  clusterEvalQ(cl, library(GMD))
                  clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
                  start.time <- Sys.time()
                  elbow.k.handle(data.clustering))
                  k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
                  end.time <- Sys.time()
                  cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)


                  It works well.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Aug 9 '16 at 8:38









                  VanThaoNguyenVanThaoNguyen

                  380313




                  380313








                  • 2





                    The elbow and css functions are coming from the GMD package : cran.r-project.org/web/packages/GMD/GMD.pdf

                    – Rohan
                    Jun 28 '17 at 18:22














                  • 2





                    The elbow and css functions are coming from the GMD package : cran.r-project.org/web/packages/GMD/GMD.pdf

                    – Rohan
                    Jun 28 '17 at 18:22








                  2




                  2





                  The elbow and css functions are coming from the GMD package : cran.r-project.org/web/packages/GMD/GMD.pdf

                  – Rohan
                  Jun 28 '17 at 18:22





                  The elbow and css functions are coming from the GMD package : cran.r-project.org/web/packages/GMD/GMD.pdf

                  – Rohan
                  Jun 28 '17 at 18:22











                  4














                  Splendid answer from Ben. However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. Please see the scientific paper supporting this method in Science here:



                  Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976.



                  So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters:



                  library(apcluster)
                  apclus = apcluster(negDistMat(r=2), data)
                  show(apclus)


                  If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. For example, for similarities based on Spearman correlations, this is what you need:



                  sim = corSimMat(data, method="spearman")
                  apclus = apcluster(s=sim)


                  Please note that those functions for similarities in the AP package are just provided for simplicity. In fact, apcluster() function in R will accept any matrix of correlations. The same before with corSimMat() can be done with this:



                  sim = cor(data, method="spearman")


                  or



                  sim = cor(t(data), method="spearman")


                  depending on what you want to cluster on your matrix (rows or cols).






                  share|improve this answer






























                    4














                    Splendid answer from Ben. However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. Please see the scientific paper supporting this method in Science here:



                    Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976.



                    So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters:



                    library(apcluster)
                    apclus = apcluster(negDistMat(r=2), data)
                    show(apclus)


                    If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. For example, for similarities based on Spearman correlations, this is what you need:



                    sim = corSimMat(data, method="spearman")
                    apclus = apcluster(s=sim)


                    Please note that those functions for similarities in the AP package are just provided for simplicity. In fact, apcluster() function in R will accept any matrix of correlations. The same before with corSimMat() can be done with this:



                    sim = cor(data, method="spearman")


                    or



                    sim = cor(t(data), method="spearman")


                    depending on what you want to cluster on your matrix (rows or cols).






                    share|improve this answer




























                      4












                      4








                      4







                      Splendid answer from Ben. However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. Please see the scientific paper supporting this method in Science here:



                      Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976.



                      So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters:



                      library(apcluster)
                      apclus = apcluster(negDistMat(r=2), data)
                      show(apclus)


                      If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. For example, for similarities based on Spearman correlations, this is what you need:



                      sim = corSimMat(data, method="spearman")
                      apclus = apcluster(s=sim)


                      Please note that those functions for similarities in the AP package are just provided for simplicity. In fact, apcluster() function in R will accept any matrix of correlations. The same before with corSimMat() can be done with this:



                      sim = cor(data, method="spearman")


                      or



                      sim = cor(t(data), method="spearman")


                      depending on what you want to cluster on your matrix (rows or cols).






                      share|improve this answer















                      Splendid answer from Ben. However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. Please see the scientific paper supporting this method in Science here:



                      Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976.



                      So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters:



                      library(apcluster)
                      apclus = apcluster(negDistMat(r=2), data)
                      show(apclus)


                      If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. For example, for similarities based on Spearman correlations, this is what you need:



                      sim = corSimMat(data, method="spearman")
                      apclus = apcluster(s=sim)


                      Please note that those functions for similarities in the AP package are just provided for simplicity. In fact, apcluster() function in R will accept any matrix of correlations. The same before with corSimMat() can be done with this:



                      sim = cor(data, method="spearman")


                      or



                      sim = cor(t(data), method="spearman")


                      depending on what you want to cluster on your matrix (rows or cols).







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Aug 5 '17 at 2:12

























                      answered Apr 12 '17 at 20:08









                      zsramzsram

                      1967




                      1967























                          3














                          These methods are great but when trying to find k for much larger data sets, these can be crazy slow in R.



                          A good solution I have found is the "RWeka" package, which has an efficient implementation of the X-Means algorithm - an extended version of K-Means that scales better and will determine the optimum number of clusters for you.



                          First you'll want to make sure that Weka is installed on your system and have XMeans installed through Weka's package manager tool.



                          library(RWeka)

                          # Print a list of available options for the X-Means algorithm
                          WOW("XMeans")

                          # Create a Weka_control object which will specify our parameters
                          weka_ctrl <- Weka_control(
                          I = 1000, # max no. of overall iterations
                          M = 1000, # max no. of iterations in the kMeans loop
                          L = 20, # min no. of clusters
                          H = 150, # max no. of clusters
                          D = "weka.core.EuclideanDistance", # distance metric Euclidean
                          C = 0.4, # cutoff factor ???
                          S = 12 # random number seed (for reproducibility)
                          )

                          # Run the algorithm on your data, d
                          x_means <- XMeans(d, control = weka_ctrl)

                          # Assign cluster IDs to original data set
                          d$xmeans.cluster <- x_means$class_ids





                          share|improve this answer




























                            3














                            These methods are great but when trying to find k for much larger data sets, these can be crazy slow in R.



                            A good solution I have found is the "RWeka" package, which has an efficient implementation of the X-Means algorithm - an extended version of K-Means that scales better and will determine the optimum number of clusters for you.



                            First you'll want to make sure that Weka is installed on your system and have XMeans installed through Weka's package manager tool.



                            library(RWeka)

                            # Print a list of available options for the X-Means algorithm
                            WOW("XMeans")

                            # Create a Weka_control object which will specify our parameters
                            weka_ctrl <- Weka_control(
                            I = 1000, # max no. of overall iterations
                            M = 1000, # max no. of iterations in the kMeans loop
                            L = 20, # min no. of clusters
                            H = 150, # max no. of clusters
                            D = "weka.core.EuclideanDistance", # distance metric Euclidean
                            C = 0.4, # cutoff factor ???
                            S = 12 # random number seed (for reproducibility)
                            )

                            # Run the algorithm on your data, d
                            x_means <- XMeans(d, control = weka_ctrl)

                            # Assign cluster IDs to original data set
                            d$xmeans.cluster <- x_means$class_ids





                            share|improve this answer


























                              3












                              3








                              3







                              These methods are great but when trying to find k for much larger data sets, these can be crazy slow in R.



                              A good solution I have found is the "RWeka" package, which has an efficient implementation of the X-Means algorithm - an extended version of K-Means that scales better and will determine the optimum number of clusters for you.



                              First you'll want to make sure that Weka is installed on your system and have XMeans installed through Weka's package manager tool.



                              library(RWeka)

                              # Print a list of available options for the X-Means algorithm
                              WOW("XMeans")

                              # Create a Weka_control object which will specify our parameters
                              weka_ctrl <- Weka_control(
                              I = 1000, # max no. of overall iterations
                              M = 1000, # max no. of iterations in the kMeans loop
                              L = 20, # min no. of clusters
                              H = 150, # max no. of clusters
                              D = "weka.core.EuclideanDistance", # distance metric Euclidean
                              C = 0.4, # cutoff factor ???
                              S = 12 # random number seed (for reproducibility)
                              )

                              # Run the algorithm on your data, d
                              x_means <- XMeans(d, control = weka_ctrl)

                              # Assign cluster IDs to original data set
                              d$xmeans.cluster <- x_means$class_ids





                              share|improve this answer













                              These methods are great but when trying to find k for much larger data sets, these can be crazy slow in R.



                              A good solution I have found is the "RWeka" package, which has an efficient implementation of the X-Means algorithm - an extended version of K-Means that scales better and will determine the optimum number of clusters for you.



                              First you'll want to make sure that Weka is installed on your system and have XMeans installed through Weka's package manager tool.



                              library(RWeka)

                              # Print a list of available options for the X-Means algorithm
                              WOW("XMeans")

                              # Create a Weka_control object which will specify our parameters
                              weka_ctrl <- Weka_control(
                              I = 1000, # max no. of overall iterations
                              M = 1000, # max no. of iterations in the kMeans loop
                              L = 20, # min no. of clusters
                              H = 150, # max no. of clusters
                              D = "weka.core.EuclideanDistance", # distance metric Euclidean
                              C = 0.4, # cutoff factor ???
                              S = 12 # random number seed (for reproducibility)
                              )

                              # Run the algorithm on your data, d
                              x_means <- XMeans(d, control = weka_ctrl)

                              # Assign cluster IDs to original data set
                              d$xmeans.cluster <- x_means$class_ids






                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Dec 27 '17 at 15:07









                              RDRRRDRR

                              17916




                              17916























                                  1














                                  The answers are great. If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting.



                                  > set.seed(2)
                                  > x=matrix(rnorm(50*2), ncol=2)
                                  > hc.complete = hclust(dist(x), method="complete")
                                  > plot(hc.complete)


                                  enter image description here



                                  Depending on how many classes you need you can cut your dendrogram as;



                                  > cutree(hc.complete,k = 2)
                                  [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
                                  [26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2


                                  If you type ?cutree you will see the definitions. If your data set has three classes it will be simply cutree(hc.complete, k = 3). The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9).






                                  share|improve this answer
























                                  • I prefer Wards over complete.

                                    – Chris
                                    Nov 29 '17 at 1:47
















                                  1














                                  The answers are great. If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting.



                                  > set.seed(2)
                                  > x=matrix(rnorm(50*2), ncol=2)
                                  > hc.complete = hclust(dist(x), method="complete")
                                  > plot(hc.complete)


                                  enter image description here



                                  Depending on how many classes you need you can cut your dendrogram as;



                                  > cutree(hc.complete,k = 2)
                                  [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
                                  [26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2


                                  If you type ?cutree you will see the definitions. If your data set has three classes it will be simply cutree(hc.complete, k = 3). The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9).






                                  share|improve this answer
























                                  • I prefer Wards over complete.

                                    – Chris
                                    Nov 29 '17 at 1:47














                                  1












                                  1








                                  1







                                  The answers are great. If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting.



                                  > set.seed(2)
                                  > x=matrix(rnorm(50*2), ncol=2)
                                  > hc.complete = hclust(dist(x), method="complete")
                                  > plot(hc.complete)


                                  enter image description here



                                  Depending on how many classes you need you can cut your dendrogram as;



                                  > cutree(hc.complete,k = 2)
                                  [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
                                  [26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2


                                  If you type ?cutree you will see the definitions. If your data set has three classes it will be simply cutree(hc.complete, k = 3). The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9).






                                  share|improve this answer













                                  The answers are great. If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting.



                                  > set.seed(2)
                                  > x=matrix(rnorm(50*2), ncol=2)
                                  > hc.complete = hclust(dist(x), method="complete")
                                  > plot(hc.complete)


                                  enter image description here



                                  Depending on how many classes you need you can cut your dendrogram as;



                                  > cutree(hc.complete,k = 2)
                                  [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
                                  [26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2


                                  If you type ?cutree you will see the definitions. If your data set has three classes it will be simply cutree(hc.complete, k = 3). The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9).







                                  share|improve this answer












                                  share|improve this answer



                                  share|improve this answer










                                  answered Aug 22 '17 at 23:56









                                  boyaronurboyaronur

                                  304210




                                  304210













                                  • I prefer Wards over complete.

                                    – Chris
                                    Nov 29 '17 at 1:47



















                                  • I prefer Wards over complete.

                                    – Chris
                                    Nov 29 '17 at 1:47

















                                  I prefer Wards over complete.

                                  – Chris
                                  Nov 29 '17 at 1:47





                                  I prefer Wards over complete.

                                  – Chris
                                  Nov 29 '17 at 1:47











                                  1














                                  A simple solution is the library factoextra. You can change the clustering method and the method for calculate the best number of groups. For example if you want to know the best number of clusters for a k- means:



                                  Data: mtcars



                                  library(factoextra)   
                                  fviz_nbclust(mtcars, kmeans, method = "wss") +
                                  geom_vline(xintercept = 3, linetype = 2)+
                                  labs(subtitle = "Elbow method")


                                  Finally, we get a graph like:



                                  enter image description here






                                  share|improve this answer




























                                    1














                                    A simple solution is the library factoextra. You can change the clustering method and the method for calculate the best number of groups. For example if you want to know the best number of clusters for a k- means:



                                    Data: mtcars



                                    library(factoextra)   
                                    fviz_nbclust(mtcars, kmeans, method = "wss") +
                                    geom_vline(xintercept = 3, linetype = 2)+
                                    labs(subtitle = "Elbow method")


                                    Finally, we get a graph like:



                                    enter image description here






                                    share|improve this answer


























                                      1












                                      1








                                      1







                                      A simple solution is the library factoextra. You can change the clustering method and the method for calculate the best number of groups. For example if you want to know the best number of clusters for a k- means:



                                      Data: mtcars



                                      library(factoextra)   
                                      fviz_nbclust(mtcars, kmeans, method = "wss") +
                                      geom_vline(xintercept = 3, linetype = 2)+
                                      labs(subtitle = "Elbow method")


                                      Finally, we get a graph like:



                                      enter image description here






                                      share|improve this answer













                                      A simple solution is the library factoextra. You can change the clustering method and the method for calculate the best number of groups. For example if you want to know the best number of clusters for a k- means:



                                      Data: mtcars



                                      library(factoextra)   
                                      fviz_nbclust(mtcars, kmeans, method = "wss") +
                                      geom_vline(xintercept = 3, linetype = 2)+
                                      labs(subtitle = "Elbow method")


                                      Finally, we get a graph like:



                                      enter image description here







                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered Dec 12 '18 at 21:47









                                      Cro-MagnonCro-Magnon

                                      9081120




                                      9081120

















                                          protected by Community Mar 27 '14 at 16:55



                                          Thank you for your interest in this question.
                                          Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                                          Would you like to answer one of these unanswered questions instead?



                                          Popular posts from this blog

                                          Biblatex bibliography style without URLs when DOI exists (in Overleaf with Zotero bibliography)

                                          ComboBox Display Member on multiple fields

                                          Is it possible to collect Nectar points via Trainline?