Microsoft Excel is omni present. It is also an excellent vehicle for implementing many algorithms in their basic form to gain a better understanding of the underlying concepts. In my previous post, I had provided an Excel implementation of a classification problem to help understand naive Bayes classifier. In this post I want to demonstrate K-means clustering via an Excel implementation.
Clustering implies partitioning data into meaningful groups such that data items within each group are similar but are dissimilar across different groups. One of the most widely used clustering procedure is the K-means clustering algorithm, which seeks to group data into K clusters. The K-means algorithm is meant for situations where data items are represented as vectors of continuous attributes, i.e. numeric attributes, and each cluster is represented by the mean/centroid of respective cluster members. The similarity between different data items in such a case is measured by the Euclidean distance. The steps of the K-means algorithm are:
Step 1: Randomly assign every data item to one of the K clusters. (K is a user specified parameter)
Step2: Calculate center for each cluster by taking mean of its corresponding data item vectors.
Step 3: Using the cluster centers of Step 2, Calculate new membership for each data item by locating its closest cluster center. Go back to Step 2.
Termination: The algorithm terminates when the cluster centers do not change from one iteration to the next.
The K-means clustering process can be viewed as an optimization process where the objective is to minimize the sum of the squared error (SSE), defined as follows:
where , the center of the i-th cluster with data items, is defined by
The SSE criteria means that grouping of data items should be as compact as possible. In other words, data items in a cluster should reside close to the cluster center.
The initial clustering assignment has an important effect on the final resulting clusters. This is generally true when the clusters are not well separated. This means that some initializations of cluster centers can result in algorithm terminating in local minima. Thus, it is a good idea to run K-means many times with different initial assignments. It should be obvious that as K goes up, the SSE goes down. The SSE goes to zero when the number of clusters equals the number of data items, i.e. each data item is its own cluster. Since the number of clusters is generally not known, it is common to run K-means for a range of K values and plot SSE against different K values. A knee point in such a plot generally indicates the value of K that should be selected.
Excel Illustration of K-means
I will illustrate K-means using a small set, 10 to be exact, of datapoints in a two-dimensional space for easy visualization. To run through successive iterations, Excel will be used in the manual calculation mode. Lets set the maximum number of iterations to 5 by entering this value in a cell, naming it as max_count. Next, we set up some indicators to start the clustering process and to step through it. For this, lets name a cell iteration_count and enter in it the formula =IF(indicator_2,iteration_count+1,0), where indicator_2 is a cell that takes on logical True/False values. The formula in cell iteration_count behaves as a counter which gets initialized by the status of cell indicator_2. We use another cell named indicator_1 to limit the iterations to the number specified in cell max_count, and initialize the clustering process. We do this by entering in cell indicator_1 the formula =IF(iteration_count<max_count, TRUE,FALSE) and linking cell indicator_2 to indicator_1 by using the formula =indicator_1 in cell indicator_2. At this stage, our worksheet in formula view looks as shown below.
The next step is to set up a formula to assign cluster membership to each data point. We set it up in Column C next to data values. The formula is =IF(indicator_1,label,RANDBETWEEN(1,3)). It assigns randomly a membership number between 1 to 3 assuming we are going to create three clusters. The random assignment is done only at the beginning of the clustering process as determined by indicator_1; otherwise the cluster label is picked up by the nearest center, calculated separately and shown at cell location defined as label. A plot of initial assignments is shown below. I have also applied conditional formatting to cluster membership column to show memberships in different colors.
Next, we need to set up calculations for computing cluster centers based upon assigned data points. We do this at locations, say A19:B21, using the following formulas for cluster number 1 and similar formulas for other two clusters. In these formulas, cmembership refers to cell range C2:C11 where we are showing cluster membership.
=SUMIF(cmembership,1,feat1)/COUNTIF(cmembership,1), and =SUMIF(cmembership,1,feat2)/COUNTIF(cmembership,1)
To find the closest cluster centers for data points, we set up distance calculations and finding the cluster label of the closest center. A formula view of the appropriate part of the worksheet is shown below.
Once we have cluster centers, we can plot them along with data points and their respective cluster labels. At this stage, our worksheet is ready to perform clustering as we manually click through successive calculations. A series of plots shown below illustrate the K-means clustering process showing how cluster centers move and assignments of data points change through the process. In each plot, the numeric label next to a point indicates the current cluster membership of that point. These memberships determine the cluster centers shown in the plots. The three panels of plots correspond to three different runs of the K-means algorithm with different initializations, and are arranged from left to right as we step through the algorithm. Since the three groups off data points are well separated, the clustering process yields identical results through three different initializations; however, the number of iterations required to reach stable groupings are different.
Although I demonstrated the algorithm for three clusters with a small set of data points, it is pretty straightforward to extend thie demo to more points and a different number of clusters.