You have a large set of numeric, continuous data, with a variety of features, but no labels. You'd like to see if you can find interesting patterns in this data, but you're not sure where to start.
Enter K-Means clustering.
"Clustering" is finding groups of points in your data that share similar features. For example, if you were trying to divide up a city into neighborhoods, you would cluster similar houses together.
Continuing the example of neighborhoods: We don't necessarily know how many neighborhoods are in our city before we try to cluster them. There could be 3, 5, 10, or 50. That's where "K" comes in - K is the number of clusters you're identifying. "Means" is simply the idea that for all K clusters, you're taking their average - their mean!
K-Means is a clustering algorithm for "unsupervised learning" - when you don't have labels for your data. If we had data for every house in a city, but no labels for neighborhoods, we would want to use K-Means.
This is a question for which there are many answers. The simplest answer may just be, "trial and error." There are a variety of ways to do so; one being the "Elbow Method." The elbow method, while a bit advanced to implement, is quite intuitive to understand. It's a method used to visualize the best choice for K - as shown here (source: stackoverflow):
The X-axis is our choice for K, and the Y-axis is a measure of error. We can see the "elbow" point where the error sharply decreases; this is usually a good choice for K.
That's what we're about to do! However, before we dive into the programming aspects of it, let's think about the algorithm in English first.
That is one iteration of the K-Means algorithm! The algorithm can be repeated many times over to improve accuracy. Here's a helpful visual representation, thanks to Mubaris:
Now that we know how it works, let's return to the elbow method momentarily. As our value for K increases, so, too, does the time it takes for K-Means to complete. A downside to the algorithm is that it's not necessarily the fastest; especially when we have a huge amount of data and features, and a big K-value.
Thinking about it intuitively, the larger our value for K, the more times we'll have to iterate through our centroids when comparing them to our points. Over a large dataset, this can really start to add up. When choosing K, over a large set of data, you'll want to balance accuracy (which we can think about intuitively for this workshop) with run time.
Now, let's dive in!
Let's begin by importing all of the packages we need, as well as making some sample data to work on.
We'll import all our usual packages for data manipulation and visualization - making sure to include %matplotlib inline when we're working in a notebook.
import numpy as np %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns plt.style.use('ggplot')
We're also going to import another function - make_blobs(), a function that simply returns a set of ordered pairs, as well their corresponding label. This is going to be the data we're working with.
We'll set a variable "pairs" to take the ordered pairs from the make_blobs() function, and "labels" will receive the value for the cluster each point is assigned to.
from sklearn.datasets.samples_generator import make_blobs pairs, labels = make_blobs(n_samples=300, cluster_std=1.5, centers=3, n_features=2, random_state=0) #cluster_std=0.25, pairs[:3]
array([[-0.39296326, 5.97931176], [ 0.96687145, -1.17738227], [ 2.11782667, 4.48629985]])
Perfect. So now we know what our data looks like numerically. What about visualy? Let's plot our data to get a feel for what it looks like.
plt.scatter(pairs[:,0], pairs[:,1], c='k')
<matplotlib.collections.PathCollection at 0x22990e1def0>
If you've followed along thus far, you should see three rough clusters to our data - or, if you've changed the value for "centers" in make_blobs, you'll see that many clusters. However, these really are rough. Where does one cluster end, and the next cluster start? K-Means is going to help us answer that by showing us what the midpoints - or, centroids - of this data are.
Now we can get to work on our clustering. We're going to start off by randomly initializing our centroids (k-number of them). Let's write a function to implement this.
Our function will take in two arguments - a dataset, and k, the number of clusters in our data. Our function will then take our dataset and slice it. The way our function slices our dataset is by returning k-number of points, all selected randomly.
Putting that together, we now have k-number of randomly selected centroids.
def choose_centroids(dataset, k): return dataset[np.random.randint(0, len(dataset) - 1, size=k)] #randomly choose our initial centroids choose_centroids(pairs, 3)
array([[ 0.12352249, 4.86491522], [ 3.80099612, 2.28214874], [ 1.28868255, 5.76874588]])
Sometimes you'll get lucky, and your centroids will already be pretty close to the midpoints they're trying to find - this is good! It means the algorithm won't have to repeat as many times. However, if your random centroids are all bunched up or are way off, then the algorithm will have to run more times.
So let's see what this looks like. Try the following cell a bunch of times, and see how close or far your centroids randomly initialize!
example_centroids = choose_centroids(pairs, 3) plt.scatter(pairs[:,0], pairs[:,1], c='k') plt.scatter(example_centroids[:,0], example_centroids[:,1], c='r')
<matplotlib.collections.PathCollection at 0x22991edccc0>
Next, we need to define how close or far our centroids are from their goal. We'll do so using Euclidean distance. We can think about Euclidean distance very similarly to the Pythagorean theorem.
def euclidean(a, b): return np.linalg.norm(a-b)
The goal of the Euclidean distance is to determine how far two points are from each other on a real number plane (like our plot). In our image below, we see how we can think about our two points - (x1,y1) and (x2, y2) - being connected by a triangle. The Euclidean distance is simply the hypotenuse of this triangle - the straight line between them!
So, all we're really calculating is something a lot like the Pythagorean theorem. Credit for this image to Rosalind.info.
Okay, now we can finally get into the actual K-Means algorithm. Fair warning, it's long - don't be intimidated! Nothing we're doing here is too complicated to understand. Let's go through this step by step, in plain English. We'll create a function, kmeans, that takes a dataset, a k value, and a maximum number of iterations as its arguments.
def kmeans(dataset, k, max_iterations):
We create a list to hold the history of how our centroids change, and variables to hold the number of rows and columns in our data.
old_centroids =  num_rows, num_columns = dataset.shape
We use the function we wrote to randomly choose our first centroids. We then add them to our list as the very first point in our centroids' history.
current_centroids = choose_centroids old_centroids.append(current_centroids)
We create a couple more empty variables - previous_centroids will keep track of the most recent centroid locations, and cluster_list will simply hold the cluster (1, 2, or 3) that each data point has been assigned to.
Now, we've set up the data structures we need for our algorithm, and we can begin our analysis. I recommend stopping every few steps, or wherever it feels natural, to look away from all the text and code and to simply think about what's happening in our code.
Our first for loop is simply the number of times our algorithm is going to run.
for i in range(max_iterations):
Now, we're going to go through every point in our data set. For every point in our data set, we're going to calculate the Euclidean distance between our point and our current centroids. Here, we set up an empty array to hold our distances.
for row, data_point in enumerate(dataset): distance_array = np.zeros((k,1))
Now we're going to go through each of our centroids. So, for each point in our dataset, we're going to compare it with our k (three, here) centroids. We're then going to assign the distance between our point and each centroid to our distance array.
We use enumerate to allow us to create two variables - row and data_point - as our loop counters. This is just to help us understand what we're doing.
for row, data_point in enumerate(dataset): distance_array = np.zeros((k,1)) for centroid_index, centroid in enumerate(current_centroids): distance_array[centroid_index] = euclidean(centroid, data_point)
np.argmin will take the smallest distance between our point and our centroids, and assign our point to that cluster (by storing it in cluster_list). In other words, we're finding the centroid that our point is closest to.
for row, data_point in enumerate(dataset): distance_array = np.zeros((k,1)) for centroid_index, centroid in enumerate(current_centroids): distance_array[centroid_index] = euclidean(centroid, data_point) cluster_list[row, 0] = np.argmin(distance_array)
If you haven't already, now's a good time to stop and ponder. Why are we finding the centroid our point is closest to?
That completes our first loop. Let's take a moment and review where we stand. So far, we've gone through each point in our dataset, and computed the distance between our point and our randomly chosen centroids. If you're feeling a bit lost, take a break and try to read back over the previous steps after you've cleared your mind a bit.
Now, we begin another loop. This time we're going to go through all of our centroids. We'll create a temporary variable to hold our new centroid locations so that no values are overwritten as we iterate through.
temp_centroid = np.zeros((k, num_columns)) for index in range(len(current_centroids)):
This step uses a neat trick called list comprehension. It's simply a shortcut to do what you could also do with a normally written for loop. Here, we'd normally create an empty list, then loop through and append each point. This is essentially doing the same thing.
The goal of this step is to re-evaluate our centroids. The way we'll do that is by taking every point belonging to each cluster, and finding the average. So, we'll start off by using list comprehension to create a list of points in the current cluster. We do this by comparing our cluster_list index to the current centroid we're looking at.
temp_centroid = np.zeros((k, num_columns)) for index in range(len(current_centroids)): points_in_cluster = [i for i in range(len(cluster_list)) if cluster_list[i] == index]
Almost done! We'll complete our loop by using numpy's mean function to create our newest centroid (which is no longer random, now the average of all the points in the cluster), and assigning it to our temp_centroid list that holds our new centroid locations.
temp_centroid = np.zeros((k, num_columns)) for index in range(len(current_centroids)): points_in_cluster = [i for i in range(len(cluster_list)) if cluster_list[i] == index] centroid = np.mean(dataset[points_in_cluster], axis=0) temp_centroid[index, :] = centroid
We then set our most recent/current centroids to what we've just calculated, and add it to our history list.
current_centroids = temp_centroid old_centroids.append(temp_centroid)
Finally, we return our calculations! We want our function to provide the most up-to-date centroids, our history of the locations of all of our centroids previously, and the list of clusters that each point has been assigned to.
return current_centroids, old_centroids, cluster_list
We did it! That's our implementation of K-Means clustering. Now, let's put it all together:
def kmeans(dataset, k, max_iterations): old_centroids =  #list to store past centroids num_rows, num_columns = dataset.shape #number of rows and columns from the dataset current_centroids = choose_centroids(dataset, 3) #randomly choose centroids old_centroids.append(current_centroids) #set these to our list of past centroid (to show progress over time) previous_centroids = np.zeros(current_centroids.shape) #to keep track of centroid at every iteration cluster_list = np.zeros((num_rows, 1)) #a null array to hold the assigned cluster of every point in the dataset for i in range(max_iterations): for row, data_point in enumerate(dataset): #for each row/data point in the dataset distance_array = np.zeros((k,1)) #a null array to hold the distances between the point and our three centroids for centroid_index, centroid in enumerate(current_centroids): #for each centroid distance_array[centroid_index] = euclidean(centroid, data_point) #compute the distance between the point and centroid cluster_list[row, 0] = np.argmin(distance_array) #find the smallest distance, assign that distance to a cluster temp_centroid = np.zeros((k, num_columns)) #acts as a buffer while we compute new centroids for index in range(len(current_centroids)): #for each cluster (k of them) points_in_cluster = [i for i in range(len(cluster_list)) if cluster_list[i] == index] #get all the points assigned to a cluster centroid = np.mean(dataset[points_in_cluster], axis=0) #find the mean of those points, this is our new centroid temp_centroid[index, :] = centroid #add our new centroid to our new temporary list current_centroids = temp_centroid #set the new list to the current list old_centroids.append(temp_centroid) #add our calculated centroids to our history for plotting return current_centroids, old_centroids, cluster_list #Our function computes and returns our centroids with their histories, as well as data point's cluster
If you want to stop there, that's completely fine; that's K-Means! However, it's good practice to visualize what you're doing - both to catch errors, and to understand what you're really doing.
Let's define a function to plot our data. I won't go through this quite as in depth. We'll plot all of our clusters in different colors. Let's take a list of colors, and narrow it down to how many we'll need (by how many clusters we have).
We create our pyplot figure and axis. Now we have to plot our points, and then our centroids.
Let's go through every point in the data set. We use list comprehension to create a list of all points in each cluster, and then take each point and plot them with the color for that cluster. Then we loop through our centroids, and plot them as well.
def plot(dataset, old_centroids, cluster_list, k): list_of_colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k', 'w'] #list of colors for each cluster #it is possible but extremely unlikely that you will have more clusters, and need more colors, than this colors = list_of_colors[:k] #take colors for as many clusters as we're using figure, axis = plt.subplots() for index in range(dataset.shape): #for every point in the data set points_in_cluster = [i for i in range(len(cluster_list)) if cluster_list[i] == index] #Let's grab every point in the cluster for point_index in points_in_cluster: #for each point in cluster axis.plot(dataset[point_index,0], dataset[point_index,1], (colors[index] + 'o')) #plot the point with the cluster's color centroid_history =  #null list to hold all of our old centroids for graphing for index, centroids in enumerate(old_centroids): #go through our history of centroids from each iteration of the algorithm for inner, item in enumerate(centroids): #go through each of our k-number of centroids at each iteration if index == 0: centroid_history.append(axis.plot(item, item, 'ko', ms=10)) else: centroid_history[inner].set_data(item, item) plt.pause(0.8) plt.show()
Finally, let's tie together everything we've done. We'll make a run() method that first executes our K-Means algorithm. We then plot each stage of our algorithm (by using our old_centroids), and this is what we see:
def run(data, k, max_iterations): current_centroids, old_centroids, cluster_list = kmeans(data, k, max_iterations) for item in old_centroids: #this includes the last one plot(data, [item], cluster_list, k) run(pairs, 3, 5)
Watch how the centroids adjust. Depending on how lucky you get with your initialization, you may not even need as many as 5 iterations. Or, you may need more!
Congratulations; you've completed your first clustering algorithm! Remember, K-Means is used with unlabeled, continuous data, with several (or many) features.
There are lots of cool ways K-Means can be implemented with real world data - namely, for things like identifying disease, cybersecurity, search engines, consumer analysis, and many more. A super interesting application can be found here.
How can you apply K-Means to your data?
Notice that in our algorithm, we specified how many times we wanted our algorithm to run. It can be cumbersome to try running it 10 times, 20 times, 30, 31, 32...
Can you think of a way to automatically tell our algorithm how many times it should run? Hint: Another way to think about it is, "When should the algorithm stop running?"
What you've learned in this workshop is how K-Means works from the ground up. However, after you gain a full understanding of the algorithm and how it works, there's no need to continually re-implement it over and over. The SciKit-Learn package has a K-Means function built-in, should you prefer to use a pre-built function.