Welcome to ML Made Simple #3, we will talk about a little algorithm called k-Means Clustering, how it works, and when you should use it.

k-Means Clustering is an unsupervised learning algorithm, meaning that you use it when your data is unlabeled (when you got a bunch of points, but don't know what they refer to).

Unsupervised algorithms are commonly used to find the "structure" of a dataset, to discover where are the different groups and separate them. They tell you which data points have the same behavior.

Pros

  • Efficient
  • Simple

Cons

  • Sensitive to outliers
  • Output depends on the initial state, so it may not find the same results every time

Let's see how it finds groups in data. Our goal is to apply k-means to this randomly generated dataset :

We will proceed step by step, and follow the algorithm.

Centroids

As its name implies, k-Means Clustering works by finding where data clusters. Centroids are where the algorithm thinks the center of each cluster is.

When it begins every centroids are initialized to a random position, which gives us something like this :

Wait a second. How does it knows how many cluster to look for ?

Well, the k in the name doesn't have the same meaning as in k-Nearest Neighbors. It's still a parameter, but here it refers to the number of cluster. To find its optimal value, a good option is to look at the mean distance between data points and their centroid, across different values of k. You need to find the "elbow point", the value of k where the decreasing curve becomes almost flat. For our example, this is what we get :

3 is probably the best value for k in this example.

Separating data points between centroids

Now that our centroids are positioned (randomly), we need to separate all the data between them. That's a piece of cake, each data points groups with the closest centroid.

We've already seen it in the last ML Made Simple, but to evaluate a distance, an algorithm needs a distance function, like the Pythagorean theorem (Euclidean distance) :

$$d(A,B) = \sqrt{\sum_{i=1}^{n} (A_i-B_i)^2}$$

K-Means will compute the 3 distances — because k = 3 — between a data point and the centroids, and chooses the closest one. Repeat that for every points and you get this :

Moving the centroids

Here comes the real "learning" part, it's time to find those clusters ! Actually, it's as easy as the last steps. The algorithm compute the average data points position of each groups, and move their centroid there.

That's way more close !

Repeat

We got everything we need, the algorithm just need to repeat the process :

  • Separate data points
  • Move the centroids

And magic happens :

After only 5 iterations, k-means found the clusters !