Welcome to ML Made Simple #3, we will talk about a little algorithm called *k*-Means Clustering, how it works, and when you should use it.

*k*-Means Clustering is an unsupervised learning algorithm, meaning that you use it when your data is *unlabeled *(when you got a bunch of points, but don't know what they refer to).

Unsupervised algorithms are commonly used to find the "structure" of a dataset, to discover where are the different groups and separate them. They tell you which data points have the same behavior.

### Pros

- Efficient
- Simple

### Cons

- Sensitive to outliers
- Output depends on the initial state, so it may not find the same results every time

Let's see how it finds groups in data. Our goal is to apply *k*-means to this randomly generated dataset :

We will proceed step by step, and follow the algorithm.

## Centroids

As its name implies, *k*-Means Clustering works by finding where data clusters. Centroids are where the algorithm thinks the center of each cluster is.

When it begins every centroids are initialized to a random position, which gives us something like this :

Wait a second. How does it knows how many cluster to look for ?

Well, the *k* in the name doesn't have the same meaning as in* k*-Nearest Neighbors. It's still a *parameter*, but here it refers to the number of cluster. To find its optimal value, a good option is to look at the mean distance between data points and their centroid, across different values of k. You need to find the "*elbow point*", the value of* k *where the decreasing curve becomes almost flat. For our example, this is what we get :

**3** is *probably* the best value for *k* in this example.

## Separating data points between centroids

Now that our centroids are positioned (randomly), we need to separate all the data between them. That's a piece of cake, each data points groups with the closest centroid.

We've already seen it in the last *ML Made Simple*, but to evaluate a distance, an algorithm needs a distance function, like the Pythagorean theorem (Euclidean distance) :

$$d(A,B) = \sqrt{\sum_{i=1}^{n} (A_i-B_i)^2}$$

*K*-Means will compute the 3 distances — because *k* = 3 — between a data point and the centroids, and chooses the closest one. Repeat that for every points and you get this :

## Moving the centroids

Here comes the real "learning" part, it's time to find those clusters ! Actually, it's as easy as the last steps. The algorithm compute the average data points position of each groups, and move their centroid there.

## Repeat

We got everything we need, the algorithm just need to repeat the process :

- Separate data points
- Move the centroids

And magic happens :

After only **5 iterations**, *k*-means found the clusters !