Random Fact #3

Today’s Fact: Linear Discriminant Analysis

Linear Discriminant Analysis, or LDA for short, is an ML model used in classification. Specifically, for a model of d features and n training points, it assumes that the d features are all distributed Normally (Gaussian), with the same variance. More formally, taken from my professor’s cs189 notes, we get:

Lecture Note 1

This doesn’t help much for those who are unfamiliar with Machine Learning, so he also included helpful diagrams. Below is a two-class classification (imagine your input is either of class A or class B, and you’re deciding which class it’s supposed to be in). The decision function (breaking point between choosing A versus B) is the black line; less than 0.5 indicates a stronger guess towards the left side, and greater than 0.5 indicates a stronger guess towards the right class. Below that, we have a multi-class problem, where each class’s boundaries are broken up by black line:

Lecture Note 2

When Andi saw this, he immediately thought of k-means clustering, which is a type of unsupervised learning meant to cluster similar data without knowing their actual labels. K-means often uses similar voronoi diagrams to represent its output, so he raised a question: How does K-means compare to LDA, even though they fundamentally were quite different (supervised vs unsupervised, classification vs grouping)? And so we simulated it.

Simulation

First, we examined the case of Normally distributed data that is decently spread out from each other, like so:

Simulation 1, Data

Here, we see that LDA and KMeans actually (somewhat surprisingly!) performed about the same:

Simulation 1, KMeans Simulation 1, LDA

Now, we looked at the case where data is overlapping, like so: Simulation 2, Data

We see that the results are different this time (!):

Simulation 2, KMeans Simulation 2, LDA

Notice that in the second case, where there are overlapping data, LDA still learns of a distinction between class Yellow and class Purple. K-means completely lost it there, since it has no idea what the labels are supposed to be. This is expected, since K-means is intended to group similar data, not learn things like this. We thought that it was quite interesting that K-means performed as well as it did in part 1.

Addendum

Jupyter Notebook