Statistical Modelling Big Data AnalyticsTM is in vogue at the moment, and there’s nothing quite so fashionable as the neural network. Capable of capturing complex non-linear relationships and scalable for high-dimensional datasets, they’re here to stay.
For your garden-variety neural network, you need two things: a set of features, X, and a label, Y. But what do you do if labelling is prohibitively expensive or your expert labeller goes on holiday for 2 months and all you have in the meantime is a set of features? Happily, we can still learn something about the labels, even if we might not know what they are!
In this blog post, we’re going to take a quick look at the Linear Discriminant Analysis (LDA) model before looking at its unsupervised extension, the Gaussian Mixture Model. We won’t use any more algebra than is necessary to give some intuition as to what’s going on, so fair warning to any mathematician who might find my vigorous hand waving personally offensive!
Linear Discriminant Analysis
Let’s start by defining our dataset. For each observation we have a set of continuous features, X, and an associated categorical label, Y. Our task is to predict the correct value Y* for a new example X*. For the LDA model, we know exactly what the true label is for each observation in our training set and we make the following assumption:
- For a given observation, the features X were generated by a ‘ground truth’ multivariate normal distribution. Each label is associated with a different normal distribution, so if two observations have the same label, they were generated by the same distribution, but if they have a different label they were generated by different distributions.
Given this assumption we have a recipe for generating a new data point (X*, Y*). First we randomly pick what the label, Y*, of the data point is. Then we sample from the normal distribution associated with Y* to get X*.
Assuming this is the way the data were actually generated, we don’t know for certain what the true distribution parameters are (although for simplicity we’re assuming we do know what the covariance matrix is for each Normal distribution). We need to estimate what the underlying label probabilities are, and then for each label we need to estimate the mean of the associated Normal distribution. Once we have estimates for all of these parameters, we can predict the label of a new datapoint X* using the following procedure:
- Calculate the conditional probability P(Y*=y|X*) for each possible label y
- Whichever label has the highest probability is the one we assign X* to.
Parameter estimation
For each distribution, the parameters are estimated using Maximum Likelihood Estimation. This involves some dense algebra if you want to prove it for yourself, but happily the end results are exactly what you might expect intuitively. As mentioned above, for the sake of simplicity, we’re going to make the additional assumption that all of the normal distributions have the same known variance – whether or not that’s a reasonable assumption depends on the specific use case (but the method works in much the same way).
Estimation of label probabilities: To estimate the probability of a point having label y (unconditional on the value of X), simply calculate the proportion of observations in the training set which have the label y
Estimation of Normal distribution parameters: To estimate the theoretical mean associated with a label, take all of the data points with that label and calculate the average of all of the feature vectors with that label
Model fitting and evaluation
By fitting an LDA model to the data above, we can predict the labels of each point in the training set (see results below). As the three clusters are separated quite clearly, we’re able to assign the correct label to the points in the dataset with high accuracy.
Modelling unlabelled data
To estimate the parameters in the way described above, we needed to know what the label of each point was – without knowing what the labels were, the algorithm doesn’t work.
A Gaussian mixture model has much the same assumptions as the Linear Discriminant Analysis model – each data point is associated with a label, and each label is associated with a normal distribution which generates the data point. The difference is that for the Gaussian mixture model, the label is a latent variable (denoted as Z below), we never actually know for sure what the label is – therefore it’s not possible to compute the maximum likelihood estimates in the way we did for LDA.
The key difference in fitting the model is that the true label (which identifies the correct group for an observation with 100% certainty) is replaced by a probability distribution of what the label is, conditional on the associated value of X. This means that all data points are included in the estimation of the mean of a given cluster’s normal distribution parameters, but if for a given data point we believe there is only a small chance that the observation truly has that label given the value of X, then we weight that point so that it only has a very small contribution in estimating the parameter.
The tricky part now is deciding how to determine what the probability a point belongs to a given cluster is – if we knew that then we’ve already accomplished our goal! Instead we’re forced to adopt a two-stage iterative approximation scheme called an Expectation-Maximisation (EM) algorithm. The EM algorithm works as follows:
- Start by randomly initialising values for the unconditional label probabilities and normal distribution parameters
- Given the current values for the unconditional label probabilities and normal distribution parameters, calculate the probability that each point belongs to each cluster, conditional on its feature vector
- Use the calculated probabilities to weight the contribution of each point to each cluster and update the current values for the unconditional label probabilities and normal distribution parameters
- Repeat steps (2)-(3) until convergence
It can be proved that repeating steps (2)-(3) above improves the models performance each time (although it will plateau eventually), so if you run the algorithm for long enough you’ll eventually arrive at a set of parameters and probabilities that give good performance.
If you’re familiar with the K-means algorithm, we can intuitively think of the EM algorithm in this setting as being quite similar to K-means – Assigning points to a cluster is analogous to updating the values of Q_ik (step 2 above), and updating the cluster centroid can be compared to recalibrating the parameter estimates (step 3 above). **That statement is devoid of any rigour whatsoever, please don’t hate me, mathematicians**
Fitting the mixture model
Comparison to K-means algorithm
You’d be forgiven for wondering why, rather than putting ourselves through the pain of this approximation scheme, we don’t just use K-means, one of the simplest ML methods around. For the dataset we’ve presented, it is indeed hard to see what GMMs offer that K-means doesn’t.
However, Gaussian Mixture Models do offer some advantages over K-means:
- For our implementation of the GMM, we’ve assumed a shared, known, diagonal, covariance matrix. This need not be the case, and relaxing that assumption allows the GMM to make non-linear partitions of the feature space, which would be necessary with more complex data than we’ve used.
- For prediction of the label of a new example, whilst we get a prediction of the label with both GMMs and K-means, GMMs also give us an associated probability distribution too. That means we can get an estimation of how confident the model is that a point belongs to the cluster we’ve assigned it to.
- The fact that all data points contribute to the estimation of every parameter in GMMs would mitigate the impact of having overlap between two clusters, whereas K-means could not account for that.
Conclusion
That’s it for this post! We’ve seen how the Gaussian Mixture Model provides a framework for probabilistically identifying clusters in unlabelled datasets. Hopefully next time you’re faced with some unlabelled data, you’ll think about applying a probabilistic approach (even if for no other reason than it sounds fancier than using K-means!)