I have recently stumbled upon this paper which, quite unexpectedly, sent me down a rabbit hole reading about compression, generalisation, algorithmic information theory and looking at gifs of milk mixing with coffee on the internet. Here are some half-processed takeaways from this weird journey.
Complexity of a cup of Coffee
First, check out this cool video.
What you can hopefully see, if content embedding still works, is a simple demonstration of how complexity evolves in physical systems. Start with a cup of black coffee – simple, uniform state. Add milk – and suddenly you’ve got this complicated swirls forming through the liquid, creating these elaborate patterns. The complexity graph starts shoots up here. But then, as the milk gradually diffuses, those patterns fade away. The system maxes out its entropy – totally mixed – but circles back to simplicity: one uniform (now brown) liquid. High entropy, sure, but low complexity, just like at the start.
Hopefully you can get some intuitive sense that complexity is some measure of information required to describe the system. Although it is import to make the distinction that we are not describing the whole system (i.e. every particle within the cup) but just the “interesting” parts for us – the visual patterns in the liquid. This is where modern image compression algorithms come in handy – they’re already optimised for human visual perception, designed to preserve the patterns we find meaningful while discarding imperceptible details. This is in fact what the video above uses to quantify complexity. When the patterns were intricate, the frames compressed less efficiently (higher complexity), but when the liquid was uniform, either before mixing or after full diffusion, the frames compressed very well (lower complexity).
The key insight from Aaronson et al. is that this rise and fall of complexity might be fundamental to how closed systems evolve. They show mathematically that as entropy increases (which it must in closed systems), complexity doesn’t just increase monotonically – it follows this characteristic peak pattern. The really interesting bit is how they formalise “apparent complexity” using compression – essentially saying that if something can be described simply (compresses well), it’s not complex. And crucially, they point out that what we consider “interesting” or “complex” often depends on our choice of how we look at the system – like focusing on visual patterns rather than individual particles. This seemingly simple observation turns out to have deep implications for everything from neural networks to the evolution of the universe itself. Not bad for a paper about coffee mixing.
Swirling in Deep Models
If you have done any machine learning you have probably encountered the Occam’s razor principle. It’s the idea that simpler hypotheses which fit the data should be preferred over complex ones. Modern neural networks are massively overparameterised – they could theoretically learn incredibly complex solutions. Yet, much like how our coffee-cream system naturally evolves toward simpler states, these networks tend to find surprisingly simple solutions. Recent work on “grokking” shows that neural networks often follow a similar complexity trajectory: they start simple, become complex while memorising data, then suddenly simplify again as they discover generalisable patterns. This suggests that the bias toward simplicity isn’t just a helpful design principle – it might be baked into the fundamental mathematics of how systems learn. Perhaps this is the way to robustly quantify when our models learn (signified by sufficient compression of data + algorithm), specifically in low data regimes encountered frequently in biology or even if it is possible to generalise with the amount of data we have for these tasks.