Last week, Google DeepMind announced AlphaGeometry, a novel deep learning system that is able to solve geometry problems of the kind presented at the International Mathematics Olympiad (IMO). The work is described in a recent Nature paper, and is accompanied by a GitHub repo including full code and weights.
This paper has caused quite a stir in some circles. Well, at least the kind of circles that you tend to get in close contact with when you work at a Department of Statistics. Like folks in structural in biology wondered three years ago, those who earn a living by veering into the mathematical void and crafting proofs, were wondering if their jobs may also have a close-by expiration date. I found this quite interesting, so I decided to read the paper and try to understand it — and, to motivate myself, I set to present this paper at an upcoming journal club, and also write this blog post.
So, let’s ask, what has actually been achieved and how powerful is this model?
What has been achieved
The image that has been making the rounds this time is the following benchmark:
These are the results of testing AlphaGeometry on a test set of 30 geometry problems extracted from International Mathematics Olympiad (IMO) competitions from the year 2000 onwards. As one tends to say with DeepMind’s big papers, these results are really quite impressive — but there are also a few caveats that are important in order to put the advances in context.
The first caveat is that, of course, this benchmark does not represent a realistic International Mathematics Olympiad competition. In a normal IMO competition, there are six problems (divided into two sets of three, one per day) of which only two are geometry. While the authors state that their system would still achieve Bronze Medalist status on at least two editions of IMO, I think that it is something that has been lacking from many relatively serious discussion of the paper that I have seen.
Another caveat is that the way the authors encode the geometric problems (they use a specialised geometric proof language that apparently is common in the literature) can only cover ~75% of the geometry problems presented at IMO. This limitation is quite stark if we notice that it includes important topics like combinatorial geometry and, notably, geometric inequalities. With regards to that, the authors claim that finding a complete solution to the geometry representation problem “is a separate and extremely challenging research topic that demands substantial investment from the mathematical formalization community“. Which, in my opinion, is fair enough.
These caveats aside, I would like to note that these problems are not easy problems, in any way, shape, or form. In fact, this is what they look like:
How it works
The central idea behind AlphaGeometry is to combine two hitherto separate paradigms: an automated theorem prover and a language model. In the official blog post, the authors present this in the light of “thinking fast and slow“. The theorem prover acts as the “thinking slow” logical part of the thinking, using rigorous and provable logical constructs to derive complex statements from simple axioms. If you want it, the brute forcey part of the creative process. The language model, on the other hand, acts as the “thinking fast” intuition, proposing new statements that can then be used by the theorem provers to get to the end result.
And… that’s it, really. There are lots of details about how the authors improve the automated theorem proving systems for this paper. I am not an expert about these systems, and in fact only have a light intuition about how they work, so I will obviate them and refer the interested reader to the Methods section in the paper.
The secret sauce behind the paper appears to be a massive amount of fully synthetic proofs that the model was trained upon. Aside from enormous variation and diversity, these proofs are interesting because they do not overlap with proofs in the IMO problems, and frankly, because they do not require any “human” training set. To generate these proofs, the authors developed a customised diagrammatic generation language to generate a very large amount of diagrams like the ones below, using of the order of 800 CPU years!
The diagrams were then used to build lots of proofs, which were then used to train the language model. An interesting part about synthetic data generation is that it is not given to the model as is, but rather, the authors identify important “constructs” (which I urge to think as an imaginary line, or angle, that is used as an intermediate step to assist in the proof) in the proof sequence, and train the language model to suggest these.
Discussion
The most interesting part of the paper is perhaps how little deep learning is actually involved with its success. The language model used is very small for current standards — with 12 layers and roughly 150 million parameters, it is something that could probably be trained at an academic lab without major resources. For reference, CaLM, the cDNA language model we published last year, had ~80 million parameters and took only about 1 month to train on 4 not-too-beefy GPUs.
One interesting observation is that, when the authors benchmark their model against (carefully prompted) GPT-4, it solves none of the problems. This result is not entirely surprising, given GPT-4’s tendency to struggle with math. However, it is puzzling that when GPT-4 is combined with the same automated theorem solver in AlphaGeometry, it still manages to solve nearly 50% of the problems. That is certainly not bad performance, given that it has not been fine-tuned for the task — I would definitely have asked for comparison against GPT-4, or at least an open-source LLM, fine-tuned for the task.
Another thought is that their main construction that they propose, using “intuition” and “thinking fast”, appears to be a potential flaw. Given that constructs suggested by the transformer are crucial for the process, if the language model can’t find a step that is crucial for the proving of the theorem, then there is no way that the “thinking slow” system reaches the end. Similarly, I wonder how much of the system’s exploration time is spent in loose ends or even hallucinated steps.
More broadly, it is not entirely surprising that the first class of problems “cracked” by AI systems is precisely that of geometry. Geometric problems are the most “mechanical”, the main problem being precisely to find the right intermediate constructs to rapidly explore the rapidly branching space of solutions. While not resting any merit to this work, it seems like there may be a long way until this same class of problem solvers can tackle problems in other areas like, say, number theory.
Regardless, I am quite interested in seeing what cleverly designed neurosymbolic systems like this, where there is a pure AI model that brings the intuition, combined with a logical algorithmic part, can achieve in the future.