A week ago I participated in Copenhagen Bioinformatics Hackathon 2021, a hackathon focusing on machine learning and proteins, as a mentor for a challenge proposed by our group. The whole experience was fun, but I am also sitting here contemplating over a lot of things I wish I had done differently. For this blog text, I therefore want to highlight two changes which I believe would have greatly improved my challenge and which can hopefully also work as an inspiration for others presenting a hackathon challenge.
Going into this event I had some experience from a few hackathons I had previously attended. Based on this, I wanted to create a challenge containing two parts. First, a simple task which everyone would be able to create a solution for, and second, a more challenging addition to the first task for more experienced participants. I decided to go with the challenge of predicting which heavy and light chains can form a pair, where the additional challenge was to try to visualize which residues were relevant for this interaction. Together with OAS containing a really nice positive dataset of paired chains, I thought this was going to be an amazing challenge, but as soon as the event began I started seeing the flaws of the challenge.
My first problem was the dataset. While being a nice dataset, it’s also an extremely rich dataset, containing a lot of additional information about the sequences (more than 200 columns) and redundancy (⅔’s of the sequences were redundant). Additionally, it only has positive values, as a part of the challenge was to create your own negatives. For a 2 day hackathon, where most participants hadn’t worked with antibodies before, the dataset was too overwhelming and most of the participants were spending too much time on it. My first regret was therefore not having simplified the dataset by removing irrelevant columns (trimming it down to around 20) and redundant sequences. This would have made the data much less confusing and have allowed the participants to focus on solving the core of the problem.
My second problem was the simple task not being simple for anyone who hadn’t worked with antibodies before or weren’t proficient with deep learning. Knowing that you have to align and encode the sequences, e.g. with one-hot, before you can feed the data into ML algorithms such as a Random Forest or knowing how to code and train a CNN or RNN model was needed to solve this “simple task”. Instead of them creating a model from scratch, I should have provided them with a good baseline model, e.g. a CNN model (a CNN can be trained much faster than an RNN, fitting the tight schedule they have), as their starting point. This way participants unfamiliar with machine learning, would spend the hackathon playing around with a deep learning model and learn how they can be used to solve a protein problem, without the panic of trying to code it in a short period of time. Moreover, participants familiar with ML would be able to spend their time expanding and improving the model straight away instead of having to use the first day coding a simple CNN.
There are definitely other things I could have done better, but I believe these two points (providing an easy to understand dataset and a good solution to the “simple” task) would have greatly improved the challenge, not just for the participants, but also for us, as we might have seen more creative ideas.