“Dead shopping malls rise like mountains beyond mountains. And there’s no end in sight.”
Régine Chassagne
Sometimes I wonder would my PhD have been simpler if I had broken up the findings into three smaller papers. In the end there were 7 main figures, 7 supplementary figures, 5 supplementary tables and one supplementary data section in one solitary publication. The contents of a 3 year 3 month tour through the helper T cell response to the inner proteins of the flu virus. The experimental worked comprised crystal structures, cell assays, tetramer staining and TCR sequencing. During the following years as it was batted back and forth between last authors, different journals and reviewers I continually reworked the figures and added extra bioinformatic analyses. I was fortunate that others in the lab kindly performed some in vivo experiments which helped cement the findings. It all started in January 2014, but the paper wasn’t published until July 2020. There are many terms which could be used to describe how the process of writing and re-writing felt as it dragged on through my 3 year post doc, for the purpose of this very public blog I will refer to it as, “a slog.
Binding a desired protein tightly is important for biotechnology. Recent advances in deep learning have allowed the de novo design of (mostly α-helical) binding protein, sidestepping the laborious process of raising antibodies or nanobodies or evolving affibodies, darpins or similar. These deep learning designed binders will bind with okay affinity, but what if the affinity required were much stronger? <Enter autocatalytic isopeptide bonds>
At the recent OPIG retreat, I was tasked with writing the pub quiz. The quiz included five rounds, and it’s always fun to do a couple “how well do you know your group?” style rounds. Since I work with Transformers, I thought it would be fun to get AI to create Haiku summaries of OPIGlet research descriptions from the website.
AI isn’t as funny as it used to be, but it’s a lot easier to get it to write something coherent. There are also lots of knobs you can turn like temperature, top_p, and the details of the prompt. I decided to use Meta’s new Llama 3.2-3B-Instruct model which is publicly available on Hugging Face. I ran it locally using vllm, and instructed it to write a haiku for each member’s description using a short script which parses the html from the website.
Through our work in OPIG, many of our projects come in the form of code bases written in Python. These can be many different things like databases, machine learning models, and other software tools. Often, the user interface for these tools is developed as both a web app and a command line application. Here, I will discuss one of my favourite tools for testing command-line applications: prysk!
The XChem facility at Diamond Light Source is truly impressive feat of automation in fragment-based drug discovery, where visitors comes clutching a styrofoam ice box teeming with apo-form protein crystals, which the shifter soaks with compounds from one or more fragment libraries and a robot at the i04-1 beamline kindly processes each of the thousands of crystal-laden pins, while the visitor enjoys the excellent food in the Diamond canteen (R22). I would especially recommend the jambalaya. Following data collection, the magic of data processing happens: the PanDDA method is used to find partial occupancy in the density, which is processed semi-automatedly and most open targets are uploaded in the Fragalysis web app allowing the ligand binding to be studied and further compounds elaborated. This collection of targets bound to hundreds of small molecules is a true treasure trove of data as many have yet to be deposited in the PDB, making it a perfect test set for algorithm design: fragments are notorious fickle to model and deep learning models cannot cheat by remembering these from the protein database.
The power of machine learning (ML) techniques has captivated the field of small molecule drug discovery. Increasingly, researchers and organisations have employed ML to create more accurate algorithms to improve the efficiency of the discovery process.
To be published, methods have to prove they have improved upon others. Often, methods are tested against the same benchmarks within a field, allowing us to track progress over time. To explore the rate of improvement, I curated the performance on three popular benchmarks. The first benchmark is CASF 2016, used to test the accuracy of methods that predict the binding affinity of experimental determined protein-ligand complexes. Accuracy was measured using the Pearson’s R value between predicted and experimental affinity values.
Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.
Fragmenstein is a Python module that combine hits or position a derivative following given templates by being very strict in obeying them. This is done by creating a “monster”, a compound that has the atomic positions of the templates, which then reanimated by very strict energy minimisation. This is done in two steps, first in RDKit with an extracted frozen neighbourhood and then in PyRosetta within a flexible protein. The mapping for both combinations and placements are complicated, but I will focus here on a particular step the minimisation, primarily in answer to an enquiry, namely how does the RDKit minimisation work.
16:30 BST 27/06/2023 Oxford, UK. A large number of scientists were spotting riding bicycles across town, to the consternation of onlookers. The event was the Oxford Protein Informatics Group (OPIG) “tour de farce” 2023. A circular bike ride from the Department of Statistics, to The Up in Arms (Marston), The Trout Inn (Godstow), The Perch (Port Meadow) and The Holly Bush (Osney Island). This spurred great bystander-anxiety due to one of a multitude of factors: the impressive size of the jovial horde, the erraticism of the cycling, the deplorable maintenance of certain bikes, and the unchained bizarrerie of the overheard dialogue.
Last month, I had the privilege of being invited to the KAUST Research Conference on Computational Advances in Structural Biology, held from May 1-3, 2023. This gave me the opportunity to present some of the latest OPIG works on small molecules while visiting an exceptional campus with state-of-the-art facilities in one of those corners of the world that are not widely known. Moreover, the experience went beyond the impressive surroundings as I had the chance to attend a highly engaging conference and meet many scientists from different backgrounds.
KAUST Library (left) and Dinning Hall (right)
The conference brought together experts in the field to explore cutting-edge developments in computational structural biology. It had a primary focus on advancements in protein structure prediction, multi-scale simulations, and integrative structural biology. Cryo-electron microscopy (cryo-EM) was the most popular experimental technique, with more than a third of the talks dedicated to its applications. These talks showcased impressive examples where structure prediction, simulations, and mid-resolution cryo-EM maps were combined to construct atomic models of large macromolecular complexes.
Notable examples of integrative works were presented by Jan Kosinski and Thomas Miller, among others. Jan Kosinski shared insights into the model of the human nuclear pore complex, highlighting the integration of cryo-electron tomography (cryo-ET), prior experimental knowledge, and AlphaFold predictions. Thomas Miller, on the other hand, presented his work on EM-based visual biochemistry, which combines single-particle cryo-electron microscopy (cryo-EM), and time-resolved experiments, as a tool to study the molecular mechanisms of eukaryotic DNA replication.
There were also several talks about novel algorithms. Nazim Bouatta presented some less-known details about OpenFold and introduced some of their approaches to tackling the problem of multimer modelling. He also announced the future release of folding methods for predicting protein-ligand complexes. Jianlin Cheng presented MULTICOM, their new protein structure predictor based on consensus predictions from Alphafold. Sergei Grudinin showed deep-learning tools able to predict protein dynamics as well as some integrative modelling tools driven by low-resolution experimental observations, such as small-angle scattering.
On the cryo-EM methods side, Mikhail Kudryashev presented TomoBEAR and SUSAN, cryoEM tools developed to automatize the analysis of tomographic data. Johannes Schwab presented dynamight, a deep learning-based approach for heterogeneity analysis in single particle cryo-EM. While, on the ComChem side, Haribabu Arthanari showed their ultra-large Virtual screening platform and Jean-Louis Reymond talked about tools to enumerate, visualize and search the vast chemical space of drug-like molecules
Overall, the conference provided a quite diverse set of talks that facilitated multidisciplinary views and discussions. From protein structure prediction to integrative approaches combining experimental and computational methods, the talks showed the transformative potential of computational analysis in unravelling the complexities of biological macromolecules.