Tag Archives: Protein Structure Prediction

Estimating the Generalisability of Machine Learning Models in Drug Discovery

Machine learning (ML) has significantly advanced key computational tasks in drug discovery, including virtual screening, binding affinity prediction, protein-ligand structure prediction (co-folding), and docking. However, the extent to which these models generalise beyond their training data is often overestimated due to shortcomings in benchmarking datasets. Existing benchmarks frequently fail to account for similarities between the training and test sets, leading to inflated performance estimates. This issue is particularly pronounced in tasks where models tend to memorise training examples rather than learning generalisable biophysical principles. The figure below demonstrates two examples of model performance decreasing with increased dissimilarity between training and test data, for co-folding (left) and binding affinity prediction (right).

Continue reading →

Conference summary: Generative AI in Life Science

This year I attended the second edition of Generative AI in Life Science (GenLife – https://genlife.dk/) and it was an enriching experience that I thoroughly enjoyed. Held in Copenhagen, the event brought together researchers from different areas of AI applied to the life sciences and provided a fantastic platform for networking, learning and sharing ideas. The programme included a mix of long and short talks from experts in the field, but also had a significant presence of emerging PIs, making the conference a perfect place to discover emerging groups in the field. Here I have collected some highlights of the talks I have enjoyed the most at the conference.

Continue reading →

Pyrosetta for RFdiffusion

I will not lie: I often struggle to find a snippet of code that did something in PyRosetta or I spend hours facing a problem caused by something not working as I expect it to. I recently did a tricky project involving RFdiffusion and I kept slipping on the PyRosetta side. So to make future me, others, and ChatGTP5 happy, here are some common operations to make working with PyRosetta for RFdiffusion easier.

Continue reading →

Useful metrics and their meanings

Short and selfish blog here. Probably been done before, but I shall carry on regardless. I am going to review some metrics relevant to our area of Immunoinformatics. In other words, I will try dissect things such as perplexity, logits, pTM, pLDDT and the ABodyBuilder2 confidence score. These numbers can help inform us on the likelihood of predictions, and whether we should have confidence in them.

Continue reading →

Current strategies to predict structures of multiple protein conformational states

Since the release of AlphaFold2 (AF2), the problem of protein structure prediction is widely believed to be solved. Current structure prediction tools, such as AF2, are able to model most proteins with high accuracy. These methods, however, have a major limitation as they have been trained to predict a single structure for a given protein. Proteins are highly dynamic molecules, and their function often depends on transitions between several conformational states. Despite research focusing on the task of predicting the structures of multiple conformations of a protein, currently, no accurate and reliable method is available. In this blog post, I will provide a short overview of the strategies developed for predicting protein conformations. I have grouped these into three sets of related approaches. To conclude, I will also demonstrate how to run one of these strategies on your own.

Continue reading →

AlphaFold 2 is here: what’s behind the structure prediction miracle

Nature has now released that AlphaFold 2 paper, after eight long months of waiting. The main text reports more or less what we have known for nearly a year, with some added tidbits, although it is accompanied by a painstaking description of the architecture in the supplementary information. Perhaps more importantly, the authors have released the entirety of the code, including all details to run the pipeline, on Github. And there is no small print this time: you can run inference on any protein (I’ve checked!).

Have you not heard the news? Let me refresh your memory. In November 2020, a team of AI scientists from Google DeepMind indisputably won the 14^th Critical Assessment of Structural Prediction competition, a biennial blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally but not publicly released. Their results were so astounding, and the problem so central to biology, that it took the entire world by surprise and left an entire discipline, computational biology, wondering what had just happened.

Continue reading →

Journal Club: Large-scale structure prediction by improved contact predictions and model quality assessment.

With the advent of statistical techniques to infer protein contacts from multiple sequence alignments (which you can read more about here), accurate protein structure prediction in the absence of a template has become possible. Taking advantage of this fact, there have been efforts to brave the sea of protein families for which no structure is known (about 8,500 – over 50% of known protein families) in an attempt to predict their topology. This is particularly exciting given that protein structure prediction has been an open problem in biology for over 50 years and, for the first time, the community is able to perform large-scale predictions and have confidence that at least some of those predictions are correct.

Based on these trends, last group meeting I presented a paper entitled “Large-scale structure prediction by improved contact predictions and model quality assessment”. This paper is the culmination of years of work, making use of a large number of computational tools developed by the Elofsson Lab at Stockholm University. With this blog post, I hope to offer some insights as to the innovative findings reported in their paper.

Let me begin by describing their structure prediction pipeline, PconsFold2. Their method for large-scale structure prediction can be broken down into three components: contact prediction, model generation and model quality assessment. As the very name of their article suggests, most of the innovation of the paper stems from improvements in contact prediction and the quality assessment protocols used, whereas for their model generation routine, they opted to sacrifice some quality in favour of speed. I will try and dissect each of these components over the next paragraphs.

Contact prediction relates to the process in which residues that share spatial proximity in a protein’s structure are inferred from multiple sequence alignments by co-evolution. I will not go into the details of how these protocols work, as they have been previously discussed in more detail here and here. The contact predictor used in PconsFold2 is PconsC3, which is another product of the Elofsson Lab. There was some weirdness with the referencing of PconsC3 on the PconsFold2 article, but after a quick google search, I was able to retrieve the article describing PconsC3 and it was worth a read. Other than showcasing PconsC3’s state-of-the-art contact prediction capabilities, the original PconsC3 paper also provides figures for the number of protein families for which accurate contact prediction is possible (over 5,000 of the ~8,500 protein families in Pfam without a member of known structure). I found the PconsC3 article feels like a prequel to the paper I presented. The bottom line here is that PconsC3 is a reliable tool for predicting contacts from multiple sequence alignments and is a sensible choice for the PconsFold2 pipeline.

Another aspect of contact prediction that the authors explore is the idea that the precision of contact prediction is dependent on the quality of the underlying multiple sequence alignment (MSA). They provide a comparison of the Positive Predicted Value (PPV) of PconsC3 using different MSAs on a test set of 626 protein domains from Pfam. To my knowledge, this is the first time I have encountered such a comparison and it serves to highlight the importance the MSA has on the quality of resulting contact predictions. In the PconsFold2 pipeline, the authors use consensus approach; they identify the consensus of four predicted contact maps each using a different alignment. Alignments were generated using Jackhmmer and HHBlits at E-Value cutoffs of 1 and 10^-4.

Now, moving on to the model generation routine. PconsFold2 makes use of CONFOLD to perform model generation. CONFOLD, in turn, uses the simulated annealing routine of the Crystallographic and NMR System (CNS) to produce models based on spatial and geometric constraints. To derive those constraints, predicted secondary structure and the top 2.5 L predicted contacts are given as input. The authors do note that the refinement stage of CONFOLD is omitted, which is a convenience I assume was adopted to save computational time. The article also acknowledges that models generated by CONFOLD are likely to be less accurate than the ones produced by Rosetta, yet a compromise was made in order to make the large-scale comparison feasible in terms of resources.

One particular issue that we often discuss when performing structure prediction is the number of models that should be produced for a particular target. The authors performed a test to assess how many decoys should be produced and, albeit simplistic in their formulation, their results suggest that 50 models per target should be sufficient. Increasing this number further did not lead to improvements in the average quality of the best models produced for their test set of 626 proteins.

After producing 50 models using CONFOLD, the final step in the PconsFold2 protocol is to select the best possible model from this ensemble. Here, they present a novel method, PcombC, for ranking models. PcombC combines the clustering-based method Pcons, the single-model deep learning method ProQ3D, and the proportion of predicted contacts that are present in the model. These three scores are combined linearly, and are given weights that were optimised via a parameter sweep. One of my reservations relating to this paper is that little detail is given regarding the data set that was used to perform this training. It is unclear from their methods section if the parameter sweep was trained on the test set with 626 proteins used throughout the manuscript. Given that no other data set (with known structures) is ever introduced, this scenario seems likely. Therefore, all the classification results obtained by PcombC, and all of the reported TM-score Top results should be interpreted with care since performance on validation set tends to be poorer than on a training set.

Recapitulating the PconsFold2 pipeline:

Step 1: generate four multiple sequence alignments using HHBlits and Jackhmmer.
Step 2: generate four predicted contact maps using PconsC3.
Step 3: Use CONFOLD to produce 50 models using a consensus of the contact maps from step 2.
Step 4: Use PCombC to rank the models based on a linear combination of the Pcons and ProQ3D scores and the proportion of predicted contacts that are present in the model.

So, how well does PconsFold2 perform? The conclusion is that it depends on the quality of the contact predictions. For the protein families where abundant sequence information is available, PconsFold2 produces a correct model (TM-Score > 0.5) for 51% of the cases. This is great news. First, because we know which cases have abundant sequence information beforehand. Second, because this comprises a large number of protein families of unknown structure. As the number of effective sequence (a common way to assess the amount of information available on an MSA) decreases, the proportion of families for which a correct model has been generated also decreases, which restricts the applicability of their method to protein families with abundant sequence information. Nonetheless, given that protein sequence databases are growing exponentially, it is possible that over the next years, the number of cases where protein structure prediction achieves success is likely to increase.

One interesting detail that I was curious about was the length distribution of the cases where modelling was successful. Can we detect the cases for which good models were produced simply by looking at a combination of length and number of effective sequences? The authors never address this question, and I think it would provide some nice insights as to which protein features are correlated to modelling success.

We are still left with one final problem to solve: how do we separate the cases for which we have a correct model from the ones where modelling has failed? This is what the authors address with the last two subsections of their Results. In the first of these sections, the authors compare four ways of ranking decoys: PcombC, Pcons, ProQ3D, and the CNS contact score. They report that, for the test set of 626 proteins, PcombC obtains the highest Pearson’s Correlation Coefficient (PCC) between the predicted and observed TM-Score of the highest ranking models. As mentioned before, this measure could be overestimated if PcombC was, indeed, trained on this test set. Reported PCCs are as follows: PcombC = 0.79, Pcons = 0.73, ProQ3D = 0.67, and CNS-contact = -0.56.

In their final analysis, the authors compare the ability of each of the different Quality Assessment (QA) scores to discern between correct and incorrect models. To do this, they only consider the top-ranked model for each target according to different QA scores. They vary the false positive rate and note the number of true positives they are able to recall. At a 10% false positive rate, PcombC is able to recall about 50% of the correct models produced for the test set. This is another piece of good news. Bottomline is: if we have sufficient sequence information available, PconsFold2 can generate a correct model 51% of the time. Furthermore, it can detect 50% of these cases, meaning that for ~25% of the cases it produced something good and it knows the model is good. This opens the door for looking at these protein families with no known structure and trying to accurately predict their topology.

That is exactly what the authors did! On the most interesting section of the paper (in my opinion), the authors predict the topology of 114 protein families (at FPR of 1%) and 558 protein families (at FPR of 10%). Furthermore, the authors compare the overlap of their results with the ones reported by a similar study from the Baker group (previously presented at group meeting here) and find that, at least for some cases, the predictions agree. These large-scale efforts force us to revisit the way we see template-free structure prediction, which can no longer be dismissed as a viable way of obtaining structural models when sufficient sequences are available. This is a remarkable achievement for the protein structure prediction community, with the potential to change the way we conduct structural biology research.

CCP4 Study Weekend 2017: From Data to Structure

This year’s CCP4 study weekend focused on providing an overview of the process and pipelines available, to take crystallographic diffraction data from spot intensities right through to structure. Therefore sessions included; processing diffraction data, phasing through molecular replacement and experimental techniques, automated model building and refinement. As well as updates to CCP4 and where is crystallography going to take us in the future?

Surrounding the meeting there was also a session for Macromolecular (MX) crystallography users of Diamond Light Source (DLS), which gave an update on the beamlines, and scientific software, as well as examples of how fragment screening at DLS has been used. The VMXi (Versatile Macromolecular X-tallography in-situ) beamline is being developed to image crystals that are forming in situ crystallisation plates. This should allow for crystallography to be optimized, as crystallization conditions can be screened, and data collected on experiments as they crystallise, especially helpful in cases where crystallisation has routinely led to non-diffracting crystals. VXMm is a micro/nanofocus MX beamline, which is in development, with a focus to get crystallographic from very small crystals (~300nm to 10 micron diameters, with a bias to the smaller size), thereby allowing crystallography of targets that have previously been hard to get sufficient crystals. Other updates included how technology developed for fast solid state data collection on x-ray free electron lasers (XFEL) can be used on synchrotron beamlines.

A slightly more in-depth discussion of two tools presented that were developed for use alongside and within CCP4, which might be of interest more broadly:

ConKit: A python interface for contact prediction tools

Contact prediction for proteins, at its simplest, involves estimating which residues within a certain certain spatial proximity of each other, given the sequence of the protein, or proteins (for complexes and interfaces). Two major types of contact prediction exist:

Evolutionary Coupling
- Take a series of sequence homologues, and identifying co-evolved residues from multiple sequence alignment of the protein family. These co-evolved residues are hypothesized to share a functional dependence. Discussed previously on BLOPIG: Predicted protein contacts: is it the solution to (de novo) protein structure prediction?
Supervised machine learning
- Using ab initio structure prediction tools, without sequence homologues, to predict which contacts exist, but with a much lower accuracy than evolutionary coupling.

fullscreen

ConKit is a python interface (API) for contact prediction tools, consisting of three major modules:

Core: A module for constructing hierarchies, thereby storing necessary data such as sequences in a parsable format.
- Providing common functionality through functions that for example declare a contact as a false positive.
Application: Python wrappers for common contact prediction and sequence alignment applications
- CCMPred
- CdHit
- HHblits
- HHfilter
- Jackhmmer
- Psicov
- BbContacts
I/O: I/O interface for file reading, writing and conversions.

Contact prediction can be used in the crystallographic structure determination field, during unconventional molecular replacement, using a tool such as AMPLE. Molecular replacement is a computational strategy to solve the phase problem. In the typical case, by using homologous structures to determine an estimate a model of the protein, which best fits the experimental diffraction intensities, and thus estimate the phase. AMPLE utilises ab initio modeling (using Rosetta) to generate a model for the protein, contact prediction can provide input to this ab initio modeling, thereby making it more feasible to generate an appropriate structure, from which to solve the phase problem. Contact prediction can also be used to analyse known and unknown structures, to identify potential functional sites.

For more information: Talk given at CCP4 study weekend (Felix Simkovic), ConKit documentation

ACEDRG: Generating Crystallographic Restraints for Ligands

Small molecule ligands are present in many crystallographic structures, especially in drug development campaigns. Proteins are formed (almost exclusively) from a sequence containing a selection of 20 amino acids, this means there are well known restraints (for example: bond lengths, bond angles, torsion angles and rotamer position) for model building or refinement of amino acids. As ligands can be built from a much wider selection of chemical moieties, they have not previously been restrained as well during MX refinement. Ligands found in PDB depositions can be used as models for the model building/ refinement of ligands in new structures, however there are a limited number of ligands available (~23,000). Furthermore, the resolution of the ligands is limited to the resolution of the macro-molecular structure from which they are extracted.

ACEDRG utilises the crystallorgraphy open database (COD), a library of (>300,000) small molecules usually with atomic resolution data (often at least 0.84 Angstrom), to generate a dictionary of restraints to be used in refining the ligand. To create these restraints ACEDRG utilises the RDkit chemoinformatics package, generating a detailed descriptor of each atom of the ligands in COD. The descriptor utilises properties of each atom including the element name, number of bonds, environment of nearest neighbours, third degree neighbours that are aromatic ring systems. The descriptor, is stored alongside the electron density values from the COD. When a ACEDRG query is generated, for each atom in the ligand, the atom type is compared to those for which a COD structure is available, the nearest match is then used to generate a series of restraints for the atom.

ACEDRG can take a molecular description (SMILES, SDF MOL, SYBYL MOL2) of your ligand, and generate appropriate restraints for refinement, (atom types, bond lengths and angles, torsion angles, planes and chirality centers) as a mmCIF file. These restraints can be generated for a number of different probable conformations for the ligand, such that it can be refined in these alternate conformations, then the refinement program can use local scoring criteria to select the ligand conformation that best fits the observed electron density. ACEDRG can accessed through the CCP4i2 interface, and as a command line interface.

Hopefully a useful insight to some of the tools presented at the CCP4 Study weekend. For anyone looking for further information on the CCP4 Study weekend: Agenda, Recording of Sessions, Proceedings from previous years.

Is “fragment-based” still the way forward in template-free protein structure prediction?

Out of the many questions surrounding the notion that you can predict a protein’s structure from its sequence, there is one in particular that I decided to tackle during last group meeting.

Protein structure prediction is a hard problem (do I sound repetitive?). One of the many cop outs employed by the structure prediction community is the idea that you can break down known structures into fragments and use these protein pieces to perform predictions. This is known as fragment-assembly or fragment-based template-free protein structure prediction.

As absurd as the idea may seem, there is robust evidence that suggests that this is actually a viable strategy. There is a notion that the fragment space is complete; you can reconstruct the backbone of any known structure based on the torsion angles of fragments from other structures. In less technical jargon, you can effectively use fragments and combine them to re-create any of the protein structures that we know and to a fairly acceptable level of precision.

So, technically, it is possible to predict a protein structure using fragments from other structures. In practice, you are still left with the problem of choosing the right fragments to model your sequence of interest. How easy do you think that is?

We can look at this question in light of observations that were made back in the early 80s. Kabsch and Sander reported that two protein fragments having exactly the same sequence can present completely different structures [1]. This complies with the notion that global properties can affect and even define local structure, which in turn suggests that selecting the right fragments to assemble a structure is not necessarily a straightforward process.

The starting point for protein structure prediction is a sequence. Since we are talking about template-free protein structure prediction, it is safe to assume that there is no good global sequence match to your target with a known structure (otherwise you would use that match/structure as a template). Hence, fragment selection is restricted to local sequence similarity, which, as suggested in the previous paragraph, is not necessarily ideal.

On the other hand, we are becoming increasingly more accurate in inferring one-dimensional properties from a protein’s sequence. These properties can and often are used to enhance our fragment-selection capabilities. Yet, even using the state-of-the-art in secondary structure and torsion angle prediction, fragment selection is still fairly imprecise.

During group meeting I highlighted a possible contrast between practical fragment space and general (or possible) fragment space. My premise is simple. I define practical fragment space as the fragments that we can accurately select from the possible fragment space to model protein structures. In my opinion, it would be extremely interesting to quantify the difference between the two. This would answer the fundamental question of how useful fragment-assembly actually is. More importantly, it would help the community make an educated decision in regards to whether template-free structure prediction strategies should shift from fragment-based to ones based on distance constraints, an approach that is gaining popularity due to the success of contact predictions.

I am very keen to investigate this further. Maybe for my next blog post, we will have an answer! Stay tuned.

[1] Kabsch, Wolfgang, and Christian Sander. “On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.” Proceedings of the National Academy of Sciences 81.4 (1984): 10751078.

Strachey Lecture – “Artificial Intelligence and the Future” by Dr. Demis Hassabis

For this week’s group meeting, some of us had the pleasure of attending a very interesting lecture by Dr. Demis Hassabis, founder of Deep Mind. Personally, I found the lecture quite thought-evoking and left the venue with a plethora of ideas sizzling in my brain. Since one of the best ways to end mental sizzlingness is by writing things down, I volunteered to write this week’s blog post in order to say my peace about yesterday’s Strachey Lecture.

Dr. Hassabis began by listing some very audacious goals: “To solve intelligence” and “To use it to make a better world”. At the end of his talk, someone in the audience asked him if he thought it was possible to achieve these goals (“to fully replicate the brain”), to which he responded with a simple there is nothing that tells us that we can’t.

After his bold introductory statement, Dr. Hassabis pressed on. For the first part of his lecture, he engaged the audience with videos and concepts of a reinforcement learning agent trained to learn and play several ATARI games. I was particularly impressed with the notion that the same agent could be used to achieve a professional level of gaming for 49 different games. Some of the videos are quite impressive and can be seen here or here. Suffice to say that their algorithm was much better at playing ATARi than I’ll ever be. It was also rather impressive to know that all the algorithm received as input was the game’s score and the pixels on the screen.

Dr. Hassabis mentioned in his lecture that games provide the ideal training ground for any form of AI. He presented several reasons for this, but the one that stuck with me was the notion that games quite often present a very simplistic and clear score. Your goal in a game is usually very well defined. You help the frog cross the road or you defeat some aliens for points. However, what I perceive to be the greatest challenge for AI is the fact that real world problems do not come with such a clear-cut, incremental score.

For instance, let us relate back to my particular scientific question: protein structure prediction. It has been suggested that much simpler algorithms such as Simulated Annealing are able to model protein structures as long as we have a perfect scoring system [Yang and Zhou, 2015]. The issue is, currently, the only way we have to define a perfect score is to use the very structure we are trying to predict (which kinda takes the whole prediction part out of the story).

Real world problems are hard. I am sure this is no news to anyone, including the scientists at Deep Mind.

During the second part of his talk, Dr. Hassabis focused on AlphaGo. AlphaGo is Deep Mind’s effort at mastering the ancient game of Go. What appealed to me in this part of the talk is the fact that Go has such a large number of possible configurations that devising an incremental score is no simple task (sounds familiar?). Yet, somehow, Deep Mind scientists were able to train their algorithm to a point where it defeated a professional Go player.

Their next challenge? In two weeks, AlphaGo will face the professional Go player with the highest number of titles in the last decade (the best player in the world?). This makes me reminiscent of when Garry Kasparov faced Deep Blue. After the talk, my fellow OPIG colleagues also seemed to be pretty excited about the outcome of the match (man vs. ~~food~~ computer).

Dr. Hassabis finished by saying that his career goal would be to develop AI that is capable of helping scientists tackle the big problems. From what I gather (and from my extremely biased point of view; protein structure prediction mindset), AI will only be able to achieve this goal once it is capable of coming up with its own scores for the games we present it to play with (hence developing some form of impetus). Regardless of how far we are from achieving this, at least we have a reason to cheer for AlphaGo in a couple of weeks (because hey, if you are trying to make our lives easier with clever AI, I am all up for it).

Oxford Protein Informatics Group

or "OPIG" to friends

Tag Archives: Protein Structure Prediction

Estimating the Generalisability of Machine Learning Models in Drug Discovery

Conference summary: Generative AI in Life Science

Pyrosetta for RFdiffusion

Useful metrics and their meanings

Current strategies to predict structures of multiple protein conformational states

AlphaFold 2 is here: what’s behind the structure prediction miracle

Journal Club: Large-scale structure prediction by improved contact predictions and model quality assessment.

CCP4 Study Weekend 2017: From Data to Structure

ConKit: A python interface for contact prediction tools

ACEDRG: Generating Crystallographic Restraints for Ligands

Is “fragment-based” still the way forward in template-free protein structure prediction?

[1] Kabsch, Wolfgang, and Christian Sander. “On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.” Proceedings of the National Academy of Sciences 81.4 (1984): 10751078.

Strachey Lecture – “Artificial Intelligence and the Future” by Dr. Demis Hassabis

ConKit: A python interface for contact prediction tools

ACEDRG: Generating Crystallographic Restraints for Ligands

[1] Kabsch, Wolfgang, and Christian Sander. “On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.” Proceedings of the National Academy of Sciences 81.4 (1984): 1075­1078.

[1] Kabsch, Wolfgang, and Christian Sander. “On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.” Proceedings of the National Academy of Sciences 81.4 (1984): 10751078.