Since the release of AlphaFold2 (AF2), the problem of protein structure prediction is widely believed to be solved. Current structure prediction tools, such as AF2, are able to model most proteins with high accuracy. These methods, however, have a major limitation as they have been trained to predict a single structure for a given protein. Proteins are highly dynamic molecules, and their function often depends on transitions between several conformational states. Despite research focusing on the task of predicting the structures of multiple conformations of a protein, currently, no accurate and reliable method is available. In this blog post, I will provide a short overview of the strategies developed for predicting protein conformations. I have grouped these into three sets of related approaches. To conclude, I will also demonstrate how to run one of these strategies on your own.
Continue readingCategory Archives: Protein Structure
What can you do with the OPIG Immunoinformatics Suite? v3.0
OPIG’s growing immunoinformatics team continues to develop and openly distribute a wide variety of databases and software packages for antibody/nanobody/T-cell receptor analysis. Below is a summary of all the latest updates (follows on from v1.0 and v2.0).
Continue reading9th Joint Sheffield Conference on Cheminformatics
Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:
- De Novo Design
- Open Science
- Chemical Space
- Physics-based Modelling
- Machine Learning
- Property Prediction
- Virtual Screening
- Case Studies
- Molecular Representations
It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:
Continue readingChecking your PDB file for clashing atoms
Detecting atom clashes in protein structures can be useful in a number of scenarios. For example if you are just about to start some molecular dynamics simulation, or if you want to check that a structure generated by a deep learning model is reasonable. It is quite straightforward to code, but I get the feeling that these sort of functions have been written from scratch hundreds of times. So to save you the effort, here is my implementation!!!
Continue readingCross-linking mass-spectrometry: a guide to conformational confusions.
In the age of highly accurate structure prediction methods, I have seen more and more usage of cross-linking mass-spectrometry (XL-MS) and I wanted to understand its limitations more carefully. This is more of a guide to interpreting the data rather than how to perform the experiment.
Continue readingTrain Your Own Protein Language Model In Just a Few Lines of Code
Language models have token the world by storm recently and, given the already explored analogies between protein primary sequence and text, there’s been a lot of interest in applying these models to protein sequences. Interest is not only coming from academia and the pharmaceutical industry, but also some very unlikely suspects such as ByteDance – yes the same ByteDance of TikTok fame. So if you also fancy trying your hand at building a protein language model then read on, it’s surprisingly easy.
Training your own protein language model from scratch is made remarkably easy by the HuggingFace Transformers library, which allows you to specify a model architecture, tokenise your training data, and train a model in only a few lines of code. Under the hood, the Transformers library uses PyTorch (or optionally Tensorflow) models, allowing you to dig deeper into customising training or model architecture, or simply leave it to the highly abstracted Transformers library to handle it all for you.
For this article, I’ll assume you already understand how language models work, and are now looking to implement one yourself, trained from scratch.
Continue readingCan AlphaFold predict protein-protein interfaces?
Since its release, AlphaFold has been the buzz of the computational biology community. It seems that every group in the protein science field is trying to apply the model in their respective areas of research. Already we are seeing numerous papers attempting to adapt the model to specific niche domains across a broad range of life sciences. In this blog post I summarise a recent paper’s use of the technology for predicting protein-protein interfaces.
Continue readinghisto.fyi: A Useful New Database of Peptide:Major Histocompatibility Complex (pMHC) Structures
pMHCs are set to become a major target class in drug discovery; unusual peptide fragments presented by MHC can be used to distinguish infected/cancerous cells from healthy cells more precisely than over-expressed biomarkers. In this blog post, I will highlight a prototype resource: Dr. Chris Thorpe’s new database of pMHC structures, histo.fyi.
histo.fyi provides a one-stop shop for data on (currently) around 1400 pMHC complexes. Similar to our dedicated databases for antibody/nanobody structures (SAbDab) and T-cell receptor (TCR) structures (STCRDab), histo.fyi will scrape the PDB on a weekly basis for any new pMHC data and process these structures in a way that facilitates their analysis.
Continue readingHow to build a Python dictionary of residues for each molecule in PyMOL
Sometimes it can be handy to work with multiple structures in PyMOL using Python.
Here’s a snippet of code you might find useful: we iterate over all the α-carbon atoms in a protein and append to a list tuples such as (‘GLY’, 1). The dictionary, ‘reslist’, returns a list of residue names and indices for each molecule, where the key is a string containing the name of the molecule.
from pymol import cmd # Create a list of all the objects, called 'mpls': mols = cmd.get_object_list('*') # Create an empty dictionary that will return a list of residues # given the name of the molecule object reslist = {} # Set the dictionaries to be empty lists for m in mols: reslist[m] = [] # Use PyMOL's iterate command to go over every α-Carbon and append # a tuple consisting of the each residue's residue name ('resn') and # residue index ('resi '): for m in mols: cmd.iterate('%s and n. ca'%m, 'reslist["%s"].append((resn,int(resi)))'%m)
This script assumes you only have protein molecules loaded, and ignores things like chain ID and insertion codes.
Once you have your list of residues, you can use it with the cmd.align
command, e.g., to align a particular residue to a reference structure.
Do you have cis peptide bonds in your simulation inputs?
People who run molecular simulations quickly become familiar with all of the things about a PDB file – missing residues, missing heavy atoms in residues, missing hydrogens, non-standard amino acids, multiple conformations, crystallization ligands, etc. – that might need to be fixed before setting up a simulation. This blog post is a reminder to check, after you have “fixed” your PDB, if you have accidentally introduced aberrant cis peptide bonds into your structure during rebuilding.
Continue reading