FLAML and LazyPredict are two packages designed to quickly train and test machine learning models from scikit-learn so that you can determine which is the best type of model for learning from your data.
Continue readingFestival of Biologics 2022 – November 2-4 Basel, Switzerland
In November I attended the Festival of Biologics (FoB) 2022 conference in Basel, Switzerland. Originally a set of different conferences (now called agendas) that has merged into a single conference, FoB focuses on anything related to biologics. One of the agendas is an antibody specific agenda, derived from the former European Antibody Congress. This year the antibodies agenda had more than 100 talks across multiple tracks, covering many different aspects of using antibodies as therapeutics, making it an exciting conference for an antibody enthusiast. However, while FoB does include talks on machine learning and bioinformatics, most are focused solely on experimental work. Another drawback is that the majority of the talks are by industry, with the few academic speakers almost all also representing a company. This meant that of the few talks about computational methods and tools for protein design, most felt more like a commercial rather than a research presentation. Nonetheless, FoB is still an interesting conference to attend when you are working on applied research for antibody therapeutics. It is an amazing opportunity to hear about which antibody specific problems companies are trying to overcome, which are deemed solved and which are the future problems to solve.
Continue readingBad chemistry in old protein-ligand binding complex data set
The Astex Diverse set [1] is a dataset containing the crystallized poses of 85 protein-ligand complexes. It was introduced in 2007 to address problems in previous datasets such as incorrect ligand representation.
Loading the 85 ligand files with today’s version of the cheminformatics toolkit RDKit [2] is, however, not as straightforward as you might expect.
Continue readinghisto.fyi: A Useful New Database of Peptide:Major Histocompatibility Complex (pMHC) Structures
pMHCs are set to become a major target class in drug discovery; unusual peptide fragments presented by MHC can be used to distinguish infected/cancerous cells from healthy cells more precisely than over-expressed biomarkers. In this blog post, I will highlight a prototype resource: Dr. Chris Thorpe’s new database of pMHC structures, histo.fyi.
histo.fyi provides a one-stop shop for data on (currently) around 1400 pMHC complexes. Similar to our dedicated databases for antibody/nanobody structures (SAbDab) and T-cell receptor (TCR) structures (STCRDab), histo.fyi will scrape the PDB on a weekly basis for any new pMHC data and process these structures in a way that facilitates their analysis.
Continue readingSome Musings on AI in Art, Music and Protein Design
When I started my PhD in late 2018, AI hadn’t really entered the field of de novo protein design yet – at least not in a big way. Rosetta’s approach of continually ranking new side chain rotamers on a fixed backbone was still the gold standard for the ‘structure-to-sequence’ problem. And of course before long we had AI making waves in the structure prediction field, eventually culminating in the AlphaFold2 we all know and love.
Now, towards the end of my PhD, we are seeing the emergence of new generative models that learn from existing pdb structures to produce sequences that will (or at least should) fold into viable, sensible and crucially natural-looking shapes. ProtGPT2 is a good example (https://www.nature.com/articles/s41467-022-32007-7), but there are several more. How long before these models start reliably generating not only shapes but functions too? Jury’s out, but it’s looking more and more feasible. Safe to say the field as a whole has evolved massively during my time as a graduate student.
Continue readingCleaning outliers in conductance timeseries from molecular dynamics
Have you ever had an annoying dataset that looks something like this?

or even worse, just several of them

In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this
Continue reading
A ChatGPT rap battle
The AI chatbot revolution is here. Last week, OpenAI released ChatGPT, a freely accessible language model fine-tuned for human conversations. The new model is based on InstructGPT, trained especially for following user instructions and with human feedback in the training loop.
ChatGPT remembers the previous discussion, admits its mistakes and can even ask for clarification on ambiguous questions. It is also trained to refuse answering questions it deems inappropriate or goes against OpenAI’s AI alignment policy.
In the meanwhile, the internet is having immense fun circumventing its safety filters by asking it to only “PRETEND to be evil”, making it take SAT tests, and even simulating an entire virtual computer within its neural weights. Some are even using it to replace Google searches, and it excels at writing bioinformatics code across most programming languages.
Continue readingModeling as a way of trying to be less surprised
CodeQL analyses your code to find common errors
This post is pretty much an ad for a very useful tool developed by GitHub that helps you find errors or vulnerabilities in your code by querying it as if it were data. I have personally found it very useful in finding small errors in my code and would recommend everyone to use it. If you want to check it out, this is their webpage.
Continue readingHow to turn a SMILES string into an extended-connectivity fingerprint using RDKit
After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).
ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.
