We’ve been investigating deep learning-based protein-ligand docking methods which often claim to be able to generate ligand binding modes within 2Å RMSD of the experimental one. We found, however, this simple criterion can conceal a multitude of chemical and structural sins…
DeepDock attempted to generate the ligand binding mode from PDB ID 1t9b (light blue carbons, left), but gave pretzeled rings instead (white carbons, right).
After being recommended by a friend, I really wanted to try plotly express but I never had the inclination to read more documentation when matplotlib gives me enough grief. While experimenting with ChatGPT I finally decided to functionalise my figure making scripts. With these scripts I manage to produce figures that made people question what I had actually been doing with my time – but I promise this will be worth your time.
I have been using with dimensionality reducition techniques recently and I came across this paper by Moon et al. PHATE is a technique that represents high dimensional (ie biological) data in a way that aims to preserve connections over preserving distance and I knew I wanted to try this as soon as I saw it. Why should you care? PHATE in 3D is faster that t-SNE in 2D. It would almost be rude to not try it out.
PHATE
In my opinion PHATE (or potential of heat diffusion for affinity-based transition embedding) does have a lot going on but that the choices at each stage feel quite sensisble. It might not come as a surprise this was primarily designed to make visual inspection of data easier on the eyes.
During the first 6 months of my DPhil, I worked on clustering antibodies and I thought I would share what I learned about these algorithms. Clustering is an unsupervised data analysis technique that groups a data set into subsets of similar data points. The main uses of clustering are in exploratory data analysis to find hidden patterns or data compression, e.g. when data points in a cluster can be treated as a group. Clustering algorithms have many applications in computational biology, such as clustering antibodies by structural similarity. Actually, this is objectively the most important application and I don’t see why anyone would use it for anything else.
There are several types of clustering algorithms that offer different advantages.
The Rectified Linear Unit (ReLU) activation function was first used in 1975, but its use exploded when it was used by Nair & Hinton in their 2010 paper on Restricted Boltzmann Machines. ReLU and its derivative are fast to compute, and it has dominated deep neural networks for years. The main problem with the activation function is the so-called dead ReLU problem, where significant negative input to a neuron can cause its gradient to always be zero. To rectify this (har har), modified versions have proposed, including leaky ReLU, GeLU and SiLU, wherein the gradient for x < 0 is not always zero.
A 2020 paper by Naizat et al., which builds upon ideas set out in a 2014 Google Brain blog post seeks to explain why ReLU and its variants seem to be better in general for classification problems than sigmoidal functions such as tanh and sigmoid.
Disclaimer – the title is a Quality Street pun only and bears no relation to the quality of the data or analysis presented below. This whole blog post is basically to discredit the personal chocolate preferences of a group member who shall remain nameless. Safe to say though, they Vostly overestimated people’s love for the Toffee Finger. Long live the Orange Creme.
pMHCs are set to become a major target class in drug discovery; unusual peptide fragments presented by MHC can be used to distinguish infected/cancerous cells from healthy cells more precisely than over-expressed biomarkers. In this blog post, I will highlight a prototype resource: Dr. Chris Thorpe’s new database of pMHC structures, histo.fyi.
histo.fyi provides a one-stop shop for data on (currently) around 1400 pMHC complexes. Similar to our dedicated databases for antibody/nanobody structures (SAbDab) and T-cell receptor (TCR) structures (STCRDab), histo.fyi will scrape the PDB on a weekly basis for any new pMHC data and process these structures in a way that facilitates their analysis.
Have you ever had an annoying dataset that looks something like this?
or even worse, just several of them
In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this
Drawing molecules on your laptop usually requires access to proprietary software such as ChemDraw (link) or free websites such as PubChem’s online sketcher (link). However, if you are feeling adventurous, you can build your personal sketcher in React/Typescript using the Ketcher package!
Ketcher is an open-source package that allows easy implementation of a molecule sketcher into a web application. Unfortunately, it does require TypeScript so the script to run it cannot be imported directly into an HTML page. Therefore we will set up a simple React app to get it working.
The sketcher is very sleek and has a vast array of functionality, such as choosing any atom from the periodic table and being able to directly import molecules from either SMILES or Mol2/SDF file format into the sketcher. These molecules can then be edited and saved to a new file in the chemical file format of your choosing.
The default plots made by Python’s matplotlib.pyplot module are almost always insufficient for publication. With a ~20 extra lines of code, however, you can generate high-quality plots suitable for inclusion in your next article.
The fake data I generated for the plot look something like Root Mean Square Deviation (RMSD) versus time for a converged molecular dynamics simulation, so let’s pretend they are. There are a number of problems with this plot: it’s overall ugly, the color scheme is not very attractive and may not be color-blind friendly, the y-axis range of the data extends outside the range of the tick labels, etc.