Making Peace with Molecular Entropy

I first stumbled upon OPIG blogs through a post on ligand-binding thermodynamics, which refreshed my understanding of some thermodynamics concepts from undergrad, bringing me face-to-face with the concept that made most molecular physics students break out in cold sweats: Entropy. Entropy is that perplexing measure of disorder and randomness in a system. In the context of molecular dynamics simulations (MD), it calculates the conformational freedom and disorder within protein molecules which becomes particularly relevant when calculating binding free energies.

In MD, MM/GBSA and MM/PBSA are fancy terms for trying to predict how strongly molecules stick together and are the go-to methods for binding free energy calculations. MM/PBSA uses the Poisson–Boltzmann (PB) equation to account for solvent polarisation and ionic effects accurately but at a high computational cost. While MM/GBSA approximates PB, using the Generalised Born (GB) model, offering faster calculations suitable for large systems, though with reduced accuracy. Consider MM/PBSA as the careful accountant who considers every detail but takes forever, while MM/GBSA is its faster, slightly less accurate coworker who gets the job done when you’re in a hurry.

Like many before me, I made the classic error of ignoring entropy, assuming that entropy changes that were similar across systems being compared would have their terms cancel out and could be neglected. This would simplify calculations and ease computational constraints (in other words it was too complicated, and I had deadlines breathing down my neck). This worked fine… until it didn’t. The wake-up call came during a project studying metal-isocitrate complexes in IDH1. For context, IDH1 is a homodimer with a flexible ‘hinge’ region that becomes unstable without its corresponding subunit, giving rise to very high fluctuations. By ignoring entropy in this unstable system, I managed to generate binding free energy results that violated several laws of thermodynamics and would make Clausius roll in his grave.

My then supervisor comments:

Funny enough, a year after finishing my master’s, an old lab mate messaged me on LinkedIn (yes, LinkedIn) to chat about entropic terms:  

So, I revisited the theory, application, and implementation of entropic contributions in free energy calculations, focusing on its use in MM/PBSA and MM/GBSA methods, and provide a high-level overview. More detailed reviews can be found in the works of Homeyer and Gohlke (2012), Steinbrecher and Labahn (2010), and Miller et al. (2008).

Fundamentals of Molecular Dynamics and Mechanics

We first need to understand MD simulation. This is a method that numerically integrates Newton’s equations of motion to model the microscopic behaviour of molecules. It updates atomic positions and velocities over discrete time steps using algorithms like Verlet and Velocity Verlet methods (These algorithms are explained more on my GitHub repo). By incorporating thermostats and barostats, MD simulations can operate under various ensembles (NVE, NVT, NPT), enabling the study of systems under different thermodynamic conditions.

MD simulations are built on the principles of molecular mechanics (MM) – computational methods that model molecular systems using classical physics. This approach calculates molecular structures, energies, and properties by treating atoms as particles and bonds as springs. At its heart are force fields – mathematical functions and parameters that describe the system’s potential energy. The total potential energy (E) of a system is expressed as the sum of individual energy terms representing different interactions:

E = Ebonded + Enon-bonded

The Ebonded interactions include bond stretching (modelled using Hooke’s Law or Morse potential), angle bending (energy from deviations in equilibrium bond angles), and dihedral angles (energy from rotation around bonds). The Enon-bonded interactions consist of van der Waals forces (modelled using Lennard-Jones potential) and electrostatic interactions (calculated using Coulomb’s Law).

The potential energy functions in force fields are parameterised using experimental data or high-level quantum mechanical calculations. Key parameters include force constants, equilibrium bond lengths, and partial charges – all critical for accurate simulations. However, MM does not account for electronic polarisation or bond breaking/forming. Additionally, modelling chemical reactions requires specialised force fields, and overall accuracy depends heavily on parameter quality.

While MM and MD simulations provide detailed insights into molecular structures and dynamics, they cannot accurately compute thermodynamic properties like free energies. Advanced computational methods such as Free Energy Perturbation (FEP), replica exchange FEP, and thermodynamic integration are needed to estimate these properties, particularly binding free energies, but these become computationally demanding for large systems and often need intermediate steps to converge properly.

End-state methods for binding free energy calculations

End-state free energy methods provide a more efficient alternative by eliminating the need to simulate these intermediate states. Using implicit solvent models further reduces computational demands by removing fluctuations from explicit solvent molecules. Consequently, end-state methods are widely used in computational studies. These methods are frequently used for two analyses: assessing the relative stability of multiple conformations within a system and determining the binding free energy of a non-covalently bound receptor-ligand complex, as shown in thermodynamic cycles in Figure 1. Stability calculations compare free energies of different conformations to determine their relative stabilities. For instance, when a biomolecule transitions from conformation A to B, the free energy change is calculated using equation 1:

ΔGtransition (A to B), solvated = ΔGsolvated, B – ΔGsolvated, A (1)

Therefore, to determine binding free energies, we calculate the difference between the free energy of the bound complex and the sum of the free energies of the unbound receptor and ligand using equation 2:

ΔGbinding (solvated) = ΔGsolvated, complex – (ΔGsolvated, receptor + ΔGsolvated, ligand) (2)

Think of it like figuring out how much you save on rent when two people share an apartment versus living separately, subtracting individual expenses from the combined cost. The changes in free energy for each term in equations 1 and 2 are estimated using equation 3:

ΔGtotal (solvated) = Egas, phase + (ΔGsolvation T X Ssolute) (3)

In this equation 3, ΔGsolvation represents a true free energy because the solvent’s effects by using an implicit solvent model have been averaged. Typically, the gas-phase energies (Egas) are derived from MM energies provided by force fields. The solvation energies (ΔGsolvation) are calculated using an implicit solvent model, and the entropy contribution (S) is estimated using established approximations mentioned below.

Figure 1: Thermodynamic cycle for binding free energy calculation, showing ligand and receptor in solvated (top) and vacuum (bottom) states. Free energies of binding in solution (ΔGsolvated binding) and in vacuum (ΔGvacuum binding) are illustrated, facilitating the calculation of binding free energy in solution.

The free energy of solvation is further broken down into electrostatic and nonpolar contributions. Several implicit solvent models are available to compute solvation-free energies, including the PB/GB models mentioned above and a Reference Interaction Site Model (RISM). Note that these energy equations only describe when we are working with one structural snapshot/single-point energies of the solute at this point—kind of like judging a movie based on a single scene (hopefully one without spoilers). However, in practice, end-state calculations estimate these energies by averaging over an ensemble of representative structures generated by the MD simulations.

Accounting for Entropy (S)

In end-state calculations for relative stability and binding free energies, implicit solvent models often overlook certain contributions to the solute’s entropy. By treating biological systems as rigid rotors, translational and rotational entropies can be calculated using standard formulas from statistical mechanics. The vibrational entropy can be approximated using either normal mode analysis (NMA) or quasi-harmonic analysis (QHA).

NMA involves calculating vibrational frequencies at local minima on the potential energy surface. This method is computationally intensive for large systems because it requires minimising each frame, constructing the Hessian matrix, and diagonalising it to obtain vibrational frequencies (eigenvalues). The computational cost scales approximately with the number of atoms N as (3N)m. Here, m = 3, reflecting the cubic scaling associated with diagonalising a 3N×3N matrix.

QHA is less demanding computationally but necessitates many ensemble frames to accurately estimate the total entropy, which increases the computational load of the initial simulation. This method approximates the eigenvalues of the mass-weighted covariance matrix derived from the ensemble as frequencies of global, orthogonal motions, allowing for the calculation of vibrational entropies using standard formulas

A standard NMA Implementation

Since I have more hands-on experience with NMA, I will focus on its implementation details over QHA. A more detailed application can be found on the AMBER tutorials.

Implementing NMA in MD simulations involves crucial steps to ensure accurate results. It begins with with the systems energy minimisation to reach equilibrium geometry. Common algorithms, including steepest descent, conjugate gradient, and Newton-Raphson methods, are used to do this, and you can see an implementation of these here. This step is followed by constructing the Hessian matrix containing second derivatives of potential energy. This can be calculated by analytical derivatives or finite differences, which involves slightly displacing atoms and computing changes in forces (very expensive!). The next steps involve mass-weighting to form a dynamical matrix, which is then diagonalised to obtain eigenvalues and eigenvectors representing vibrational frequencies and atomic displacement patterns. I tend to use AMBER, but several software packages like GROMACS and NAMD/VMD facilitate NMA calculations, though choice depends on system size and computational resources.

Limitations and Challenges of NMA

Outside of its ridiculous computational costs, NMA is limited because it assumes atoms vibrate around the equilibrium position with small amplitudes. But at higher temperatures or in flexible molecules, anharmonicity becomes significant, and the harmonic approximation may fail. Essentially, NMA works like predicting how a swing moves, but it makes a big assumption: it thinks the swing only moves in small, perfect back-and-forth motions, like a pendulum. But real molecules don’t always behave so neatly, just like how a kid might swing wildly or twist the swing chains, molecules can make bigger, messier movements, especially when they get hot or are naturally flexible. When this happens, the simple back-and-forth prediction fail, meaning we might miss important details about how they work in real life.

NMA also tends not to account for solvent effects because including every water molecule in calculations is like trying to count every raindrop in a storm. You could either pretend the water is there as a general influence (implicit) or include actual water molecules (explicit) – but this makes everything much slower to calculate. Finally, there are mathematical hiccups. Sometimes, the calculations give impossible answers, like negative frequencies or different movements that get tangled together in the math. Over time, small math errors pile up – eventually, things blow up. Research tries to fix these issues by using fancier math like QHA and anharmonic corrections to account for these wild movements, but it’s a bit like trying to write rules for playground chaos. It’s possible but much more complicated than describing simple swinging.

Despite its limitations, NMA remains a valued tool for understanding molecular systems and tends to effectively bridge the gap between molecular structure and dynamic behaviour. However, make sure to carefully consider its constraints when applying it to your simulation.

Conclusion

To conclude, I hope to have clarified that while MM/PBSA and MM/GBSA are powerful tools, their accuracy fundamentally depends on how we handle entropy. Yes, these calculations are computationally expensive, complicated, and can technically be skipped, but they’re essential for meaningful results. So, my advice is simple: respect entropy from the beginning. Your future self (and your supervisor) will thank you.

Navigating Hallucinations in Large Language Models: A Simple Guide

AI is moving fast, and large language models (LLMs) are at the centre of it all, doing everything from generating coherent, human-like text to tackling complex coding challenges. And this is just scratching the surface—LLMs are popping up everywhere, and their list of talents keeps growing by the day.

However, these models aren’t infallible. One of their most intriguing and concerning quirks is the phenomenon known as “hallucination” – instances where the AI confidently produces information that is fabricated or factually incorrect. As we increasingly rely on AI-powered systems in our daily lives, understanding what hallucinations are is crucial. This post briefly explores LLM hallucinations, exploring what they are, why they occur, and how we can navigate them and get the most out of our new favourite tools.

Continue reading

Protein Property Prediction Using Graph Neural Networks

Proteins are fundamental biological molecules whose structure and interactions underpin a wide array of biological functions. To better understand and predict protein properties, scientists leverage graph neural networks (GNNs), which are particularly well-suited for modeling the complex relationships between protein structure and sequence. This post will explore how GNNs provide a natural representation of proteins, the incorporation of protein language models (PLLMs) like ESM, and the use of techniques like residual layers to improve training efficiency.

Why Graph Neural Networks are Ideal for Representing Proteins

Graph Neural Networks (GNNs) have emerged as a promising framework to fuse primary and secondary structure representation of proteins. GNNs are uniquely suited to represent proteins by modeling atoms or residues as nodes and their spatial connections as edges. Moreover, GNNs operate hierarchically, propagating information through the graph in multiple layers and learning representations of the protein at different levels of granularity. In the context of protein property prediction, this hierarchical learning can reveal important structural motifs, local interactions, and global patterns that contribute to biochemical properties.

Continue reading

Testing python (or any!) command line applications

Through our work in OPIG, many of our projects come in the form of code bases written in Python. These can be many different things like databases, machine learning models, and other software tools. Often, the user interface for these tools is developed as both a web app and a command line application. Here, I will discuss one of my favourite tools for testing command-line applications: prysk!

Continue reading

The XChem trove of protein–small-molecules structures not in the PDB

The XChem facility at Diamond Light Source is truly impressive feat of automation in fragment-based drug discovery, where visitors comes clutching a styrofoam ice box teeming with apo-form protein crystals, which the shifter soaks with compounds from one or more fragment libraries and a robot at the i04-1 beamline kindly processes each of the thousands of crystal-laden pins, while the visitor enjoys the excellent food in the Diamond canteen (R22). I would especially recommend the jambalaya. Following data collection, the magic of data processing happens: the PanDDA method is used to find partial occupancy in the density, which is processed semi-automatedly and most open targets are uploaded in the Fragalysis web app allowing the ligand binding to be studied and further compounds elaborated. This collection of targets bound to hundreds of small molecules is a true treasure trove of data as many have yet to be deposited in the PDB, making it a perfect test set for algorithm design: fragments are notorious fickle to model and deep learning models cannot cheat by remembering these from the protein database.

Continue reading

Why the vegans will say “I told you so…”

I am writing this on Wednesday 2nd October 2024. The news has all eyes on the middle eastern skies. Yesterday a story was circulating on BBC news warning of a drop in uptake of the seasonal flu jab.

https://www.bbc.co.uk/news/articles/c62d8r0nnl6o

Four days ago, on Friday 27th September, several news outlets reported that several healthcare workers had shown flu-like symptoms following exposure to the first patient known to have contracted avian flu (H5N1) without any animal contact. PCR testing has been inconclusive, with none of these workers testing positive for signs of the virus.

https://www.bbc.co.uk/news/articles/czd1v3vn6ero

Continue reading

Aider and Cheap, Free, and Local LLMs

Aider and the Future of Coding: Open-Source, Affordable, and Local LLMs

The landscape of AI coding is rapidly evolving, with tools like Cursor gaining popularity for multi-file editing and copilot for AI-assisted autocomplete. However, these solutions are both closed-source and require a subscription.

This blog post will explore Aider, an open-source AI coding tool that offers flexibility, cost-effectiveness, and impressive performance, especially when paired with affordable, free, and local LLMs like DeepSeek, Google Gemini, and Ollama.

Continue reading

MDAnalysis: Work with dynamics trajectories of proteins

For a long time crystallographers and subsequently the authors of AlphaFold2 had you believe that proteins are a static group of atoms written to a .pdb file. Turns out this was a HOAX. If you don’t want to miss out on the latest trend of working with dynamic structural ensembles of proteins this blog post is exactly right for you. MDAnalysis is a python package which as the name says was designed to analyse molecular dyanmics simulation and lets you work with trajectories of protein structures easily.

Continue reading

Our future health: A new UK health research programme

Last week I walked into Boots  and, after giving some physical measurements, including my blood pressure and cholesterol levels, I gave a blood sample to be part of the Our Future Health initiative. Our Future Health (https://ourfuturehealth.org.uk/)  is set to become the UK’s largest health research programme ever. With the aim of recruiting five million volunteers across the country, it aims to revolutionise the way we detect, prevent and treat disease.

The breadth, depth and detail of Our Future Health makes it a world-leading resource. The data collected could hold the key to a wide range of health discoveries, such as:

  • Identifying early signals to detect disease much earlier.
  • Accurately predicting who is at higher risk of disease.
  • Developing better interventions and more effective treatments and technologies.

How’s it going so far?

Since the start of recruitment in July 2022 (delyed because of Covid), the programme has recruited over one million participants where:

Continue reading

Walk through a cell

In 2022, Maritan et al. released the first ever macromolecular model of an entire cell. The cell in question is a bacterial cell from the genus Mycoplasma. If you’re a biologist, you likely know Mycoplasma as a common cell culture contaminant.

Now, through the work of app developer Timothy Davison, you can interactively explore this cell model from the comfort of your iPhone or Apple Vision Pro. Here are three reasons why I like CellWalk:

1. It’s pretty

The visuals of CellWalk are striking. The app offers a rich depiction of the cell, allowing the user to zoom from the whole cell to individual atoms. I spent a while clicking through each protein I could see to see if I could guess what it was or what it did. Zooming out, CellWalk offers a beautiful tripartite cross section of the cell, showing first the lipid membrane, then a colourful jumble-bag of all its cellular proteins, and then finally the spaghetti-like polynucleic acids.

Tripartite cross section of a Mycoplasma cell. Screengrab taken from the CellWalk app on my phone.
Continue reading