The XChem facility at Diamond Light Source is truly impressive feat of automation in fragment-based drug discovery, where visitors comes clutching a styrofoam ice box teeming with apo-form protein crystals, which the shifter soaks with compounds from one or more fragment libraries and a robot at the i04-1 beamline kindly processes each of the thousands of crystal-laden pins, while the visitor enjoys the excellent food in the Diamond canteen (R22). I would especially recommend the jambalaya. Following data collection, the magic of data processing happens: the PanDDA method is used to find partial occupancy in the density, which is processed semi-automatedly and most open targets are uploaded in the Fragalysis web app allowing the ligand binding to be studied and further compounds elaborated. This collection of targets bound to hundreds of small molecules is a true treasure trove of data as many have yet to be deposited in the PDB, making it a perfect test set for algorithm design: fragments are notorious fickle to model and deep learning models cannot cheat by remembering these from the protein database.
Continue readingTag Archives: Small Molecules
Out of the box RDKit-valid is an imperfect metric: a review of the KekulizeException and nitrogen protonation to correct this
In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException
or ValenceException
from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.
Tanimoto similarity of ECFPs with RDKit: Common pitfalls
A common measure for the similarity of two molecules is the Tanimoto similarity of their ECFPs (Extended Connectivity FingerPrint). However, there is no clear standard in literature for what kind of ECFPs should be used when calculating the Tanimoto similarity, and that choice can lead to substantially different results. In this post I wish to shed light on some results you should know about before you jump into your calculations.
A blog post on how ECFPs are generated was written by Marcus Dablander in 2022 so please take a look at that. In short, ECFPs have a hyperparameter called the radius r, and sometimes a fingerprint length L. Each entry in the fingerprint indicates the presence or absence of a particular substructure in the molecule of interest, and the radius r defines how large the substructures that you consider are. If you have r=3 then you consider substructures made by going up to three hops away from each atom in your molecule. This is best explained by this figure from Marcus’ post:
Continue readingI really hope my compounds get the green light
As a cheminformatician in a drug discovery campaign or an algorithm developer making the perfect Figure 1, when one generates a list of compounds for a given target there is a deep desire that the compounds are well received by the reviewer, be it a med chemist on the team or a peer reviewer. This is despite scientific rigour and training and is due to the time invested. So to avoid the slightest shadow of med chem grey zone, here is a hopefully handy filter against common medchem grey-zone groups.
Continue readingSort and Slice Tutorial – An alternative to extended connectivity fingerprints
Background¶
Sort and Slice (SNS) was developed by a former OPIGlet, Markus, as a method for improving Extended Connectivity Fingerprints (ECFPs) by overcoming bit collisions. ECFPs are a form of topological fingerprint which denote the absence and presence of circular substructures in a molecule. The steps for deriving an ECFP from a molecule are as follows:
Identifier assignment:
Each atom in the molecule is assigned an initial numerical identifier; this is typically generated by hashing a tuple of atomic properties called Daylight atomic invariants into a 32-bit integer. These properties are:
- Number of non-hydrogen neighbours.
- Valence – number of neighbouring hydrogens.
- Atomic number.
- Atomic mass.
- Atomic charge.
- Number of hydrogen neighbours.
- Ring membership.*
*Ring membership is an additional property that is often used but is not one of the original Daylight atomic invariants.
Continue reading
Comparing pose and affinity prediction methods for follow-up designs from fragments
In any task in the realm of virtual screening, there need to be many filters applied to a dataset of ligands to downselect the ‘best’ ones on a number of parameters to produce a manageable size. One popular filter is if a compound has a physical pose and good affinity as predicted by tools such as docking or energy minimisation. In my pipeline for downselecting elaborations of compounds proposed as fragment follow-ups, I calculate the pose and ΔΔG by energy minimizing the ligand with atom restraints to matching atoms in the fragment inspiration. I either use RDKit using its MMFF94 forcefield or PyRosetta using its ref2015 scorefunction, all made possible by the lovely tool Fragmenstein.
With RDKit as the minimizer the protein neighborhood around the ligand is fixed and placements take on average 21s whereas with PyRosetta placements, they take on average 238s (and I can run placements in parallel luckily). I would ideally like to use RDKit as the placement method since it is so fast and I would like to perform 500K within a few days but, I wanted to confirm that RDKit is ‘good enough’ compared to the slightly more rigorous tool PyRosetta (it allows residues to relax and samples more conformations with the longer runtime I think).
Mapping derivative compounds to parent hits
Whereas it is easy to say in a paper “Given the HT-Sequential-ITC results, 42 led to 113, a substituted decahydro-2,6-methanocyclopropa[f]indene”, it is frequently rather trickier algorithmically figure out which atoms map to which. In Fragmenstein, for the placement route, for example, a lot goes on behind the scenes, yet for some cases human provided mapping may be required. Here I discuss how to get the mapping from Fragmenstein and what goes on behind the scenes.
Continue readingTracking the change in ML performance for popular small molecule benchmarks
The power of machine learning (ML) techniques has captivated the field of small molecule drug discovery. Increasingly, researchers and organisations have employed ML to create more accurate algorithms to improve the efficiency of the discovery process.
To be published, methods have to prove they have improved upon others. Often, methods are tested against the same benchmarks within a field, allowing us to track progress over time. To explore the rate of improvement, I curated the performance on three popular benchmarks. The first benchmark is CASF 2016, used to test the accuracy of methods that predict the binding affinity of experimental determined protein-ligand complexes. Accuracy was measured using the Pearson’s R value between predicted and experimental affinity values.
Continue readingRSC Fragments 2024
I attended RSC Fragments 2024 (Hinxton, 4–5 March 2024), a conference dedicated to fragment-based drug discovery. The various talks were really good, because they gave overviews of projects involving teams across long stretches of time. As a result there were no slides discussing wet lab protocol optimisations and not a single Western blot was seen. The focus was primarily either illustrating a discovery platform or recounting a declassified campaign. The latter were interesting, although I’d admit I wish there had been more talk of organic chemistry —there was not a single moan/gloat about a yield. This top-down focus was nice as topics kept overlapping, namely:
- Target choice,
- covalents,
- molecular glues,
- whether to escape Flatland,
- thermodynamics, and
- cryptic pockets
Taking Equivariance in deep learning for a spin?
I recently went to Sheh Zaidi‘s brilliant introduction to Equivariance and Spherical Harmonics and I thought it would be useful to cement my understanding of it with a practical example. In this blog post I’m going to start with serotonin in two coordinate frames, and build a small equivariant neural network that featurises it.
Continue reading