In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException
or ValenceException
from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.
Category Archives: Small Molecules
Tanimoto similarity of ECFPs with RDKit: Common pitfalls
A common measure for the similarity of two molecules is the Tanimoto similarity of their ECFPs (Extended Connectivity FingerPrint). However, there is no clear standard in literature for what kind of ECFPs should be used when calculating the Tanimoto similarity, and that choice can lead to substantially different results. In this post I wish to shed light on some results you should know about before you jump into your calculations.
A blog post on how ECFPs are generated was written by Marcus Dablander in 2022 so please take a look at that. In short, ECFPs have a hyperparameter called the radius r, and sometimes a fingerprint length L. Each entry in the fingerprint indicates the presence or absence of a particular substructure in the molecule of interest, and the radius r defines how large the substructures that you consider are. If you have r=3 then you consider substructures made by going up to three hops away from each atom in your molecule. This is best explained by this figure from Marcus’ post:
Continue readingMy CCDC Science Day Experience
In June, I had the opportunity to visit the Cambridge Crystallographic Data Centre (CCDC) for Science Day to give a lightning talk on my rotation project with OPIG. The day was packed with presentations from researchers and PhD students collaborating with the CCDC, offering a great opportunity to hear about some of the fascinating work happening there in the fields of Structural and Computational Chemistry.
We kicked off with a dinner at the University Arms in Cambridge. This was a great opportunity to meet people who were attending Science Day in a relaxed environment, complemented by the lovely food and drink.
The next day was all about the talks. The first part of the day was filled with longer talks by more senior PhD students and CCDC researchers, followed by lightning talks from first-year PhD or master’s students. These shorter presentations provided a fast-paced overview of each project.
Continue readingSort and Slice Tutorial – An alternative to extended connectivity fingerprints
Background¶
Sort and Slice (SNS) was developed by a former OPIGlet, Markus, as a method for improving Extended Connectivity Fingerprints (ECFPs) by overcoming bit collisions. ECFPs are a form of topological fingerprint which denote the absence and presence of circular substructures in a molecule. The steps for deriving an ECFP from a molecule are as follows:
Identifier assignment:
Each atom in the molecule is assigned an initial numerical identifier; this is typically generated by hashing a tuple of atomic properties called Daylight atomic invariants into a 32-bit integer. These properties are:
- Number of non-hydrogen neighbours.
- Valence – number of neighbouring hydrogens.
- Atomic number.
- Atomic mass.
- Atomic charge.
- Number of hydrogen neighbours.
- Ring membership.*
*Ring membership is an additional property that is often used but is not one of the original Daylight atomic invariants.
Continue reading
Interactive visualization of protein–ligand complexes with Py3Dmol
I recently had a problem where I wanted to provide an interactive visualization of multiple different protein–ligand complexes, requiring minimal setup by the user, allowing them to zoom in and out and change the visualization style, without just providing multiple PDB files or a PyMOL session.
Continue readingComparing pose and affinity prediction methods for follow-up designs from fragments
In any task in the realm of virtual screening, there need to be many filters applied to a dataset of ligands to downselect the ‘best’ ones on a number of parameters to produce a manageable size. One popular filter is if a compound has a physical pose and good affinity as predicted by tools such as docking or energy minimisation. In my pipeline for downselecting elaborations of compounds proposed as fragment follow-ups, I calculate the pose and ΔΔG by energy minimizing the ligand with atom restraints to matching atoms in the fragment inspiration. I either use RDKit using its MMFF94 forcefield or PyRosetta using its ref2015 scorefunction, all made possible by the lovely tool Fragmenstein.
With RDKit as the minimizer the protein neighborhood around the ligand is fixed and placements take on average 21s whereas with PyRosetta placements, they take on average 238s (and I can run placements in parallel luckily). I would ideally like to use RDKit as the placement method since it is so fast and I would like to perform 500K within a few days but, I wanted to confirm that RDKit is ‘good enough’ compared to the slightly more rigorous tool PyRosetta (it allows residues to relax and samples more conformations with the longer runtime I think).
Fine-tune generated molecular poses with a force field
Some molecular pose generation methods benefit from an energy relaxation post-processing step.
Here is a quick way to do this using OpenMM via a short script I prepared:
Continue readingRSC Fragments 2024
I attended RSC Fragments 2024 (Hinxton, 4–5 March 2024), a conference dedicated to fragment-based drug discovery. The various talks were really good, because they gave overviews of projects involving teams across long stretches of time. As a result there were no slides discussing wet lab protocol optimisations and not a single Western blot was seen. The focus was primarily either illustrating a discovery platform or recounting a declassified campaign. The latter were interesting, although I’d admit I wish there had been more talk of organic chemistry —there was not a single moan/gloat about a yield. This top-down focus was nice as topics kept overlapping, namely:
- Target choice,
- covalents,
- molecular glues,
- whether to escape Flatland,
- thermodynamics, and
- cryptic pockets
Taking Equivariance in deep learning for a spin?
I recently went to Sheh Zaidi‘s brilliant introduction to Equivariance and Spherical Harmonics and I thought it would be useful to cement my understanding of it with a practical example. In this blog post I’m going to start with serotonin in two coordinate frames, and build a small equivariant neural network that featurises it.
Continue readingFinding and testing a reaction SMARTS pattern for any reaction
Have you ever needed to find a reaction SMARTS pattern for a certain reaction but don’t have it already written out? Do you have a reaction SMARTS pattern but need to test it on a set of reactants and products to make sure it transforms them correctly and doesn’t allow for odd reactants to work? I recently did and I spent some time developing functions that can:
- Generate a reaction SMARTS for a reaction given two reactants, a product, and a reaction name.
- Check the reaction SMARTS on a list of reactants and products that have the same reaction name.