In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException
or ValenceException
from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.
Monthly Archives: September 2024
The Patterns that Escape Us
Part The First: An Outrageous Claim
Reproduced below is the introductory passage from a psycholinguistics paper, published in the mid-nineties. Riveted, as I’m sure you are, having just read that banger opening line to my blog post, humour me and read on; I promise it gets interesting.
Drug Discovery Tools, but they’re olympic sports…
The Olympic Games may have come and gone, but like me, I’m sure you’re all wondering which Olympic sport your favourite drug discovery tool would compete in. Fear not, I have taken it upon myself to answer this pressing question. In this blogpost, we’ll match some of the most popular tools in our field with their Olympic counterparts. Before we begin, let me clarify that I’m using the term ‘tool’ rather loosely here; I’ve included a variety of resources. I don’t claim these to be the most popular, just the ones I thought were most sport like.
RDKit: Athletics. I’m biased, but we must start with the big one. Like track and field events at the heart of the Olympics, RDKit is at the centre of many other tools in our field. It’s versatile, essential, and it’s hard to imagine our work without it. RDKit does it all.
Do not forget to add your data folder to .gitignore
It is good practice not to commit a data folder to version control if the data is available elsewhere and you do not want to track changes of the data. But do not forget to also add an entry for this folder to .gitignore
because otherwise git iterates over all the files in the folder when checking for file changes, which may take a long time if there are many files.
Tanimoto similarity of ECFPs with RDKit: Common pitfalls
A common measure for the similarity of two molecules is the Tanimoto similarity of their ECFPs (Extended Connectivity FingerPrint). However, there is no clear standard in literature for what kind of ECFPs should be used when calculating the Tanimoto similarity, and that choice can lead to substantially different results. In this post I wish to shed light on some results you should know about before you jump into your calculations.
A blog post on how ECFPs are generated was written by Marcus Dablander in 2022 so please take a look at that. In short, ECFPs have a hyperparameter called the radius r, and sometimes a fingerprint length L. Each entry in the fingerprint indicates the presence or absence of a particular substructure in the molecule of interest, and the radius r defines how large the substructures that you consider are. If you have r=3 then you consider substructures made by going up to three hops away from each atom in your molecule. This is best explained by this figure from Marcus’ post:
Continue readingI really hope my compounds get the green light
As a cheminformatician in a drug discovery campaign or an algorithm developer making the perfect Figure 1, when one generates a list of compounds for a given target there is a deep desire that the compounds are well received by the reviewer, be it a med chemist on the team or a peer reviewer. This is despite scientific rigour and training and is due to the time invested. So to avoid the slightest shadow of med chem grey zone, here is a hopefully handy filter against common medchem grey-zone groups.
Continue readingRoche Continents 2024
This July I had the opportunity to be part of the Roche Continents programme [1]. The programme was organised by Roche and LUMA Arles and took place in the beautiful city of Arles in the south of France. Together with 40 students from various disciplines and European universities we discussed and explored the connection between arts, science, and sustainability. The theme of the week was resourcefulness.
For students considering applying to Roche Continents next year, I’d like to offer some insights on what to expect, as well as share a few of my personal highlights from the experience.
Continue readingThe wider applications of nanobodies
This week, it was my turn to give the short talk at our group meeting. I chose to present a recently published paper on thermostability prediction for nanobodies. The motivation for this work, at least in part, is the need for thermostability in the diverse applications of nanobodies. At OPIG, our research primarily revolves around the therapeutic uses of nanobodies, but their potential extends beyond this. I thought it would be interesting to highlight some of these broader applications here:
Continue readingMaking your code pip installable
aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py
Bioinformatics software is complicated, and often a little bit messy. Recently I found myself wading through a python package building quagmire and thought I could share something I learnt about when to use a custom build command and when to use a custom install command. I have also provided some information about how to copy executables to your package installation bin. **ChatGPT wrote the initial skeleton draft of this post, and I have corrected and edited.
Next time you need to create a pip installable package yourself, hopefully this can save you some time!
Continue readingFive-word stories about a world where AI dominates the world
“For sale: baby shoes, never worn.” ~ Ernest Hemingway??
This is a six-word story famously misattributed to Ernest Hemingway. According to Wikipedia, this story first appeared in 1906, when Hemingway was 7 years old, and later attributed to him in 1991, 30 years after his death. So, no chance it was his.
Regardless of its origin, I found this type of story very creative.
In this blog post, as the title says, I will dare to push the boundary to present 5-word stories on the topic of AI taking over the world, BUT with a humorous spin.
Continue reading