In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException
or ValenceException
from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.
Category Archives: Technical
Making your code pip installable
aka when to use a CutomBuildCommand or a CustomInstallCommand when building python packages with setup.py
Bioinformatics software is complicated, and often a little bit messy. Recently I found myself wading through a python package building quagmire and thought I could share something I learnt about when to use a custom build command and when to use a custom install command. I have also provided some information about how to copy executables to your package installation bin. **ChatGPT wrote the initial skeleton draft of this post, and I have corrected and edited.
Next time you need to create a pip installable package yourself, hopefully this can save you some time!
Continue readingConverting or renaming files, whilst still maintaining the directory structure
For various reasons we might need to convert files from one format to another, for instance from lossless FLAC to MP3. For example:
ffmpeg -i lossless-audio.flac -acodec libmp3lame -ab 128k compressed-audio.mp3
This could be any conversion, but it implies that the input file and the output file are in the same directory. What if we have a carefully curated directory structure and we want to convert (or rename) every file within that structure?
find . -name “*.whateveryouneed” -exec somecommand {} \; is the tool for you.
Continue readingMounting a remote file system with SSHFS
If you’re working with data stored on a remote server, you might not want to (or even have the space to) copy data to your local file system when you work on it. Instead, we can use SSHFS to mount a remote file system via SSH, allowing us to read and write data on the remote file system without manually copying files.
Continue readingAn Open-Source CUDA for AMD GPUs – ZLUDA
Lots of work has been put into making AMD designed GPUs to work nicely with GPU accelerated frameworks like PyTorch. Despite this, getting performant code on non-NVIDIA graphics cards can be challenging for both users and developers. Even in the case where the developer has appropriately optimised for each platform there are often gaps in performance where, at the driver-level, instructions to the GPU may not be optimised fully. This is because software developed using CUDA can benefit from optimisations like operation-fusing without having to specify in many cases.
This may not be much of a concern for most researchers as we simply use what is available to us. Most of the time this is usually NVIDIA GPUs and there is hardly a choice to it. NVIDIA is aware of this and prices their products accordingly. Part of the problem is that system designers just dont have an incentive to build AMD platfroms other than for highly specialised machines.
Continue readingOut of Band Management
We’ve all had things go wrong with computers, however when they go catastrophically wrong, there’s often little you can do other than to be physically on site to reinstall. This doesn’t have to be the case though. Most PCs have a tiny secondary processor which can allow full remote control of a computer that’s crashed, unresponsive or even switched off.
Continue readingWriting a BLOPIG Post With ChatGPT: A Personal Take on Using AI for Assisted Writing
Disclaimer: I used ChatGPT to improve the writing style of this article, in combination with some personal curation before obtaining a final version.
You’ve probably heard it all already, from ChatGPT writing code and doing proofreading for you to a rap battle between OPIG’s Antibodies and Small Molecules groups, and more.
Whether you like it or not, ChaGPT has unleashed people’s creative side regarding applications and attempts to find shortcuts. Questionable? Absolutely!
In this BLOPIG post, I show how I used ChatGPT to easily write a post summarising some material of my own intellectual property, which I presented as part of my group meeting talk. Mainly, I list some personal thoughts on the ethical concerns around using ChatGPT to assist your writing.
To start off, I passed on content from my own publication draft to ChatGPT, asking to generate a blog post in plain English for BLOPIG. The outcome:
Not bad.
But, it made me realise a number of things:
- With great power comes great responsibility [Uncle Ben – Spiderman].
You are responsible for the ethics that go into using ChatGPT. Are you faking expertise? Are you being actually lazy or just being efficient? Think twice (or many more times) if you’re doing the right thing. - It can significantly reduce the number of writing iterations but don’t take it at face value.
Can you actually trust the plain output? No.
Never take its output as the ground truth, as Large Language Models such as ChatGPT often produce biased writing outputs.
Keep in mind that whatever you produce as a scientist will be picked up by others, and prone to drive misinformation, if incorrect. It is OK to reduce mechanical iterations, but it’s NOT OK to skip quality control. - Be open about it.
You don’t want to set the wrong example for your colleagues. So, mention if you use it, how you used it, and it is fine to encourage efficiency, but not incentivising a culture of scientific misconduct and plagiarism. Don’t skip the step of producing quality ideas on your own. This is such a concern that publishers like Elsevier have already reacted by publishing guidelines contemplating this possibility. While Nature Springer is working on ways to spot AI-generated outputs.
The bottom line
What are the dos and don’ts of using ChatGPT?
Yes, use it to have fun. Yes, use it to proofread or polish your writing. Yes, use it to summarise your own ideas. No, don’t use it to do the analysis and interpretation of your results. No, don’t copy and paste its direct output into your publication. No, don’t hide that you used it. Finally, NO, you can’t add ChatGPT as a contributing author!
Entering a Stable Relationship with your Neural Network
Over the past year, I have been working on building a graph-based paratope (antibody binding site) prediction tool – Paragraph. Fortunately, I have had moderate success with this and you can now check out the preprint of this work here.
However, for a long time, I struggled with a highly unstable network, where different random seeds yielded very different results. I believe this instability was largely due to the high class imbalance in my data – only ~10% of all residues in the Fv (variable region of the antibody) belong to the paratope.
I tried many different things in an attempt to stabilise my training, most of which failed. I will share all of these ideas with you though – successful or not – as what works for one person/network is never guaranteed to work for another. I hope that the below may provide some ideas to try out for others facing similar issues. Where possible, I also provide some example hyperparameter values that could act as sensible starting points.
Continue readingOpenMM Setup: Start Simulating Proteins in 5 Minutes
Molecular dynamics (MD) simulations are a good way to explore the dynamical behaviour of a protein you might be interested in. One common problem is that they often have a relatively steep learning curve when using most MD engines.
What if you just want to run a simple, one-off simulation with no fancy enhanced sampling methods? OpenMM Setup is a useful tool for exactly this. It is built on the open-source OpenMM engine and provides an easy to install (via conda) GUI that can have you running a simulation in less than 5 minutes. Of course, running a simulation requires careful setting of parameters and being familiar with best practices and while this is beyond the scope of this post, there are many guides out there that can easily be found. Now on to the good stuff: using OpenMM Setup!
When you first run OpenMM Setup, you’ll be greeted by a browser window asking you to choose a structure to use. This can be a crystal structure or a model. Remember, sometimes these will have problems that need fixing like missing density or charged, non-physiological termini that would lead to artefacts, so visual inspection of the input is key! You can then choose the force field and water model you want to use, and tell OpenMM to do some cleaning up of the structure. Here I am running the simulation on hen egg-white lysozyme:
Continue readingHow to prepare a molecule for RDKit
RDKit is very fussy when it comes to inputs in SDF format. Using the SDMolSupplier, we get a significant rate of failure even on curated datasets such as the PDBBind refined set. Pymol has no such scruples, and with that, I present a function which has proved invaluable to me over the course of my DPhil. For reasons I have never bothered to explore, using pymol to convert from sdf, into mol2 and back to sdf format again (adding in missing hydrogens along the way) will almost always make a molecule safe to import using RDKit:
from pathlib import Path from pymol import cmd def py_mollify(sdf, overwrite=False): """Use pymol to sanitise an SDF file for use in RDKit. Arguments: sdf: location of faulty sdf file overwrite: whether or not to overwrite the original sdf. If False, a new file will be written in the form <sdf_fname>_pymol.sdf Returns: Original sdf filename if overwrite == False, else the filename of the sanitised output. """ sdf = Path(sdf).expanduser().resolve() mol2_fname = str(sdf).replace('.sdf', '_pymol.mol2') new_sdf_fname = sdf if overwrite else str(sdf).replace('.sdf', '_pymol.sdf') cmd.load(str(sdf)) cmd.h_add('all') cmd.save(mol2_fname) cmd.reinitialize() cmd.load(mol2_fname) cmd.save(str(new_sdf_fname)) return new_sdf_fname