When recently looking at some reaction data, I was confronted with the problem of atom-to-atom mapping (AAM) and what tools are available to tackle it. AAM refers to the process of mapping individual atoms in reactants to their corresponding atoms in the products, which is important for defining a reaction template and identifying which bonds are being formed and broken. This has many downstream uses for computational chemists, such as for reaction searching and forward and retrosynthesis planning1. The problem is that many reaction databases do not contain these mappings, and annotation by expert chemists is impractical for databases containing thousands (or more) data points.
Continue readingCategory Archives: Code
How to easily use pharmacophoric atom features to turn ECFPs into FCFPs
Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.
We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.
In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.
# import packages import numpy as np from rdkit.Chem import AllChem # define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False def FCFP_from_smiles(smiles, R = 2, L = 2**10, use_features = True, use_chirality = False): """ Inputs: - smiles ... SMILES string of input compound - R ... maximum radius of circular substructures - L ... fingerprint-length - use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features - use_chirality ... if true then append tetrahedral chirality flags to atom features Outputs: - np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R """ molecule = AllChem.MolFromSmiles(smiles) feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule, radius = R, nBits = L, useFeatures = use_features, useChirality = use_chirality) return np.array(feature_list)
The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.
[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.
[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.
Datamining Wikipedia and writing JS with ChatGTP just to swap the colours on university logos…
A few months back I moved from the Oxford BRC to OPIG, both within the university of Oxford, but like many in academia I have moved across a few universities. As this is my first post here I wanted to do something neat: a JS tool that swapped colours in university logos!
It was a rather laborious task requiring a lot of coding, but once I got it working, I ended up tripping up at the last metre. So for technical reasons, I have resorted to hosting it in my own blog (see post), but nevertheless the path towards it is worth discussing.
Does ChatGPT know how to translate images?
Yesterday I spent a couple of hours playing with ChatGPT. I know, we have some other recent posts about it. It’s so amazing that I couldn’t resist writing another. Apologies for that.
The goal of this post is to determine if I can effectively use ChatGPT as a programmer/mathematician assistant. OK. It was not my original intention, but let’s pretend it was, just to make this post more interesting.
So, I started asking a few very simple programming answers like the following:
Can you implement a function to compute the factorial of a number using a cache? Use python.
And this is what I got.
A clear and efficient implementation of the factorial. This is the kind of answer you would expect from a first year CS student.
Continue readingCleaning outliers in conductance timeseries from molecular dynamics
Have you ever had an annoying dataset that looks something like this?
or even worse, just several of them
In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this
Continue readingCodeQL analyses your code to find common errors
This post is pretty much an ad for a very useful tool developed by GitHub that helps you find errors or vulnerabilities in your code by querying it as if it were data. I have personally found it very useful in finding small errors in my code and would recommend everyone to use it. If you want to check it out, this is their webpage.
Continue readingHow to turn a SMILES string into an extended-connectivity fingerprint using RDKit
After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).
ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.
Continue readingUnreasonably faster notes, with command-line fuzzy search
A good note system should act like a second brain:
- Accessible in seconds
- Adding information should be frictionless
- Searching should be exhaustive – if it’s there, you must find it
The benefits of such a note system are immense – never forget anything again! Search, perform the magic ritual of Copy Paste, and rejoice in the wisdom of your tried and tested past.
But how? Through the unreasonable effectiveness of interactive fuzzy search. This is how I have used Fuz, a terminal-based file fuzzy finder, for about 4 years.
Briefly, Fuz extracts all text within a directory using ripgrep, enables interactive fuzzy search with FZF, and returns you the selected item. As you type, the search results get narrowed down to a few matches. Files are opened at the exact line you found. And it’s FAST – 100,000 lines in half a second fast.
Continue readingHow to build a Python dictionary of residues for each molecule in PyMOL
Sometimes it can be handy to work with multiple structures in PyMOL using Python.
Here’s a snippet of code you might find useful: we iterate over all the α-carbon atoms in a protein and append to a list tuples such as (‘GLY’, 1). The dictionary, ‘reslist’, returns a list of residue names and indices for each molecule, where the key is a string containing the name of the molecule.
from pymol import cmd # Create a list of all the objects, called 'mpls': mols = cmd.get_object_list('*') # Create an empty dictionary that will return a list of residues # given the name of the molecule object reslist = {} # Set the dictionaries to be empty lists for m in mols: reslist[m] = [] # Use PyMOL's iterate command to go over every α-Carbon and append # a tuple consisting of the each residue's residue name ('resn') and # residue index ('resi '): for m in mols: cmd.iterate('%s and n. ca'%m, 'reslist["%s"].append((resn,int(resi)))'%m)
This script assumes you only have protein molecules loaded, and ignores things like chain ID and insertion codes.
Once you have your list of residues, you can use it with the cmd.align
command, e.g., to align a particular residue to a reference structure.
Running code that fails with style
We have all been there, working on code that continuously fails while staring at a dull and colorless command-line. However, we are in luck, as there is a way to make the constant error messages look less depressing. By changing our shell to one which enables a colorful themed command-line and fancy features like automatic text completion and web search your code won’t just fail with ease, but also with style!
A shell is your command-line interpreter, meaning you use it to process commands and output results of the command-line. The shell therefore also holds the power to add a little zest to the command-line. The most well-known shell is bash, which comes pre-installed on most UNIX systems. However, there exist many different shells, all with different pros and cons. The one we will focus on is called Z Shell or zsh for short.
Zsh was initially only for UNIX and UNIX-Like systems, but its popularity has made it accessible on most systems now. Like bash, zsh is extremely customizable and their syntax so similar that most bash commands will work in zsh. The benefit of zsh is that it comes with additional features, plugins and options, and open-source frameworks with large communities. The framework which we will look into is called Oh My Zsh.
Continue reading