Finding the parts in common between two molecules appears to be a straightforward, but actually is a maze of layers. The task, maximum common substructure (MCS) searching, in RDKit is done by Chem.rdFMCS.FindMCS
, which is highly customisable with lots of presets. What if one wanted to control in minute detail if a given atom X and is a match for atom Y? There is a way and this is how.
Category Archives: Python
Exploring the Observed Antibody Space (OAS)
The Observed Antibody Space (OAS) [1,2] is an amazing resource for investigating observed antibodies or as a resource for training antibody specific models, however; its size (over 2.4 billion unpaired and 1.5 million paired antibody sequences as of June 2023) can make it painful to work with. Additionally, OAS is extremely information rich, having nearly 100 columns for each antibody heavy or light chain, further complicating how to handle the data.
From spending a lot of time working with OAS, I wanted to share a few tricks and insights, which I hope will reduce the pain and increase the joy of working with OAS!
Continue readingPairwise sequence identity and Tanimoto similarity in PDBbind
In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.
A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:
from rdkit.Chem import AllChem, MolFromMol2File from rdkit.DataStructs import BulkTanimotoSimilarity import numpy as np import os fps = [] for pdb in pdbs: mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2')) fps.append(AllChem.GetMorganFingerprint(mol, 3)) sims = [] for i in range(len(fps)): sims.append(BulkTanimotoSimilarity(fps[i],fps)) arr = np.array(sims) np.savez_compressed('data/tanimoto_similarity.npz', arr)
Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:
Continue readingChecking your PDB file for clashing atoms
Detecting atom clashes in protein structures can be useful in a number of scenarios. For example if you are just about to start some molecular dynamics simulation, or if you want to check that a structure generated by a deep learning model is reasonable. It is quite straightforward to code, but I get the feeling that these sort of functions have been written from scratch hundreds of times. So to save you the effort, here is my implementation!!!
Continue readingUnclear documentation? ChatGPT can help!
The PyMOL Python API is a useful resource for most people doing research in OPIG, whether focussed on antibodies, small molecule drug design or protein folding. However, the documentation is poorly structured and difficult to interpret without first having understood the structure of the module. In particular, the differences between use of the PyMOL command line and the API can be unclear, leading to a much longer debugging process for code than you’d like.
While I’m reluctant to continue the recent theme of ChatGPT-related posts, this is a use for ChatGPT that would have been incredibly useful to me when I was first getting to grips with the PyMOL API.
Continue readingBRICS Decomposition in 3D
Inspired by this blog post by the lovely Kate, I’ve been doing some BRICS decomposing of molecules myself. Like the structure-based goblin that I am, though, I’ve been applying it to 3D structures of molecules, rather than using the smiles approach she detailed. I thought it may be helpful to share the code snippets I’ve been using for this: unsurprisingly, it can also be done with RDKit!
I’ll use the same example as in the original blog post, propranolol.

First, I import RDKit and load the ligand in question:
Continue readingPLIP on PDBbind with Python
Today’s blog post is about using PLIP to extract information about interactions between a protein and ligand in a bound complex, using data from PDBbind. The blog post will cover how to combine the protein pdb file and the ligand mol2 file into a pdb file, and how to use PLIP in a high-throughput manner with python.
In order for PLIP to consider the ligand as one molecule interacting with the protein, we need to modify the mol2 file of the ligand. The 8th column of the atom portion of a mol2 file (the portion starts with @<TRIPOS>ATOM) includes the ID of the ligand that the atom belongs to. Most often all the atoms have the same ligand ID, but for peptides for instance, the atoms have the ID of the residue they’re part of. The following code snippet will make the required changes:
ligand_file = 'data/5oxm/5oxm_ligand.mol2' with open(ligand_file, 'r') as f: ligand_lines = f.readlines() mod = False for i in range(len(ligand_lines)): line = ligand_lines[i] if line == '@<TRIPOS>BOND\n': mod = False if mod: ligand_lines[i] = line[:59] + 'ISK ' + line[67:] if line == '@<TRIPOS>ATOM\n': mod = True with open('data/5oxm/5oxm_ligand_mod.mol2', 'w') as g: for j in ligand_lines: g.write(j)Continue reading
The ultimate modulefile for conda
Environment modules is a great tool for high-performance computing as it is a modular system to quickly and painlessly enable preset configurations of environment variables, for example a user may be provided with modulefile for an antiquated version of a tool and a bleeding-edge alpha version of that same tool and they can easily load whichever they wish. In many clusters the modules are created with a tool called EasyBuild, which delivered an out-of-the-box installation. This works for things like a single binary, but for conda this severely falls short as there are many many configuration changes needed.
Continue readingBRICS Decomposition and Synthetic Accessibility
Recently I’ve been thinking a lot about how to decompose a compound into smaller fragments specifically for a retrosynthetic purpose. My question is: given a compound, can I return building blocks that are likely to synthesize together to produce this compound simply by breaking likely bonds formed in a reaction? A method that is nearly 15 years old named, breaking of retrosynthetically interesting chemical substructures (BRICS), is one approach to do this. Here I’ll explore how BRICS can reflect synthetic accessibility.
Continue readingHow to easily use pharmacophoric atom features to turn ECFPs into FCFPs
Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.
We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.

In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.
# import packages import numpy as np from rdkit.Chem import AllChem # define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False def FCFP_from_smiles(smiles, R = 2, L = 2**10, use_features = True, use_chirality = False): """ Inputs: - smiles ... SMILES string of input compound - R ... maximum radius of circular substructures - L ... fingerprint-length - use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features - use_chirality ... if true then append tetrahedral chirality flags to atom features Outputs: - np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R """ molecule = AllChem.MolFromSmiles(smiles) feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule, radius = R, nBits = L, useFeatures = use_features, useChirality = use_chirality) return np.array(feature_list)
The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.
[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.
[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.