Inspired by this blog post by the lovely Kate, I’ve been doing some BRICS decomposing of molecules myself. Like the structure-based goblin that I am, though, I’ve been applying it to 3D structures of molecules, rather than using the smiles approach she detailed. I thought it may be helpful to share the code snippets I’ve been using for this: unsurprisingly, it can also be done with RDKit!
I’ll use the same example as in the original blog post, propranolol.
First, I import RDKit and load the ligand in question:
Today’s blog post is about using PLIP to extract information about interactions between a protein and ligand in a bound complex, using data from PDBbind. The blog post will cover how to combine the protein pdb file and the ligand mol2 file into a pdb file, and how to use PLIP in a high-throughput manner with python.
In order for PLIP to consider the ligand as one molecule interacting with the protein, we need to modify the mol2 file of the ligand. The 8th column of the atom portion of a mol2 file (the portion starts with @<TRIPOS>ATOM) includes the ID of the ligand that the atom belongs to. Most often all the atoms have the same ligand ID, but for peptides for instance, the atoms have the ID of the residue they’re part of. The following code snippet will make the required changes:
ligand_file = 'data/5oxm/5oxm_ligand.mol2'
with open(ligand_file, 'r') as f:
ligand_lines = f.readlines()
mod = False
for i in range(len(ligand_lines)):
line = ligand_lines[i]
if line == '@<TRIPOS>BOND\n':
mod = False
if mod:
ligand_lines[i] = line[:59] + 'ISK ' + line[67:]
if line == '@<TRIPOS>ATOM\n':
mod = True
with open('data/5oxm/5oxm_ligand_mod.mol2', 'w') as g:
for j in ligand_lines:
g.write(j)
Environment modules is a great tool for high-performance computing as it is a modular system to quickly and painlessly enable preset configurations of environment variables, for example a user may be provided with modulefile for an antiquated version of a tool and a bleeding-edge alpha version of that same tool and they can easily load whichever they wish. In many clusters the modules are created with a tool called EasyBuild, which delivered an out-of-the-box installation. This works for things like a single binary, but for conda this severely falls short as there are many many configuration changes needed.
Recently I’ve been thinking a lot about how to decompose a compound into smaller fragments specifically for a retrosynthetic purpose. My question is: given a compound, can I return building blocks that are likely to synthesize together to produce this compound simply by breaking likely bonds formed in a reaction? A method that is nearly 15 years old named, breaking of retrosynthetically interesting chemical substructures (BRICS), is one approach to do this. Here I’ll explore how BRICS can reflect synthetic accessibility.
Alternative Title: The tragic story of how I got trapped making slides with latex.
Typically after giving a presentation at least one person will approach me and ask if they could have access to my custom latex template to make slides with beamer that don’t look rubbish.
When recently looking at some reaction data, I was confronted with the problem of atom-to-atom mapping (AAM) and what tools are available to tackle it. AAM refers to the process of mapping individual atoms in reactants to their corresponding atoms in the products, which is important for defining a reaction template and identifying which bonds are being formed and broken. This has many downstream uses for computational chemists, such as for reaction searching and forward and retrosynthesis planning1. The problem is that many reaction databases do not contain these mappings, and annotation by expert chemists is impractical for databases containing thousands (or more) data points.
Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.
We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.
In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.
# import packages
import numpy as np
from rdkit.Chem import AllChem
# define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False
def FCFP_from_smiles(smiles,
R = 2,
L = 2**10,
use_features = True,
use_chirality = False):
"""
Inputs:
- smiles ... SMILES string of input compound
- R ... maximum radius of circular substructures
- L ... fingerprint-length
- use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features
- use_chirality ... if true then append tetrahedral chirality flags to atom features
Outputs:
- np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R
"""
molecule = AllChem.MolFromSmiles(smiles)
feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule,
radius = R,
nBits = L,
useFeatures = use_features,
useChirality = use_chirality)
return np.array(feature_list)
The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.
[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.
[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.
A few months back I moved from the Oxford BRC to OPIG, both within the university of Oxford, but like many in academia I have moved across a few universities. As this is my first post here I wanted to do something neat: a JS tool that swapped colours in university logos! It was a rather laborious task requiring a lot of coding, but once I got it working, I ended up tripping up at the last metre. So for technical reasons, I have resorted to hosting it in my own blog (see post), but nevertheless the path towards it is worth discussing.
Yesterday I spent a couple of hours playing with ChatGPT. I know, we have some other recent posts about it. It’s so amazing that I couldn’t resist writing another. Apologies for that.
The goal of this post is to determine if I can effectively use ChatGPT as a programmer/mathematician assistant. OK. It was not my original intention, but let’s pretend it was, just to make this post more interesting.
So, I started asking a few very simple programming answers like the following:
Can you implement a function to compute the factorial of a number using a cache? Use python.
And this is what I got.
A clear and efficient implementation of the factorial. This is the kind of answer you would expect from a first year CS student.
Have you ever had an annoying dataset that looks something like this?
or even worse, just several of them
In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this