Tag Archives: Python

PLIP on PDBbind with Python

Today’s blog post is about using PLIP to extract information about interactions between a protein and ligand in a bound complex, using data from PDBbind. The blog post will cover how to combine the protein pdb file and the ligand mol2 file into a pdb file, and how to use PLIP in a high-throughput manner with python.

In order for PLIP to consider the ligand as one molecule interacting with the protein, we need to modify the mol2 file of the ligand. The 8th column of the atom portion of a mol2 file (the portion starts with @<TRIPOS>ATOM) includes the ID of the ligand that the atom belongs to. Most often all the atoms have the same ligand ID, but for peptides for instance, the atoms have the ID of the residue they’re part of. The following code snippet will make the required changes:

ligand_file = 'data/5oxm/5oxm_ligand.mol2'

with open(ligand_file, 'r') as f:
    ligand_lines = f.readlines()

mod = False
for i in range(len(ligand_lines)):
    line = ligand_lines[i]
    if line == '@&lt;TRIPOS&gt;BOND\n':
        mod = False
        
    if mod:
        ligand_lines[i] = line[:59] + 'ISK     ' + line[67:]
        
    if line == '@&lt;TRIPOS&gt;ATOM\n':
        mod = True

with open('data/5oxm/5oxm_ligand_mod.mol2', 'w') as g:
    for j in ligand_lines:
        g.write(j)

Continue reading →

Molecular conformation generation with a DL-based force field

Deep learning (DL) methods in structural modelling are outcompeting force fields because they overcome the two main limitations to force fields methods – the prohibitively large search space for large systems and the limited accuracy of the description of the physics [4].

However, the two methods are also compatible. DL methods are helping to close the gap between the applications of force fields and ab initio methods [3]. The advantage of DL-based force fields is that the functional form does not have to be specified explicitly and much more accurate. Say goodbye to the 12-6 potential function.

In principle DL-based force fields can be applied anywhere where regular force fields have been applied, for example conformation generation [2]. The flip-side of DL-based methods commonly is poor generalization but it seems that force fields, when properly trained, generalize well. ANI trained on molecules with up to 8 heavy atoms is able to generalize to molecules with up to 54 atoms [1]. Excitingly for my research, ANI-2 [2] can replace UFF or MMFF as the energy minimization step for conformation generation in RDKit [5].

So let’s use Auto3D [2] to generated low energy conformations for the four molecules caffeine, Ibuprofen, an experimental hybrid peptide, and Imatinib:

CN1C=NC2=C1C(=O)N(C(=O)N2C)C CFF
CC(C)Cc1ccc(cc1)C(C)C(O)=O IBP
Cc1ccccc1CNC(=O)[C@@H]2C(SCN2C(=O)[C@H]([C@H](Cc3ccccc3)NC(=O)c4cccc(c4C)O)O)(C)C JE2
Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C STI

Continue reading →

The ultimate modulefile for conda

Environment modules is a great tool for high-performance computing as it is a modular system to quickly and painlessly enable preset configurations of environment variables, for example a user may be provided with modulefile for an antiquated version of a tool and a bleeding-edge alpha version of that same tool and they can easily load whichever they wish. In many clusters the modules are created with a tool called EasyBuild, which delivered an out-of-the-box installation. This works for things like a single binary, but for conda this severely falls short as there are many many configuration changes needed.

Continue reading →

Ligands of CASF-2016

CASF-2016 is a commonly used benchmark for docking tools. Unfortunately, some of the provided ligand files cannot be loaded using RDKit (version 2022.09.1) but there is an easy remedy.

Continue reading →

How to easily use pharmacophoric atom features to turn ECFPs into FCFPs

Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.

We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.

In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.

# import packages
import numpy as np
from rdkit.Chem import AllChem

# define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False
def FCFP_from_smiles(smiles,
                     R = 2,
                     L = 2**10,
                     use_features = True,
                     use_chirality = False):
    """
    Inputs:
    
    - smiles ... SMILES string of input compound
    - R ... maximum radius of circular substructures
    - L ... fingerprint-length
    - use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features
    - use_chirality ... if true then append tetrahedral chirality flags to atom features
    
    Outputs:
    - np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R
    """
    
    molecule = AllChem.MolFromSmiles(smiles)
    feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule,
                                                         radius = R,
                                                         nBits = L,
                                                         useFeatures = use_features,
                                                         useChirality = use_chirality)
    return np.array(feature_list)

The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.

[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.

[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.

Two useful modules to help you find the best ML model for your task

FLAML and LazyPredict are two packages designed to quickly train and test machine learning models from scikit-learn so that you can determine which is the best type of model for learning from your data.

Continue reading →

How to turn a SMILES string into an extended-connectivity fingerprint using RDKit

After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).

ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.

Continue reading →

Using Conda environments with Flask and Apache

With the advent of ABlooper, we’ve recently introduced OpenMM as a new dependency for the SAbDab-SAbPred antibody modelling platform. By far the easiest way to install the OpenMM Python API is via Conda, so we’ve moved to Conda environments for the entire platform. This has made installation of the platform much easier, but introduces complications when it comes to running its web applications under Apache. In this post, I’ll briefly explain the reason for this, and provide a basic guide for running Flask apps using Conda environments under Apache.

Continue reading →

Exploring topological fingerprints in RDKit

Finding a way to express the similarity of irregular and discrete molecular graphs to enable quantitative algorithmic reasoning in chemical space is a fundamental problem in data-driven small molecule drug discovery.

Virtually all algorithms that are widely and successfully used in this setting boil down to extracting and comparing (multi-)sets of subgraphs, differing only in the space of substructures they consider and the extent to which they are able to adapt to specific downstream applications.

A large body of recent work has explored approaches centred around graph neural networks (GNNs), which can often maximise both of these considerations. However, the subgraph-derived embeddings learned by these algorithms may not always perform well beyond the specific datasets they are trained on and for many generic or resource-constrained applications more traditional “non-parametric” topological fingerprints may still be a viable and often preferable choice .

This blog post gives an overview of the topological fingerprint algorithms implemented in RDKit. In general, they count the occurrences of a certain family of subgraphs in a given molecule and then represent this set/multiset as a bit/count vector, which can be compared to other fingerprints with the Jaccard/Dice similarity metric or further processed by other algorithms.

Continue reading →

Visualise with Weight and Biases

Understanding what’s going on when you’ve started training your shiny new ML model is hard enough. Will it work? Have I got the right parameters? Is it the data? Probably. Any tool that can help with that process is a Godsend. Weights and biases is a great tool to help you visualise and track your model throughout your production cycle. In this blog post, I’m going to detail some basics on how you can initialise and use it to visualise your next project.

Installation

To use weights and biases (wandb), you need to make an account. For individuals it is free, however, for team-oriented features, you will have to pay. Wandb can then be installed using pip or conda.

$ 	conda install -c conda-forge wandb

or 

$   pip install wandb

To initialise your project, import the package, sign in, and then use the following command using your chosen project name and username (if you want):

import wandb

wandb.login()

wandb.init(project='project1')

In addition to your project, you can also initialise a config dictionary with starting parameter values:

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Tag Archives: Python

PLIP on PDBbind with Python

Molecular conformation generation with a DL-based force field

The ultimate modulefile for conda

Ligands of CASF-2016

How to easily use pharmacophoric atom features to turn ECFPs into FCFPs

Two useful modules to help you find the best ML model for your task

How to turn a SMILES string into an extended-connectivity fingerprint using RDKit

Using Conda environments with Flask and Apache

Exploring topological fingerprints in RDKit

Visualise with Weight and Biases