Tag Archives: Small Molecules

Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

Continue reading

Watch out when using PDBbind!

Now that PDBbind 2020 has been released, I want to draw some attention to an issue with using the SDF files that are supplied in the PDBbind refined set 2020.

Normally, SDF files save the chirality information of compounds in the atom block of the file which is shown belowas a snipped of the full sdf file for the ligand of PDB entry 4qsv. The column that defines chirality is marked in red.

As you can see, all columns shown here are 0. The SDF files supplied by PDBbind for some reason do NOT encode chirality information explicitly. This will be a problem when using RDKit to read the molecule and transform it into a smiles string. By using the following commands to read the ligand for 4qsv from PDBBind 2020 and write a SMILES string, we get:

Continue reading

Out-of-distribution generalisation and scaffold splitting in molecular property prediction

The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.

In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set X into two sets – a training set X_{\text{train}} and a test set X_{\text{test}}. The model is then subsequently trained on the examples in the training set X_{\text{train}} and afterwards its prediction abilities are measured on the untouched examples in the test set X_{\text{test}} via a suitable performance metric.

Since in this scenario the model has never seen any of the examples in X_{\text{test}} during training, its performance on X_{\text{test}} must be indicative of its performance on novel data X_{\text{new}} which it will encounter in the future. Right?

Continue reading

NeurIPS 2020: Chemistry / Biology papers

Another blog post, another look at accepted papers for a major ML conference. NeurIPS joins the other major machine learning conferences (and others) in moving virtual this year, running from 6th – 12th December 2020. In a continuation of past posts (ICML 2020, NeurIPS 2019), I will highlight several of potential interest to the chem-/bio-informatics communities

The list of accepted papers can be found here, with 1,903 papers accepted out of 9,467 submissions (20% acceptance rate).

In addition to the main conference, there are several workshops highly related to the type of research undertaken in OPIG: Machine Learning in Structural Biology and Machine Learning for Molecules.

The usual caveat: given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”). If you find any I have missed, please reach out and I will update accordingly.

Continue reading

ICML 2020: Chemistry / Biology papers

ICML is one of the largest machine learning conferences and, like many other conferences this year, is running virtually from 12th – 18th July.

The list of accepted papers can be found here, with 1,088 papers accepted out of 4,990 submissions (22% acceptance rate). Similar to my post on NeurIPS 2019 papers, I will highlight several of potential interest to the chem-/bio-informatics communities. As before, given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”).

Continue reading

NeurIPS 2019: Chemistry/Biology papers

NeurIPS is the largest machine learning conference (by number of participants), with over 8,000 in 2017. This year, the conference will be held in Vancouver, Canada from 8th-14th December.

Recently, the list of accepted papers was announced, with 1430 papers accepted. Here, I will highlight several of potential interest to the chem-/bio-informatics communities. Given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”).

Continue reading

Experimental Binding Modes of Small Molecules in Protein-Ligand Docking

Protein-ligand docking tends to be very good at generating binding modes that resemble experimental binding modes from X-ray crystallography and other methods (assuming we have a high quality structure…); but it is also very good at generating plausible models for ligands that don’t bind. These so-called “false positives” lead to reduced accuracy in structure-based virtual screening campaigns.

Structure-based methods are not the only way of approaching virtual screening: when all we know is the chemical structure of an active molecule, but nothing about its target (or targets), we can use ligand-based virtual screening methods, which operate on the principle of molecular similarity (Maggiora et al., 2014).

But what if we combine both methods?

Continue reading

Using RDKit to load ligand SDFs into Pandas DataFrames

If you have downloaded lots of ligand SDF files from the PDB, then a good way of viewing/comparing all their properties would be to load it into a Pandas DataFrame.

RDKit has a very handy function just for this – it’s found under the PandasTool module.

I show an example below within Jupypter-notebook, in which I load in the SDF file, view the table of molecules and perform other RDKit functions to the molecules.

First import the PandasTools module:

from rdkit.Chem import PandasTools

Read in the SDF file:

SDFFile = "./Ligands_noHydrogens_noMissing_59_Instances.sdf"
BRDLigs = PandasTools.LoadSDF(SDFFile)

You can see the whole table by calling the dataframe:

BRDLigs

The ligand properties in the SDF file are stored as columns. You can view what these properties are, and in my case I have loaded 59 ligands each having up to 26 properties:

BRDLigs.info()

It is also very easy to perform other RDKit functions on the dataframe. For instance, I noticed there is no heavy atom column, so I added my own called ‘NumHeavyAtoms’:

BRDLigs['NumHeavyAtoms']=BRDLigs.apply(lambda x: x['ROMol'].GetNumHeavyAtoms(), axis=1)

Here is the column added to the table, alongside columns containing the molecules’ SMILES and RDKit molecule:

BRDLigs[['NumHeavyAtoms','SMILES','ROMol']]

CCP4 Study Weekend 2017: From Data to Structure

This year’s CCP4 study weekend focused on providing an overview of the process and pipelines available, to take crystallographic diffraction data from spot intensities right through to structure. Therefore sessions included; processing diffraction data, phasing through molecular replacement and experimental techniques, automated model building and refinement. As well as updates to CCP4 and where is crystallography going to take us in the future?

Surrounding the meeting there was also a session for Macromolecular (MX) crystallography users of Diamond Light Source (DLS), which gave an update on the beamlines, and scientific software, as well as examples of how fragment screening at DLS has been used. The VMXi (Versatile Macromolecular X-tallography in-situ) beamline is being developed to image crystals that are forming in situ crystallisation plates. This should allow for crystallography to be optimized, as crystallization conditions can be screened, and data collected on experiments as they crystallise, especially helpful in cases where crystallisation has routinely led to non-diffracting crystals. VXMm is a micro/nanofocus MX beamline, which is in development, with a focus to get crystallographic from very small crystals (~300nm to 10 micron diameters, with a bias to the smaller size), thereby allowing crystallography of targets that have previously been hard to get sufficient crystals. Other updates included how technology developed for fast solid state data collection on x-ray free electron lasers (XFEL) can be used on synchrotron beamlines.

A slightly more in-depth discussion of two tools presented that were developed for use alongside and within CCP4, which might be of interest more broadly:

ConKit: A python interface for contact prediction tools

Contact prediction for proteins, at its simplest, involves estimating which residues within a certain certain spatial proximity of each other, given the sequence of the protein, or proteins (for complexes and interfaces). Two major types of contact prediction exist:

  • Evolutionary Coupling
  • Supervised machine learning
    • Using ab initio structure prediction tools, without sequence homologues, to predict which contacts exist, but with a much lower accuracy than evolutionary coupling.

fullscreen

ConKit is a python interface (API) for contact prediction tools, consisting of three major modules:

  • Core: A module for constructing hierarchies, thereby storing necessary data such as sequences in a parsable format.
    • Providing common functionality through functions that for example declare a contact as a false positive.
  • Application: Python wrappers for common contact prediction and sequence alignment applications
  • I/O: I/O interface for file reading, writing and conversions.

Contact prediction can be used in the crystallographic structure determination field, during unconventional molecular replacement, using a tool such as AMPLE. Molecular replacement is a computational strategy to solve the phase problem. In the typical case, by using homologous structures to determine an estimate a model of the protein, which best fits the experimental diffraction intensities, and thus estimate the phase. AMPLE utilises ab initio modeling (using Rosetta) to generate a model for the protein, contact prediction can provide input to this ab initio modeling, thereby making it more feasible to generate an appropriate structure, from which to solve the phase problem. Contact prediction can also be used to analyse known and unknown structures, to identify potential functional sites.

For more information: Talk given at CCP4 study weekend (Felix Simkovic), ConKit documentation

ACEDRG: Generating Crystallographic Restraints for Ligands

Small molecule ligands are present in many crystallographic structures, especially in drug development campaigns. Proteins are formed (almost exclusively) from a sequence containing a selection of 20 amino acids, this means there are well known restraints (for example: bond lengths, bond angles, torsion angles and rotamer position) for model building or refinement of amino acids. As ligands can be built from a much wider selection of chemical moieties, they have not previously been restrained as well during MX refinement. Ligands found in PDB depositions can be used as models for the model building/ refinement of ligands in new structures, however there are a limited number of ligands available (~23,000). Furthermore, the resolution of the ligands is limited to the resolution of the macro-molecular structure from which they are extracted.

ACEDRG utilises the crystallorgraphy open database (COD), a library of (>300,000) small molecules usually with atomic resolution data (often at least 0.84 Angstrom), to generate a dictionary of restraints to be used in refining the ligand. To create these restraints ACEDRG utilises the RDkit chemoinformatics package, generating a detailed descriptor of each atom of the ligands in COD. The descriptor utilises properties of each atom including the element name, number of bonds, environment of nearest neighbours, third degree neighbours that are aromatic ring systems. The descriptor, is stored alongside the electron density values from the COD.  When a ACEDRG query is generated, for each atom in the ligand, the atom type is compared to those for which a COD structure is available, the nearest match is then used to generate a series of restraints for the atom.

ACEDRG can take a molecular description (SMILES, SDF MOL, SYBYL MOL2) of your ligand, and generate appropriate restraints for refinement, (atom types, bond lengths and angles, torsion angles, planes and chirality centers) as a mmCIF file. These restraints can be generated for a number of different probable conformations for the ligand, such that it can be refined in these alternate conformations, then the refinement program  can use local scoring criteria to select the ligand conformation that best fits the observed electron density. ACEDRG can accessed through the CCP4i2 interface, and as a command line interface.

Hopefully a useful insight to some of the tools presented at the CCP4 Study weekend. For anyone looking for further information on the CCP4 Study weekend: Agenda, Recording of Sessions, Proceedings from previous years.

Network Representations of Allostery

Allostery is the process by which action at one site, such as the binding of an effector molecule, causes a functional effect at a distant site. Allosteric mechanisms are important for the regulation of cellular processes, altering the activity of a protein, or the whole biosynthetic pathway. Triggers for allosteric action include binding of small molecules, protein-protein interaction, phosphorylation events and modification of disulphide bonds. These triggers can lead to changes in accessibility of the active site, through large or small motions, such as hinge motion between two domains, or the motion of a single side chain.

Figure 1 from

Figure 1 from [1]: Rearrangement of a residue–residue interaction in phosphofructokinase. Left panel: interaction between E241 and H160 of chain A in the inactive state; right: this interaction in the active state. Red circles mark six atoms unique to the residue–residue interface in the I state, green circles mark four atoms unique to the A state, and yellow circles mark three atoms present in both states. In these two residues, there are a total of 19 atoms, so the rearrangement factor R(i,j) = max(6, 4)/19 = 0.32

One way to consider allostery is as signal propagation from one site to another, as a change in residue to residue contacts. Networks provide a way to represent these changes. Daily et al [1] introduce the idea of contact rearrangement networks, constructed from a local comparison of the protein structure with and without molecules bound to the allosteric site. These are referred to as the active and inactive structures respectively. To measure the whether a residue to residue contact is changed between the active and inactive states, the authors use a rearrangement factor (R(i,j)). This is the ratio of atoms which are within a threshold distance (5 angstroms) in only one of the active or inactive states (whichever is greater), to the total number of atoms in the two residues.The rearrangement factor is distributed such that the large majority of residues have low rearrangement factors (as they do not change between the active and inactive state). To consider when a rearrangement is significant the authors use a benchmark set of non allosteric proteins to set a threshold for the rearrangement factor. The residues above this threshold form the contact rearrangement network, which can be analysed to assess whether the allosteric and functional sites are linked by residue to residue contacts. In the paper 5/15 proteins analysed are found to have linked functional and allosteric sites.

Contact rearrangement network

Adaption of Figure 2 from [1]. Contact rearrangement network for phosphofructokinase. Circles in each graph represent protein residues, and red and green squares represent substrate and effector molecules, respectively. Lines connect pairs of residues with R(i,j) ≥ 0.3 and residues in the graph with any ligands which are adjacent (within 5.0 Å) in either structure. All connected components which include at least one substrate or effector molecule are shown.

Collective rigid body domain motion was not initially analysed by these contact rearrangement networks, however a later paper [2], discusses how considering these motions alongside the contact rearrangement networks can lead to a detection of allosteric activity in a greater number of proteins analysed. These contact rearrangement networks provide a way to assess the residues that are likely to be involved in allosteric signal propagation. However this requires a classification of allosteric and non-allosteric proteins, to undertake the thresholding for significance of the change in contacts, as well as multiple structures that have and do not have a allosteric effector molecule bound.

CONTACT

Figure 1 from [3]. (a) X-ray electron density map contoured at 1σ (blue mesh) and 0.3σ (cyan mesh) of cyclophilin A (CYPA) fit with discrete alternative conformations using qFit. Alternative conformations are colored red, orange or yellow, with hydrogen atoms added in green. (b) Visualizing a pathway in CYPA: atoms involved in clashes are shown in spheres scaled to van der Waals radii, and clashes between atoms are highlighted by dotted lines. This pathway originates with the OG atom of Ser99 conformation A (99A) and the CE1 atom of Phe113 conformation B (113B), which clash to 0.8 of their summed van der Waals radii. The pathway progresses from Phe113 to Gln63, and after the movement of Met61 to conformation B introduces no new clashes, the pathway is terminated. A 90° rotation of the final panel is shown to highlight how the final move of Met61 relieves the clash with Gln63. (c) Networks identified by CONTACT are displayed as nodes connected by edges representing contacts that clash and are relieved by alternative conformations. The node number represents the sequence number of the residue. Line thickness between a pair of nodes represents the number of pathways that the corresponding residues are part of. The pathway in b forms part of the red contact network in CYPA. (d) The six contact networks comprising 29% of residues are mapped on the three-dimensional structure of CYPA.

Alternatively, Van den Bedem et al [3]  define contact networks of conformationally coupled residues, in which movement of an alternative conformation of a residue likely influences the conformations of all other residues in the contact network. They utilise qFit, a tool for exploring conformational heterogeneity in a single electron density map of a protein, by fitting alternate conformations to the electron density.  For each conformation of a residue, it assesses whether it is possible to reduce steric clashes with another residue, by changing conformations. If a switch in conformations reduces steric clashes, then a pathway is extend to the neighbours of the residue that is moved. This continued until no new clashes are introduced. Pathways that share common members are considered as conformationally coupled, and grouped into a single contact network. As this technique is suitable for a single structure, it is possible to estimate residues which may be involved in allosteric signalling without prior knowledge of the allosteric binding region.

These techniques show two different ways to locate and annotate local conformational changes in a protein, and determine how they may be linked to one another. Considering whether these, and similar techniques highlight the same allosteric networks within proteins will be important in the integration of many data types and sources to inform the detection of allostery. Furthermore, the ability to compare networks, for example finding common motifs, will be important as the development of techniques such as fragment based drug discovery present crystal structures with many differently bound fragments.

[1] Daily, M. D., Upadhyaya, T. J., & Gray, J. J. (2008). Contact rearrangements form coupled networks from local motions in allosteric proteins. Proteins: Structure, Function and Genetics. http://doi.org/10.1002/prot.21800

[2] Daily, M. D., & Gray, J. J. (2009). Allosteric communication occurs via networks of tertiary and quaternary motions in proteins. PLoS Computational Biology. http://doi.org/10.1371/journal.pcbi.1000293

[3] van den Bedem, H., Bhabha, G., Yang, K., Wright, P. E., & Fraser, J. S. (2013). Automated identification of functional dynamic contact networks from X-ray crystallography. Nature Methods, 10(9), 896–902. http://doi.org/10.1038/nmeth.2592