Category Archives: Publications

State of the art in AI for drug discovery: more wet-lab please

The reception of ML approaches for the drug discovery pipeline, especially when focused on the hit to lead optimization process, has been rather skeptical by the medchem community. One of the main drivers for that is the way many ML publications benchmark their models: Historic datasets are split into two parts, with the larger part used to train and the smaller to test ML models. In order to standardize that validation process, computational chemists have constructed widely used benchmark datasets such as the DUD-E set, which is commonly used as a standard for protein-ligand binding classification tasks. Common criticism from medicinal chemists centers on the main problem associated with benchmark datasets: the absence of direct lab validation.

Continue reading →

The evolution of contact prediction – a new paper

I’m so pleased to be able to write about our work on The evolution of contact prediction: evidence that contact selection in statistical contact prediction is changing (Bioinformatics btz816). Contact prediction – the prediction of parts of the amino-acid chain that are close together – has been critical to improving the ability of scientists to predict protein structures over the last decade. Here we look at the properties of these predictions, and what that might mean for their use.

The paper begins with a question. If contact prediction methods are based on statistical properties of sequence alignments, and those alignments are generated in the presence of ecological and physical constraints, what effect do the physical constraints have on the statistical properties of real sequence alignments? More concisely: when we predict contacts, do we predict particularly important contacts?

Continue reading →

BOKEI: Bayesian Optimization Using Knowledge of Correlated Torsions and Expected Improvement for Conformer Generation

In previous blog post, we introduced the idea of Bayesian optimization and its application in finding the lowest energy conformation of given molecule[1]. Here, we extend this approach to incorporate the knowledge of correlated torsion and accelerate the search.

Continue reading →

Modelling Conformational Flexibility of Kinases in Inactive States

I would like to shamelessly advertise my master thesis project which just got published in Proteins. Keep on reading if you are interested in kinases and/or systematic modelling of protein families.

Continue reading →

Finding the lowest energy conformation of given molecule!

Generating low-energy molecular conformers is important for many areas of computational chemistry, molecular modeling and cheminformatics. Many tools have been developed to generate conformers, including BALLOON (1), Confab (2), FROG2 (3), MOE (4), OMEGA (5) and RDKit (6). The search algorithm implemented in these tools can be broadly classified as either systematic or stochastic. These algorithms primarily focus on generating geometrically diverse low-energy conformers. Here, we are interested in finding lowest energy conformation of a molecule instead of achieving geometric diversity and Bayesian optimization is used to find the lowest energy conformation (7). Continue reading →

What can you do with the OPIG Antibody Suite?

OPIG has now developed a whole range of tools for antibody analysis. I thought it might be helpful to summarise all the different tools we are maintaining (some of which are brand new, and some are not hosted at opig.stats), and what they are useful for.

Immunoglobulin Gene Sequencing (Ig-Seq/NGS) Data Analysis

1. OAS
Link: http://antibodymap.org/
Required Input: N/A (Database)
Paper: http://www.jimmunol.org/content/201/8/2502

OAS (Observed Antibody Space) is a quality-filtered, consistently-annotated database of all of the publicly available next generation sequencing (NGS) data of antibodies. Here you can:

Continue reading →

ISMB 2018 (Chicago): Summary of Interesting Talks/Posters

Catherine’s Selection

Network approach integrates 3D structural and sequence data to improve protein structural comparison

Why: Current graph mapping in protein structural comparison ignores sequence order of residues. Residues distant in sequence but close in 3D space are more important.
How: Introduce sequence order of residues, set a sequence-distance cutoff to consider structurally important residues, count the graphlet frequency and embed into PCA space.
Results: the new method is predictive of SCOP and CATH ‘groups’. Certain graphlets are enriched in alpha and beta folds.
Link: https://www.nature.com/articles/s41598-017-14411-y

Investigating the molecular determinants of Ebola virus pathogenicity

Why: Reston virus is the only Ebola virus that is not pathogenic to human
What they do: multiple sequence alignment to look for specificity determining positions (SDPs) using s3det, then predict the effect of each individual SDP on the stability of the protein with mCSM.
Results: VP40 SDPs alter octamer formation, structure hydrophobic core. VP24 SDPs leads to impair binding to KPNA5 in human, which inhibits interferon signalling.
Impact: only a few SDPs distinguish Reston VP24 from VP24 of others. Human-pathogenic Reston viruses may emerge.
Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5558184/#__ffn_sectitle

Computational Analysis Highlights Key Molecular Interactions and Conformational Flexibility of a New Epitope on the Malaria Circumsporozoite Protein and Paves the Way for Vaccine Design

Why: An antibody with a strong binding affinity was found in a group of subjects. This antibody prevents cleavage of the surface protein.
What they do: They found the linear epitope, crystallise the strong and medium binders and run a molecular dynamic simulation to find out the flexibility of the structures.
Results: The strong binder is less flexible. Moreover, the strong binder is similar to the germline sequence which may mean that this antibody could have been readily formed.
Link: https://www.nature.com/articles/nm.4512

—

Matt’s Selection

“Analysis of sequence and structure data to understand nanobody architectures and antigen interactions”
Laura S. Mitchell (Colwell Group)
University of Cambridge, UK

This poster detailed the work from Laura’s two most recent publications, which can be found here: https://doi.org/10.1002/prot.25497, https://doi.org/10.1093/protein/gzy017

They describe a comprehensive analysis of the binding properties of the 156 non-redundant nanobody-antigen (Nb-Ag) complexes in the PDB/SAbDab (October 2017). Their analyses include Nb sequence variability (both global and across the binding regions), contact maps of nanobody-antigen interactions by region, and the typical chemical properties of each paratope. Nb-Ag complexes are compared to a reference set of monoclonal antibody-antigen (mAb-Ag) complexes. This work is a key first step in advancing our understanding of Nb paratopes, and will aid the development of new diagnostics and therapeutics.

“OSPREY 3.0: Open-Source Protein Redesign for You, with Powerful New Features”
Jeffrey W. Martin (Donald Group)
Duke University, USA

OSPREY 3.0 (https://www.biorxiv.org/content/early/2018/04/23/306324) represents a large advance towards time-efficient continuous flexibility modelling of protein-protein interfaces.

Its new algorithms LUTE and BBK* allow for continuous rotamer flexibility searching and entropy-aware binding constant approximation in a much more efficient manner. The CATS algorithm also introduces local backbone flexibility as a long-awaited feature. This software now has a easy-to-use Python interface, and is fully Open-Source, making it an extremely attractive alternative to other proprietary protein design tools.

“Functional annotation of chemical libraries across diverse biological processes”
Scott Simpkins
University of Minnesota-Twin Cities, USA

This interesting talk detailed the work published in Nature Chemical Biology in September 2017 (https://doi.org/10.1038/nchembio.2436).

310 yeast gene-deletion mutants were isolated to perform chemical-genetic profile studies across six diverse small molecule high-throughput screening libraries. By studying which gene-deletion mutants were hypersensitive or resistant to each compound, the researchers could assign most members of each chemical library a probable functional annotation. Mapping back to gene-interaction profile data also allowed them to infer likely targets for some compounds. The GO annotations associated with these genes could then be used assess whether a given starting library is likely to contain promising starting-points that affect a given biological function. For example, the authors highlighted a deficiency across all libraries against the cellular processes of cytokinesis and ribosome biogenesis. Conversely, they found a large enrichment across all libraries for compounds likely to affect glycosylation or cell wall biogenesis. Compounds that target transcription and chromatin organisation were found to be enriched in certain datasets, and depleted in others. This genre of profiling provides researchers a way of judging a priori whether a given screening library is likely to contain promising lead compounds, given the functional role of the target of interest.

Interesting Antibody Papers.

Below are several antibody papers that should be of interest to those dealing with antibody engineering, be it computational or experimental. The running motif in this post will be humanization, or the process of engineering a mouse antibody sequence which binds to a target to look ‘more human’ so as to reduce the immune response (if you need an early citation on this issue, here it is).

We present two papers which talk about antibody humanization directly, one from structural point of view (Choi et al. 2015), the other one highlighting issues facing antibody engineers mining for information (Martin & Rees, 2016). The third paper (Collins et al. 2015) takes a step back from the issues presented in the other papers and talks broadly about the nature of mouse sequences raised in the lab.

Humanization via structural means [here] (Bailey-Kellogg group). The authors introduce a novel methodology named CoDAH to facilitate humanization of antibodies. They design an approach which makes a tradeoff between sequence and structural humanization scores. The sequence score used is the Human String Content (Laza et al. 2007, Mol Immunol), which calculates how similar the query (murine) sequence is to short stretches of human sequences (mostly germilne). In line with the fact that T-Cells are one of main drivers of anti-biologics immunity, they define the sequences stretches to be 9-mer, as recognized by T-Cells. For the structural score, they use Rotameric energy as calculated by Amber. They demonstrate that constructs designed using their score express and retain affinity towards the target antigen, however they do not appear to prove that the new sequences are not immunogenic.

Extracting data from databases for humanization [here] (Martin group and Rees consulting). The main purpose of this manuscript is to warn potential antibody engineers of the pitfalls of species mis-annotations. They point out that in a routine ‘humanization’ pipeline where we aim to find human sequences given a mouse sequence, a great number of seemingly good ‘human’ templates are not human at all (sources as diverse as IMGT or PDB). This might lead to errors down the line if the engineer does not double check the annotations (unfortunate but true). Many of such annotations arise because the cells in which mouse antibodies are expressed are human cells or because the sequences are chimeric — in either case the annotation would not read mouse or chimeric, but erroneously ‘human’. NB. Another thing to watch in this publication is the fact that authors are working on a sequence database of their own: EMBLIG which is said to collect data from EMBL-ENA (nucleotide repository from EMBL). Hopefully in their database, authors will address the issues that they point out here.

What can we say about antibodies produces by laboratory mice? [here] (Collins group). Authors of this manuscript have addressed the issue that the now available High Throughput Sequencing (HTS) overlooked mouse repertoires. Different mice strains have different susceptibilities to diseases (Houpt, 2002, J Imunol; which might mean that you need to think twice which mice strain to choose for a given target). Currently known antibody repertoire of mice is based on the sequencing of two strains, BALB/c and C57BL/6. Here the authors apply HTS to two strains (BALB/c and C57BL/6) of laboratory mice (eight mice per strain) to get a better snapshot of antibody gene usage. Specifically, they pay close attention to the different genes combinations (VDJ) in the sequences that they obtain. Authors conclude that the repertoires between the two strains are strikingly different and quite restricted — which might mean that the laboratory mice were under very specific pressures (read inbred/overbred). All in all, the VDJ usage numbers that they produce in this publication are a useful reference to know which sequence combinations might be used by antibody engineers.

Interesting Antibody Papers

De Novo H3 prediction by C-terminal kink-biasing (Gray Lab) [here].

Authors introduce an improvement to the prediction of CDR-H3 in the form of a constraint for de-novo decoy generation. Working from the observation that 80% of CDR-H3 have kinked C-Terminal (Weitzner et al., 2015, Structure), they bias the loops to assume this conformation (they prove that it does not force ALL loops to do so!). The constraint is in the form of a pseudo bond angle between Ca for the three C-terminal residues and a pseudo dihedral angle for the three C-terminal residues and one adjacent residue in the framework. The bias takes the form of a penalty score if the generated angle falls outside mean +/- 1s. They use a quite stringent H3 loop benchmark of only 49 loops. Using this constraint on this dataset improves prediction for majority of the loops. They also demonstrate the utility of the score for full Fv homology modeling and Ab-Ag docking.

Therapeutic vs synthetic vs natural antibodies (Ofran Lab) [here].

The authors analyzed 137 Ab-Ag complexes from the PDB. Those from hybridoma and synthetic libraries were classified as ‘Natural’ and those coming from ‘synthetic’ libraries. They demonstrate that synthetic libraries overuse H3 in the number of contacts the antibody forms with the antigen, whereas natural constructs share the paratope with H1& H2 to a larger extent. This, together with their tool, CDRs analyzer (analysis of structural & biochemical properties of ab-ag complex) can be a useful method to inform the design of antibodies.

From the past: TABHU, tools for antibody humanization (Tramontano Lab) [here]. Authors have created a tool to aid antibody humanization. Given a sequence of an antibody, the system would look for the most suitable template from their extensive sequence databases (DIGIT) and germline sequences from IMGT. The templates are assessed on sequence similarity to the query and the similarity of the ‘binding’ mode which is assessed by their paratope prediction tool proABC. After the template had been chosen, the user can produce a structural model of the sequence.

Network Pharmacology

The dominant paradigm in drug discovery has been one of finding small molecules (or more recently, biologics) that bind selectively to one target of therapeutic interest. This reductionist approach conveniently ignores the fact that many drugs do, in fact, bind to multiple targets. Indeed, systems biology is uncovering an unsettling picture for comfortable reductionists: the so-called ‘magic bullet’ of Paul Ehrlich, a single compound that binds to a single target, may be less effective than a compound with multiple targets. This new approach—network pharmacology—offers new ways to improve drug efficacy, to rescue orphan drugs, re-purpose existing drugs, predict targets, and predict side-effects.

Building on work Stuart Armstrong and I did at InhibOx, a spinout from the University of Oxford’s Chemistry Department, and inspired by the work of Shoichet et al. (2007), Álvaro Cortes-Cabrera and I took our ElectroShape method, designed for ultra-fast ligand-based virtual screening (Armstrong et al., 2010 & 2011), and built a new way of exploring the relationships between drug targets (Cortes-Cabrera et al., 2013). Ligand-based virtual screening is predicated on the molecular similarity principle: similar chemical compounds have similar properties (see, e.g., Johnson & Maggiora, 1990). ElectroShape built on the earlier pioneering USR (Ultra-fast Shape Recognition) work of Pedro Ballester and Prof. W. Graham Richards at Oxford (Ballester & Richards, 2007).

Our new approach addressed two Inherent limitations of the network pharmacology approaches available at the time:

Chemical similarity is calculated on the basis of the chemical topology of the small molecule; and
Structural information about the macromolecular target is neglected.

Our method addressed these issues by taking into account 3D information from both the ligand and the target.

The approach involved comparing the similarity of each set ligands known to bind to a protein, to the equivalent sets of ligands of all other known drug targets in DrugBank, DrugBank is a tremendous “bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.” This analysis generated a network of related proteins, connected by the similarity of the sets of ligands known to bind to them.

We looked at two different kinds of ligand similarity metrics, the inverse Manhattan distance of our ElectroShape descriptor, and compared them to 2D Morgan fingerprints, calculated using the wonderful open source cheminformatics toolkit, RDKit from Greg Landrum. Morgan fingerprints use connectivity information similar to that used for the well known ECFP family of fingerprints, which had been used in the SEA method of Keiser et al. We also looked at the problem from the receptor side, comparing the active sites of the proteins. These complementary approaches produced networks that shared a minimal fraction (0.36% to 6.80%) of nodes: while the direct comparison of target ligand-binding sites could give valuable information in order to achieve some kind of target specificity, ligand-based networks may contribute information about unexpected interactions for side-effect prediction and polypharmacological profile optimization.

Our new target-fishing approach was able to predict drug adverse effects, build polypharmacology profiles, and relate targets from two complementary viewpoints:
ligand-based, and target-based networks. We used the DUD and WOMBAT benchmark sets for on-target validation, and the results were directly comparable to those obtained using other state-of-the-art target-fishing approaches. Off-target validation was performed using a limited set of non-annotated secondary targets for already known drugs. Comparison of the predicted adverse effects with data contained in the SIDER 2 database showed good specificity and reasonable selectivity. All of these features were implemented in a user-friendly web interface that: (i) can be queried for both polypharmacology profiles and adverse effects, (ii) links to related targets in ChEMBLdb in the three networks (2D, 4D ligand and 3D receptor), and (iii) displays the 2D structure of already annotated drugs.