Customising MCS mapping in RDKit

Finding the parts in common between two molecules appears to be a straightforward, but actually is a maze of layers. The task, maximum common substructure (MCS) searching, in RDKit is done by Chem.rdFMCS.FindMCS, which is highly customisable with lots of presets. What if one wanted to control in minute detail if a given atom X and is a match for atom Y? There is a way and this is how.

Continue reading →

Machine learning strategies to overcome limited data availability

Machine learning (ML) for biological/biomedical applications is very challenging – in large part due to limitations in publicly available data (something we recently published about [1]). Substantial amounts of time and resources may be required to generate the types of data (eg protein structures, protein-protein binding affinity, microscopy images, gene expression values) required to train ML models, however.

In cases where there is sufficient data available to provide signal, but not enough for the desired performance, ML strategies can be employed:

Continue reading →

Exploring the Observed Antibody Space (OAS)

The Observed Antibody Space (OAS) [1,2] is an amazing resource for investigating observed antibodies or as a resource for training antibody specific models, however; its size (over 2.4 billion unpaired and 1.5 million paired antibody sequences as of June 2023) can make it painful to work with. Additionally, OAS is extremely information rich, having nearly 100 columns for each antibody heavy or light chain, further complicating how to handle the data.

From spending a lot of time working with OAS, I wanted to share a few tricks and insights, which I hope will reduce the pain and increase the joy of working with OAS!

Continue reading →

Academic Reading? There’s an AI for that.

AI tools are literally everywhere. Recently, I stumbled across an AI aggregator website (theresanaiforthat.com) that, given a task, will find an AI solution. At the time of writing this article, there are 4871 AI’s across 1369 tasks, with solutions ranging from scribes to polygraph examiners. Recently, I stumbled across SciSpace (formerly typeset – https://typeset.io), an “AI assistant to understand scientific literature.” So, of course, I tested it out. In this blog post, we will explore the capabilities of SciSpace and discuss how it can potentially enhance your literature review process.

The user experience of a tool can make or break its adoption. Thankfully, SciSpace isn’t bad. Its main website offers basic search functionality, enabling you to find specific papers, topics, or authors within their database. I did notice that it is missing many new papers in its database; however, users have the option to upload a PDF for analysis. Additionally, each search result includes a TL;DR summary, providing a concise overview of the paper’s contents at a glance. As expected, this summary serves as a helpful reminder for familiar papers, but I often found it inadequate in providing enough information to grasp the main arguments or story of a paper. One interesting feature of SciSpace is the ability to “trace” papers in their database. By following the citations of a paper, users can navigate through related works, authors, and topics. I think this feature would be helpful during exploration and makes finding connections between related topics a little easier.

The best thing about SciSpace is the Copilot Chrome extension. Available whenever you open a paper’s PDF or journal link, it offers text analysis, summarization, and mathematical or table comprehension. It provides a set of common template prompts, which I found helpful. For example, “What were the key contributions of that paper?”, “What data and methods have been used in this paper?”, or “What are the limitations of this paper?” I found these prompts helpful in getting a quick overview of the work faster than reading the abstract, figures, and conclusion.

To put SciSpace Copilot to the test, I used it on my recent publication. The extension provided an accurate summary of the abstract and introduction. It effectively extracted the key result and arguments plus highlighted the main contributions of the work well. To be honest, it also offered a fair and accurate summary of the limitations of the study. It was helpful; however, it does not replace the need to read the full paper.

Tools like SciSpace are clearly becoming more popular and could potentially play a larger role in how we write, read, and understand research output. In the meantime, I’ve found it helpful to significantly improve the efficiency and effectiveness of my academic reading. Its clean, user-friendly interface, TL;DR summaries, and the impressive Copilot Chrome extension save me time. Plus, it’s completely free! I do expect that at some point it will become a paid tool. Until then, it’s a great way to stay on top of published work and build an understanding of related, but unfamiliar, fields.

Pairwise sequence identity and Tanimoto similarity in PDBbind

In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.

A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:

from rdkit.Chem import AllChem, MolFromMol2File
from rdkit.DataStructs import BulkTanimotoSimilarity
import numpy as np
import os

fps = []
for pdb in pdbs:
    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))
    fps.append(AllChem.GetMorganFingerprint(mol, 3))

sims = []
for i in range(len(fps)):
    sims.append(BulkTanimotoSimilarity(fps[i],fps))

arr = np.array(sims)
np.savez_compressed('data/tanimoto_similarity.npz', arr)

Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:

Continue reading →

Checking your PDB file for clashing atoms

Detecting atom clashes in protein structures can be useful in a number of scenarios. For example if you are just about to start some molecular dynamics simulation, or if you want to check that a structure generated by a deep learning model is reasonable. It is quite straightforward to code, but I get the feeling that these sort of functions have been written from scratch hundreds of times. So to save you the effort, here is my implementation!!!

Continue reading →

Streamlining Your Terminal Commands With Custom Bash Functions and Aliases

If you’ve ever found yourself typing out the same long commands over and over again, or if you’ve ever wished you could teleport directly to your favourite directories, then this post is for you.

Before we jump into some useful examples, let’s go over what bash functions and aliases are, and how to set them up.

Bash Functions vs Aliases

A bash function is like a mini script stored in your .bashrc or .bash_profile file. It can accept arguments, execute a series of commands, and even return a value.

Continue reading →

Unclear documentation? ChatGPT can help!

The PyMOL Python API is a useful resource for most people doing research in OPIG, whether focussed on antibodies, small molecule drug design or protein folding. However, the documentation is poorly structured and difficult to interpret without first having understood the structure of the module. In particular, the differences between use of the PyMOL command line and the API can be unclear, leading to a much longer debugging process for code than you’d like.

While I’m reluctant to continue the recent theme of ChatGPT-related posts, this is a use for ChatGPT that would have been incredibly useful to me when I was first getting to grips with the PyMOL API.

Continue reading →

Cross-linking mass-spectrometry: a guide to conformational confusions.

In the age of highly accurate structure prediction methods, I have seen more and more usage of cross-linking mass-spectrometry (XL-MS) and I wanted to understand its limitations more carefully. This is more of a guide to interpreting the data rather than how to perform the experiment.

Continue reading →

Coding a Progress Bar for your Google Slides Presentation

Presentations are a great opportunity to explain your work to a new audience and receive valuable feedback. A vital aspect of a presentation is keeping the audience’s attention which is generally quite tricky I have found (from experience).

One thing that I have noticed other presenters using, which has helped maintain my focus, is an indication of the progression of the presentation. Including in your slides information that there are only a few slides remaining, encourages the listeners to keep their focus for a little longer.

Instead I will show you how to do it using Apps Script, Google’s cloud platform that allows you to write JavaScript code which can work with its online products such as Docs or Slides.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Customising MCS mapping in RDKit

Machine learning strategies to overcome limited data availability

Exploring the Observed Antibody Space (OAS)

Academic Reading? There’s an AI for that.

Pairwise sequence identity and Tanimoto similarity in PDBbind

Checking your PDB file for clashing atoms

Streamlining Your Terminal Commands With Custom Bash Functions and Aliases

Bash Functions vs Aliases

Unclear documentation? ChatGPT can help!

Cross-linking mass-spectrometry: a guide to conformational confusions.

Coding a Progress Bar for your Google Slides Presentation