Category Archives: Data Science

What can you do with the OPIG Immunoinformatics Suite? v3.0

OPIG’s growing immunoinformatics team continues to develop and openly distribute a wide variety of databases and software packages for antibody/nanobody/T-cell receptor analysis. Below is a summary of all the latest updates (follows on from v1.0 and v2.0).

Continue reading

PHinally PHunctionalising my PHigures with PHATE feat. Plotly Express.

After being recommended by a friend, I really wanted to try plotly express but I never had the inclination to read more documentation when matplotlib gives me enough grief. While experimenting with ChatGPT I finally decided to functionalise my figure making scripts. With these scripts I manage to produce figures that made people question what I had actually been doing with my time – but I promise this will be worth your time.

I have been using with dimensionality reducition techniques recently and I came across this paper by Moon et al. PHATE is a technique that represents high dimensional (ie biological) data in a way that aims to preserve connections over preserving distance and I knew I wanted to try this as soon as I saw it. Why should you care? PHATE in 3D is faster that t-SNE in 2D. It would almost be rude to not try it out.

PHATE

In my opinion PHATE (or potential of heat diffusion for affinity-based transition embedding) does have a lot going on but that the choices at each stage feel quite sensisble. It might not come as a surprise this was primarily designed to make visual inspection of data easier on the eyes.

Continue reading

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

  • De Novo Design
  • Open Science
  • Chemical Space
  • Physics-based Modelling
  • Machine Learning
  • Property Prediction
  • Virtual Screening
  • Case Studies
  • Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading

Machine learning strategies to overcome limited data availability

Machine learning (ML) for biological/biomedical applications is very challenging – in large part due to limitations in publicly available data (something we recently published about [1]). Substantial amounts of time and resources may be required to generate the types of data (eg protein structures, protein-protein binding affinity, microscopy images, gene expression values) required to train ML models, however.

In cases where there is sufficient data available to provide signal, but not enough for the desired performance, ML strategies can be employed:

Continue reading

Pairwise sequence identity and Tanimoto similarity in PDBbind

In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.

A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:

from rdkit.Chem import AllChem, MolFromMol2File
from rdkit.DataStructs import BulkTanimotoSimilarity
import numpy as np
import os

fps = []
for pdb in pdbs:
    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))
    fps.append(AllChem.GetMorganFingerprint(mol, 3))

sims = []
for i in range(len(fps)):
    sims.append(BulkTanimotoSimilarity(fps[i],fps))

arr = np.array(sims)
np.savez_compressed('data/tanimoto_similarity.npz', arr)

Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:

Continue reading

An Overview of Clustering Algorithms

During the first 6 months of my DPhil, I worked on clustering antibodies and I thought I would share what I learned about these algorithms. Clustering is an unsupervised data analysis technique that groups a data set into subsets of similar data points. The main uses of clustering are in exploratory data analysis to find hidden patterns or data compression, e.g. when data points in a cluster can be treated as a group. Clustering algorithms have many applications in computational biology, such as clustering antibodies by structural similarity. Actually, this is objectively the most important application and I don’t see why anyone would use it for anything else.

There are several types of clustering algorithms that offer different advantages.

Continue reading

PLIP on PDBbind with Python

Today’s blog post is about using PLIP to extract information about interactions between a protein and ligand in a bound complex, using data from PDBbind. The blog post will cover how to combine the protein pdb file and the ligand mol2 file into a pdb file, and how to use PLIP in a high-throughput manner with python.

In order for PLIP to consider the ligand as one molecule interacting with the protein, we need to modify the mol2 file of the ligand. The 8th column of the atom portion of a mol2 file (the portion starts with @<TRIPOS>ATOM) includes the ID of the ligand that the atom belongs to. Most often all the atoms have the same ligand ID, but for peptides for instance, the atoms have the ID of the residue they’re part of. The following code snippet will make the required changes:

ligand_file = 'data/5oxm/5oxm_ligand.mol2'

with open(ligand_file, 'r') as f:
    ligand_lines = f.readlines()

mod = False
for i in range(len(ligand_lines)):
    line = ligand_lines[i]
    if line == '@&lt;TRIPOS&gt;BOND\n':
        mod = False
        
    if mod:
        ligand_lines[i] = line[:59] + 'ISK     ' + line[67:]
        
    if line == '@&lt;TRIPOS&gt;ATOM\n':
        mod = True

with open('data/5oxm/5oxm_ligand_mod.mol2', 'w') as g:
    for j in ligand_lines:
        g.write(j)
Continue reading

How to easily use pharmacophoric atom features to turn ECFPs into FCFPs

Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.

We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.

In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.

# import packages
import numpy as np
from rdkit.Chem import AllChem

# define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False
def FCFP_from_smiles(smiles,
                     R = 2,
                     L = 2**10,
                     use_features = True,
                     use_chirality = False):
    """
    Inputs:
    
    - smiles ... SMILES string of input compound
    - R ... maximum radius of circular substructures
    - L ... fingerprint-length
    - use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features
    - use_chirality ... if true then append tetrahedral chirality flags to atom features
    
    Outputs:
    - np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R
    """
    
    molecule = AllChem.MolFromSmiles(smiles)
    feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule,
                                                         radius = R,
                                                         nBits = L,
                                                         useFeatures = use_features,
                                                         useChirality = use_chirality)
    return np.array(feature_list)

The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.

[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.

[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.

Datamining Wikipedia and writing JS with ChatGTP just to swap the colours on university logos…

I am not sure the University of Oxford logo works in the gold from the University of Otago…

A few months back I moved from the Oxford BRC to OPIG, both within the university of Oxford, but like many in academia I have moved across a few universities. As this is my first post here I wanted to do something neat: a JS tool that swapped colours in university logos!
It was a rather laborious task requiring a lot of coding, but once I got it working, I ended up tripping up at the last metre. So for technical reasons, I have resorted to hosting it in my own blog (see post), but nevertheless the path towards it is worth discussing.

Continue reading

histo.fyi: A Useful New Database of Peptide:Major Histocompatibility Complex (pMHC) Structures

pMHCs are set to become a major target class in drug discovery; unusual peptide fragments presented by MHC can be used to distinguish infected/cancerous cells from healthy cells more precisely than over-expressed biomarkers. In this blog post, I will highlight a prototype resource: Dr. Chris Thorpe’s new database of pMHC structures, histo.fyi.

histo.fyi provides a one-stop shop for data on (currently) around 1400 pMHC complexes. Similar to our dedicated databases for antibody/nanobody structures (SAbDab) and T-cell receptor (TCR) structures (STCRDab), histo.fyi will scrape the PDB on a weekly basis for any new pMHC data and process these structures in a way that facilitates their analysis.

Continue reading