Tag Archives: machine learning

Understanding positional encoding in Transformers

Transformers are a very popular architecture in machine learning. While they were first introduced in natural language processing, they have been applied to many fields such as protein folding and design.
Transformers were first introduced in the excellent paper Attention is all you need by Vaswani et al. The paper describes the key elements, including multiheaded attention, and how they come together to create a sequence to sequence model for language translation. The key advance in Attention is all you need is the replacement of all recurrent layers with pure attention + fully connected blocks. Attention is very efficeint to compute and allows for fast comparisons over long distances within a sequence.
One issue, however, is that attention does not natively include a notion of position within a sequence. This means that all tokens could be scrambled and would produce the same result. To overcome this, one can explicitely add a positional encoding to each token. Ideally, such a positional encoding should reflect the relative distance between tokens when computing the query/key comparison such that closer tokens are attended to more than futher tokens. In Attention is all you need, Vaswani et al. propose the slightly mysterious sinusoidal positional encodings which are simply added to the token embeddings:

Continue reading

Current strategies to predict structures of multiple protein conformational states

Since the release of AlphaFold2 (AF2), the problem of protein structure prediction is widely believed to be solved. Current structure prediction tools, such as AF2, are able to model most proteins with high accuracy. These methods, however, have a major limitation as they have been trained to predict a single structure for a given protein. Proteins are highly dynamic molecules, and their function often depends on transitions between several conformational states. Despite research focusing on the task of predicting the structures of multiple conformations of a protein, currently, no accurate and reliable method is available. In this blog post, I will provide a short overview of the strategies developed for predicting protein conformations. I have grouped these into three sets of related approaches. To conclude, I will also demonstrate how to run one of these strategies on your own.

Continue reading

Machine learning strategies to overcome limited data availability

Machine learning (ML) for biological/biomedical applications is very challenging – in large part due to limitations in publicly available data (something we recently published about [1]). Substantial amounts of time and resources may be required to generate the types of data (eg protein structures, protein-protein binding affinity, microscopy images, gene expression values) required to train ML models, however.

In cases where there is sufficient data available to provide signal, but not enough for the desired performance, ML strategies can be employed:

Continue reading

An Overview of Clustering Algorithms

During the first 6 months of my DPhil, I worked on clustering antibodies and I thought I would share what I learned about these algorithms. Clustering is an unsupervised data analysis technique that groups a data set into subsets of similar data points. The main uses of clustering are in exploratory data analysis to find hidden patterns or data compression, e.g. when data points in a cluster can be treated as a group. Clustering algorithms have many applications in computational biology, such as clustering antibodies by structural similarity. Actually, this is objectively the most important application and I don’t see why anyone would use it for anything else.

There are several types of clustering algorithms that offer different advantages.

Continue reading

How to easily use pharmacophoric atom features to turn ECFPs into FCFPs

Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.

We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.

In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.

# import packages
import numpy as np
from rdkit.Chem import AllChem

# define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False
def FCFP_from_smiles(smiles,
                     R = 2,
                     L = 2**10,
                     use_features = True,
                     use_chirality = False):
    """
    Inputs:
    
    - smiles ... SMILES string of input compound
    - R ... maximum radius of circular substructures
    - L ... fingerprint-length
    - use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features
    - use_chirality ... if true then append tetrahedral chirality flags to atom features
    
    Outputs:
    - np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R
    """
    
    molecule = AllChem.MolFromSmiles(smiles)
    feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule,
                                                         radius = R,
                                                         nBits = L,
                                                         useFeatures = use_features,
                                                         useChirality = use_chirality)
    return np.array(feature_list)

The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.

[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.

[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.

A ChatGPT rap battle

The AI chatbot revolution is here. Last week, OpenAI released ChatGPT, a freely accessible language model fine-tuned for human conversations. The new model is based on InstructGPT, trained especially for following user instructions and with human feedback in the training loop. 

ChatGPT remembers the previous discussion, admits its mistakes and can even ask for clarification on ambiguous questions. It is also trained to refuse answering questions it deems inappropriate or goes against OpenAI’s AI alignment policy.

In the meanwhile, the internet is having immense fun circumventing its safety filters by asking it to only “PRETEND to be evil”, making it take SAT tests, and even simulating an entire virtual computer within its neural weights. Some are even using it to replace Google searches, and it excels at writing bioinformatics code across most programming languages.

Continue reading

Graphormer: Merging GNNs and Transformers for Cheminformatics

This is my first OPIG blog! I’m going to start with a summary of the Graphormer, a Graph Neural Network (GNN) that borrows concepts from Transformers to boost performance on graph tasks. This post is largely based on the NeurIPS paper Do Transformers Really Perform Bad for Graph Representation? by Ying et. al., which introduces the Graphormer, and which we read for our last deep learning journal club. The project has now been integrated as a Microsoft Research project.

I’ll start with a cheap and cheerful summary of Transformers and GNNs before diving into the changes in the Graphormer. Enjoy!

Continue reading

Universal graph pooling for GNNs

Graph neural networks (GNNs) have quickly become one of the most important tools in computational chemistry and molecular machine learning. GNNs are a type of deep learning architecture designed for the adaptive extraction of vectorial features directly from graph-shaped input data, such as low-level molecular graphs. The feature-extraction mechanism of most modern GNNs can be decomposed into two phases:

  • Message-passing: In this phase the node feature vectors of the graph are iteratively updated following a trainable local neighbourhood-aggregation scheme often referred to as message-passing. Each iteration delivers a set of updated node feature vectors which is then imagined to form a new “layer” on top of all the previous sets of node feature vectors.
  • Global graph pooling: After a sufficient number of layers has been computed, the updated node feature vectors are used to generate a single vectorial representation of the entire graph. This step is known as global graph readout or global graph pooling. Usually only the top layer (i.e. the final set of updated node feature vectors) is used for global graph pooling, but variations of this are possible that involve all computed graph layers and even the set of initial node feature vectors. Commonly employed global graph pooling strategies include taking the sum or the average of the node features in the top graph layer.

While a lot of research attention has been focused on designing novel and more powerful message-passing schemes for GNNs, the global graph pooling step has often been treated with relative neglect. As mentioned in my previous post on the issues of GNNs, I believe this to be problematic. Naive global pooling methods (such as simply summing up all final node feature vectors) can potentially form dangerous information bottlenecks within the neural graph learning pipeline. In the worst case, such information bottlenecks pose the risk of largely cancelling out the information signal delivered by the message-passing step, no matter how sophisticated the message-passing scheme.

Continue reading

Exploring topological fingerprints in RDKit

Finding a way to express the similarity of irregular and discrete molecular graphs to enable quantitative algorithmic reasoning in chemical space is a fundamental problem in data-driven small molecule drug discovery.

Virtually all algorithms that are widely and successfully used in this setting boil down to extracting and comparing (multi-)sets of subgraphs, differing only in the space of substructures they consider and the extent to which they are able to adapt to specific downstream applications.

A large body of recent work has explored approaches centred around graph neural networks (GNNs), which can often maximise both of these considerations. However, the subgraph-derived embeddings learned by these algorithms may not always perform well beyond the specific datasets they are trained on and for many generic or resource-constrained applications more traditional “non-parametric” topological fingerprints may still be a viable and often preferable choice .

This blog post gives an overview of the topological fingerprint algorithms implemented in RDKit. In general, they count the occurrences of a certain family of subgraphs in a given molecule and then represent this set/multiset as a bit/count vector, which can be compared to other fingerprints with the Jaccard/Dice similarity metric or further processed by other algorithms.

Continue reading