Monthly Archives: February 2022

How to Blopig

Blopig has a wealth of knowledge, with everything from a Bayesian answer to the question “should ketchup be stored in the fridge?“* to the Nobel-Prize-Winner-approved analysis of AlphaFold2. Blopig runs on WordPress and uses blocks, components for adding different types of content to a post. These are blocks like paragraphs, headers, images, image galleries, and videos. Here are some hints and tips for getting the most out of WordPress.

One of the first blocks worth mentioning is the “Read More…” block…

Continue reading

How to turn a SMILES string into a molecular graph for Pytorch Geometric

Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors.

But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of Pytorch Geometric which currently is one of the best and most commonly used Python-based GNN-libraries.

We divide our task into three high-level steps:

  1. We define a function that maps an RDKit atom object to a suitable atom feature vector.
  2. We define a function that maps an RDKit bond object to a suitable bond feature vector.
  3. We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.
Continue reading

NeurIPS 2021 Conference Feedback

Held annually in December, the Neural Information Processing Systems meetings aim to encourage researchers using machine learning techniques in their work – whether it be in economics, physics, or any number of fields – to get together to discuss their findings, hear from world-leading experts, and in many years past, ski. The virtual nature of this year’s conference had an enormously negative impact on attendees’ skiing experiences, but it nevertheless was a pleasure to attend – the machine learning in structural biology workshop, in particular, provided a useful overview of the hottest topics in the field, and of the methods that people are using to tackle them. 

This year’s NeurIPS highlighted the growing interest in applying the newest Natural Language Processing (NLP) algorithms on proteins. This includes antibodies, as seen by two presentations in the MLSB workshop, which focused on using these algorithms for the discovery and design of antibodies. Ruffolo et al. presented their version of a BERT-inspired language model for antibodies. The purpose of such a model is to create representations that encapsulate all information of an antibody sequence, which can then be used to predict antibody properties. In their work, they showed how the representations could be used to predict high-redundancy sequences (a proxy for strong binders) and how continuous trajectories consistent with the number of mutations could be observed when using umap on the representations. While such representations can be used to predict properties of antibodies, another work by Shuai et al. instead focused on training a generative language model for antibodies, able to generate a region in an antibody based on the rest of the antibody. This can then potentially be used to generate new viable CDR regions of variable length, better than randomly mutating them.

Continue reading

Python’s Data Classes

When writing code, you have inevitably needed to store data throughout your pipeline. In these cases you store your value, list or data frame as a variable to easily use it elsewhere in your code. However, sometimes your data has an awkward form, consisting of a number of different length lists or data of different types and sizes. While it is still doable to work with, and using tuples or dictionaries can help, accessing different elements in your data quickly becomes messy and it is less intuitive what your code is actually doing.

To solve the above stated problem, data classes were introduced as a new feature in Python 3.7. A data class is a regular Python class, but with certain methods already implemented for you. This makes them easy to create and removes a lot of boilerplate (repeated code) making them simpler, more intuitive and pretty. Further, as data classes are part of the standard library, you can directly import it without needing to install any external dependencies (noice).

With the sales pitch out of the way, let us look at how we can use data classes.

from dataclasses import dataclass
from typing import Any

@dataclass
class Antibody:
    vgene: str
    jgene: None
    sequence: Any = 'EVQ'
Continue reading

A quantitative way to measure targeted protein degradation

Whenever we order consumables in the Chemistry department, the whole lab gets an email notification once they arrive. So I can understand why I got some puzzled reactions from my colleagues when one such email arrived saying that my ‘artichoke’ was ready to collect from stores. Had I been sneakily doing my grocery shopping on a university research budget?

Artichoke is, in fact, the name of a plasmid designed by the Ebert lab (https://www.addgene.org/73320/), which I have been using in some of my research on targeted protein degradation. The premise is simple enough: genes for two different fluorescent proteins, one of which is fused to a protein-of-interest.

Continue reading