Category Archives: Python

Cleaning outliers in conductance timeseries from molecular dynamics

Have you ever had an annoying dataset that looks something like this?

or even worse, just several of them

In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this

Continue reading

How to turn a SMILES string into an extended-connectivity fingerprint using RDKit

After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).

ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.

Continue reading

How to build a Python dictionary of residues for each molecule in PyMOL

Sometimes it can be handy to work with multiple structures in PyMOL using Python.

Here’s a snippet of code you might find useful: we iterate over all the α-carbon atoms in a protein and append to a list tuples such as (‘GLY’, 1). The dictionary, ‘reslist’, returns a list of residue names and indices for each molecule, where the key is a string containing the name of the molecule.

from pymol import cmd

# Create a list of all the objects, called 'mpls':
mols = cmd.get_object_list('*')

# Create an empty dictionary that will return a list of residues
# given the name of the molecule object
reslist = {}

# Set the dictionaries to be empty lists
for m in mols:  reslist[m] = []

# Use PyMOL's iterate command to go over every α-Carbon and append 
# a tuple consisting of the each residue's residue name ('resn') and
# residue index ('resi '):
for m in mols:  cmd.iterate('%s and n. ca'%m, 'reslist["%s"].append((resn,int(resi)))'%m)

This script assumes you only have protein molecules loaded, and ignores things like chain ID and insertion codes.

Once you have your list of residues, you can use it with the cmd.align command, e.g., to align a particular residue to a reference structure.

Automatic argument parsers for python

One of the recurrent problems I used to have when writing argument parsers is that after refactoring code, I also had to change the argument parser options which generally led to inconsistency between the arguments of the function and some of the options of the argument parser. The following example can illustrate the problem:

def main(a,b):
  """
  This function adds together two numbers a and b
  param a: first number
  param b: second number
  """
  print(a+b)

if __name__ == "__main__":
  import argparse
  parser = argparse.ArgumentParser()
  parser.add_argument("--a", type=int, required=True, help="first number")
  parser.add_argument("--b", type=int, required=True, help="second number")
  args = parser.parse_args()
  main(**vars(args))

This code is nothing but a simple function that prints a+b and the argument parser asks for a and b. The perhaps not so obvious part is the invocation of the function in which we have ** and vars. vars converts the named tuple args to a dictionary of the form {“a":1, "b":2}, and ** expands the dictionary to be used as arguments for the function. So if you have main(**{"a":1, "b":2}) it is equivalent to main(a=1, b=2).

Let’s refactor the function so that we change the name of the argument a to num.

Continue reading

Retrieving AlphaFold models from AlphaFoldDB

There are now nearly a million AlphaFold [1] protein structure predictions openly available via AlphaFoldDB [2]. This represents a huge set of new data that can be used for the development of new methods. The options for downloading structures are either in bulk (sorted by genome), or individually from the webpage for a prediction.

If you want just a few hundred or a few thousand specific structures, across different genomes, neither of these options are particularly practical. For example, if you have several thousand experimental structures for which you have their PDB [3] code, and you want to obtain the equivalent AlphaFold predictions, there is another way!

If we take the example of the PDB’s current molecule of the month, pyruvate kinase (PDB code 4FXF), this is how you can go about downloading the equivalent AlphaFold prediction programmatically.

  1. Query UniProt [4] for the corresponding accession number – an example python script is shown below:
Continue reading

Meeko: Docking straight from SMILES string

When docking, using software like AutoDock Vina, you must prepare your ligand by protonating the molecule, generating 3D coordinates, and converting it to a specific file format (in the case of Vina, PDBQT). Docking software typically needs the protein and ligand file inputs to be written on disk. This is limiting as generating 10,000s of files for a large virtual screen can be annoying and hinder the speed at which you dock.

Fortunately, the Forli group in Scripps Research have developed a Python package, Meeko, to prepare ligands directly from SMILES or other molecule formats for docking to AutoDock 4 or Vina, without writing any files to disk. This means you can dock directly from a single file containing all the SMILES of the ligands you are investigating!

Continue reading

Visualise with Weight and Biases

Understanding what’s going on when you’ve started training your shiny new ML model is hard enough. Will it work? Have I got the right parameters? Is it the data? Probably.  Any tool that can help with that process is a Godsend. Weights and biases is a great tool to help you visualise and track your model throughout your production cycle. In this blog post, I’m going to detail some basics on how you can initialise and use it to visualise your next project.

Installation

To use weights and biases (wandb), you need to make an account. For individuals it is free, however, for team-oriented features, you will have to pay. Wandb can then be installed using pip or conda.

$ 	conda install -c conda-forge wandb

or 

$   pip install wandb

To initialise your project, import the package, sign in, and then use the following command using your chosen project name and username (if you want):

import wandb

wandb.login()

wandb.init(project='project1')

In addition to your project, you can also initialise a config dictionary with starting parameter values:

Continue reading

Making better plots with matplotlib.pyplot in Python3

The default plots made by Python’s matplotlib.pyplot module are almost always insufficient for publication. With a ~20 extra lines of code, however, you can generate high-quality plots suitable for inclusion in your next article.

Let’s start with code for a very default plot:

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(1)
d1 = np.random.normal(1.0, 0.1, 1000)
d2 = np.random.normal(3.0, 0.1, 1000)
xvals = np.arange(1, 1000+1, 1)

plt.plot(xvals, d1, label='data1')
plt.plot(xvals, d2, label='data2')
plt.legend(loc='best')
plt.xlabel('Time, ns')
plt.ylabel('RMSD, Angstroms')
plt.savefig('bad.png', dpi=300)

The result of this will be:

Plot generated with matplotlib.pyplot defaults

The fake data I generated for the plot look something like Root Mean Square Deviation (RMSD) versus time for a converged molecular dynamics simulation, so let’s pretend they are. There are a number of problems with this plot: it’s overall ugly, the color scheme is not very attractive and may not be color-blind friendly, the y-axis range of the data extends outside the range of the tick labels, etc.

We can easily convert this to a much better plot:

Continue reading

How to turn a SMILES string into a vector of molecular descriptors using RDKit

Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:

  • molecular graph structure,
  • lipophilicity (logP),
  • molecular refractivity,
  • electrotopological state,
  • druglikeness,
  • fragment profile,
  • molecular charge,
  • molecular surface,

Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.

A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:

Continue reading