Category Archives: Python
Automatic argument parsers for python
One of the recurrent problems I used to have when writing argument parsers is that after refactoring code, I also had to change the argument parser options which generally led to inconsistency between the arguments of the function and some of the options of the argument parser. The following example can illustrate the problem:
def main(a,b): """ This function adds together two numbers a and b param a: first number param b: second number """ print(a+b) if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("--a", type=int, required=True, help="first number") parser.add_argument("--b", type=int, required=True, help="second number") args = parser.parse_args() main(**vars(args))
This code is nothing but a simple function that prints a+b and the argument parser asks for a and b. The perhaps not so obvious part is the invocation of the function in which we have ** and vars
. vars
converts the named tuple args
to a dictionary of the form {“a":1, "b":2}
, and ** expands the dictionary to be used as arguments for the function. So if you have main(**{"a":1, "b":2})
it is equivalent to main(a=1, b=2)
.
Let’s refactor the function so that we change the name of the argument a to num.
Continue readingRetrieving AlphaFold models from AlphaFoldDB
There are now nearly a million AlphaFold [1] protein structure predictions openly available via AlphaFoldDB [2]. This represents a huge set of new data that can be used for the development of new methods. The options for downloading structures are either in bulk (sorted by genome), or individually from the webpage for a prediction.
If you want just a few hundred or a few thousand specific structures, across different genomes, neither of these options are particularly practical. For example, if you have several thousand experimental structures for which you have their PDB [3] code, and you want to obtain the equivalent AlphaFold predictions, there is another way!
If we take the example of the PDB’s current molecule of the month, pyruvate kinase (PDB code 4FXF), this is how you can go about downloading the equivalent AlphaFold prediction programmatically.
- Query UniProt [4] for the corresponding accession number – an example python script is shown below:
Meeko: Docking straight from SMILES string
When docking, using software like AutoDock Vina, you must prepare your ligand by protonating the molecule, generating 3D coordinates, and converting it to a specific file format (in the case of Vina, PDBQT). Docking software typically needs the protein and ligand file inputs to be written on disk. This is limiting as generating 10,000s of files for a large virtual screen can be annoying and hinder the speed at which you dock.
Fortunately, the Forli group in Scripps Research have developed a Python package, Meeko, to prepare ligands directly from SMILES or other molecule formats for docking to AutoDock 4 or Vina, without writing any files to disk. This means you can dock directly from a single file containing all the SMILES of the ligands you are investigating!
Continue readingVisualise with Weight and Biases
Understanding what’s going on when you’ve started training your shiny new ML model is hard enough. Will it work? Have I got the right parameters? Is it the data? Probably. Any tool that can help with that process is a Godsend. Weights and biases is a great tool to help you visualise and track your model throughout your production cycle. In this blog post, I’m going to detail some basics on how you can initialise and use it to visualise your next project.
Installation
To use weights and biases (wandb), you need to make an account. For individuals it is free, however, for team-oriented features, you will have to pay. Wandb can then be installed using pip or conda.
$ conda install -c conda-forge wandb or $ pip install wandb
To initialise your project, import the package, sign in, and then use the following command using your chosen project name and username (if you want):
import wandb wandb.login() wandb.init(project='project1')
In addition to your project, you can also initialise a config dictionary with starting parameter values:
Continue readingMaking better plots with matplotlib.pyplot in Python3
The default plots made by Python’s matplotlib.pyplot module are almost always insufficient for publication. With a ~20 extra lines of code, however, you can generate high-quality plots suitable for inclusion in your next article.
Let’s start with code for a very default plot:
import matplotlib.pyplot as plt import numpy as np np.random.seed(1) d1 = np.random.normal(1.0, 0.1, 1000) d2 = np.random.normal(3.0, 0.1, 1000) xvals = np.arange(1, 1000+1, 1) plt.plot(xvals, d1, label='data1') plt.plot(xvals, d2, label='data2') plt.legend(loc='best') plt.xlabel('Time, ns') plt.ylabel('RMSD, Angstroms') plt.savefig('bad.png', dpi=300)
The result of this will be:
![](https://i0.wp.com/www.blopig.com/blog/wp-content/uploads/2022/06/bad.png?resize=625%2C417&ssl=1)
The fake data I generated for the plot look something like Root Mean Square Deviation (RMSD) versus time for a converged molecular dynamics simulation, so let’s pretend they are. There are a number of problems with this plot: it’s overall ugly, the color scheme is not very attractive and may not be color-blind friendly, the y-axis range of the data extends outside the range of the tick labels, etc.
We can easily convert this to a much better plot:
Continue readingHow to turn a SMILES string into a vector of molecular descriptors using RDKit
Molecular descriptors are quantities associated with small molecules that specify physical or chemical properties of interest. They can be used to numerically describe many different aspects of a molecule such as:
- molecular graph structure,
- lipophilicity (logP),
- molecular refractivity,
- electrotopological state,
- druglikeness,
- fragment profile,
- molecular charge,
- molecular surface,
- …
Vectors whose components are molecular descriptors can be used (amongst other things) as high-level feature representations for molecular machine learning. In my experience, molecular descriptor vectors tend to fall slightly short of more low-level molecular representation methods such as extended-connectivity fingerprints or graph neural networks when it comes to predictive performance on large and medium-sized molecular property prediction data sets. However, one advantage of molecular descriptor vectors is their interpretability; there is a reasonable chance that the meaning of a physicochemical descriptor can be intuitively understood by a chemical expert.
A wide variety of useful molecular descriptors can be automatically and easily computed via RDKit purely on the basis of the SMILES string of a molecule. Here is a code snippet to illustrate how this works:
Continue readingMake your code do more, with less
When you wrangle data for a living, you start to wonder why everything takes so darn long. Through five years of introspection, I have come to conclude that two simple factors limit every computational project. One is, of course, your personal productivity. Your time of focused work, minus distractions (and yes, meetings figure here), times your energy and mental acuity. All those things you have little control over, unfortunately. But the second is the productivity of your code and tools. And this, in principle, is a variable that you have full control over.
![](https://i0.wp.com/www.blopig.com/blog/wp-content/uploads/2022/05/image.png?resize=625%2C35&ssl=1)
This is a post about how to increase your productivity, by helping you navigate all those instances when the progress bar does not seem to go fast enough. I want to discuss actionable tools to make your code run faster, and generate more results, with less effort, in less time. Instructions to tinker less and think more, so you can do the science that you truly want to be doing. And, above all, I want to give out advice that is so counter-intuitive that you should absolutely consider following it.
Continue readingBetter Models Through Molecular Standardization
“Cheminformatics is hard.”
— Paul Finn
I would add: “Chemistry is nuanced”… Just as there are many different ways of drawing the same molecule, SMILES is flexible enough to allow us to write the same molecule in different ways. While canonical SMILES can resolve this problem, we sometimes have different problem. In some situations, e.g., in machine learning, we need to map all these variants back to the same molecule. We also need to make sure we clean up our input molecules and eliminate invalid or incomplete structures.
Different Versions of the Same Molecule: Salt, Neutral or Charged?
Sometimes, a chemical supplier or compound vendor provides a salt of the compound, e.g., sodium acetate, but all we care about is the organic anion, i.e., the acetate. Very often, our models are built on the assumption we have only one molecule as input—but a salt will appear as two molecules (the sodium ion and the acetate ion). We might also have been given just the negatively-charged acetate instead of the neutral acetic acid.
Tautomers
Another important chemical phenomenon exists where apparently different molecules with identical heavy atoms and a nearby hydrogen can be easily interconverted: tautomers. By moving just one hydrogen atom and exchanging adjacent bond orders, the molecule can convert from one form to another. Usually, one tautomeric form is most stable. Warfarin, a blood-thinning drug, can exist in solution in 40 distinct tautomeric forms. A famous example is keto-enol tautomerism: for example, ethenol (not ethanol) can interconvert with the ketone form. When one form is more stable than the other form(s), we need to make sure we convert the less stable form(s) into the most stable form. Ethenol, a.k.a. vinyl alcohol, (SMILES: ‘C=CO[H]’), will be more stable when it is in the ketone form (SMILES: ‘CC(=O)([H])’):
from IPython.display import SVG # to use Scalar Vector Graphics (SVG) not bitmaps, for cleaner lines import rdkit from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import Draw # to draw molecules from rdkit.Chem.Draw import IPythonConsole # to draw inline in iPython from rdkit.Chem import rdDepictor # to generate 2D depictions of molecules from rdkit.Chem.Draw import rdMolDraw2D # to draw 2D molecules using vectors AllChem.ReactionFromSmarts('[C:1]-[C:2](-[O:3]-[H:4])>>[C:1]-[C:2](=[O:3])(-[H:4])')Continue reading
Einops: Powerful library for tensor operations in deep learning
Tobias and I recently gave a talk at the OPIG retreat on tips for using PyTorch. For this we created a tutorial on Google Colab notebook (link can be found here). I remember rambling about the advantages of implementing your own models against using other peoples code. Well If I convinced you, einops is for you!!
Basically, einops lets you perform operations on tensors using the Einstein Notation. This package comes with a number of advantages a few of which I will try and summarise here:
Continue reading