This post is pretty much an ad for a very useful tool developed by GitHub that helps you find errors or vulnerabilities in your code by querying it as if it were data. I have personally found it very useful in finding small errors in my code and would recommend everyone to use it. If you want to check it out, this is their webpage.
Continue readingCategory Archives: Code
How to turn a SMILES string into an extended-connectivity fingerprint using RDKit
After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).
ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.
Continue readingUnreasonably faster notes, with command-line fuzzy search
A good note system should act like a second brain:
- Accessible in seconds
- Adding information should be frictionless
- Searching should be exhaustive – if it’s there, you must find it
The benefits of such a note system are immense – never forget anything again! Search, perform the magic ritual of Copy Paste, and rejoice in the wisdom of your tried and tested past.
But how? Through the unreasonable effectiveness of interactive fuzzy search. This is how I have used Fuz, a terminal-based file fuzzy finder, for about 4 years.
Briefly, Fuz extracts all text within a directory using ripgrep, enables interactive fuzzy search with FZF, and returns you the selected item. As you type, the search results get narrowed down to a few matches. Files are opened at the exact line you found. And it’s FAST – 100,000 lines in half a second fast.
Continue readingHow to build a Python dictionary of residues for each molecule in PyMOL
Sometimes it can be handy to work with multiple structures in PyMOL using Python.
Here’s a snippet of code you might find useful: we iterate over all the α-carbon atoms in a protein and append to a list tuples such as (‘GLY’, 1). The dictionary, ‘reslist’, returns a list of residue names and indices for each molecule, where the key is a string containing the name of the molecule.
from pymol import cmd # Create a list of all the objects, called 'mpls': mols = cmd.get_object_list('*') # Create an empty dictionary that will return a list of residues # given the name of the molecule object reslist = {} # Set the dictionaries to be empty lists for m in mols: reslist[m] = [] # Use PyMOL's iterate command to go over every α-Carbon and append # a tuple consisting of the each residue's residue name ('resn') and # residue index ('resi '): for m in mols: cmd.iterate('%s and n. ca'%m, 'reslist["%s"].append((resn,int(resi)))'%m)
This script assumes you only have protein molecules loaded, and ignores things like chain ID and insertion codes.
Once you have your list of residues, you can use it with the cmd.align
command, e.g., to align a particular residue to a reference structure.
Running code that fails with style
We have all been there, working on code that continuously fails while staring at a dull and colorless command-line. However, we are in luck, as there is a way to make the constant error messages look less depressing. By changing our shell to one which enables a colorful themed command-line and fancy features like automatic text completion and web search your code won’t just fail with ease, but also with style!
A shell is your command-line interpreter, meaning you use it to process commands and output results of the command-line. The shell therefore also holds the power to add a little zest to the command-line. The most well-known shell is bash, which comes pre-installed on most UNIX systems. However, there exist many different shells, all with different pros and cons. The one we will focus on is called Z Shell or zsh for short.
Zsh was initially only for UNIX and UNIX-Like systems, but its popularity has made it accessible on most systems now. Like bash, zsh is extremely customizable and their syntax so similar that most bash commands will work in zsh. The benefit of zsh is that it comes with additional features, plugins and options, and open-source frameworks with large communities. The framework which we will look into is called Oh My Zsh.
Continue readingHow to make your own singularity container zero fuss!
In this blog post, I’ll show you guys how to make your own shiny container for your tool! Zero fuss(*) and in FOUR simple steps.
As an example, I will show how to make a singularity container for one of our public tools, ANARCI, the antibody numbering tool everyone in OPIG and external users are familiar with – If not, check the web app and the GitHub repo here and here.
(*) Provided you have your own Linux machine with sudo
permissions, otherwise, you can’t do it – sorry. Same if you have a Mac or Windows – sorry again.
BUT, there are workarounds for these cases such as using the remote singularity builder here, for which you only need to sign up and create an account, and the use of Virtual Machines (VMs), as described here.
Identify Silly Molecules
Automatic argument parsers for python
One of the recurrent problems I used to have when writing argument parsers is that after refactoring code, I also had to change the argument parser options which generally led to inconsistency between the arguments of the function and some of the options of the argument parser. The following example can illustrate the problem:
def main(a,b): """ This function adds together two numbers a and b param a: first number param b: second number """ print(a+b) if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument("--a", type=int, required=True, help="first number") parser.add_argument("--b", type=int, required=True, help="second number") args = parser.parse_args() main(**vars(args))
This code is nothing but a simple function that prints a+b and the argument parser asks for a and b. The perhaps not so obvious part is the invocation of the function in which we have ** and vars
. vars
converts the named tuple args
to a dictionary of the form {“a":1, "b":2}
, and ** expands the dictionary to be used as arguments for the function. So if you have main(**{"a":1, "b":2})
it is equivalent to main(a=1, b=2)
.
Let’s refactor the function so that we change the name of the argument a to num.
Continue readingRetrieving AlphaFold models from AlphaFoldDB
There are now nearly a million AlphaFold [1] protein structure predictions openly available via AlphaFoldDB [2]. This represents a huge set of new data that can be used for the development of new methods. The options for downloading structures are either in bulk (sorted by genome), or individually from the webpage for a prediction.
If you want just a few hundred or a few thousand specific structures, across different genomes, neither of these options are particularly practical. For example, if you have several thousand experimental structures for which you have their PDB [3] code, and you want to obtain the equivalent AlphaFold predictions, there is another way!
If we take the example of the PDB’s current molecule of the month, pyruvate kinase (PDB code 4FXF), this is how you can go about downloading the equivalent AlphaFold prediction programmatically.
- Query UniProt [4] for the corresponding accession number – an example python script is shown below:
Filtering molecules with long linkers
Recently I was tasked with filtering out ‘stringy’ molecules that were being produced with the fragment merging method I’m working on (that is, molecules with lots of consecutive non-ring bonds that weren’t necessarily caught with my rotatable bond filter). While this is quite a niche/specific task, through this I discovered a couple of RDKit functions that I wasn’t previously aware of but might be helpful for other people regularly looking at small molecules. The demo adapts code from this helpful blogpost on cutting a molecule into rings and linkers from ‘Is life worth living?’ (which is a useful source of cheminformatics wisdom; https://iwatobipen.wordpress.com/2020/01/23/cut-molecule-to-ring-and-linker-with-rdkit-rdkit-chemoinformatics-memo/). Obviously in practice you may be applying lots of different filters to enumerated molecules, but this is just a small example of something I found useful.
The Jupyter Notebook can be found at:
https://github.com/stephwills/Demo-removing-stringy-molecules/blob/main/Molecule%20filter.ipynb
Happy coding,
Steph