This post is pretty much an ad for a very useful tool developed by GitHub that helps you find errors or vulnerabilities in your code by querying it as if it were data. I have personally found it very useful in finding small errors in my code and would recommend everyone to use it. If you want to check it out, this is their webpage.
Continue readingHow to turn a SMILES string into an extended-connectivity fingerprint using RDKit
After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).
ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.
Continue readingUnreasonably faster notes, with command-line fuzzy search
A good note system should act like a second brain:
- Accessible in seconds
- Adding information should be frictionless
- Searching should be exhaustive – if it’s there, you must find it
The benefits of such a note system are immense – never forget anything again! Search, perform the magic ritual of Copy Paste, and rejoice in the wisdom of your tried and tested past.
But how? Through the unreasonable effectiveness of interactive fuzzy search. This is how I have used Fuz, a terminal-based file fuzzy finder, for about 4 years.
Briefly, Fuz extracts all text within a directory using ripgrep, enables interactive fuzzy search with FZF, and returns you the selected item. As you type, the search results get narrowed down to a few matches. Files are opened at the exact line you found. And it’s FAST – 100,000 lines in half a second fast.
Continue readingNaga101: A Guide to Getting Started with (OPIG) Slurm Servers
Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.
Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.
Slurm is a workload manager or job scheduler for Linux, meaning that it helps with allocating resources (eg CPUs and GPUs) on a server to users’ jobs.
To note, all of the commands and files shown here are run from a so-called ‘head’ node, from which you access Slurm servers.
1. Entering an interactive session
Unlike many other servers, you cannot access a Slurm server via ‘ssh’. Instead, you can enter an interactive (or ‘debug’) session – which, in OPIG, is limited to 30 minutes – via the srun command. This is incredibly useful for copying files, setting up environments and checking that your code runs.
srun -p servername-debug --pty --nodes=1 --ntasks-per-node=1 -t 00:30:00 --wait=0 /bin/bash
2. Submitting jobs
While the srun command is easy and helpful, many of the jobs we want to run on a server will take longer than the debug queue time limit. You can submit a job, which can then run for a longer (although typically still capped) time but is not interactive, via sbatch.
Continue readingCoarse-grained models of antibody solutions
Various coarse-grained (CG) models have become increasingly common in studies of antibody-antibody interactions in solution. These models appear poised to enter development pipelines in the near future to help predict and understand how antibody-antibody interactions influence the suitability of a given monoclonal antibody (mAb) for mass production and delivery as an antibody therapy. This blog post is a non-exhaustive summary of some of the highlights I found during a recent literature search.
Continue readingSupercharge Your Literature Review With These Tools
When starting a new project, conducting a literature review of the field can be one of the most daunting prospects. Not only do you need to get through a mountain of research papers, you also need to work out which mountain of papers to get through. You don’t want to start a project only to realise a few weeks (or months!) in that you missed a key paper which would have completely changed the course of your research. Luckily, there are now several handy tools which can help speed up this process.
Continue readingThinking of going to a conference
As so many members of the group have never attended an in-person conference, I thought it might be worth answering the question “why do people attend conferences?”
First- up, we should remember that flying around the world is not a zero cost to the planet, so all of us lucky enough to be able to travel should think hard every time before we choose to do so.
This means it’s really important to make sure that we know why we are going to any conference and maximise the benefits from attendance. Below are a few things to think about in terms of why you attend a conference and what to do when you are there, but this is definitely not a complete list, more a starter for four.
Continue readingAm I better? Performance metrics unravelled
What’s the deal with all these numbers? Accuracy, Precision, Recall, Sensitivity, AUC and ROCs.
The basic stuff:
Given a method that produces a numerical outcome either catagorical (classification) or continuous (regression), we want to know how well our method did. Let’s start simple:
True positives (TP): You said something was a cow and it was in fact a cow – duh.
False positives (FP): You said it was a cow and it wasn’t – sad.
True negative (TN): You said it was not a cow and it was not – good job.
False negative (FN): You said it was not a cow but it was a cow – do better.
I can optimise these metrics artificially. Just call everything a cow and I have a 100% true positive rate. We are usually interested in a trade-off, something like the relative value of metrics. This gives us:
Continue readingCode your own molecule sketcher in 4 easy steps!
Drawing molecules on your laptop usually requires access to proprietary software such as ChemDraw (link) or free websites such as PubChem’s online sketcher (link). However, if you are feeling adventurous, you can build your personal sketcher in React/Typescript using the Ketcher package!
Ketcher is an open-source package that allows easy implementation of a molecule sketcher into a web application. Unfortunately, it does require TypeScript so the script to run it cannot be imported directly into an HTML page. Therefore we will set up a simple React app to get it working.
The sketcher is very sleek and has a vast array of functionality, such as choosing any atom from the periodic table and being able to directly import molecules from either SMILES or Mol2/SDF file format into the sketcher. These molecules can then be edited and saved to a new file in the chemical file format of your choosing.
Continue readingHow to build a Python dictionary of residues for each molecule in PyMOL
Sometimes it can be handy to work with multiple structures in PyMOL using Python.
Here’s a snippet of code you might find useful: we iterate over all the α-carbon atoms in a protein and append to a list tuples such as (‘GLY’, 1). The dictionary, ‘reslist’, returns a list of residue names and indices for each molecule, where the key is a string containing the name of the molecule.
from pymol import cmd # Create a list of all the objects, called 'mpls': mols = cmd.get_object_list('*') # Create an empty dictionary that will return a list of residues # given the name of the molecule object reslist = {} # Set the dictionaries to be empty lists for m in mols: reslist[m] = [] # Use PyMOL's iterate command to go over every α-Carbon and append # a tuple consisting of the each residue's residue name ('resn') and # residue index ('resi '): for m in mols: cmd.iterate('%s and n. ca'%m, 'reslist["%s"].append((resn,int(resi)))'%m)
This script assumes you only have protein molecules loaded, and ignores things like chain ID and insertion codes.
Once you have your list of residues, you can use it with the cmd.align
command, e.g., to align a particular residue to a reference structure.