Category Archives: Hints and Tips

Unreasonably faster notes, with command-line fuzzy search

A good note system should act like a second brain:

  1. Accessible in seconds
  2. Adding information should be frictionless
  3. Searching should be exhaustive – if it’s there, you must find it

The benefits of such a note system are immense – never forget anything again! Search, perform the magic ritual of Copy Paste, and rejoice in the wisdom of your tried and tested past.

But how? Through the unreasonable effectiveness of interactive fuzzy search. This is how I have used Fuz, a terminal-based file fuzzy finder, for about 4 years.

Briefly, Fuz extracts all text within a directory using ripgrep, enables interactive fuzzy search with FZF, and returns you the selected item. As you type, the search results get narrowed down to a few matches. Files are opened at the exact line you found. And it’s FAST – 100,000 lines in half a second fast.

Using Fuz to quickly add a code-snippet in our note directory – then retrieving it with fuzzy-search. Here, on how to read FASTA files with Biopython, conveniently added to a file called biopython.py.
Continue reading

Naga101: A Guide to Getting Started with (OPIG) Slurm Servers

Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.

Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.

Slurm is a workload manager or job scheduler for Linux, meaning that it helps with allocating resources (eg CPUs and GPUs) on a server to users’ jobs.

To note, all of the commands and files shown here are run from a so-called ‘head’ node, from which you access Slurm servers.

1. Entering an interactive session

Unlike many other servers, you cannot access a Slurm server via ‘ssh’. Instead, you can enter an interactive (or ‘debug’) session – which, in OPIG, is limited to 30 minutes – via the srun command. This is incredibly useful for copying files, setting up environments and checking that your code runs.

srun -p servername-debug --pty --nodes=1 --ntasks-per-node=1 -t 00:30:00 --wait=0 /bin/bash

2. Submitting jobs

While the srun command is easy and helpful, many of the jobs we want to run on a server will take longer than the debug queue time limit. You can submit a job, which can then run for a longer (although typically still capped) time but is not interactive, via sbatch.

Continue reading

Supercharge Your Literature Review With These Tools

When starting a new project, conducting a literature review of the field can be one of the most daunting prospects. Not only do you need to get through a mountain of research papers, you also need to work out which mountain of papers to get through. You don’t want to start a project only to realise a few weeks (or months!) in that you missed a key paper which would have completely changed the course of your research. Luckily, there are now several handy tools which can help speed up this process.

Continue reading

How to build a Python dictionary of residues for each molecule in PyMOL

Sometimes it can be handy to work with multiple structures in PyMOL using Python.

Here’s a snippet of code you might find useful: we iterate over all the α-carbon atoms in a protein and append to a list tuples such as (‘GLY’, 1). The dictionary, ‘reslist’, returns a list of residue names and indices for each molecule, where the key is a string containing the name of the molecule.

from pymol import cmd

# Create a list of all the objects, called 'mpls':
mols = cmd.get_object_list('*')

# Create an empty dictionary that will return a list of residues
# given the name of the molecule object
reslist = {}

# Set the dictionaries to be empty lists
for m in mols:  reslist[m] = []

# Use PyMOL's iterate command to go over every α-Carbon and append 
# a tuple consisting of the each residue's residue name ('resn') and
# residue index ('resi '):
for m in mols:  cmd.iterate('%s and n. ca'%m, 'reslist["%s"].append((resn,int(resi)))'%m)

This script assumes you only have protein molecules loaded, and ignores things like chain ID and insertion codes.

Once you have your list of residues, you can use it with the cmd.align command, e.g., to align a particular residue to a reference structure.

Automatic argument parsers for python

One of the recurrent problems I used to have when writing argument parsers is that after refactoring code, I also had to change the argument parser options which generally led to inconsistency between the arguments of the function and some of the options of the argument parser. The following example can illustrate the problem:

def main(a,b):
  """
  This function adds together two numbers a and b
  param a: first number
  param b: second number
  """
  print(a+b)

if __name__ == "__main__":
  import argparse
  parser = argparse.ArgumentParser()
  parser.add_argument("--a", type=int, required=True, help="first number")
  parser.add_argument("--b", type=int, required=True, help="second number")
  args = parser.parse_args()
  main(**vars(args))

This code is nothing but a simple function that prints a+b and the argument parser asks for a and b. The perhaps not so obvious part is the invocation of the function in which we have ** and vars. vars converts the named tuple args to a dictionary of the form {“a":1, "b":2}, and ** expands the dictionary to be used as arguments for the function. So if you have main(**{"a":1, "b":2}) it is equivalent to main(a=1, b=2).

Let’s refactor the function so that we change the name of the argument a to num.

Continue reading

Do you have cis peptide bonds in your simulation inputs?

People who run molecular simulations quickly become familiar with all of the things about a PDB file – missing residues, missing heavy atoms in residues, missing hydrogens, non-standard amino acids, multiple conformations, crystallization ligands, etc. – that might need to be fixed before setting up a simulation. This blog post is a reminder to check, after you have “fixed” your PDB, if you have accidentally introduced aberrant cis peptide bonds into your structure during rebuilding.

Continue reading

Words of Wisdom from final year PhD students

NB: These are entirely subjective so please ignore them all if you want.

1.     Write everything down in a searchable place 

Maybe you are gifted with a brilliant memory but, for the rest of us, write everything down (either in a notebook, or better yet, some kind of searchable typed document). This includes notes from supervisor meetings, industry meetings, clever suggestions over coffee, group meetings, etc… 

In our experience, writing things on paper is risky unless you have a decent filing system (see our desks for examples of how not to file notes). It also requires writing legibly. Typed notes are also particularly useful for saving common error messages/bug fixes/useful installation instructions/functions etc in one place so that you can easily search for them again! This can be just a word document, o rGemma showed me “Notion” which has so far been really useful (and you get to put emojis next to your notes).

This also leads to the second tip…

2.     Type up notes on papers you’ve read or use a reference manager 

Continue reading

Retrieving AlphaFold models from AlphaFoldDB

There are now nearly a million AlphaFold [1] protein structure predictions openly available via AlphaFoldDB [2]. This represents a huge set of new data that can be used for the development of new methods. The options for downloading structures are either in bulk (sorted by genome), or individually from the webpage for a prediction.

If you want just a few hundred or a few thousand specific structures, across different genomes, neither of these options are particularly practical. For example, if you have several thousand experimental structures for which you have their PDB [3] code, and you want to obtain the equivalent AlphaFold predictions, there is another way!

If we take the example of the PDB’s current molecule of the month, pyruvate kinase (PDB code 4FXF), this is how you can go about downloading the equivalent AlphaFold prediction programmatically.

  1. Query UniProt [4] for the corresponding accession number – an example python script is shown below:
Continue reading

Ten quick tips for proofreading your work

For my blog post, I thought I’d revisit my dark past on the other side of academic publishing when I worked as a copy editor and proofreader for two years between my undergrad and Masters. During this time, I worked primarily on review papers and news content. While I don’t claim to be a great writer or editor, I thought I’d share some easy tips to help refine your writing and make it more consistent. This is by no means an exhaustive list and probably most of them will already be familiar to you!

1. Consistency is key

I think two of the most important aspects of proofreading are ensuring consistency and using your common sense. For example, instead of agonizing over how you style a word, choose what you think is most appropriate and check that you’ve applied it consistently throughout the text. Check that style matches between the main text, headings, figure legends and footnotes. Some specific things to look for include the following:

  • Capitalization
    • If you’ve capitalized headings, has this been done throughout?
  • Italicization
    • E.g., have you italicized all your mentions of ‘in silico’? 
  • Superscript and subscript
    • E.g., is ‘half-maximal inhibitory concentration (IC50)‘ the same throughout? 
  • Numbers
    • Have you mixed up numerical and spelled-out numbers?  
    • E.g., I drank five cups of coffee and 4 cups of tea. 
Continue reading

Tackling horizontal and vertical limitations

A blog post about reviewing papers and preparing papers for publication.

We start with the following premise: all papers have limitations. There is not a single paper without limitations. A method may not be generally applicable, a result may not be completely justified by the data or a theory may make restrictive assumptions. To cover all limitations would make a paper infinitely long, so we must stop somewhere.

A lot of limitations fall into the following scenario. The results or methods are presented but they could have extended them in some way. Suppose, we obtain results on a particular cell type using an immortalized cell-line. Are the results still true, if we performed the experiments on primary or patient-derived cells? If the signal from the original cells was sufficiently robust then we would hope so. However, we can not be one hundred percent sure. A similar example is a method that can be applied to a certain type of data. It may be possible to extend the method to be applied to other data types. However, this may require some new methodology. I call this flavor of limitations vertical limitations. They are vertical in the sense that they build upon an already developed result in the manuscript. For certain journals, they will require that you tackle vertical limitations by adapting the original idea or method to demonstrate broad appeal or that idea could permeate multiple fields. Most of the time, however, the premise of an approach is not to keep extending it. It works. Leave it alone. Do not ask for more. An idea done well does not need more.

Continue reading