Category Archives: Hints and Tips

SSH, the boss-fight level: Jupyter notebooks from compute nodes

Secure shell (SSH) is an essential tool for remote operations. However, not everything with it is smooth-sailing. Especially, when you want to do things like reverse–port-forwarding via a proxy-hump or two a Jupyter notebook to your local machine from a compute node on a no-home container . Even if it sounds less plausible than the exploits on Mr Robot, it actually can work and requires zero social-engineering or sneaking in server rooms to install Raspberry Pis while using a baseball cap as a disguise.

Continue reading

Placeholder compounds: distraction vs. accuracy

When showcasing an approach in computational chemistry, an example molecule is required as a placeholder. But which to chose from? I would classify there different approaches: choosing a recognisable molecules, a top selling drugs, or a randomly sketched compound.

At a recent conference, Sheffield Cheminformatics 2023, I saw examples of all three and one problem I had that some placeholders distracted me into searching to figure out what it was.

Continue reading

The dangers of Conda-Pack and OpenMM

If you are running lots of little jobs in SLURM and want to make use of free nodes that suddenly become available, it is helpful to have a way of rapidly shipping your environments that does not rely on installing conda or rebuilding the environment from scratch every time. This is useful with complex rebuilds where exported .yml files do not always work as expected, even when specifying exact versions and source locations.

In these situations a tool such a conda-pack becomes incredibly useful. Once you have perfected the house of cards that is your conda environment, you can use conda-pack to save that exact state as a tar.gz file.

conda-pack -n my_precious_env -o my_precious_env.tar.gz

This can provide you with a backup to be used when you accidentally delete conda from your system, or if you irreparable corrupt the environment and cannot roll back to the point in time when everything worked. These tar.gz files can also be copied to distant locations by the use of rsync or scp, unpacked, sourced and used without installing conda…

Continue reading

A match made in heaven: academic writing with latex and git

Alternative titles:

  • A match made in heavenhell: academic writing with latex and git
  • Procrastinating writing by over-engineering my workflow

If you are like me, you can happily write code for hours and hours on end but as soon as you need to write a paper you end up staring at a blank page. Luckily, I have come up with a fool proof way to trick myself into thinking I am coding when in reality I am finalling getting around to writing up the work my supervisor has been wanting for the last month. Introducing Latex and git- this was my approach to draft a review paper recently and in this blopig post I will go through some of the ups and downs I had using these tools.

Continue reading

Exploring the Observed Antibody Space (OAS)

The Observed Antibody Space (OAS) [1,2] is an amazing resource for investigating observed antibodies or as a resource for training antibody specific models, however; its size (over 2.4 billion unpaired and 1.5 million paired antibody sequences as of June 2023) can make it painful to work with. Additionally, OAS is extremely information rich, having nearly 100 columns for each antibody heavy or light chain, further complicating how to handle the data. 

From spending a lot of time working with OAS, I wanted to share a few tricks and insights, which I hope will reduce the pain and increase the joy of working with OAS!

Continue reading

Pairwise sequence identity and Tanimoto similarity in PDBbind

In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.

A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:

from rdkit.Chem import AllChem, MolFromMol2File
from rdkit.DataStructs import BulkTanimotoSimilarity
import numpy as np
import os

fps = []
for pdb in pdbs:
    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))
    fps.append(AllChem.GetMorganFingerprint(mol, 3))

sims = []
for i in range(len(fps)):
    sims.append(BulkTanimotoSimilarity(fps[i],fps))

arr = np.array(sims)
np.savez_compressed('data/tanimoto_similarity.npz', arr)

Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:

Continue reading

Streamlining Your Terminal Commands With Custom Bash Functions and Aliases

If you’ve ever found yourself typing out the same long commands over and over again, or if you’ve ever wished you could teleport directly to your favourite directories, then this post is for you.

Before we jump into some useful examples, let’s go over what bash functions and aliases are, and how to set them up.

Bash Functions vs Aliases

A bash function is like a mini script stored in your .bashrc or .bash_profile file. It can accept arguments, execute a series of commands, and even return a value.

Continue reading

Unclear documentation? ChatGPT can help!

The PyMOL Python API is a useful resource for most people doing research in OPIG, whether focussed on antibodies, small molecule drug design or protein folding. However, the documentation is poorly structured and difficult to interpret without first having understood the structure of the module. In particular, the differences between use of the PyMOL command line and the API can be unclear, leading to a much longer debugging process for code than you’d like.

While I’m reluctant to continue the recent theme of ChatGPT-related posts, this is a use for ChatGPT that would have been incredibly useful to me when I was first getting to grips with the PyMOL API.

Continue reading

Writing a BLOPIG Post With ChatGPT: A Personal Take on Using AI for Assisted Writing

Disclaimer: I used ChatGPT to improve the writing style of this article, in combination with some personal curation before obtaining a final version.

You’ve probably heard it all already, from ChatGPT writing code and doing proofreading for you to a rap battle between OPIG’s Antibodies and Small Molecules groups, and more.

Whether you like it or not, ChaGPT has unleashed people’s creative side regarding applications and attempts to find shortcuts. Questionable? Absolutely!

In this BLOPIG post, I show how I used ChatGPT to easily write a post summarising some material of my own intellectual property, which I presented as part of my group meeting talk. Mainly, I list some personal thoughts on the ethical concerns around using ChatGPT to assist your writing.

To start off, I passed on content from my own publication draft to ChatGPT, asking to generate a blog post in plain English for BLOPIG. The outcome:

Not bad.

But, it made me realise a number of things:

  • With great power comes great responsibility [Uncle Ben – Spiderman].
    You are responsible for the ethics that go into using ChatGPT. Are you faking expertise? Are you being actually lazy or just being efficient? Think twice (or many more times) if you’re doing the right thing.
  • It can significantly reduce the number of writing iterations but don’t take it at face value.
    Can you actually trust the plain output? No.
    Never take its output as the ground truth, as Large Language Models such as ChatGPT often produce biased writing outputs.
    Keep in mind that whatever you produce as a scientist will be picked up by others, and prone to drive misinformation, if incorrect. It is OK to reduce mechanical iterations, but it’s NOT OK to skip quality control.
  • Be open about it.
    You don’t want to set the wrong example for your colleagues. So, mention if you use it, how you used it, and it is fine to encourage efficiency, but not incentivising a culture of scientific misconduct and plagiarism. Don’t skip the step of producing quality ideas on your own. This is such a concern that publishers like Elsevier have already reacted by publishing guidelines contemplating this possibility. While Nature Springer is working on ways to spot AI-generated outputs.

The bottom line

What are the dos and don’ts of using ChatGPT?

Yes, use it to have fun. Yes, use it to proofread or polish your writing. Yes, use it to summarise your own ideas. No, don’t use it to do the analysis and interpretation of your results. No, don’t copy and paste its direct output into your publication. No, don’t hide that you used it. Finally, NO, you can’t add ChatGPT as a contributing author!