Category Archives: Hints and Tips

Writing a BLOPIG Post With ChatGPT: A Personal Take on Using AI for Assisted Writing

Disclaimer: I used ChatGPT to improve the writing style of this article, in combination with some personal curation before obtaining a final version.

You’ve probably heard it all already, from ChatGPT writing code and doing proofreading for you to a rap battle between OPIG’s Antibodies and Small Molecules groups, and more.

Whether you like it or not, ChaGPT has unleashed people’s creative side regarding applications and attempts to find shortcuts. Questionable? Absolutely!

In this BLOPIG post, I show how I used ChatGPT to easily write a post summarising some material of my own intellectual property, which I presented as part of my group meeting talk. Mainly, I list some personal thoughts on the ethical concerns around using ChatGPT to assist your writing.

To start off, I passed on content from my own publication draft to ChatGPT, asking to generate a blog post in plain English for BLOPIG. The outcome:

Not bad.

But, it made me realise a number of things:

  • With great power comes great responsibility [Uncle Ben – Spiderman].
    You are responsible for the ethics that go into using ChatGPT. Are you faking expertise? Are you being actually lazy or just being efficient? Think twice (or many more times) if you’re doing the right thing.
  • It can significantly reduce the number of writing iterations but don’t take it at face value.
    Can you actually trust the plain output? No.
    Never take its output as the ground truth, as Large Language Models such as ChatGPT often produce biased writing outputs.
    Keep in mind that whatever you produce as a scientist will be picked up by others, and prone to drive misinformation, if incorrect. It is OK to reduce mechanical iterations, but it’s NOT OK to skip quality control.
  • Be open about it.
    You don’t want to set the wrong example for your colleagues. So, mention if you use it, how you used it, and it is fine to encourage efficiency, but not incentivising a culture of scientific misconduct and plagiarism. Don’t skip the step of producing quality ideas on your own. This is such a concern that publishers like Elsevier have already reacted by publishing guidelines contemplating this possibility. While Nature Springer is working on ways to spot AI-generated outputs.

The bottom line

What are the dos and don’ts of using ChatGPT?

Yes, use it to have fun. Yes, use it to proofread or polish your writing. Yes, use it to summarise your own ideas. No, don’t use it to do the analysis and interpretation of your results. No, don’t copy and paste its direct output into your publication. No, don’t hide that you used it. Finally, NO, you can’t add ChatGPT as a contributing author!

Train Your Own Protein Language Model In Just a Few Lines of Code

Language models have token the world by storm recently and, given the already explored analogies between protein primary sequence and text, there’s been a lot of interest in applying these models to protein sequences. Interest is not only coming from academia and the pharmaceutical industry, but also some very unlikely suspects such as ByteDance – yes the same ByteDance of TikTok fame. So if you also fancy trying your hand at building a protein language model then read on, it’s surprisingly easy.

Training your own protein language model from scratch is made remarkably easy by the HuggingFace Transformers library, which allows you to specify a model architecture, tokenise your training data, and train a model in only a few lines of code. Under the hood, the Transformers library uses PyTorch (or optionally Tensorflow) models, allowing you to dig deeper into customising training or model architecture, or simply leave it to the highly abstracted Transformers library to handle it all for you.

For this article, I’ll assume you already understand how language models work, and are now looking to implement one yourself, trained from scratch.

Continue reading

How ChatGPT changed my writing as an ESL speaker

It’s not always easy to live in an Anglophone scientific world when English isn’t your first language. When careers are built upon the ability to communicate ideas clearly and eloquently, struggling to find the right words can be a real hindrance to explain your science in a way that is taken seriously. Contrary to popular belief, it’s not something you can simply “work” on. Often, it doesn’t matter how many books you’ve read, how many years of education you have, or how articulate you are in your original language — your brain will refuse to summon the right expression, or get stuck in a construction that a native speaker would never use. Struggling with a second language is very much a biological phenomenon.

The standard recommendation for ESL (English as a Second Language) speakers has long been to ask a native colleague to read through any text that needs to be published or submitted somewhere (such as an article or a grant application). Well-intentioned as this advice may be, there are multiple problems with it. Lingua franca or not, only 15% of the world population speaks English, of which only 5% are native speakers — meaning that for most scientists not working in Anglophone countries, the option is rarely available. Even when available, it is unreasonable to expect these colleagues to add charitable proof-reading to their workload simply because they happened to be born speaking a different language. But, most importantly, I have always felt — and I want to emphasize that I truly believe most people who issue this kind of advice to be well-intentioned — that the underliying message sounds too much like “you need vetting by a member of our select linguistic club if you want your ideas to be taken seriously“.

Continue reading

LaTeX Beamer Template with Logos

Alternative Title: The tragic story of how I got trapped making slides with latex.

Typically after giving a presentation at least one person will approach me and ask if they could have access to my custom latex template to make slides with beamer that don’t look rubbish.

TL;DR Yes you can: https://github.com/npqst/latex-beamer-template

Continue reading

Creating a Personal Website

Personal websites are a great and increasingly important way to build your online presence. Along with professional social media pages, such as on LinkedIn and Twitter, a website can provide a boost to your career and/or job search.

This blog post is based on my recent experience creating a personal website, following guidelines from Lewis’ talk at the OPIG Retreat last year (thank you Lewis!). The method I used and will cover here, based on an HTML5 UP! template and GitHub pages, is free and fast.

Why have a personal website?

  • Improves your online presence and brand
  • Boost for your career, including by allowing potential future employers to find you
  • Share things you have accomplished or are interested in
Continue reading

Unreasonably faster notes, with command-line fuzzy search

A good note system should act like a second brain:

  1. Accessible in seconds
  2. Adding information should be frictionless
  3. Searching should be exhaustive – if it’s there, you must find it

The benefits of such a note system are immense – never forget anything again! Search, perform the magic ritual of Copy Paste, and rejoice in the wisdom of your tried and tested past.

But how? Through the unreasonable effectiveness of interactive fuzzy search. This is how I have used Fuz, a terminal-based file fuzzy finder, for about 4 years.

Briefly, Fuz extracts all text within a directory using ripgrep, enables interactive fuzzy search with FZF, and returns you the selected item. As you type, the search results get narrowed down to a few matches. Files are opened at the exact line you found. And it’s FAST – 100,000 lines in half a second fast.

Using Fuz to quickly add a code-snippet in our note directory – then retrieving it with fuzzy-search. Here, on how to read FASTA files with Biopython, conveniently added to a file called biopython.py.
Continue reading

Naga101: A Guide to Getting Started with (OPIG) Slurm Servers

Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.

Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.

Slurm is a workload manager or job scheduler for Linux, meaning that it helps with allocating resources (eg CPUs and GPUs) on a server to users’ jobs.

To note, all of the commands and files shown here are run from a so-called ‘head’ node, from which you access Slurm servers.

1. Entering an interactive session

Unlike many other servers, you cannot access a Slurm server via ‘ssh’. Instead, you can enter an interactive (or ‘debug’) session – which, in OPIG, is limited to 30 minutes – via the srun command. This is incredibly useful for copying files, setting up environments and checking that your code runs.

srun -p servername-debug --pty --nodes=1 --ntasks-per-node=1 -t 00:30:00 --wait=0 /bin/bash

2. Submitting jobs

While the srun command is easy and helpful, many of the jobs we want to run on a server will take longer than the debug queue time limit. You can submit a job, which can then run for a longer (although typically still capped) time but is not interactive, via sbatch.

Continue reading

Supercharge Your Literature Review With These Tools

When starting a new project, conducting a literature review of the field can be one of the most daunting prospects. Not only do you need to get through a mountain of research papers, you also need to work out which mountain of papers to get through. You don’t want to start a project only to realise a few weeks (or months!) in that you missed a key paper which would have completely changed the course of your research. Luckily, there are now several handy tools which can help speed up this process.

Continue reading

How to build a Python dictionary of residues for each molecule in PyMOL

Sometimes it can be handy to work with multiple structures in PyMOL using Python.

Here’s a snippet of code you might find useful: we iterate over all the α-carbon atoms in a protein and append to a list tuples such as (‘GLY’, 1). The dictionary, ‘reslist’, returns a list of residue names and indices for each molecule, where the key is a string containing the name of the molecule.

from pymol import cmd

# Create a list of all the objects, called 'mpls':
mols = cmd.get_object_list('*')

# Create an empty dictionary that will return a list of residues
# given the name of the molecule object
reslist = {}

# Set the dictionaries to be empty lists
for m in mols:  reslist[m] = []

# Use PyMOL's iterate command to go over every α-Carbon and append 
# a tuple consisting of the each residue's residue name ('resn') and
# residue index ('resi '):
for m in mols:  cmd.iterate('%s and n. ca'%m, 'reslist["%s"].append((resn,int(resi)))'%m)

This script assumes you only have protein molecules loaded, and ignores things like chain ID and insertion codes.

Once you have your list of residues, you can use it with the cmd.align command, e.g., to align a particular residue to a reference structure.

Automatic argument parsers for python

One of the recurrent problems I used to have when writing argument parsers is that after refactoring code, I also had to change the argument parser options which generally led to inconsistency between the arguments of the function and some of the options of the argument parser. The following example can illustrate the problem:

def main(a,b):
  """
  This function adds together two numbers a and b
  param a: first number
  param b: second number
  """
  print(a+b)

if __name__ == "__main__":
  import argparse
  parser = argparse.ArgumentParser()
  parser.add_argument("--a", type=int, required=True, help="first number")
  parser.add_argument("--b", type=int, required=True, help="second number")
  args = parser.parse_args()
  main(**vars(args))

This code is nothing but a simple function that prints a+b and the argument parser asks for a and b. The perhaps not so obvious part is the invocation of the function in which we have ** and vars. vars converts the named tuple args to a dictionary of the form {“a":1, "b":2}, and ** expands the dictionary to be used as arguments for the function. So if you have main(**{"a":1, "b":2}) it is equivalent to main(a=1, b=2).

Let’s refactor the function so that we change the name of the argument a to num.

Continue reading