Category Archives: AI

“The Rise of ChatGPT 4.0: Is the Future of Work in Jeopardy?”

In my previous blog post, I explored the capabilities of ChatGPT 3.5, testing its skills as a programmer and mathematician’s assistant. The results were mixed, to say the least. While it could handle simple coding tasks with ease, it faltered when faced with more complex mathematical problems and image manipulation tasks. I concluded that while ChatGPT 3.5 was impressive, it was far from replacing us in our jobs. It was a tool, yes, but one that needed careful supervision and frequent corrections.

Fast forward to today, and we have ChatGPT 4.0 with code interpreter enabled. I approached this new version with the same scepticism I had for its predecessor, but I must admit, my initial impressions have been challenged. The leap from 3.5 to 4.0 is not just a step forward; it’s a giant leap. You can see my experiment here https://chat.openai.com/share/0099a12f-e577-44bd-aa10-7c7508f15f19.

In my previous experiment, I asked ChatGPT 3.5 to translate a 2D image, a task that combines both programming and a bit of maths. The results were less than satisfactory. It was able to write code to translate images if they were in real space, a task that can be easily found searching on the web. However, it struggled to handle images represented in Fourier space correctly, and even with multiple hints, the best it could do was a rough approximation of the correct solution. It took six attempts to get the code automatically generated, and even then, it was far from perfect.

With ChatGPT 4.0, however, the story is different. When asked to perform a similar task, it provided a clear, concise, and correct solution. It understood the concept of image translation in the context of computer vision and image processing, and it was able to generate Python code using the scikit-image package to perform the task. The code was not only syntactically correct but also logically sound, demonstrating a significant improvement in understanding and problem-solving capabilities.

"""
#Me: Can you know provide some python code to translate (pixel shifts) images? And can you execute it?
#Chat gpt: Here's a simple Python example using scikit-image to translate an image:
"""

import numpy as np
import matplotlib.pyplot as plt
from skimage import data
from skimage.transform import AffineTransform, warp

# Load an example image
image = data.camera()

# Define a translation transform
tform = AffineTransform(translation=(50, 50))

# Apply the transform to the image
translated_image = warp(image, tform.inverse)

# Plot the original and translated images
fig, ax = plt.subplots(1, 2)
ax[0].imshow(image, cmap='gray')
ax[0].set_title('Original Image')
ax[1].imshow(translated_image, cmap='gray')
ax[1].set_title('Translated Image')
plt.show()

"""
#Chat gpt: 
In this code, AffineTransform is used to define a translation of 50 pixels to the right and 50 pixels down, and warp is used to apply this transform to the image. The result is an image that has been shifted.
However, I must clarify that the current environment doesn't support displaying plots, so the matplotlib part of the code won't work here. You should be able to run it in a Jupyter notebook or similar environment on your local machine. Let me know if you need further help or clarification!
"""


"""
ME: perform the same operation as before but with an image represented in Fourier space. 
#Chat gpt: 
Sure! However, we need to keep in mind that translating an image in the Fourier domain is not the same as translating an image in the spatial domain. In the Fourier domain, translations are performed by multiplying the image by a complex exponential, which essentially changes the phase of the Fourier transform.

Here is a Python code snippet demonstrating translation in the Fourier domain:


"""

import numpy as np
from skimage import data
from scipy.fft import fft2, ifft2, fftshift, ifftshift

# Load an example image
image = data.camera()

# Compute the 2-dimensional FFT of the image
image_fft = fft2(image)

# Define a translation
dy, dx = 30, 30

# Generate an array with the same size as the image
x = np.arange(0, image.shape[1])
y = np.arange(0, image.shape[0])
xv, yv = np.meshgrid(x, y)

# Create the translation function in Fourier space
translation = np.exp(-2j * np.pi * (dx * xv / image.shape[1] + dy * yv / image.shape[0]))

# Translate the image in Fourier space
translated_image_fft = image_fft * translation

# Transform back to spatial domain
translated_image = np.abs(ifft2(translated_image_fft))

# Now you could plot the images
"""
#Chat gpt: 
This code first computes the Fourier transform of the image using fft2. Then it generates a complex exponential function that represents a shift in the Fourier domain. This function is multiplied with the Fourier-transformed image to achieve the translation. Finally, the inverse Fourier transform ifft2 is used to transform the image back to the spatial domain.

As before, the current environment doesn't support displaying plots, so you would need to run this on your local machine to see the translated image.
"""

Another test I revisited was a mathematical problem I had posed to ChatGPT 3.5: “What is the intersection between the unit circle and the curve y = x^2?” The previous version had failed to provide a correct answer. However, ChatGPT 4.0 not only understood the problem but also wrote its own code to solve it, correctly identifying the intersection points. This was a clear demonstration of the AI’s improved mathematical problem-solving capabilities although It’s important to note that only the premium version of ChatGPT 4.0 can run the code it suggests, as it did in the example of the intersection of the unit circle and the parabola. +

ME: Can you tell me which is the intersection between the unit circle and the curve “y= x*x”?
ChatGPT: The unit circle is defined by the equation x²+y²=x²+y²=1. The curve y=x² is a parabola. To find their intersection, we can substitute y² from the equation of the parabola into the equation of the circle, and solve for x². So, we have x²+(x²)²= 1 or x²+x⁴=1 This is a quartic equation and can be difficult to solve, but let’s give it a try. [AND IT STARTS WRITING AND EXECUTING CODE]

So, is my job in danger now? Well, it’s a complex question. The advancements in AI, as demonstrated by the jump from ChatGPT 3.5 to 4.0, are indeed impressive. The AI’s ability to understand complex tasks and generate accurate solutions is growing quite fast. However, it’s important to remember that AI, at its core, is a tool. It’s a tool that can augment our capabilities, automate mundane tasks, and help us solve complex problems. In the end, whether AI becomes a threat or an ally in our jobs depends largely on how we choose to use it. If we see it as a tool to enhance our skills and productivity, then there’s no danger, only opportunity. But if we see it as a replacement for human intelligence and creativity, then we might indeed have cause for concern. For now, though, I believe we’re safe. The Turing test might be a thing of the past, but the “human test” is still very much alive.

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

De Novo Design
Open Science
Chemical Space
Physics-based Modelling
Machine Learning
Property Prediction
Virtual Screening
Case Studies
Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading →

Machine learning strategies to overcome limited data availability

Machine learning (ML) for biological/biomedical applications is very challenging – in large part due to limitations in publicly available data (something we recently published about [1]). Substantial amounts of time and resources may be required to generate the types of data (eg protein structures, protein-protein binding affinity, microscopy images, gene expression values) required to train ML models, however.

In cases where there is sufficient data available to provide signal, but not enough for the desired performance, ML strategies can be employed:

Continue reading →

Academic Reading? There’s an AI for that.

AI tools are literally everywhere. Recently, I stumbled across an AI aggregator website (theresanaiforthat.com) that, given a task, will find an AI solution. At the time of writing this article, there are 4871 AI’s across 1369 tasks, with solutions ranging from scribes to polygraph examiners. Recently, I stumbled across SciSpace (formerly typeset – https://typeset.io), an “AI assistant to understand scientific literature.” So, of course, I tested it out. In this blog post, we will explore the capabilities of SciSpace and discuss how it can potentially enhance your literature review process.

The user experience of a tool can make or break its adoption. Thankfully, SciSpace isn’t bad. Its main website offers basic search functionality, enabling you to find specific papers, topics, or authors within their database. I did notice that it is missing many new papers in its database; however, users have the option to upload a PDF for analysis. Additionally, each search result includes a TL;DR summary, providing a concise overview of the paper’s contents at a glance. As expected, this summary serves as a helpful reminder for familiar papers, but I often found it inadequate in providing enough information to grasp the main arguments or story of a paper. One interesting feature of SciSpace is the ability to “trace” papers in their database. By following the citations of a paper, users can navigate through related works, authors, and topics. I think this feature would be helpful during exploration and makes finding connections between related topics a little easier.

The best thing about SciSpace is the Copilot Chrome extension. Available whenever you open a paper’s PDF or journal link, it offers text analysis, summarization, and mathematical or table comprehension. It provides a set of common template prompts, which I found helpful. For example, “What were the key contributions of that paper?”, “What data and methods have been used in this paper?”, or “What are the limitations of this paper?” I found these prompts helpful in getting a quick overview of the work faster than reading the abstract, figures, and conclusion.

To put SciSpace Copilot to the test, I used it on my recent publication. The extension provided an accurate summary of the abstract and introduction. It effectively extracted the key result and arguments plus highlighted the main contributions of the work well. To be honest, it also offered a fair and accurate summary of the limitations of the study. It was helpful; however, it does not replace the need to read the full paper.

Tools like SciSpace are clearly becoming more popular and could potentially play a larger role in how we write, read, and understand research output. In the meantime, I’ve found it helpful to significantly improve the efficiency and effectiveness of my academic reading. Its clean, user-friendly interface, TL;DR summaries, and the impressive Copilot Chrome extension save me time. Plus, it’s completely free! I do expect that at some point it will become a paid tool. Until then, it’s a great way to stay on top of published work and build an understanding of related, but unfamiliar, fields.

Unclear documentation? ChatGPT can help!

The PyMOL Python API is a useful resource for most people doing research in OPIG, whether focussed on antibodies, small molecule drug design or protein folding. However, the documentation is poorly structured and difficult to interpret without first having understood the structure of the module. In particular, the differences between use of the PyMOL command line and the API can be unclear, leading to a much longer debugging process for code than you’d like.

While I’m reluctant to continue the recent theme of ChatGPT-related posts, this is a use for ChatGPT that would have been incredibly useful to me when I was first getting to grips with the PyMOL API.

Continue reading →

Writing a BLOPIG Post With ChatGPT: A Personal Take on Using AI for Assisted Writing

Disclaimer: I used ChatGPT to improve the writing style of this article, in combination with some personal curation before obtaining a final version.

You’ve probably heard it all already, from ChatGPT writing code and doing proofreading for you to a rap battle between OPIG’s Antibodies and Small Molecules groups, and more.

Whether you like it or not, ChaGPT has unleashed people’s creative side regarding applications and attempts to find shortcuts. Questionable? Absolutely!

In this BLOPIG post, I show how I used ChatGPT to easily write a post summarising some material of my own intellectual property, which I presented as part of my group meeting talk. Mainly, I list some personal thoughts on the ethical concerns around using ChatGPT to assist your writing.

To start off, I passed on content from my own publication draft to ChatGPT, asking to generate a blog post in plain English for BLOPIG. The outcome:

Not bad.

But, it made me realise a number of things:

With great power comes great responsibility [Uncle Ben – Spiderman].
You are responsible for the ethics that go into using ChatGPT. Are you faking expertise? Are you being actually lazy or just being efficient? Think twice (or many more times) if you’re doing the right thing.
It can significantly reduce the number of writing iterations but don’t take it at face value.
Can you actually trust the plain output? No.
Never take its output as the ground truth, as Large Language Models such as ChatGPT often produce biased writing outputs.
Keep in mind that whatever you produce as a scientist will be picked up by others, and prone to drive misinformation, if incorrect. It is OK to reduce mechanical iterations, but it’s NOT OK to skip quality control.
Be open about it.
You don’t want to set the wrong example for your colleagues. So, mention if you use it, how you used it, and it is fine to encourage efficiency, but not incentivising a culture of scientific misconduct and plagiarism. Don’t skip the step of producing quality ideas on your own. This is such a concern that publishers like Elsevier have already reacted by publishing guidelines contemplating this possibility. While Nature Springer is working on ways to spot AI-generated outputs.

The bottom line

What are the dos and don’ts of using ChatGPT?

Yes, use it to have fun. Yes, use it to proofread or polish your writing. Yes, use it to summarise your own ideas. No, don’t use it to do the analysis and interpretation of your results. No, don’t copy and paste its direct output into your publication. No, don’t hide that you used it. Finally, NO, you can’t add ChatGPT as a contributing author!

Train Your Own Protein Language Model In Just a Few Lines of Code

Language models have token the world by storm recently and, given the already explored analogies between protein primary sequence and text, there’s been a lot of interest in applying these models to protein sequences. Interest is not only coming from academia and the pharmaceutical industry, but also some very unlikely suspects such as ByteDance – yes the same ByteDance of TikTok fame. So if you also fancy trying your hand at building a protein language model then read on, it’s surprisingly easy.

Training your own protein language model from scratch is made remarkably easy by the HuggingFace Transformers library, which allows you to specify a model architecture, tokenise your training data, and train a model in only a few lines of code. Under the hood, the Transformers library uses PyTorch (or optionally Tensorflow) models, allowing you to dig deeper into customising training or model architecture, or simply leave it to the highly abstracted Transformers library to handle it all for you.

For this article, I’ll assume you already understand how language models work, and are now looking to implement one yourself, trained from scratch.

Continue reading →

How ChatGPT changed my writing as an ESL speaker

It’s not always easy to live in an Anglophone scientific world when English isn’t your first language. When careers are built upon the ability to communicate ideas clearly and eloquently, struggling to find the right words can be a real hindrance to explain your science in a way that is taken seriously. Contrary to popular belief, it’s not something you can simply “work” on. Often, it doesn’t matter how many books you’ve read, how many years of education you have, or how articulate you are in your original language — your brain will refuse to summon the right expression, or get stuck in a construction that a native speaker would never use. Struggling with a second language is very much a biological phenomenon.

The standard recommendation for ESL (English as a Second Language) speakers has long been to ask a native colleague to read through any text that needs to be published or submitted somewhere (such as an article or a grant application). Well-intentioned as this advice may be, there are multiple problems with it. Lingua franca or not, only 15% of the world population speaks English, of which only 5% are native speakers — meaning that for most scientists not working in Anglophone countries, the option is rarely available. Even when available, it is unreasonable to expect these colleagues to add charitable proof-reading to their workload simply because they happened to be born speaking a different language. But, most importantly, I have always felt — and I want to emphasize that I truly believe most people who issue this kind of advice to be well-intentioned — that the underliying message sounds too much like “you need vetting by a member of our select linguistic club if you want your ideas to be taken seriously“.

Continue reading →

AI-pril Fools

As my turn to write a blog post has fallen a few days after April 1st, I decided I would write an April Fools’ Day-inspired post and ask everyone’s favourite chatbot to tell me some jokes.

I asked ChatGPT to tell me a knock-knock joke, prompting it with various topics relevant to OPIG (including AI, antibodies, drug discovery and proteins) to see what it could come up with. I’d argue that we’re playing fast and loose with the definition of a joke (several of these just made me cringe), but here are some of the results…

Continue reading →

How to easily use pharmacophoric atom features to turn ECFPs into FCFPs

Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.

We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.

In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.

# import packages
import numpy as np
from rdkit.Chem import AllChem

# define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False
def FCFP_from_smiles(smiles,
                     R = 2,
                     L = 2**10,
                     use_features = True,
                     use_chirality = False):
    """
    Inputs:
    
    - smiles ... SMILES string of input compound
    - R ... maximum radius of circular substructures
    - L ... fingerprint-length
    - use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features
    - use_chirality ... if true then append tetrahedral chirality flags to atom features
    
    Outputs:
    - np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R
    """
    
    molecule = AllChem.MolFromSmiles(smiles)
    feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule,
                                                         radius = R,
                                                         nBits = L,
                                                         useFeatures = use_features,
                                                         useChirality = use_chirality)
    return np.array(feature_list)

The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.

[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.

[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: AI

“The Rise of ChatGPT 4.0: Is the Future of Work in Jeopardy?”

9th Joint Sheffield Conference on Cheminformatics

Machine learning strategies to overcome limited data availability

Academic Reading? There’s an AI for that.

Unclear documentation? ChatGPT can help!

Writing a BLOPIG Post With ChatGPT: A Personal Take on Using AI for Assisted Writing

The bottom line

Train Your Own Protein Language Model In Just a Few Lines of Code

How ChatGPT changed my writing as an ESL speaker

AI-pril Fools

How to easily use pharmacophoric atom features to turn ECFPs into FCFPs