In 2019, I tried my hand at using large language models, specifically GPT-2, for text generation. In that blogpost, I used Hansard files to fine-tune the public release of GPT-2 to generate speeches by several speakers in the House of Commons (link).
In 2020, OpenAI released GPT-3, their new and improved text generation model (paper), which uses a whopping 175 billion parameters (as opposed to its predecessor’s 1.5 billion) and not only proved to be capable of state of the art performance on common text prediction benchmarks, but also generated a considerable amount of interest in the news media:
- Here is an article in the Guardian co-written by GPT-3: link
- And another one in the NYT, discussing applications and dangers of GPT-3: link
Though there are serious concerns about inherent biases learned by GPT-3*, the potential of GPT-3 as a few-shot learner for chat systems and information queries can hardly be overstated.
Last week, I finally received my invite for the OpenAI API beta, allowing me to use their text completion models myself. Given that GPT-3 is trained on datasets including CommonCrawl (> 1 trillion words of scraped website data) as well as the entire Wikipedia corpus, I was interested in exploring whether, given the correct prompts, the OpenAI API would be able to perform simple bioinformatics tasks.
Other than for GPT-2, where a smaller model was made available to the public, the model weights for GPT-3 have not been released in any form. While this means that fine-tuning of the model weights for specific tasks is not possible (nor would it be feasible given the size of the model), the demonstrated capacity of GPT-3 as a few shot learner (i.e. it only needs a small number of prompts to adapt to a new task domain), nevertheless made it possible to generate the results below. For reproduction, I highlight the prompt given to the model in italics and the model-generated text in bold.
Amino-acid code translation
A simple bioinformatics task is translating between 1-letter and 3-letter amino acid code. I provided the model with two translated examples and then asked it to translate three more strings of 1-letter amino acid code
1-letter amino acid sequence: AAPGYTRA
3-letter amino acid sequence: AlaAlaProGlyTyrThrArgAla
1-letter amino acid sequence: TTYV
3-letter amino acid sequence: ThrThrTyrVal
1-letter amino acid sequence: NDQC
3-letter amino acid sequence: AsnAspProGln
1-letter amino acid sequence: NQTY
3-letter amino acid sequence: AsnAsnProTyr
1-letter amino acid sequence: GPR
3-letter amino acid sequence: GlyProArg
While it did make some mistakes (see the second completion, which should read AsnGlnThrTyr as opposed to AsnAsnProTyr), the model correctly identified the task structure and importantly, was able to translate amino acids which I had not provided in the prompt (none of the two initial translations included N).
Generate an antibody heavy chain sequence
This was a task at which I did not expect the model to succeed, as it seems unlikely that it would have encountered many examples of specifically labeled heavy chain amino acid sequences before.
I first tried a light-weight prompt:
This is an antibody heavy chain amino acid sequence: AAAALACTVLTTVSSQSQSVGDRVSFSTSCRASQEQGHSSPVTPLTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPA
I tried parsing the output in Anarci and, unsurprisingly, it failed to parse the sequence into an antibody numbering scheme. It does however have 86% identity to an antibody identified via BLAST search:
Query 28 TSCRASQEQGHSSPVTPLTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVT 87 TSCR QG T +TVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVT Sbjct 107 TSCRDYWGQG-----TLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVT 161 Query 88 VSWNSGALTSGVHTFPA 104 VSWNSGALTSGVHTFPA Sbjct 162 VSWNSGALTSGVHTFPA 178
That is quite the impressive results given the minimal priming of the model. To further try to get the model to generate a parseable heavy chain sequence, I prompted it with the first 4 amino acids of an antibody heavy chain:
This is an antibody heavy chain amino acid sequence:
EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSGGGTVTVSS
Bingo! This one parses in Anarci, starting at IMGT position 0 and ending at 65. This particular sequence has 100% ID to a BLAST hit, so it is possible that the model has encountered this exact sequence during training, but I did not assess this further.
Sequences generated in this manner are not directly usable in any meaningful application, but it is nonetheless impressive that a model trained primarily on general language generation is able to generate meaningful domain-specific amino acid sequences.
Generate a description of OPIG
Lastly, I also asked it to describe OPIG:
Describe OPIG, the Oxford Protein Informatics Group.
The Oxford Protein Informatics Group (OPIG) is a group of scientists with interests in protein science and informatics. We come from different departments in the University of Oxford, but we have a common interest in using computer science and informatics to solve problems in biology. We have a number of projects in progress, ranging from developing a new bioinformatics tool to experimental work studying new proteins and enzymes.
Not bad.
*I should add that OpenAI has included a warning in their API for content generated by their models which might be harmful or unsafe.