Language models have token the world by storm recently and, given the already explored analogies between protein primary sequence and text, there’s been a lot of interest in applying these models to protein sequences. Interest is not only coming from academia and the pharmaceutical industry, but also some very unlikely suspects such as ByteDance – yes the same ByteDance of TikTok fame. So if you also fancy trying your hand at building a protein language model then read on, it’s surprisingly easy.
Training your own protein language model from scratch is made remarkably easy by the HuggingFace Transformers library, which allows you to specify a model architecture, tokenise your training data, and train a model in only a few lines of code. Under the hood, the Transformers library uses PyTorch (or optionally Tensorflow) models, allowing you to dig deeper into customising training or model architecture, or simply leave it to the highly abstracted Transformers library to handle it all for you.
For this article, I’ll assume you already understand how language models work, and are now looking to implement one yourself, trained from scratch.
1. Install Required Libraries
We’ll be using both the Transformers and Datasets library from HuggingFace, as well as PyTorch. Installing transformers from conda can be problematic sometimes, so I’d recommend using pip, which can still be done under a conda virtual environment.
pip install transformers pip install datasets
You’ll also need to install PyTorch, you can check system-dependent guides here.
2. Prepare Your Dataset
With your dataset of protein sequences separated into train, validation, and test sets, we’ll use the Datasets library from HuggingFace for easy and efficient tokenisation.
Protein language models typically tokenize on a character level (i.e., residue), so you won’t need to train a custom tokenizer like in most text-based language models. Download this example tokenizer, which includes a token for each of the 20 amino acids, as well as padding, and N and C-term tokens (represented by ‘1’ and ‘2’ respectively).
tokeniser = AutoTokenizer.from_pretrained('path/to/tokeniser', use_fast=True)
Next, we’ll load the train, val, and test sets. The Datasets library has some handy features such as caching any data preparation steps (e.g. tokenisation), meaning this step won’t be repeated between training runs. This saves a lot of compute, especially with the large datasets typically used in language models.
data_set_paths = {"train": "/path/to/train.csv", "test": "/path/to/test.csv", "val": "/path/to/val.csv"} datasets = load_dataset('csv', data_files=data_sets, cache_dir='/where/to/store/the/cache')
3. Define the model architecture
Choose a model architecture from the Transformers library (check their documentation for a full list). Here, we’ll use GPT2. Customize any model architecture hyperparameters as needed, since we won’t keep any pretrained model weights:
transformer_config = GPT2Config(vocab_size=tokenizer.vocab_size, n_layer=12, n_embd=512, n_head=12, n_inner=2048) model = GPT2LMHeadModel(config=transformer_config)
Remember to move the model to the GPU using model.to()
if using CUDA.
4. Train your language model
Now, it’s time to train our model! The Trainer class in HuggingFace makes this easy with training parameters from a TrainingArguments instance. Check the TrainingArguments documentation here for customization options, or write your own training loop using PyTorch.
training_args = TrainingArguments( output_dir="./output", overwrite_output_dir=True, num_train_epochs=2, per_device_train_batch_size=16, per_device_eval_batch_size=16, save_steps=10_000, save_total_limit=2) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"] ) trainer.train()
5. Generate samples
After training your language model, you’ll want to generate new samples. There are several ways to generate samples from a language model, with further details on methods and implementation in HuggingFace available here. Below is an example of generating sequences using top-p sampling.
pad_token_id = tokenizer.pad_token_id # get the token id for the N-term token ('1') bos_token_id = tokenizer.encode('1')[0] n_samples = math.ceil(num_return_sequences / batch_size) # make sure we're using GPU model.to('cuda') # top-p sampling parameters. top-p uses sampling from the token probability distribution, so 'do_sample' is true. generation_config = GenerationConfig( max_new_tokens=max_new_tokens, do_sample=True, top_p=top_p, pad_token_id=pad_token_id, temperature=temperature, num_return_sequences=batch_size, bos_token_id=bos_token_id ) generated_token_ids = [] for x in range(n_samples): batch_generated_token_ids = self.model.generate(generation_config = generation_config, ) # type: ignore generated_token_ids.append(batch_generated_token_ids) # flatten the generated token ids generated_token_ids = [item for sublist in generated_token_ids for item in sublist] #Decode from token IDs to residue letters decoded_sequences = self.tokenizer.batch_decode(generated_token_ids, skip_special_tokens=True)
And that’s it! You’ve now successfully trained a protein language model from scratch and generated new samples from it.