Conformational diversity analysis reveals three functional mechanisms in proteins

Conformational diversity analysis reveals three functional mechanisms in proteins

This paper was published recently in Plos Comp Bio and looks at the conformational diversity (flexibility) of protein structures by comparing solved structures of identical sequences.

The premise of the work is that different crystal structures of the same protein represent instances of the conformational space of the protein. These different instances are identical in amino acid sequence but often differ in other ways they could come from different crystal forms or the protein could have different co-factors bound or have undergone post translational modifications.

The data set used in the paper came from CoDNaS (conformational diversity of the native state) Database URL: http://ufq.unq.edu.ar/codnas.

Only structures solved using X-ray crystallography to a resolution better than 2.5A were used and only proteins for which at least 5 conformers were available (average of 15.53 conformers per protein). Just under 5000 different protein chains made up the set. In order to describe the protein chains the measure used was maximum conformational diversity (the maximum RMSD between any of the conformers of a given protein chain).

The authors describe a relationship between this maximum conformational diversity and the presence absence of intrinsically disordered regions (IDRs). An IDR was defined as a segment of at least 5 contiguous residues with missing electron density (the first and last 20 residues of the chain were not included).

The proteins were divided into three groups.

Rigid

No IDRS

Partially disordered

IDRs in at least one conformer
IDR in the maximum RMSD pair of conformational diversity

Malleable

IDRs in at least one conformer
No IDR in the maximum RMSD pair of conformational diversity

Rigid proteins have in general lower conformational diversity than partially disordered than Malleable. The authors describe how these differences are not due to crystallographic conditions, protein length, number of crystal contacts or number of conformers.

The authors then go on to compare other properties based on these three types of protein chains including amino acid composition, loop RMSD and cavities and tunnels.

They summarise their findings with the figure below.

Interesting Antibody Papers

Here we highlight two antibody papers, one from the past one more recent. The more recent one talks about developing an affinity maturation model. The older one is a refresher on the Developability Index — how to computationally harness hydrophobicity and accessible surface areas to predict aggregation.

Mouse antibody maturation model — the most expanded (common) clones might not be the ones with highest affinities here (van Kampen lab). The authors of the paper define a model of affinity maturation. The main take-home message of the paper is that the ‘most expanded’ clones might not be the ones with highest affinity — expanded clones are assumed to be the ones ‘responding’ to the antigenic challenge. The model is based on Ordinary Differential Equations, tracing cell fate in a germinal center. The model was compared to experimental expansion data from lymph nodes for accuracy. In each such model one needs to assume a lot of parameters, such as which day post-immunization do we start somatic hypermuatation? The paper is a very nice example of a model of maturation and a good starting point for tracing references citing germinal center biology and numbers for parameters used for models (also the general canon of construction of such models!).

Developability index here. (Trout lab at MIT). The authors touch on a very important subject of antibody developability: after you produced your ab binder, does it have physicochemical characteristics which are suitable to carry on with it as a therapeutic. Such characteristics include stability, expression yields and aggregation propensity. Aggregation propensity is one of the most important factors here as it affects the pharmacokinetics of the drug as well as shelf life. In this manuscript, authors address attempt to predict the aggregation propensity of antibodies. As background data, they use twelve antibodies whose long term stability has been measured over several years. To develop a computational method to predict antibody aggregation propensity, they use a score which combines hydrophobicity and electrostatic factors. The hydrophobicity is an adapted SAP score which the authors developed previously, and whose main parameters are the exposed residue area and hydrophobicity of the residue as defined by Black and Mould. The electrostatics are calculated using PROPKA. Since combining the scores into a predictive model involved parametrization, they use seven of the antibodies to adjust the coefficients. They use the rest to demonstrate that their model has predictive power. Calculation of their models requires a structure of an antibody which they obtain using WAM. Take home messages? It is a nice dataset to play with aggregation prediction and it demonstrates how to calculate electrostatics and hydrophobicity of a molecule.

Protein structure determination using metagenome sequence data

For this week’s journal club, I presented a recent paper from Ovchinnikov, and the David Baker group – Protein structure determination using metagenome sequencing data. This discussed how incorporating metagenome sequence data into multiple sequence alignments, can assist with and improve residue-residue contact prediction. The paper concludes with the prediction of over 600 structures from protein families that currently have no solved structures.

The Pfam database contains 14,849 protein families with 50 or more residues. However, only 4752 of these families have at least one member with an experimentally determined structure. 3984 of the remaining 10,097 families have reliable comparative models built on the basis of homologs of known structure. Less confident comparative models can be built for a further 902 families, however this leaves 5211 families with no structural information.

The recent technological advances in genome sequencing have provided an increasingly large number of amino acid sequences to work with. Large numbers of sequences allows the identification of compensatory mutations that have occurred in residues that are in contact with each other. This is called evolutionary co-variance and can allow the relatively accurate prediction of residues that are in contact in a structure. Rosetta utilises these co-evolutionary couplings, along with partial structural matches (found by combining the predicted contacts with contact patterns of known structures, using the map_align algorithm ) to predict structures from a number of families with fold-level accuracy ( TM-score > 0.5 ). However, it was unknown if this method could be used to accurately predict protein structures on a large-scale.

One challenge in using co-evolutionary couplings to predict residue-residue contacts is that a large number of sequences (hundreds to thousands) are needed. The accuracy of the predicted contacts is also dependent on the diversity of the sequences in a family, and the length of the protein. N_f is a measure that incorporates all of these factors :

Figure 1A shows the dependence of Rosetta structure prediction accuracy on the N_f. In general, where N_f≥64, accuracy typical of comparative modelling (TM-score > 0.7) can be achieved. For N_f≥32, fold-level accuracy (TM-score > 0.5) can be achieved, below this, accuracy falls off. Of the 5211 families with no structural information, only ~400 of these had N_f≥64; therefore accurate structural modelling could not be achieved for the remaining ~4800 of these families using the sequencing data available on UniRef100.

Fig 1. (a) Accuracy of predicted structures produced with and without refinement by Rosetta for families with different Nf values. (b) Number of protein families with Nf≥64 between 2009 and 2015 using UniRef100 database, and UniRef100 and Metagenome data. (c) Percentage of protein families with Nf scores 4, 8, 16, 32, and 64 including sequences from UniRef100 and metagenome data.

The addition of metagenome sequence data (from shotgun sequencing microbial DNA from environmental samples) increased the proportion of families with N_f≥64 from 0.08, to 0.25. The proportion of families with N_f≥32 also increased from 0.16, to 0.33. The difference in the fraction of protein families with N_f≥64 before and after the addition of metagenome sequence data can be seen in Figure 1B, and Figure 1C shows the percentage of families with N_f scores above 4, 8, 16, 32 and 64.

After running a set of benchmark calculations, this larger set of sequence data were used to generate models for 921 protein families, which now had N_f≥64 and also had number of long range contacts greater than half the number of residues in the protein. Of these 921 protein families, models with predicted TM scores > 0.65 were generated for 614 families. Although these were only predicted TM scores, crystal structures for members of 5 of the 614 families have since been published and had a TM-score > 0.7 when compared with the corresponding model.

Limitations with this using this data include the lack of eukaryotic genetic information currently, as well as the lack of explicit modeling of ligands, co-factors and lipids using the Rosetta workflow. However, the fast rate of increase in metagenome sequencing data (as compared to the rate of increase of sequencing data in UniRef100) means that while these new models fill roughly 12% of the unknown structural information for protein families, the potential for future structural prediction is bright.

Bitbucket and PyCharm – Tools to make a DPhil less problematic

I find Git a wonderful tool for my work, with version control providing much needed damage control to my projects. I also find Git incredibly powerful at making my working life easier, with the ability to use git push and git pull to synchronise my code between the various computers that I use for my DPhil. Via a BitBucket account, providing a remote Git repository, I am able to move my code around to wherever I am working and allow more room for either more procrastination or staring at my screen in confusion.

As simple as GIT is, it can be a fiddle entering the git commands in command line as well as remembering to do this as you rush to leave the building. This has all been made much easier with PyCharm, from JetBrains. This IDE (integrated development environment) has many tools including version control such as support for a variety of file types, PEP8 checks to ensure good quality code and its ability to work with ipython notebooks.

I’ve put the following mini-tutorial together for those who want to make or bring in an existing repository to PyCharm and get version control working:

Continue reading →

Colour page counter

So you’ve written the thesis, you’ve been examined, the corrections are done, and now you are left with just wearing the silly clown robes to get a piece of paper with your name on it. However, you’ve been informed that you aren’t allowed to don the silly robes until you print the damn thing (again) and submit it to the Bod to be ignored for generations to come. Oh, and the added bonus is that you have to pay for it. Naturally, you want the high-quality printing and paper to match for the final versions, but it’s all so expensive. At least you can save a few meagre pounds by specifying only the pages for colour printing. Naturally, I decided that I would spend far more time making a script than just counting them myself (which I did anyway to verify it works). Enjoy.

#!/usr/bin/env Rscript


library(data.table)

args <- commandArgs(trailingOnly=TRUE) x <- system(paste("gs -q -o - -sDEVICE=inkcov",args[1]," | awk '{print $1,$2,$3}'"),intern=TRUE) x <- as.data.table(tstrsplit(x,' ')) x[,c("V1","V2","V3"):=.(as.numeric(V1),as.numeric(V2),as.numeric(V3))] print(paste("Colour pages total:",sum(rowSums(x)!=0))) print(paste("Colour pages:", paste(which(rowSums(x) != 0),collapse=', ')))

Faster FREAD with Pandas

One of the things I like to do is to scale up things using the ridiculous amount of cores at my disposal (sometimes even for a good reason). One of these examples is when I had to model millions of CDRs (or loops) using FREAD.

The process through which you model a loop in Fread is:

Pre-filtering step: Anchor Ca separation and ESST score in between your target and all the templates in the DB. The ones that pass a threshold are saved for step 2
Anchor RMSD test

The major bottleneck for such an analysis is step 1, where most of the templates are filtered out so for step 2 you get a very reduced subset. The data for doing the Anchor Ca separation and ESST score is all stored for each possible template in one row of an sqlite database. So when you do step 1 you will go through each row of this table and calculate the score, with the database is stored on the hard drive so costly I/O. This is fine for the original purpose of Fread, where you filled in a missing loop for one structure, but when you are doing it for 100 million examples, going through a table stored on a hard drive 100 million times, sequentially, it is going to be SLOW. I say sequentially because for the python implementation using sqlite3 I had a lot of trouble trying to use a db handle on multiple threads, or load the same sql file on different instances on threads, it just crashes for no good reason. There has been chat about this on stackoverflow and I think this has moved on since I implemented it in 2015. Nevertheless, I wanted a simple and clean solution.

I decided to transform the sqlite3 database into a Pandas object. Pandas objects are basically a convenient way of storing tables with methods available that mimick conventional querying mechanisms for databases. These are stored in memory, easily dumped as pickle files, and can be easily duplicated between threads so no issues with thread safety. Obviously you need to have enough memory to store all of that, but for my application that was not a problem. Below is some sample code on how I used it to transform the template DB from FREAD.

import pandas as pd
import sqlite3 as sql

rows = []

# connect to your fread sql file
conn = sql.connect("fread_sql_file.sql")
try:
    query = "SELECT dihedral, sequence, pdbcode, start, anchor, bound FROM loops"
    results = conn.execute(query)
    for row in results:
        # store the rows as a list of dictionaries
        rows.append(dict(zip(['dihedral', 'length', 'pdbcode', 'anchor', 'sequence', 'start', 'bound'], [row[0], len(row[1]), row[2], row[4] ,row[1], row[3], row[5]])))
        
except Exception as e:
    print "Error during query", str(e)
    conn.close()

# create a pandas dataframe from the list of dictionaries 
df = pd.DataFrame(rows)
# store the table as a pickle file which you can reload later (this is very fast!)
df.to_pickle("fread_pandas_file.pickle")

After running this you will have your sql database as a pandas dataframe, and you can write methods which are thread safe to model loops as below:

import pandas as pd

THRESHOLD = 25
cdr_db pd.read_pickle("fread_pandas_file.sqlite")


def model_loop(query_sequence, query_anchors_ca):
    # score_sequence_db_helper is your function that attaches a scores based on your query sequence and the row in the template db
    scores = cdr_db.apply(lambda row: score_sequence_db_helper(row, query_sequence, query_anchors_ca), axis=1)

    # attach the score
    results = zip(list(cdr_db['pdbcode']), scores, list(cdr_db['sequence']))

    # keep the ones that are over the threshold
    results = filter(lambda (pdb_code, score, sequence): score>=THRESHOLD, results)
    
    return results

A very basic introduction to Random Forests using R

Random Forests is a powerful tool used extensively across a multitude of fields. As a matter of fact, it is hard to come upon a data scientist that never had to resort to this technique at some point. Motivated by the fact that I have been using Random Forests quite a lot recently, I decided to give a quick intro to Random Forests using R.

So what are Random Forests? Well, I am probably not the most suited person to answer this question (a google search will reveal much more interesting answers) , still I shall give it a go. Random Forests is a learning method for classification (and others applications — see below). It is based on generating a large number of decision trees, each constructed using a different subset of your training set. These subsets are usually selected by sampling at random and with replacement from the original data set. The decision trees are then used to identify a classification consensus by selecting the most common output (mode). While random forests can be used for other applications (i.e. regression), for the sake of keeping this post short, I shall focus solely on classification.

Why R? Well, the quick and easy question for this is that I do all my plotting in R (mostly because I think ggplot2 looks very pretty). I decided to explore Random Forests in R and to assess what are its advantages and shortcomings. I am planning to compare Random Forests in R against the python implementation in scikit-learn. Do expect a post about this in the near future!

The data: to keep things simple, I decided to use the Edgar Anderson’s Iris Data set. You can have a look at it by inspecting the contents of iris in R. This data set contains observations for four features (sepal length and width, and petal length and width – all in cm) of 150 flowers, equally split between three different iris species. This data set is fairly canon in classification and data analysis. Let us take a look at it, shall we:

As you can observe, there seems to be some separation in regards to the different features and our three species of irises [note: this set is not very representative of a real world data set and results should be taken with a grain of salt].

Training and Validation sets: great care needs to be taken to ensure clear separation between training and validation sets. I tend to save the cases for which I am actually interested in performing predictions as a second validation set (Validation 2). Then I split the remaining data evenly into Training and Validation 1.

Let us split our data set then, shall we?

# Set random seed to make results reproducible:
set.seed(17)
# Calculate the size of each of the data sets:
data_set_size <- floor(nrow(iris)/2)
# Generate a random sample of "data_set_size" indexes
indexes <- sample(1:nrow(iris), size = data_set_size)

# Assign the data to the correct sets
training <- iris[indexes,]
validation1 <- iris[-indexes,]

Before we can move on, here are some things to consider:

1- The size of your data set usually imposes a hard limit on how many features you can consider. This occurs due to the curse of dimensionality, i.e. your data becomes sparser and sparser as you increase the number of features considered, which usually leads to overfitting. While there is no rule of thumb relating to how many features vs. the number of observations you should use, I try to keep e^Nf < No (Nf = number of features, No = number of observations) to minimise overfitting [this is not always possible and it does not ensure that we won’t overfit]. In this case, our training set has 75 observations, which suggests that using four features (e^4 ~ 54.6) is not entirely absurd. Obviously, this depends on your data, so we will cover some further overfitting checks later on.

2- An important thing to consider when assembling training sets is the proportion of negatives vs. positives in your data. Think of an extreme scenario where you have many, many more observations for one class vs. the others. How will this affect classification? This would make it more likely for the classifier to predict the dominant class when given new values. I mentioned before that the iris set is quite nice to play with. It comes with exactly 50 observations for each species of irises. What happens if you have a data set with a much higher number of observations for a particular class? You can bypass any imbalance regarding the representation of each class by carefully constructing your training set in order not to favour any particular class. In this case, our randomly selected set has 21 observations for species setosa and 27 observations for each of species versicolor and virginica, so we are good to go.

3- Another common occurrence that is not represented by the iris data set is missing values (NAs) for observations. There are many ways of dealing with missing values, including assigning the median or the mode for that particular feature to the missing observation or even disregarding some observations entirely, depending on how many observations you have. There are even ways to use random forests to estimate a good value to assign to the missing observations, but for the sake of brevity, this will not be covered here.

Right, data sets prepared and no missing values, it is time to fire our random forests algorithm. I am using the randomForest package. You can click the link for additional documentation. Here is the example usage code:

#import the package
library(randomForest)
# Perform training:
rf_classifier = randomForest(Species ~ ., data=training, ntree=100, mtry=2, importance=TRUE)

Note some important parameters:

-The first parameter specifies our formula: Species ~ . (we want to predict Species using each of the remaining columns of data).
–ntree defines the number of trees to be generated. It is typical to test a range of values for this parameter (i.e. 100,200,300,400,500) and choose the one that minimises the OOB estimate of error rate.
–mtry is the number of features used in the construction of each tree. These features are selected at random, which is where the “random” in “random forests” comes from. The default value for this parameter, when performing classification, is sqrt(number of features).
–importance enables the algorithm to calculate variable importance.

We can quickly look at the results of our classifier for our training set by printing the contents of rf_classifier:

> rf_classifier

Call:
 randomForest(formula = Species ~ ., data = training,ntree=100,mtry=2, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 2

        OOB estimate of  error rate: 5.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         21          0         0  0.00000000
versicolor      0         25         2  0.07407407
virginica       0          2        25  0.07407407

As you can see, it lists the call used to build the classifier, the number of trees (100), the variables at each split (2), and it outputs a very useful confusion matrix and OOB estimate of error rate. This estimate is calculated by counting however many points in the training set were misclassified (2 versicolor and 2 virginica observations = 4) and dividing this number by the total number of observations (4/75 ~= 5.33%).

The OOB estimate of error rate is a useful measure to discriminate between different random forest classifiers. We could, for instance, vary the number of trees or the number of variables to be considered, and select the combination that produces the smallest value for this error rate. For more complicated data sets, i.e. when a higher number of features is present, a good idea is to use cross-validation to perform feature selection using the OOB error rate (see rfcv from randomForest for more details).

Remember the importance parameter? Let us take a look at the importance that our classifier has assigned to each variable:

varImpPlot(rf_classifier)

Each features’s importance is assessed based on two criteria:

-MeanDecreaseAccuracy: gives a rough estimate of the loss in prediction performance when that particular variable is omitted from the training set. Caveat: if two variables are somewhat redundant, then omitting one of them may not lead to massive gains in prediction performance, but would make the second variable more important.

-MeanDecreaseGini: GINI is a measure of node impurity. Think of it like this, if you use this feature to split the data, how pure will the nodes be? Highest purity means that each node contains only elements of a single class. Assessing the decrease in GINI when that feature is omitted leads to an understanding of how important that feature is to split the data correctly.

Do note that these measures are used to rank variables in terms of importance and, thus, their absolute values could be disregarded.

Ok, great. Looks like we have a classifier that was properly trained and is producing somewhat good predictions for our training set. Shall we evaluate what happens when we try to use this classifier to predict classes for our validation1 set?

# Validation set assessment #1: looking at confusion matrix
prediction_for_table <- predict(rf_classifier,validation1[,-5])
table(observed=validation1[,5],predicted=prediction_for_table)

            predicted
observed     setosa versicolor virginica
  setosa         29          0         0
  versicolor      0         20         3
  virginica       0          1        22

The confusion matrix is a good way of looking at how good our classifier is performing when presented with new data.

Another way of assessing the performance of our classifier is to generate a ROC curve and compute the area under the curve:

# Validation set assessment #2: ROC curves and AUC

# Needs to import ROCR package for ROC curve plotting:
library(ROCR)

# Calculate the probability of new observations belonging to each class
# prediction_for_roc_curve will be a matrix with dimensions data_set_size x number_of_classes
prediction_for_roc_curve <- predict(rf_classifier,validation1[,-5],type="prob")

# Use pretty colours:
pretty_colours <- c("#F8766D","#00BA38","#619CFF")
# Specify the different classes 
classes <- levels(validation1$Species)
# For each class
for (i in 1:3)
{
 # Define which observations belong to class[i]
 true_values <- ifelse(validation1[,5]==classes[i],1,0)
 # Assess the performance of classifier for class[i]
 pred <- prediction(prediction_for_roc_curve[,i],true_values)
 perf <- performance(pred, "tpr", "fpr")
 if (i==1)
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i]) 
 }
 else
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i],add=TRUE) 
 }
 # Calculate the AUC and print it to screen
 auc.perf <- performance(pred, measure = "auc")
 print(auc.perf@y.values)
}

Here is the final product (ROC curve):

And here are the values for our AUCs:

Setosa
AUC = 1

Versicolor
AUC = 0.98

Virginica
AUC = 0.98

Voila! I hope this was somewhat useful!

Interesting Antibody Papers

This time round, one older paper, one recent paper. The older one talks about estimating how many H3s can there be in a human body based on sequencing of two individuals (they cap it at 9 million — not that much!). The more recent one is an attempt to define what makes a good antibody in terms of its developability properties (a battery of biophys assays on 150 therapeutic antibodies- amazing dataset to work with).

High resolution description of antibody heavy chain repertoires in (two) humans (Koralov lab at NYU). Here. Two individuals were sequenced and their VDJ frequencies measured. It is widely believed that the VDJ recombination events are largely independent and random. Here however they demonstrate some biases/interplay between the D and J regions. Since H3 falls on the VDJ junction, it might suggest that it affects the total choice of H3. Another quite important point is that they compared the productive vs nonproductive sequences (out of frame or with stop codons). If there were significant differences between the VDJ frequencies of productive vs nonproductive sequences, it would suggest selection at the VDJ frequency stage. However they do not see any significant differences here suggesting that VDJ combinations have little bearing on this initial selection step. Finally, they estimate the number of H3 in repertoire. The technique is interesting — they sample 1000 H3s from their set and see how many unique sequences it contributes. Each next sample contributes less and less unique sequences which leads to a log-decay curve. By doing so they get a rough estimate of when there will be no more new sequences added and hence an estimate of diversity (think why do this rather than counting the number of uniques!). They allow themselves to extrapolate this estimate to the whole organism by multiplying their blood sample by the total human body volume — they motivate this extrapolation by the fact that there were precious little overlaps between the two human subjects.

Biophysical landscape of clinical stage antibodies [here]. Paper from Adimab. Designing an antibody which binding its target is only a first step on the way to bring the drug on the market. The molecule needs to fulfill a variety of characteristics such as colloidal stability (does not aggregate or ‘clump up’), does not instantly clear from the organism (which is usually down to off target binding), is stable and can be expressed in reasonable quantities. In an effort to delineate what makes a good antibody, the authors take inspiration from earlier work on small molecules, namely the Lipinski Rules of Five. This set of rules describes what makes a ‘good’ small molecule drug, which was assessed by looking at ~2000 therapeutic drugs. The rules came down to certain numbers of hydrogen bond donors, acceptors, molecular weight & lipophilicity. Therefore, Jain et al would like a similar methodology, but for antibodies: give me an antibody and using methodology/rules we define, we will tell you either to carry on with development or maybe not. Since antibodies are far more complex and the data on therapeutic abs orders of magnitude smaller (around 50 therapeutic abs to date) Jain et al, had to devise a more nuanced approach than simply counting hb donors/acceptors mass etc. The underlying ‘good’ molecule data though is similar: they picked therapeutic antibodies and those in late clinical testing stages (2,3). This resulted in ~150 antibodies. So as to devise the benchmark ‘rules/methodology’, they went for a battery of assays to serve as a benchmark — if your ab raises too many red flags according to these assays, it’s not great (what constitutes a red flag to be defined). These assays were supposed to not be obscure and relatively easy to use as the point was that an arbitrary antibody can be relatively easy checked against them. The assays are a range of expression, cross reactivity, self reactivity, thermal stability etc. To define red flags, they run their therapeutic/clinical antibodies through the tests. To their surprise quite a lot of these molecules turn out to have quite ‘undesirable characteristics’. Following the Lipinski Rules, they define a red flag as being in the bottom 10th percentile of the assay values as evaluated on the therapeutic abs. They show that the antibodies which are approved or in more advanced clinical trials stages have less red flags. Therefore, the take-home messages from this paper: very nice dataset for any computational work, raising red flags does not disqualify you from being a therapeutic.

Biophysical Society 61st Annual Meeting – New Orleans, February 2017

As the sole representative of OPIG attending Biophys 2017 in New Orleans, I had to bear the heavy burden of a long and lonely flight and the fear of missing out on a week of the very grey Oxford winter. Having successfully crossed the border into the US, which was thankfully easier for me than it was for some of our scientific colleagues from around the world, I found my first time attending the conference to be full of very interesting and relevant science. While also covering a wide variety of experimental techniques and non-protein topics, the conference is so large and broad that there was more than enough to keep me busy over the five days, featuring folding, structure prediction, docking, networks, and molecular dynamics.

There were several excellent talks on the subject of folding pathways, misfolding and aggregation. A common theme was the importance of the kinetic stability of the native state, and the mechanisms by which it may be prevented from reaching a non-native global thermodynamic minimum. This is particularly important for serpins, large protease inhibitors which inactivate proteases by a suicide mechanism. The native and active state can be transformed into a lower energy conformation over long timescales. However, this also occurs by cleavage near the C-terminal end, which allows insertion of the C-terminal tail into a beta sheet, holding the cleaving protease inactive and therefore the stored energy is very important for function. Anne Gershenson described recent simulations and experiments to elucidate the order in which substructures of the complete fold assemble. There are many cooperative substructures in this case, and N-terminal helices form at an early stage. The overall topology appears to be consistent with a cotranslational folding mechanism inside the ER, but requires significant rearrangements after translation for adoption of the full native fold.

Cotranslational folding was also discussed by several others including the following: Patricia Clark is now using the YKB system of alternately folding fluorescent protein to find new translation stalling sequences; Anais Cassaignau described NMR experiments to show the interactions taking place between nascent chains and the ribosome at different stalled positions during translation; and Daniel Nissley presented a model to predict a shift in folding mechanism from post-translational to cotranslational due to specific designed synonymous codon changes, which agreed very well with experimental data.

To look more deeply into the evolution of folding mechanisms and protein stability, Susan Marqusee presented a study of the kinetics of folding of RNases, comparing the properties of inferred ancestral sequences to a present day thermophile and mesophilic E. coli. A number of reconstructed sequences were expressed, and it was found that moving along either evolutionary branch from the ancestor to modern day, folding and unfolding rates had both decreased, but the same three-state folding pathway via an intermediate is conserved for all ancestors. However, the energy transition between the intermediate and the unfolded state has evolved in opposite directions even while the kinetic stability remains similar. This has led to the greater thermodynamic stability seen in the modern day thermophile compared to the mesophile at higher temperatures and concentrations of denaturant.

Panel C shows that kinetic stability (low unfolding rate) seems to be selected for in both environments. Panel D shows that the thermodynamic stability of the intermediate (compared to the unfolded state) accounts for the differences in thermodynamic stability of the native state, when compared to the common ancestor (0,0). Link to paper

There were plenty of talks discussing the problems and mechanisms of protein aggregation, with two focussing on light chain amyloidosis. Marina Ramirez-Alvarado was investigating how fibrils begin to grow and showed using microscopy that both soluble light chains and fibrils (more slowly) are internalised by heart muscle cells. They can then be exposed at the cell surface and become a seed to recruit other soluble light chains to form fibrils. Shannon Esswein presented work on the enhancement of VL-VL dimerisation to prevent amyloid formation. The variable domain of the light chain (VL) can pair with itself in a similar orientation to its pairing with VH domains in normal antibodies, or in a non-canonical orientation. Adding disulphide bonds to stabilise these dimers prevented fibril formation, therefore they carried out a small scale screen of 27 aromatic and hydrophobic ligands to find those which would favour dimer formation by binding at the interface. Sulfasalazine was detected in this screen and was also shown to significantly reduce fibril formation and could therefore be used as a template for future drug design.

A ligand stabilises the dimer therefore fewer light chains are present as monomers, slowing the rate of the only route by which fibrils can be formed. Link to paper

Among the posters, Alan Perez-Rathke presented loop modelling by DiSGro in beta barrel membrane proteins which showed that the population of structures generated and scored favourably after relaxation at a pH 7 led to an open pore more often than at pH 5, consistent with experimental observations. There were two posters on the topic of prediction of membrane protein expression in bacteria and yeast presented by students of Bill Clemons, who also gave a great talk. Shyam Saladi has carefully curated datasets of successes and failures in expression in E. coli and trained a linear SVM on features such as RNA secondary structure and transmembrane segment hydrophobicity to predict the outcome for unknown proteins. This simple approach (preprint available here) achieved area under ROC curve of around 0.6 on a separate test set, and using more complex machine learning techniques is likely to improve this. Samuel Schulte is adapting the same method for prediction of expression in yeast.

Overall, it was a great conference and it was nice to hear about plenty of experimental work alongside the more familiar computational work. I would also highly recommend New Orleans as an excellent place to find great food, jazz and sunshine!

Using Antibody Next Generation Sequencing data to aid antibody engineering

I consider myself a wet lab scientist and I had not done any dynamic programming language like Python before starting my DPhil. My main interests lie in development of improved antibody humanization campaigns, rational antibody phage display library constructions and antibody evolution. Having completed industrial placement at MedImmune, I saw the biotechnology industry from the inside and realized that scientists who could bridge computer science and wet lab fields are in high demand.

The title of my DPhil is very broad, and research itself is data rather than hypothesis driven. Our research group collaborates with UCB Pharma, which has sequenced whole antibody repertoires across a number of species. Datasets might contain more than 10 million sequences of heavy and light variable chains. But even these datasets do not cover more than 1% of the theoretical repertoire, hence looking at entropies of sequences rather than mere sequences could provide insights into differences between intra- and inter- species datasets.

NGS of antibody repertoires provides snapshots of repertoire diversity, entropy as well as sequences. Reddy, S.T. et al 2010 showed that this information could be successfully used to pull target specific variable chains. But most of research groups believe that main application of NGS is immunodiagnostics (Grieff et al., 2015).

My project involves applying software developed by our research group namely, Anarci (Dunbar J and Deane CM., 2016) and ABodyBuilder (Leem J. et al 2016). Combination of both softwares allows analysis of NGS datasets at an unprecedented rate (1 million sequences per 7 hours). A number of manipulations can be performed on datasets to standardize them and make data reproducible, which is a big issue in science. It is possible to re-assign germlines, numbering schemes and complementary determining region (CDR) definitions of a 10 million dataset in less than a day. For instance, UCB provided data required our variable chains to be re-numbered according to IMGT numbering and CDR definition (Lefranc M., 2011). The reason for the IMGT numbering scheme selection is that it supports symmetrical amino acid numbering of CDRs, which allows for improved assignment of positions to amino acids that are located in the same structural space between different length CDRs (Figure 1).

Figure 1. IMGT numbering and CDR definition of CDR3. Symmetrical assignment of positions to amino acids in HCDR3 allows for better localization of V,D,J genes: V gene encodes for the amino terminus, J gene encodes the carboxyl terminus of CDR3, and D gene the mid portion.

To sum up, analysis of CDR lengths, CDR and framework amino acid compositions, finding novel patterns in antibody repertoires will open up new rational steps of antibody humanization and affinity maturation. The key step will be to determine amino acid scaffolds that define humanness of antibody or in other words, scaffolds that are not immunogenic in humans.

References:

Dunbar J., and Deane CM., ANARCI: Antigen receptor numbering and receptor classification. Bioinformatics (2016)
Grieff V., A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Medicine (2015)
Leem J., et al. ABodyBuilder: automated antibody structure prediction with data-driven accuracy estimation. mAbs. (2016)
Lefranc M., IMGT, the International ImMunoGeneTics Information System. Cold Spring Harb Protoc. (2011)
Reddy ST., et al. Monoclonal antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells. Nat Biotech. (2010)

Oxford Protein Informatics Group

or "OPIG" to friends

Conformational diversity analysis reveals three functional mechanisms in proteins

Interesting Antibody Papers

Protein structure determination using metagenome sequence data

Bitbucket and PyCharm – Tools to make a DPhil less problematic

Colour page counter

Faster FREAD with Pandas

A very basic introduction to Random Forests using R

Interesting Antibody Papers

Biophysical Society 61st Annual Meeting – New Orleans, February 2017

Using Antibody Next Generation Sequencing data to aid antibody engineering