Category Archives: Protein Structure

More Fun With 3D Printing

Recently the students of the Systems Approaches to Biomedical Science Centre for Doctoral Training took a 2-week module on our favourite subject: structural biology! As part of this, they were given the option to create their very own 3D printed model of a protein.

This year we had some great models created, some of which are shown in the picture above. The proteins are (clockwise from top left):

Clathrin (PDB 1XI4) – a really interesting protein that forms cages around vesicles inside the cell. This one was mine; I wrote about clathrin as part of my undergraduate dissertation many years ago…
GTPase (PDB 1YZN) – a protein that can bind and hydrolyse guanosine triphosphate (GTP), involved in membrane trafficking
TAL effector (PDB 3UGM) – this bacterial protein binds to specific regions of DNA in a host plant to activate the expression of plant genes that aid bacterial infection. The DNA here is in blue, the orange wrapped around it is the protein.
Mechanotransduction ion channel (PDB 5VKQ) – converts mechanical stimuli into electrical signals in specialized sensory cells.
ATP synthase – this protein machine builds most of the energy storage molecule ATP, which powers our cellular processes.
DNA (PDB 5F9I) – a double-helix strand of DNA, 20 base pairs long.

Protein Structure Classification: Order in the Chaos

The number of known protein structures has increased exponentially over the past decades; there are currently over 127,000 structures deposited in the PDB [1]. To bring order to this large volume of data, and to further our understanding of protein function and evolution, these structures are systematically classified according to sequence and structural similarity. Downloadable classification data can be used for annotating datasets, exploring the properties of proteins and for the training and benchmarking of new methods [2].

Yearly growth of structures in the PDB (adapted from [1])

Typically, proteins are grouped by structural similarity and organised using hierarchical clustering. Proteins are sorted into classes based on overall secondary structure composition, and grouped into related families and superfamilies. Although this process could originally be manually curated, as with Structural Classification of Proteins (SCOP) [3] (last updated in June 2009), the growing number of protein structures now requires semi- or fully-automated methods, such as SCOP-extended (SCOPe) [4] and Class, Architecture, Topology, Homology (CATH) [5]. These resources are comprehensive and widely used, particularly in computational protein research. There is a large proportion of agreement between these databases, but subjectivity of protein classification is to be expected. Variation in methods and hierarchical structure result in differences in classifications. For example, different criteria for defining and classifying domains results in inconsistencies between CATH and SCOPe.

The arrangements of secondary structure elements in space are known as folds. As a result of evolution, the number of folds that exist in nature is thought to be finite, predicted to be between 1000-10,000 [6]. Analysis of currently known structures appears to support this hypothesis, although solved structures in the PDB are likely to be a skewed sample of all protein structures space. Some folds are extremely commonly observed in protein structures.

In his ‘periodic table for protein structures’, William Taylor went one step further in his goal to find a comprehensive, non-hierarchical method of protein classification [7]. He attempted to identify a minimal set of building blocks, referred to as basic Forms, that can be used to assemble as many globular protein structures as possible. These basic Forms can be combined systematically in layers in a way analogous to the combination of electrons into valence shells to form the periodic table. An individual protein structure can then be described as the closest matching combination of these basic Forms. Related proteins can be identified by the largest combination of basic Forms they have in common.

The ‘basic Forms’ that make up Taylor’s ‘periodic table of proteins’. These secondary structure elements accounted for, on average, 80% of each protein in a set of 2,230 structures (all-alpha proteins were excluded from the dataset) [7]

The classification of proteins by sequence, secondary and tertiary structure is extensive. A relatively new frontier for protein classification is the quaternary structure: how proteins assemble into di-, tri- and multimeric complexes. In a recent publication by an interdisciplinary team of researchers, an analysis of multimeric protein structures in combination with mass spectrometry data was used to create a ‘periodic table of protein complexes’ [8]. Three main types of assembly steps were identified: dimerisation, cyclisation and heteromeric subunit addition. These types are systematically combined to predict many possible topologies of protein complexes, within which the majority of known complexes were found to reside. As has been the case with tertiary structure, this classification and exploration of of quaternary structure space could lead to a better understanding of protein structure, function and evolutionary relationships. In addition, it may inform the modelling and docking of multimeric proteins.

RCSB PDB Statistics
Fox, N.K., Brenner, S.E., Chandonia, J.-M., 2015. The value of protein structure classification information-Surveying the scientific literature. Proteins Struct. Funct. Bioinforma. 83, 2025–2038.
Murzin AG, Brenner SE, Hubbard T, Chothia C., 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 247, 536–540.
Fox, N.K., Brenner, S.E., Chandonia, J.-M., 2014. SCOPe: Structural Classification of Proteins–extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, 304-9.
Dawson NL, Lewis TE, Das S, et al., 2017. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Research. 45, 289-295.
Derek N Woolfson, Gail J Bartlett, Antony J Burton, Jack W Heal, Ai Niitsu, Andrew R Thomson, Christopher W Wood,. 2015. De novo protein design: how do we expand into the universe of possible protein structures?, Current Opinion in Structural Biology, 33, 16-26.
Taylor, W.R., 2002. A “periodic table” for protein structures. Nature. 416, 657–660.
Ahnert, S.E., Marsh, J.A., Hernandez, H., Robinson, C. V., Teichmann, S.A., 2015. Principles of assembly reveal a periodic table of protein complexes. Science. 80, 350

Start2Fold: A database of protein folding and stability data

Hydrogen/deuterium exchange (HDX) experiments are used to probe the tertiary structures and folding pathways of proteins. The rate of proton exchange between a given residue’s backbone amide proton and the surrounding solvent depends on the solvent exposure of the residue. By refolding a protein under exchange conditions, these experiments can identify which regions quickly become solvent-inaccessible, and which regions undergo exchange for longer, providing information about the refolding pathway.

Although there are many examples of individual HDX experiments in the literature, the heterogeneous nature of the data has deterred comprehensive analyses. Start2Fold (Start2Fold.eu) [1] is a curated database that aims to present protein folding and stability data derived from solvent-exchange experiments in a comparable and accessible form. For each protein entry, residues are classified as early/intermediate/late based on folding data, or strong/medium/weak based on stability data. Each entry includes the PDB code, length, and sequence of the protein, as well as details of the experimental method. The database currently includes 57 entries, most of which have both folding and stability data. Hopefully, this database will grow as scientists add their own experimental data, and reveal useful information about how proteins refold.

The folding data available in Start2Fold is visualised in the figure below, with early, intermediate and late folding residues coloured light, medium and dark blue, respectively.

[1] Pancsa, R., Varadi, M., Tompa, P., Vranken, W.F., 2016. Start2Fold: a database of hydrogen/deuterium exchange data on protein folding and stability. Nucleic Acids Res. 44, D429-34.

A program to aid primary protein structure determination -1962 style.

This year, OPIG have been doing series of weekly lectures on papers we considered to be seminal in the field of protein informatics. I initially started looking at “Comprotein: A computer program to aid primary protein structure determination” as it was one of the earliest (1960s) papers discussing a computational method of discovering the primary structure of proteins. Many bioinformaticians use these well-formed, tidy, sterile arrays of amino acids as the input to their work, for example:

MGLSDGEWQL VLNVWGKVEA DIPGHGQEVL IRLFKGHPET LEKFDKFKHL KSEDEMKASE DLKKHGATVL TALGGILKKK GHHEAEIKPL AQSHATKHKI PVKYLEFISE CIIQVLQSKH PGDFGADAQG AMNKALELFR KDMASNYKEL GFQG
(For those of you playing at home, that’s myoglobin.)

As the OPIG crew come from a diverse background and frequently ask questions well beyond my area of expertise, if for nothing other than posterior-covering, I needed to do some background reading. Though I’m not a researcher by trade any more, I began to realise despite the lectures/classes/papers/seminars I’d been exposed to, regarding all the clever things you do with a sequence when you have it, I didn’t know how you would actually go from a bunch of cells expressing (amongst a myriad of other molecules) the protein you were interested in, to the neat array of characters shown above. So without further ado:

The first stage in obtaining your protein is: cell lysis and there’s not much in it for the cell.
Mangle your cell using chemicals, enzymes, sonification or a French press (not your coffee one).

The second stage is producing a crude extract by centrifuging the above cell-mangle. This, terrifyingly, appears to be done between 10,000G and 100,000G and removes the cellular debris leaving it as a pellet in the bottom of the container, with the supernatant containing little but a mix of the proteins which were present in the cytoplasm along with some additional macromolecules.

Stage three is to purify the crude extract. Depending on the properties of the protein you’re interested in, one or more of the following stages are required:

Reverse-phase chromatography to separate based on hydrophobicity
Ion-exchange to separate based on the charge of the proteins
Gel-filtration to separate based on the size of the proteins

If all of the above are preformed, whilst the sequence of these variously charged/size-sorted/polar proteins will still be still unknown, they will now be sorted into various substrates based upon their properties. This is where the the third stage departs from science and lands squarely in the realm of art. The detergents/protocols/chemicals/enzymes/temperatures/pressures of the above techniques all differ depending on the hydrophobicity/charge/animal source of the type of protein one is aiming to extract.

Since at this point we still don’t know their sequence, working out the concentrations of the various constituent amino acids will be useful. One of the simplest methods of determining the amino acid concentrations of a protein is follow a procedure similar to:

Heat the sample in 6M HCL at at a temperature of 110C for 18-24h (or more) to fully hydrolyse all the peptide bonds. This may require an extended period (over 72h) to hydrolyse peptide bonds which are known to be more stable, such as those involving valine, isoleucine and leucine. This however can degrade Ser/Thr/Tyr/Try/Gln and Cys which will subsequently skew results. An alternative is to raise the pressure in the vessel to allow temperatures of 145-155C to for 20-240 minutes.

TL;DR: Take the glassware that’s been lying about your lab since before you were born, put 6M hydrochloric acid in it and bring to the boil. Take one difficultly refined and still totally unknown protein and put it in your boiling hydrochloric acid. Seal the above glassware in order to use it as a pressure vessel. Retreat swiftly whilst the apparatus builds up the appropriate pressure and cleaves the protein as required. -What could go wrong?

At this point I wondered if the almost exponential growth in PDB entries was due to humanity’s herd of biochemists now having been thinned to those which remained simply being several generations worth of lucky.

Once you have an idea of how many of each type of amino acid comprise your protein, we can potentially rebuild it. However at this point it’s like we’ve got a jigsaw puzzle and though we’ve got all the pieces and each piece can only be one of a limited selection of colours (thus making it a combinatorial problem) we’ve no idea what the pattern on the box should be. To further complicate matters, since this isn’t being done on but a single copy of the protein at a time, it’s like someone has put multiple copies of the same jigsaw into the box.

Once we have all the pieces, to determine the actual sequence, a second technique needs to be used. Though invented in 1950, Edman degradation appears not to have been a particularly widespread protocol, or at least it wasn’t in the National Biomedical Research Foundation from which the above paper emerged. This means of degradation tags the N-terminal amino acid and cleaves it from the rest of the protein. This can then be identified and the protocol repeated. Whilst this would otherwise be ideal, it suffers from a few issues in that it takes about an hour per cycle, only works reliably on sequences of about 30 amino acids and doesn’t work at all for proteins which have their n-terminus bonded or buried.

Instead, the refined protein is cleaved into a number of fragments at known points using a single enzyme. For example, Trypsin will cleave on the carboxyl side of arginine and lysine residues. A second copy of the protein is then cleaved using a different enzyme at a different point. These individual fragments are then sorted as above and their individual (non-sequential) components determined.

For example, if we have a protein which has an initial sequence ABCDE
Which then gets cleaved by two different enzymes to give:
Enzyme 1 : (A, B, C) and (D, E)
Enzyme 2 : (A, B) and (C, D)

We can see that the (C, D) fragment produced by Enzyme 2 overlaps with the (A, B, C) and (D, E) fragments produced by Enzyme 1. However, as we don’t know the order in which the amino acid appear within in each fragment, thus there are a number of different sequences which can come to light:
Possibility 1 : A B C D E Possibility 2 : B A C D E Possibility 3 : E D C A B Possibility 4 : E D C B A

At this point the paper comments that such a result highlights to the biochemist that the molecule requires further work for refinement. Sadly the above example whilst relatively simple doesn’t include the whole host of other issues which plague the biochemist in their search for an exact sequence.

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: Protein Structure

More Fun With 3D Printing

Protein Structure Classification: Order in the Chaos

Start2Fold: A database of protein folding and stability data

A program to aid primary protein structure determination -1962 style.