Category Archives: Conferences

Preparing a five minute conference talk: an honest account

On 26 September I had the opportunity to give a short talk at the COSTNET18 conference in Warsaw. I’d never done anything like it before, which made it both exciting and a tiny bit terrifying. I thought I’d share how I prepared for it, in the hope that other conference newbies might find some of it useful, or at least funny.

20 July

I register for the conference, and apply to give a talk. I use a version of my paper draft abstract, to which I add a couple of introductory sentences. I submit successfully, but at the end of the day accidentally delete this version of the abstract from my computer. I guess if I need it, I just need to wait until the conference programme becomes available. #fail Continue reading →

Cinder: Crystallographic Tinder

Protein structure determination is still dominated by xray diffraction. For diffraction studies structural biologists need to grow and optimise protein crystals until they diffract to an usable and optimal resolution. A purified protein sample is exposed to a number of crystallisation screens, each comprising a selection of chemical conditions that are designed to explore a reasonably wide area of potential crystallisation conditions.

Many crystallography labs routinely image these in large plate storage systems, which reduces the human interaction to viewing a set of usually 100-1000 images at various time points. This is a slow and laborious process, and highly applicable to machine learning approaches tailored to looking at images. TexRank, a texton analysis ranking software was developed by Jia Tsing in OPIG and is used at the Structural Genomics Consortium (SGC). This ranking reduces the number of images that a human needs to search through, providing a quicker review process. Continue reading →

ISMB 2018: Collaborative Structural Biology using Machine Learning and Jupyter Notebook

This post is a summary of the talk, Collaborative Structural Biology using Machine Learning and Jupyter Notebook, given by Fergus Imrie and Fergus Boyles at ISMB 2018. Materials for the experiments can be found here and here.

Myself and four other members of the Oxford Protein Informatics Group (a.k.a. OPIGlets) recently had the pleasure of attending the Intelligent Systems for Molecular Biology (ISMB) conference in Chicago. Organised by the International Society of Computational Biology (ISCB), ISMB is the largest computational biology conference in the world, with several thousand attendees.

Spread over four action-packed days in July (not including workshops/tutorial sessions), it was an eye-opening experience, showcasing the depth and breadth of computational biology research; particularly striking was the range of problems tackled, techniques applied, and data sources used.

I was fortunate enough to have the opportunity to present alongside my colleague, Fergus Boyles, as part of the 3DSIG Community of Special Interest (COSI). We led the first hands-on practical demonstration at 3DSIG, entitled “Collaborative Structural Biology using Machine Learning and Jupyter Notebook”. While a new format at the conference, with our presentation somewhat of an experiment, I understand the organising committee is keen to repeat the format next year.

In what follows, I’ll briefly outline the key themes and outcomes from our presentation. Full materials to reproduce all results presented in full can be found here and here.

Reproducibility crisis?

In a survey of 1,500 scientists by Nature in 2016 (link), more than 70% of participants had tried and failed to reproduce another scientist’s experiments, while 90% said there was a reproducibility crisis to some extent. Most striking, perhaps, was the revelation that “more than half have failed to reproduce their own experiments”!

Nature, 2016, M. Baker, 1,500 scientists lift the lid on reproducibility

While the focus of the survey was, admittedly, on traditional, lab-based, experimental research, this is certainly also an issue in computational approaches, with the machine learning community under the heaviest scrutiny.

This is clearly unsustainable and many efforts are being taken to address this across the scientific world. As one example, Nature has introduced a code and submission checklist that requires authors to submit custom algorithms or software that are central to the paper for peer review and editorial assessment. While only directly affecting a small portion of research, this is a big step in the right direction and I think we’re only going to see more of this in the future.

Software to the rescue?

With the rise of cloud computing, the open-source community, and much more, there is a plethora of software available that can be used to improve the accessibility of methods and improve the reproducibility of computational experiments. Below, I touch on a couple of general areas that are increasing used in computational pipelines and setups.

Cloud computing (such as Amazon Web Services, Google Cloud, and Microsoft Azure) provides widely accessible, standardised compute environments, and allows the use of anything from a single core to near-HPC-level resources for a short period of time at relative inexpensive.
Container solutions (such as Docker and Kubernets) allow developers to package an application, with all required libraries and dependencies, into a single executable for the end user, with no further dependencies.

Our approach

We didn’t use any of the above tools for purposes of our talk, but instead constructed our pipeline based on three other widely-used solutions: Conda, Project Jupyter, and Git/GitHub. For those unfamiliar, here is a brief overview of each.

Conda is an open-source package and environment management system. It works by creating distinct virtual environments and installing standalone interpreters or compilers within that virtual environment. You can then install additional packages within that virtual environment, that are completely isolated and separate from your system default packages, and other virtual environments.

For those of you who are familiar with the iPython notebook, Jupyter is an extension of this format to multiple languages. Jupyter provides an interactive browser-based coding environment in the form of a notebook, that can be thought of as similar to a lightweight IDE. The power of Jupyter notebooks comes from a combination of (1) the ability to intersperse code with markdown, which is much more human readable and friendly on the eye compared to traditional comments; (2) the cell-based format, where small pieces of code are contained in cells that can be run, and re-run, individually and without re-running the remainder of your code; (3) the ability to display inline figures, tables (among other things), rendering in HTML.

Git is an open-source version control system. Version control is an essential bedrock of good programming that we don’t have time to go into in more detail, but long-story short, Git takes any headache out of version control.

GitHub is a code hosting platform built for collaboration with Git at its core. Beyond a simple code repository, GitHub allows collaboration and development through two key features. “Forking” allows you to clone other projects, and either develop them yourself, or keep a record of a fixed version for integration within another project. “Pull requests” make large scale community collaboration projects possible, with users providing code for specific modifications for the original projects, which the owners/admin of the original project can choose to merge or reject.

Experiments

As a toy problem to showcase this approach to building a reproducible pipeline, we address the problem of protein classification according to the SCOP classification scheme. While the dataset we have shared contains examples of protein pairs that are in the same fold, superfamily, and family (as well as none of these), we focussed on the most straightforward task of determining whether a pair of proteins belong to the same family or not.

Our dataset is based on the Astral data set (06.02.2016 build), and consists of 8 pairwise features computed from the sequences of the two proteins. We won’t go into the details of the exact features here.

Using a simple random forest on these 8 pairwise features between the target and template protein, we achieved an accuracy of 88.0%, and an area under the receiver operative curve of 0.95. A confusion matrix and ROC curve summarising our results can be found below.

Instructions to reproduce these results, together with all materials needed, can be found here and here.

Conclusions

Reproducibility in science is facing a challenging time. All stakeholders, from researchers to funders and publishers, are placing more emphasis on work being reproducible, and are taking measures to ensure this. In computational research, in particular stochastic algorithms such as those prevalent throughout machine learning, the problem is no less serious, and on the face of it should be readily solvable.

In our demonstration, we have illustrated one approach to tackling this in a simple, efficient way. In addition, we only looked to tackle one possible problem or question, and only used a subset of the overall dataset. Please feel free to explore the dataset and pose your own questions. We’d love to hear from you if you do!

Acknowledgements

I’d like to thank all of OPIG for providing feedback on an early version of the talk. Crucially, I’d like to thank Dr Saulo de Oliveira who provided us with the dataset used in our exploratory analysis. Finally, I’d like to thank my co-presenter Fergus Bolyes, without whom I couldn’t have done this.

ISMB 2018 (Chicago): Summary of Interesting Talks/Posters

Catherine’s Selection

Network approach integrates 3D structural and sequence data to improve protein structural comparison

Why: Current graph mapping in protein structural comparison ignores sequence order of residues. Residues distant in sequence but close in 3D space are more important.
How: Introduce sequence order of residues, set a sequence-distance cutoff to consider structurally important residues, count the graphlet frequency and embed into PCA space.
Results: the new method is predictive of SCOP and CATH ‘groups’. Certain graphlets are enriched in alpha and beta folds.
Link: https://www.nature.com/articles/s41598-017-14411-y

Investigating the molecular determinants of Ebola virus pathogenicity

Why: Reston virus is the only Ebola virus that is not pathogenic to human
What they do: multiple sequence alignment to look for specificity determining positions (SDPs) using s3det, then predict the effect of each individual SDP on the stability of the protein with mCSM.
Results: VP40 SDPs alter octamer formation, structure hydrophobic core. VP24 SDPs leads to impair binding to KPNA5 in human, which inhibits interferon signalling.
Impact: only a few SDPs distinguish Reston VP24 from VP24 of others. Human-pathogenic Reston viruses may emerge.
Link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5558184/#__ffn_sectitle

Computational Analysis Highlights Key Molecular Interactions and Conformational Flexibility of a New Epitope on the Malaria Circumsporozoite Protein and Paves the Way for Vaccine Design

Why: An antibody with a strong binding affinity was found in a group of subjects. This antibody prevents cleavage of the surface protein.
What they do: They found the linear epitope, crystallise the strong and medium binders and run a molecular dynamic simulation to find out the flexibility of the structures.
Results: The strong binder is less flexible. Moreover, the strong binder is similar to the germline sequence which may mean that this antibody could have been readily formed.
Link: https://www.nature.com/articles/nm.4512

—

Matt’s Selection

“Analysis of sequence and structure data to understand nanobody architectures and antigen interactions”
Laura S. Mitchell (Colwell Group)
University of Cambridge, UK

This poster detailed the work from Laura’s two most recent publications, which can be found here: https://doi.org/10.1002/prot.25497, https://doi.org/10.1093/protein/gzy017

They describe a comprehensive analysis of the binding properties of the 156 non-redundant nanobody-antigen (Nb-Ag) complexes in the PDB/SAbDab (October 2017). Their analyses include Nb sequence variability (both global and across the binding regions), contact maps of nanobody-antigen interactions by region, and the typical chemical properties of each paratope. Nb-Ag complexes are compared to a reference set of monoclonal antibody-antigen (mAb-Ag) complexes. This work is a key first step in advancing our understanding of Nb paratopes, and will aid the development of new diagnostics and therapeutics.

“OSPREY 3.0: Open-Source Protein Redesign for You, with Powerful New Features”
Jeffrey W. Martin (Donald Group)
Duke University, USA

OSPREY 3.0 (https://www.biorxiv.org/content/early/2018/04/23/306324) represents a large advance towards time-efficient continuous flexibility modelling of protein-protein interfaces.

Its new algorithms LUTE and BBK* allow for continuous rotamer flexibility searching and entropy-aware binding constant approximation in a much more efficient manner. The CATS algorithm also introduces local backbone flexibility as a long-awaited feature. This software now has a easy-to-use Python interface, and is fully Open-Source, making it an extremely attractive alternative to other proprietary protein design tools.

“Functional annotation of chemical libraries across diverse biological processes”
Scott Simpkins
University of Minnesota-Twin Cities, USA

This interesting talk detailed the work published in Nature Chemical Biology in September 2017 (https://doi.org/10.1038/nchembio.2436).

310 yeast gene-deletion mutants were isolated to perform chemical-genetic profile studies across six diverse small molecule high-throughput screening libraries. By studying which gene-deletion mutants were hypersensitive or resistant to each compound, the researchers could assign most members of each chemical library a probable functional annotation. Mapping back to gene-interaction profile data also allowed them to infer likely targets for some compounds. The GO annotations associated with these genes could then be used assess whether a given starting library is likely to contain promising starting-points that affect a given biological function. For example, the authors highlighted a deficiency across all libraries against the cellular processes of cytokinesis and ribosome biogenesis. Conversely, they found a large enrichment across all libraries for compounds likely to affect glycosylation or cell wall biogenesis. Compounds that target transcription and chromatin organisation were found to be enriched in certain datasets, and depleted in others. This genre of profiling provides researchers a way of judging a priori whether a given screening library is likely to contain promising lead compounds, given the functional role of the target of interest.

Prague Protein Spring 2018

We, Constantin and Dominik, the newest members of OPIG (SABS rotation students, as usual) were lucky to have a conference suitable to our research within our rotation period and, granted an allowance from the powers that be, were able to visit this year’s Prague Protein Spring with the topic ‘Proteins at Work’. There, we spent four busy but very inspirational days with about 50 participants in a little palace, the Vila Lanna.

The general topic of this meeting led to a broad variety of talks representing a multitude of fields of protein research: from origins of life, over fuzzy intrinsically disordered proteins and crowded cells to metagenomics and functional sequence alignment annotation.

We picked four thought engaging talks to present at the group meeting on 08/05/2018; here are their summaries:

Protein engineering and in vitro evolution studies for the origins of life

Kosuke Fujishima from Tokyo Institute of Technology presented several examples of the research he conducts in the area of origins of life. Research on the origins of life are generally based around the questions how prebiotic monomers were created, how they condensed into polymers and how functionality emerged within these polymers.

The first example of his research deals with the condensation of prebiotic monomers on the ocean-earth crust-interface. Water cycling between the ocean and the outer layers of the earth’s core provided an environment of high pressure and high temperatures (80 – 200 °C) which is necessary for amino acid polymerisation. The mineral Olivine was found to attract amino acids to its surface and the serpentinisation reaction happening with Olivine might provide the necessary wet/dry cycle. Therefore, the researchers built a reactor aiming to investigate this potential polymerisation mechanism. They found that with providing six prebiotic amino acids, 28 out of 36 possible dipeptides could be found in the reactor. Furthermore, up to 10-mer linear polypeptides could be detected as well, providing evidence for a mechanism of early earth’s generation of polypeptides [unpublished].

The second project showed that both enzymes, CysE/CysK, responsible for the current production of cysteine from serine, could be re-engineered to contain no cysteine in their sequence. Interestingly, cysteine-free CysE showed higher reaction rates than the wild type. Additional reduction to cysteine- and methionine-free enzyme sequences only worked for CysE but not for CysK.[Fujishima et al. (2018)] Still, the experiments indicate that an enzyme world could have existed with a reduced number of amino acids compared to the 20(+) amino acids that we know today.

The third project we wanted to point out used a type of mRNA display that not only links the genotype (mRNA) with its corresponding phenotype (translated protein) but also allows the translated protein to interact with a randomised, non-translated part of the mRNA. This provided a framework for investigating the evolution of ribonucleotide-binding (RNP) proteins. When selecting for ATP-binding, it was observed that protein together with RNA had the best fitness landscape compared to protein selection or RNA selection alone. Further analysis revealed that most binding affinity of the ribonucleotide protein stemmed from its RNA part.[unpublished] These results give rise to the suggestion that RNA and proteins co-evolved, opposing the idea of a pure RNA world.

RNA-protein interactions and the structure of the genetic code

The next speaker added more to the research area of RNA-protein interaction and evolution. Bojan Zagrovic from the University of Vienna presented his research around the finding that pyrimidine (PYR) density of RNA regions is correlated with the corresponding protein region’s affinity to pyrimidine-containing bases (running means of 21 amino acids or 63 bases were used), with the highest correlation between mRNA PYR density and guanine affinity, having an average ‘typical’ Pearson correlation coefficient of 0.80.[Polyansky & Zagrovic (2013)]

This correlation is specific for the current genetic code, shown by random generation of genetic codes which could not reproduce such a correlated behaviour and by looking into three organisms with very different codon usage bias (homo sapiens, E. coli, M. jannaschii). Even though the three averages of codon usage were very different, the highest correlating pairs of mRNA and cognate proteins clustered together, having very similar codon usage. This was also true for the worst correlating pairs.[Hlevnjak & Zagrovic (2015)]

But the big question being: what does this correlation imply functionally?

Annotation analysis revealed that the highest correlating pairs were enriched in nucleotide-binding functions and intrinsically disordered proteins. Without claiming generalisability, Professor Zagrovic pointed out a case study done on RNA polymerase II which has a long disordered C-terminus build up by 26 repeats of a 7 amino acid motif. 248 RNAs were found to interact with RNA polymerase II and in all three reading frames of the interacting RNAs, amino acid codons of the polymerase’s C-terminus were enriched.[unpublished]

This indicates some regulation over gene expression but also several other hypotheses were made: the correlation between the protein regions’ affinity for their cognate mRNA regions might be relevant in virus assembly, since coding RNA and translated proteins have to be in close proximity with each other. The same could be true for some non-membrane-bound compartments, e.g. P-bodies. Or is this correlation characteristic a hint to mRNAs acting as chaperones for their respective proteins? The functional implications of this correlation, while highly speculative, nevertheless suggest exciting research to come in the future.

Fuzziness in protein assemblies

Research from a different, but equally thought provoking field was presented by Mónika Fuxreiter from the University of Debrecen. Her talk on the concept of fuzziness in protein complexes, which she introduced 10 years ago [Tompa & Fuxreiter (2008)], shed light on some more recent developments in the field as well as explaining the underlying concept for those of us (ourselves included) who have not encountered the concept as such before.

Fuzziness in the context of protein complexes describes a phenomenon in which intrinsically disordered proteins, instead of folding upon binding as one would usually observe, can sample several conformational states with different propensities, leading to the sampled states contributing with different strengths to the function of the protein complex and further leading to varying degrees of disorder in the bound state.

This observation has several implications for the understanding of the functionality of disordered proteins, since the relative propensity for different ensemble states in the bound form is thought to be highly susceptible to milieu influences, such as tissue specific splicing and post-translational modifications. Fuzziness (a term that was borrowed from the mathematical theory of fuzzy sets) could thus be a driver of functional adaptability of disordered proteins to cell-cycle stage, environmental influences or tissue type.

Evidence for fuzziness has been curated by the Fuxreiter group since 2015 [Miskei et al. 2017] in the FuzDB database and recently been used to develop a prediction algorithm [unpublished], that according to Professor Fuxreiter achieves highly accurate predictions of fuzziness on a comprehensive validation dataset.

Both the implications of fuzziness for the understanding of the mode of action for disordered proteins (and disordered regions in otherwise ordered proteins) certainly spiked our interest, not least due to the potential importance of a clear understanding of these mode of actions for drug development.

Investigation of mutually exclusive splicing events using the CATH FunFam framework

The last of the 4 talks we would like to single out in this blogpost highlighted recent progress in using structure-based databases for the investigation of complex cellular events.

Christine Orengo from UCL presented her group’s work on mutually exclusive splicing, which employed the FunFam framework of the CATH database to probe the structural and functional implications of these splicing events [Lam et al. (2018), under review].

The FunFams are a subcategory of CATH’s homologous superfamilies, which further divides the superfamilies based on clusters of residue conservation within each family, thus creating groupings of functionally related proteins [Rentzsch & Orengo (2013)].

Mutually exclusive splicing that were investigated using this framework are a group of splicing events in which only one of several specific exons is present in the spliced mRNA. These exons usually show a high level of sequence similarity, leading to a low disruption of the protein structure by the splicing event. It is thought that this feature is a reason for the relative enrichment of mutually exclusive exons amongst alternative splicing events in the proteome.

This high degree of sequence similarity further enabled the mapping of the mutually exclusive exons to FunFams in the CATH database and thus further onto protein structures. This allowed the Orengo group to conduct a ‘large scale systematic study of the structural/functional effects of MXE splicing’.

Their analysis found that variable residues between the exons are significantly enriched at the protein surface, both compared to other stretches of the protein sequence and compared to non-variable residues in the exons, and in close proximity (< 6 Angstroms) to functional sites of the protein.

The main conclusion drawn from these findings was that, as previously hypothesised, mutually exclusive exons are likely functional switches, since changes in the surface exposed area close to functional sites are likely to affect the protein function without strongly disrupting its structure.

In the eyes of the Orengo group, this makes these splicing events good candidates for drug targeting, particularly in cases where a tissue specific isoform can be drugged, since in that case off-target effects could potentially be significantly reduced.

Sources:

Fujishima et al. (2018). Reconstruction of cysteine biosynthesis using engineered cysteine-free enzymes. Scientific Reports

Hlevnjak & Zagrovic (2015). Malleable nature of mRNA-protein compositional complementarity and its functional significance. Nucleic Acids Res

Lam, S. D., Orengo, C., & Lees, J. (2018). Protein structure and function analyses to understand the implication of mutually exclusive splicing. BioRxiv

Miskei, M. et al (2017). FuzDB: Database of fuzzy complexes, a tool to develop stochastic structure-function relationships for protein complexes and higher-order assemblies. Nucleic Acids Research

Polyansky & Zagrovic (2013). Evidence of direct complementary interactions between messenger RNAs and their cognate proteins. Nucleic Acids Res

Rentzsch, R., & Orengo, C. A. (2013). Protein function prediction using domain families. BMC Bioinformatics

Tompa, P., & Fuxreiter, M. (2008). Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trends in Biochemical Sciences

Biophysical Society 62nd Annual Meeting

In February I was very fortunate to attend the Biophysical Society 62nd Annual Meeting, which was held in San Francisco – my first real conference and my first trip to North America. Despite arriving with the flu, I had a great time! The conference took place over five days, during which there were manageable 15-minute talks covering a huge range of Biophysics-related topics, and a few thousand more posters on display (including mine). With almost 6,500 attendees, it was also large enough to slip across the road to the excellent SF Museum of Modern Art without anyone noticing.

The best presentation of the conference was, of course, Saulo’s talk on integrating biological folding features into protein structure prediction [1]. Aside from that, here are a few more of my favourites:

Folding proteins from one end to the other
Micayla A. Bowman, Patricia L. Clark [2]

Here in the COFFEE (COtranslational Folding Family of Expert Enthusiasts) office, we love to talk about the vectorial nature of cotranslational folding and how it contributes to the efficiency of protein folding in vivo. Micayla Bowman and Patricia Clark have created a novel technique that will allow the effects of this vectorial folding to be investigated specifically in vitro.

The Clp complex grabs, unfolds and degrades proteins (diagram from [3]). ClpX, the translocase unit of this complex, was used to recapitulate vectorial protein refolding in vitro for the first time.

ClpX is an A+++ molecular motor that grabs proteins and translocates them through its pore. In vivo, its role is to denature substrates and feed them to an associated protease (ClpP) [3]. Bowman & Clark have used protein tags to initiate translocation of the target protein through ClpX, resulting in either N-C or C-N vectorial refolding.

The YKB construct used to demonstrate the vectorial folding mediated by ClpX (diagram from [4]).

They demonstrate the effect using YKB, a construct with two mutually exclusive native states: YK-B (fluoresces yellow) and Y-KB (fluoresces blue) [4]. In vitro refolding results in an equal proportion of yellow and blue states. Cotranslational folding, which proceeds in the N-C direction, biases towards the yellow (YK-B) state. C-N refolding in the presence of ClpX and ATP biases towards the blue (Y-KB) state. With this neat assay, they demonstrate that ClpX can mediate vectorial folding in vitro, and they plan to use the assay to investigate its effect on protein folding pathways and yields.

An ambiguous view of protein architecture
Guillaume Postic, Charlotte Perin, Yassine Ghouzam, Jean-Christope Gelly [Poster abstract: 5, Paper: 6]

This work addresses the ambiguity of domain definition by assigning multiple possible domain boundaries to protein structures. Their automated method, SWORD (Swift and Optimised Recognition of Domains), performs protein partitioning via the hierarchical clustering of protein units (PUs) [7], which are smaller than domains and larger than secondary structures. The structure is first decomposed into protein units, which are then merged depending on the resulting “separation criterion” (relative contact probabilities) and “compactness” (contact density).

Their method is able to reproduce the multiple conflicting definitions that often exist between domain databases such as SCOP and CATH. Additionally, they present a number of cases for which the alternative domain definitions have interesting implications, such as highlighting early folding regions or functional subdomains within “single-domain” structures.

Alternative SWORD domain delineations identify (R) an ultrafast folding domain and (S,T) stable autonomous folding regions within proteins designated single-domain by other methods [6]

Dual function of the trigger factor chaperone in nascent protein folding
Kaixian Liu, Kevin Maciuba, Christian M. Kaiser [8]

The authors of this work used optical tweezers to study the cotranslational folding of the first two domains of 5-domain protein elongation factor G.

In agreement with a number of other presentations at the conference, they report that interactions with the ribosome surface during the early stages of translation slows folding by stabilising disordered states, preventing both native and misfolded conformations. They found that the N-terminal domain (G domain) folds independently, while the subsequent folding of the second domain (Domain II) requires the presence of the folded G domain. Furthermore, while partially extruded, unfolded domain II destabilises the native G domain conformation and leads to misfolding. This is prevented in the presence of the chaperone Trigger factor, which protects the G domain from unproductive interactions and unfolding by stabilising the native conformation. This work demonstrates interesting mechanisms by which Trigger factor and the ribosome can influence the cotranslational folding pathway.

Optical tweezers are used to interrogate the folding pathway of a protein during stalled cotranslational folding. Mechanical force applied to the ribosome and the N-terminal of the nascent chain causes unfolding events, which can be identified as sudden increases in the extension of the chain. (Figure from [9])

Predicting protein contact maps directly from primary sequence without the need for homologs
Thrasyvoulos Karydis, Joseph M. Jacobson [10]

The prediction of protein contacts from primary sequence is an enormously powerful tool, particularly for predicting protein structures. A major limitation is that current methods using coevolution inference require a large multiple sequence alignment, which is not possible for targets without many known homologous sequences.

In this talk, Thrasyvoulos Karydis presented CoMET (Convolutional Motif Embeddings Tool), a tool to predict protein contact maps without a multiple sequence alignment or coevolution data. They extract structural and sequence motifs from known sequence-structure pairs, and use a Deep Convolutional Neural Network to associate sequence and structure motif embeddings. The method was trained on 137,000 sequence-structure pairs with a maximum of 256 residues, and is able to recreate contact map patterns with low resolution from primary sequence alone. There is no paper on this yet, but we’ll be looking out for it!

1. de Oliveira, S.H. and Deane, C.M., 2018. Exploring Folding Features in Protein Structure Prediction. Biophysical Journal, 114(3), p.36a.
2. Bowman, M.A. and Clark, P.L., 2018. Folding Proteins From One End to the Other. Biophysical Journal, 114(3), p.200a.
3. Baker, T.A. and Sauer, R.T., 2012. ClpXP, an ATP-powered unfolding and protein-degradation machine. Biochimica et Biophysica Acta (BBA)-Molecular Cell Research, 1823(1), pp.15-28.
Acta (BBA) – Molecular Cell Research, 2012, 1823 (1), 15-28
4. Sander, I.M., Chaney, J.L. and Clark, P.L., 2014. Expanding Anfinsen’s principle: contributions of synonymous codon selection to rational protein design. Journal of the American Chemical Society, 136(3), pp.858-861.
5. Postic, G., Périn, C., Ghouzam, Y. and Gelly, J.C., 2018. An Ambiguous View of Protein Architecture. Biophysical Journal, 114(3), p.46a.
6. Postic, G., Ghouzam, Y., Chebrek, R. and Gelly, J.C., 2017. An ambiguity principle for assigning protein structural domains. Science advances, 3(1), p.e1600552.
7. Gelly, J.C. and de Brevern, A.G., 2010. Protein Peeling 3D: new tools for analyzing protein structures. Bioinformatics, 27(1), pp.132-133.
8. Liu, K., Maciuba, K. and Kaiser, C.M., 2018. Dual Function of the Trigger Factor Chaperone in Nascent Protein Folding. Biophysical Journal, 114(3), p.552a.
9. Liu, K., Rehfus, J.E., Mattson, E. and Kaiser, C., 2017. The ribosome destabilizes native and non‐native structures in a nascent multi‐domain protein. Protein Science.
10. Karydis, T. and Jacobson, J.M., 2018. Predicting Protein Contact Maps Directly from Primary Sequence without the Need for Homologs. Biophysical Journal, 114(3), p.36a.

Young Entrepreneurs Scheme (YES) Competition

Fair warning: I’m going to use this BLOPIG post to promote the YES competition and talk about how semi-amazingly we did at it!

For those who don’t know, the YES competition runs yearly and is designed to develop the entrepreneurial spirit amongst graduates and post-graduates. The YES workshops come in two parts, the first being an intensive crash course in small business start-ups. These are delivered by financial experts, successful start-ups, and intellectual property teams. Carefully mixing theory with useful anecdotes, these talks were hugely insightful and all entusiastically given by people passionate about science start-ups. We were lucky to have many of these speakers mentoring for the second part: the development of our own business plan.

Our team, the fantastically named Team SolOx, developed a licence-selling business for a theoretical catalyst, which mimicked photosynthesis. Our product produced lightweight hydrocarbons from atmospheric gases quickly and efficiently. The comic value of our idea aside, we designed a 10 year business plan that saw SolOx develop and licence our catalyst. This process was eye-opening, with the mentors highlighting the hurdles we would face and taught us how to overcome them. Our pitch landed us a place in the final at the Royal Society in London, as one of the winners of the YES Industrial Challenges workshop 2017. Although the final judging panel didn’t find our plan as financially sound as others, we had a fantastic experience and would thoroughly recommend it to anyone interested in business start-ups.

Team SolOx: Winners of the YES Industrial Challenges 2017 Workshop. Left to right: Natasha Rhys, Tom Dixon, Joe Bluck, Sarah-Beth Amos and Alex Skates.

Finally, I would like to thank the Systems Approaches to Medical Science Centre for Doctoral Training for their financial support and their focus on promoting entrepreneurial skills.

Biophysical Society 61st Annual Meeting – New Orleans, February 2017

As the sole representative of OPIG attending Biophys 2017 in New Orleans, I had to bear the heavy burden of a long and lonely flight and the fear of missing out on a week of the very grey Oxford winter. Having successfully crossed the border into the US, which was thankfully easier for me than it was for some of our scientific colleagues from around the world, I found my first time attending the conference to be full of very interesting and relevant science. While also covering a wide variety of experimental techniques and non-protein topics, the conference is so large and broad that there was more than enough to keep me busy over the five days, featuring folding, structure prediction, docking, networks, and molecular dynamics.

There were several excellent talks on the subject of folding pathways, misfolding and aggregation. A common theme was the importance of the kinetic stability of the native state, and the mechanisms by which it may be prevented from reaching a non-native global thermodynamic minimum. This is particularly important for serpins, large protease inhibitors which inactivate proteases by a suicide mechanism. The native and active state can be transformed into a lower energy conformation over long timescales. However, this also occurs by cleavage near the C-terminal end, which allows insertion of the C-terminal tail into a beta sheet, holding the cleaving protease inactive and therefore the stored energy is very important for function. Anne Gershenson described recent simulations and experiments to elucidate the order in which substructures of the complete fold assemble. There are many cooperative substructures in this case, and N-terminal helices form at an early stage. The overall topology appears to be consistent with a cotranslational folding mechanism inside the ER, but requires significant rearrangements after translation for adoption of the full native fold.

Cotranslational folding was also discussed by several others including the following: Patricia Clark is now using the YKB system of alternately folding fluorescent protein to find new translation stalling sequences; Anais Cassaignau described NMR experiments to show the interactions taking place between nascent chains and the ribosome at different stalled positions during translation; and Daniel Nissley presented a model to predict a shift in folding mechanism from post-translational to cotranslational due to specific designed synonymous codon changes, which agreed very well with experimental data.

To look more deeply into the evolution of folding mechanisms and protein stability, Susan Marqusee presented a study of the kinetics of folding of RNases, comparing the properties of inferred ancestral sequences to a present day thermophile and mesophilic E. coli. A number of reconstructed sequences were expressed, and it was found that moving along either evolutionary branch from the ancestor to modern day, folding and unfolding rates had both decreased, but the same three-state folding pathway via an intermediate is conserved for all ancestors. However, the energy transition between the intermediate and the unfolded state has evolved in opposite directions even while the kinetic stability remains similar. This has led to the greater thermodynamic stability seen in the modern day thermophile compared to the mesophile at higher temperatures and concentrations of denaturant.

Panel C shows that kinetic stability (low unfolding rate) seems to be selected for in both environments. Panel D shows that the thermodynamic stability of the intermediate (compared to the unfolded state) accounts for the differences in thermodynamic stability of the native state, when compared to the common ancestor (0,0). Link to paper

There were plenty of talks discussing the problems and mechanisms of protein aggregation, with two focussing on light chain amyloidosis. Marina Ramirez-Alvarado was investigating how fibrils begin to grow and showed using microscopy that both soluble light chains and fibrils (more slowly) are internalised by heart muscle cells. They can then be exposed at the cell surface and become a seed to recruit other soluble light chains to form fibrils. Shannon Esswein presented work on the enhancement of VL-VL dimerisation to prevent amyloid formation. The variable domain of the light chain (VL) can pair with itself in a similar orientation to its pairing with VH domains in normal antibodies, or in a non-canonical orientation. Adding disulphide bonds to stabilise these dimers prevented fibril formation, therefore they carried out a small scale screen of 27 aromatic and hydrophobic ligands to find those which would favour dimer formation by binding at the interface. Sulfasalazine was detected in this screen and was also shown to significantly reduce fibril formation and could therefore be used as a template for future drug design.

A ligand stabilises the dimer therefore fewer light chains are present as monomers, slowing the rate of the only route by which fibrils can be formed. Link to paper

Among the posters, Alan Perez-Rathke presented loop modelling by DiSGro in beta barrel membrane proteins which showed that the population of structures generated and scored favourably after relaxation at a pH 7 led to an open pore more often than at pH 5, consistent with experimental observations. There were two posters on the topic of prediction of membrane protein expression in bacteria and yeast presented by students of Bill Clemons, who also gave a great talk. Shyam Saladi has carefully curated datasets of successes and failures in expression in E. coli and trained a linear SVM on features such as RNA secondary structure and transmembrane segment hydrophobicity to predict the outcome for unknown proteins. This simple approach (preprint available here) achieved area under ROC curve of around 0.6 on a separate test set, and using more complex machine learning techniques is likely to improve this. Samuel Schulte is adapting the same method for prediction of expression in yeast.

Overall, it was a great conference and it was nice to hear about plenty of experimental work alongside the more familiar computational work. I would also highly recommend New Orleans as an excellent place to find great food, jazz and sunshine!

CCP4 Study Weekend 2017: From Data to Structure

This year’s CCP4 study weekend focused on providing an overview of the process and pipelines available, to take crystallographic diffraction data from spot intensities right through to structure. Therefore sessions included; processing diffraction data, phasing through molecular replacement and experimental techniques, automated model building and refinement. As well as updates to CCP4 and where is crystallography going to take us in the future?

Surrounding the meeting there was also a session for Macromolecular (MX) crystallography users of Diamond Light Source (DLS), which gave an update on the beamlines, and scientific software, as well as examples of how fragment screening at DLS has been used. The VMXi (Versatile Macromolecular X-tallography in-situ) beamline is being developed to image crystals that are forming in situ crystallisation plates. This should allow for crystallography to be optimized, as crystallization conditions can be screened, and data collected on experiments as they crystallise, especially helpful in cases where crystallisation has routinely led to non-diffracting crystals. VXMm is a micro/nanofocus MX beamline, which is in development, with a focus to get crystallographic from very small crystals (~300nm to 10 micron diameters, with a bias to the smaller size), thereby allowing crystallography of targets that have previously been hard to get sufficient crystals. Other updates included how technology developed for fast solid state data collection on x-ray free electron lasers (XFEL) can be used on synchrotron beamlines.

A slightly more in-depth discussion of two tools presented that were developed for use alongside and within CCP4, which might be of interest more broadly:

ConKit: A python interface for contact prediction tools

Contact prediction for proteins, at its simplest, involves estimating which residues within a certain certain spatial proximity of each other, given the sequence of the protein, or proteins (for complexes and interfaces). Two major types of contact prediction exist:

Evolutionary Coupling
- Take a series of sequence homologues, and identifying co-evolved residues from multiple sequence alignment of the protein family. These co-evolved residues are hypothesized to share a functional dependence. Discussed previously on BLOPIG: Predicted protein contacts: is it the solution to (de novo) protein structure prediction?
Supervised machine learning
- Using ab initio structure prediction tools, without sequence homologues, to predict which contacts exist, but with a much lower accuracy than evolutionary coupling.

fullscreen

ConKit is a python interface (API) for contact prediction tools, consisting of three major modules:

Core: A module for constructing hierarchies, thereby storing necessary data such as sequences in a parsable format.
- Providing common functionality through functions that for example declare a contact as a false positive.
Application: Python wrappers for common contact prediction and sequence alignment applications
- CCMPred
- CdHit
- HHblits
- HHfilter
- Jackhmmer
- Psicov
- BbContacts
I/O: I/O interface for file reading, writing and conversions.

Contact prediction can be used in the crystallographic structure determination field, during unconventional molecular replacement, using a tool such as AMPLE. Molecular replacement is a computational strategy to solve the phase problem. In the typical case, by using homologous structures to determine an estimate a model of the protein, which best fits the experimental diffraction intensities, and thus estimate the phase. AMPLE utilises ab initio modeling (using Rosetta) to generate a model for the protein, contact prediction can provide input to this ab initio modeling, thereby making it more feasible to generate an appropriate structure, from which to solve the phase problem. Contact prediction can also be used to analyse known and unknown structures, to identify potential functional sites.

For more information: Talk given at CCP4 study weekend (Felix Simkovic), ConKit documentation

ACEDRG: Generating Crystallographic Restraints for Ligands

Small molecule ligands are present in many crystallographic structures, especially in drug development campaigns. Proteins are formed (almost exclusively) from a sequence containing a selection of 20 amino acids, this means there are well known restraints (for example: bond lengths, bond angles, torsion angles and rotamer position) for model building or refinement of amino acids. As ligands can be built from a much wider selection of chemical moieties, they have not previously been restrained as well during MX refinement. Ligands found in PDB depositions can be used as models for the model building/ refinement of ligands in new structures, however there are a limited number of ligands available (~23,000). Furthermore, the resolution of the ligands is limited to the resolution of the macro-molecular structure from which they are extracted.

ACEDRG utilises the crystallorgraphy open database (COD), a library of (>300,000) small molecules usually with atomic resolution data (often at least 0.84 Angstrom), to generate a dictionary of restraints to be used in refining the ligand. To create these restraints ACEDRG utilises the RDkit chemoinformatics package, generating a detailed descriptor of each atom of the ligands in COD. The descriptor utilises properties of each atom including the element name, number of bonds, environment of nearest neighbours, third degree neighbours that are aromatic ring systems. The descriptor, is stored alongside the electron density values from the COD. When a ACEDRG query is generated, for each atom in the ligand, the atom type is compared to those for which a COD structure is available, the nearest match is then used to generate a series of restraints for the atom.

ACEDRG can take a molecular description (SMILES, SDF MOL, SYBYL MOL2) of your ligand, and generate appropriate restraints for refinement, (atom types, bond lengths and angles, torsion angles, planes and chirality centers) as a mmCIF file. These restraints can be generated for a number of different probable conformations for the ligand, such that it can be refined in these alternate conformations, then the refinement program can use local scoring criteria to select the ligand conformation that best fits the observed electron density. ACEDRG can accessed through the CCP4i2 interface, and as a command line interface.

Hopefully a useful insight to some of the tools presented at the CCP4 Study weekend. For anyone looking for further information on the CCP4 Study weekend: Agenda, Recording of Sessions, Proceedings from previous years.

Seventh Joint Sheffield Conference on Cheminformatics Part 1 (#ShefChem16)

In early July I attended the the Seventh Joint Sheffield Conference on Cheminformatics. There was a variety of talks with speakers at all stages of their career. I was lucky enough to be invited to speak at the conference, and gave my first conference talk! I have written two blog posts about the conference: part 1 briefly describes a talk that I found interesting and part 2 describes the work I spoke about at the conference.

One of the most interesting parts of the conference was the active twitter presence. #ShefChem16. All of the talks were live tweeted which provided a summary of each talk and also included links to software or references. It also allowed speakers to gain insight and feedback on their talk instantly.

One of the talks I found most interesting presented the Protein-Ligand Interaction Profiler (PLIP). It is a method for the detection of protein-ligand interactions. PLIP is open-source and has a web-based online tool and a command-line tool. Unlike PyMol which only calculates polar contacts, and not the type of interaction, PLIP calculates 8 different types of interactions: hydrogen bonding, hydrophobic, $π- π stacking, π-cation interactions, salt bridges, water bridges, halogen bonds, metal complexes. For a given pdb file the interactions are calculated and shown in a publication quality figure shown here.$

$The display can also be downloaded as a PyMol session so the display can be modified.$

This tool is an extremely useful way to calculate protein-ligand interactions and can be used to find the types of interactions formed by the protein-ligand complex.

PLIP can be found here: https://projects.biotec.tu-dresden.de/plip-web/plip/