Measuring correlation

Correlation is defined as how close two variables are to having a dependence relationship with each other. At first sight, it looks kind of simple, but there are two main problems:

  1. Despite the obvious situations (i.e. correlation = 1), it is difficult to say whether 2 variables are correlated or not (i.e correlation = 0.7). For instance, would you be able to say if the variables X and Y from the following to plots are correlated?
  2. There are different ways of measure of correlation that may not agree when comparing different distributions. As an example, which plot shows a higher correlation? The answer will depend on how you do measure the correlation since if you use Pearson correlation, you would pick A whereas if you choose Spearman correlation you will take B

Here, I will explain some of the different correlation measures you can use:

Pearson product-moment correlation coefficient

  • What does it measure? Only linear dependencies between the variables.
  • How it is obtained? By dividing the covariance of the two variables by the product of their standard deviations. (It is defined only if both of the standard deviations are finite and nonzero). \rho _{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}
  • Properties:
  1. ρ (X,Y) = +1 : perfect direct (increasing) linear relationship (correlation).
  2. ρ (X,Y) = -1 : perfect decreasing (inverse) linear relationship (anticorrelation).
  3. In all other cases, ρ (X,Y) indicates the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated).
  4. Only gives a perfect value when X and Y are related by a linear function.
  • When is it useful? For the case of a linear model with a single independent variable, the coefficient of determination (R squared) is the square of r, Pearson’s product-moment coefficient.

 

Spearman’s rank correlation coefficient:

  • What does it measure? How well the relationship between two variables can be described using a monotonic function (a function that only goes up or only goes down).
  • How it is obtained? Pearson correlation between the rank values of the two variables.

{\displaystyle r_{s}=\rho _{\operatorname {rg} _{X},\operatorname {rg} _{Y}}={\frac {\operatorname {cov} (\operatorname {rg} _{X},\operatorname {rg} _{Y})}{\sigma _{\operatorname {rg} _{X}}\sigma _{\operatorname {rg} _{Y}}}}}

Only if all n ranks are distinct integers, it can be computed using the popular formula.

{\displaystyle r_{s}={1-{\frac {6\sum d_{i}^{2}}{n(n^{2}-1)}}}.}

Where di is the difference between the two ranks of each observation.

  • Properties:
  1. rs (X,Y) = +1:  X and Y are related by any increasing monotonic function.
  2. rs (X,Y) = -1:  X and Y are related by any decreasing monotonic function.
  3. The Spearman correlation increases in magnitude as X and Y become closer to being perfect monotone functions of each other.
  • When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Kendall’s tau coefficient

  • What does it measure? The ordinal association between two measured quantities.
  • How it is obtained?

{\displaystyle \tau ={\frac {({\text{number of concordant pairs}})-({\text{number of discordant pairs}})}{n(n-1)/2}}.}

Any pair of observations (xi , yi)  and (xj, yj) are said to be concordant if the ranks for both elements agree. That happens if xi-xj and yi-xj have the same sign. If their sign are different, they are considered as discordant pairs

  • Properties:
  1. τ (X,Y) = +1: The agreement between the two rankings is perfect (i.e., the two rankings are the same)
  2. τ (X,Y) = -1: The disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other)
  3. If X and Y are independent, then we would expect the coefficient to be approximately zero.
  • When is it useful? It is appropriate for both continuous and discrete ordinal variables. It can be use for looking for non-linear dependence relationships.

Distance correlation:

  • What does it measure? Both linear and nonlinear association between two random variables or random vectors.
  • How is it obtained? By dividing the variable’s distance covariance by the product of their distance standard deviations:

\operatorname {dCor}(X,Y)={\frac {\operatorname {dCov}(X,Y)}{{\sqrt {\operatorname {dVar}(X)\,\operatorname {dVar}(Y)}}}},

The distance covariance is defined as:

{\displaystyle \operatorname {dCov} _{n}^{2}(X,Y):={\frac {1}{n^{2}}}\sum _{j=1}^{n}\sum _{k=1}^{n}A_{j,k}\,B_{j,k}.}

Where:

{\displaystyle A_{j,k}:=a_{j,k}-{\overline {a}}_{j\cdot }-{\overline {a}}_{\cdot k}+{\overline {a}}_{\cdot \cdot },\qquad B_{j,k}:=b_{j,k}-{\overline {b}}_{j\cdot }-{\overline {b}}_{\cdot k}+{\overline {b}}_{\cdot \cdot },}

{\begin{aligned}a_{{j,k}}&=\|X_{j}-X_{k}\|,\qquad j,k=1,2,\ldots ,n,\\b_{{j,k}}&=\|Y_{j}-Y_{k}\|,\qquad j,k=1,2,\ldots ,n,\end{aligned}}

where || ⋅ || denotes Euclidean norm.

  • Properties:
  1. dCor (X,Y) = 0 if and only if the random vectors are independent.
  2. dCor (X,Y) = 1: Perfect dependence between the two distributions.
  3. dCor (X,Y) is defined for X and Y in arbitrary dimension.
  • When is it useful? It is appropriate to find any kind  dependence relationships between the 2 variables. Also if X and Y have different dimensions.

OPIGTREAT

On the 19th of March OPIG set off on our group retreat – henceforth referred to as the OPIGTREAT.

We kicked off a little late as apparently Saulo and check in times are not a good combination (though he is an expert at reversing on an icy road).

Jin and Flo gave the first talk on web programming specifically Flask and D3. If I understood correctly flask is a web development framework for python that runs everything on the server side. Whereas D3 is data/driven/document, which appears to be a way of making very pretty things.

Garrett then gave us an impressive overview on the area of docking, thinking about whether docking had improved in the last 10 years. He discussed how docking can be used to both predict the binding mode (the orientation and conformation) as well as the binding affinity. The state of the art appears to be if we are docking a small molecule into approximately the correct binding site a native like pose can be identified but binding affinity prediction in all cases remains challenging.

Mark then attempted the impossible, he tried to give a talk explaining how to give a good talk. In this case in the context of public engagement and taking our work out to schools. I am now versed in the 4 Ms Manageable, Measurable, Made first and Most Important. I am also weirdly aware that my head shouldn’t move when I am teaching.

Ellliot then took us through how we should judge a PDB structure, a really useful skill for everyone in the group. He described measures such as resolution, B factors Rfree, Clash score, Ramachandran outliers, sidechain outliers and RSRZ outliers. Interesting facts that I collected the average resolution of an X-ray structure in the PDB is ~2A and the average Rfree is 0.25. I also learnt of the existence of PDBredo a service that re-refines datasets in the PDB.

Saulo and the Fergi were up next and they treated us all to a short talk and then a Jupyter notebook practical on machine learning. They discussed supervised, unsupervised and reinforcement learning. Giving examples of each and how and when they should/could be used. Claire and I then learnt a great deal about Jupyter notebooks, the most important thing being to press shift enter. Useful facts “out of the bag” is a method for measuring the error of random forests, score using all data points apart from those used to make that tree.  

The evening finished with a film about the evil iniquities of smoking (very high brow stuff!?!).

The second day began with Bernhard (a visitor from the far of land of Barcelona these days) talking to us about his latest research project. As this is his story – no details in the blog.

Claire then gave an update of the talk she gave at the last OPIGTREAT – how to make “stuff” pretty. Obviously a popular topic as we all wish to display our data and findings in a way that is easily interpretable as well as visually appealing. Claire took us through some of the tools to use like ggplot and Pymol – showed us where to find the lists of useful commands and then showed us the types of images you could make if you really put some thought into it.

Anne was up next, she discussed the challenges and opportunities of integrating heterogeneous data sources and she came up with a lot of data sources to think about, running from protein structures, protein interactions, small molecule structures, drug safety, drug targets, functional annotation and pathways. One thing to remember probably don’t tell your boss when she should or shouldn’t be taking notes……

It was then the turn of team networks Javi, James and Lyuba who walked us through the basics of networks and expanded on their uses across multiple data types in biology. They mentioned areas from simple motifs to protein structure, MD simulations, ontologies, disease prediction, drug target identification…. We then had a practical to check we had understood the power of networks! The networks under consideration were dolphins, Myoglobin structure, Facebook data and the mystery voter network (where we discovered that Fergus the first in no way tried to rig the vote for what film to watch).

That afternoon I visited the bird sanctuary just down the road, others went to a gin distillery or on a walk. Top quote of the afternoon was from James “I want the birds to eat from my pants”. I believe he is from one of those countries that has the misguided belief that pants means trousers. Actually I could have a different top quote from Alex about somebody being a cheap ride in his dreams but I think I should pass over that one.

That evening we were treated to a fragment based drug discovery extravaganza headed up Hannah, Susan and Joe. They took us through the use of fragments for drug discovery and then we attempted a practical. I seem to remember that Claire and I once again excelled at shift enter on the Jupyter notebook.

That evening we had a pub quiz, which apparently ended in a draw between all the teams playing. I feel that Claire and Flo as quizmasters might have made a minor miscalculation. I was happy though as I ended up with the minions bowl and cup. I also managed to persuade several grown men to jump and smash chocolate eggs on their heads on the ceiling.

Next morning Alex and Matt were up first. In their talk they demonstrated not only their knowledge on the area of the future immunotherapy repertoire but also their ability to finish each other’s sentences. They gave a really excellent overview of current immunotherapies and where the field is moving and what might be the future. Facts to store in the head, first ever approved AB therapeutic Muromonab (1986). Currently most successful Humira (Adalimumab) from Abbvie worth 18.4b dollars in 2017, this is a fully human AB for autoimmune diseases and binds to the mediator of inflammation (TNF-alpha).

Next up Catherine and Lucian who discussed distributed computing in PySpark, they started by explaining why distributed computing is going to become so important. Basic info by 2025, 100 million to 2 billion human genomes will have been sequenced that is 2 – 40 exabytes of data. They discussed distributed computing vs centralised and Pyspark compared to Hadoop. There was a practical but Mark had to solo perform for the audience leading to one of the top photos of the whole OPIGTREAT.

As a punishment for being in charge I gave the final talk where I discussed future research direction and how you decide what those might be.

So with thanks to all of the group that concludes the OPIGTREAT report.

New avenues in antibody engineering

Hi everyone,

In this blog post I would like to review an unusual antibody scaffold that can potentially give rise to a new avenue in antibody engineering. Here, I will discuss a couple of papers that complement each others research.

My DPhil is centered on antibody NGS (Ig-seq) data analysis. I always map an antibody sequence to its structure as the three-dimensional antibody configuration dictates its function, the piece of information that cannot be obtained from just the nucleotide or amino acid sequence. When I work with human Ig-seq data, I bear in mind that antibodies are composed of two pairs of light and heavy chains that tune the antibody towards its cognate antigen. In the light of recent research discoveries, Tan et al., found that antibody repertoires of people that live in malaria endemic regions have adopted a unusual property to defend the body from the pathogen (1). Several studies followed up on this discovery to further dissect the yet uncharacterized property of antibodies.

Malaria parasites in the erythrocytic stage produce RIFIN proteins that are displayed on the surface of the erythrocytes. The main function of RIFINs is to bind to the LAIR1 receptors that are found on the surface on the immune cells. The LAIR1 receptor is inhibitory, which leads to inhibition of the immune system. The endogenous ligand of the LAIR1 receptor is collagen, which is found on the surface of body cells. This is to make sure that the immune cells will not be activated against its own body. Activating the LAIR1 receptors is one of the escape mechanisms that the malaria parasite has evolved.

Tan et al., (1) showed that in an evolutionary arms race between human and malaria, our immune system has harnessed the property of RIFINs to bind to LAIR1 against the parasite itself. By doing single B cell isolation and sequencing, it was discovered that antibodies, which are the effector molecules of our immune system, can incorporate the LAIR1 protein in its structure. Taking into account our knowledge of antibody engineering, the idea of incorporating a 100 amino acid long protein into antibody structure is very hard to comprehend. Sequences of these antibodies showed that the LAIR1 insertion was introduced to CDR-H3. Recently, the crystal structure of this construct has become available (2). The crystal structure revealed that the LAIR1 insertion indeed is structurally functional. All 5 of antibody canonical CDRs interact with the LAIR1 protein and its linkers to accommodate the insertion. The CDR-L3 forms two disulfide bonds with the liker to orientate the LAIR1 protein in the way, it will interact with RIFINs. It is worth to stress that LAIR1 sequence differs from the wild type, but the structure is very similar (<0.5 RMSD). The change in sequence and structure is crucial to prevent the LAIR1 containing antibody from interacting with collagen, but only with RIFINs.

Pieper et al., (3) tried to interrogate the modality of LAIR1 insertions into antibody structures. It was performed by single cell sequences as well as NGS of the antibody shift region. It turns out that human antibodies can accommodate two types of insertion modalities and can form   camelid-like antibodies. The insertion of LAIR1 can happen to CDR-H3, leading to the loss of antibody binding to its cognate antigen. Another modality is the incorporation of the LAIR1 protein to the shift region of the antibody. This kind of insertion does not interfere with the Fv domain binding properties, which leads to creating of  bi-specific antibodies. The last finding was the insertion of the LAIR1 into antibody structure where D, J and most of V genes, and the light chain were deleted. The resultant scaffold is structurally viable and only possesses the heavy chain. Hence, it is the evidence that human antibodies can also form camelid-like antibodies. Interestingly, these insertions into the shift region are not exclusive to people that live in malaria endemic regions. By doing NGS of the shift domain from European donors, around 1 in 1000 antibody sequences had an insertion of varying lengths. These insertions are introduced from different chromosomes of both intergenic and genic regions.

To sum up, it is very intriguing that our immune system has evolved to create camelid-like and bi-specific antibodies. It will be very informative to try to crystallize these structures to see how these antibodies accommodate the insertion of LAIR1. Current antibody NGS data analysis primarily concentrates on the heavy chain due to sequencing technology limitations. It will be invaluable information if we could sequence the entire heavy chain as well as adjacent shift region to see how our immune system matures and activates against pathogens.

 

  1. Tan J, Pieper K, Piccoli L, Abdi A, Foglierini M, Geiger R, Maria Tully C, Jarrossay D, Maina Ndungu F, Wambua J, et al. A LAIR1 insertion generates broadly reactive antibodies against malaria variant antigens. Nature (2016) 529:105–109. doi:10.1038/nature16450
  2. Hsieh FL, Higgins MK. The structure of a LAIR1-containing human antibody reveals a novel mechanism of antigen recognition. Elife (2017) 6: doi:10.7554/eLife.27311
  3. Pieper K, Tan J, Piccoli L, Foglierini M, Barbieri S, Chen Y, Silacci-Fregni C, Wolf T, Jarrossay D, Anderle M, et al. Public antibodies to malaria antigens generated by two LAIR1 insertion modalities. Nature (2017) 548:597–601. doi:10.1038/nature23670

 

Helpful resources for people studying therapeutic antibodies

My work within OPIG involves studying therapeutic antibodies. It can be tough to find information about these commercial molecules, often known by unintelligible developmental names until the later stages of clinical trials. Their structures are frequently absent, as one might expect, but even their sequences are sometimes a nightmare to get hold of! Below is a list of resources that I have found particularly helpful.

IDENTITIES OF RELEVANT ANTIBODIES

1. Wikipedia (don’t judge!) is an extremely helpful resource to get started. They have the following databases:

(a) A list of FDA-approved therapeutic monoclonal antibody therapies
(b) A more general list of therapeutic, diagnostic and preventive monoclonal antibodies (includes some things that have been withdrawn)

2. The Antibody Society has list of FDA/EU approved and antibodies to watch on their website. NB: This is only available to members of the society (free for students and other concessions, standard membership is $100pa).

3. The journal ‘mAbs’ also has a series of ‘Antibodies to Watch in [Year]’ papers. Here are the ones for 2016, 2017 and 2018.

SEQUENCES

4. 137 clinical-stage (post-phase I) mAb sequences can be found in the SI of this paper by Jain et al.

5. A slightly outdated (last updated Nov 2016), but still extremely useful, resource of antibody seqeunces is this FASTA list, written by Dr Martin’s Group at UCL.

SEQUENCES & STRUCTURES

6. The IMGT monoclonal antibody database (mAb-DB) has been possibly the most helpful resource. This includes 798 entries of both therapeutics and non-therapeutics, so it’s helpful to get a list of the antibodies you are interested in first. You can search it with a wide range of parameters, including antibody name. A typical antibody result will include its mAb-DB ID, INN details, common & developmental names, species, receptor type and isotype, sequence (via the “IMGT/2Dstructure-DB” link), target, clinical trials details and – if available – the 3D structure (via the “IMGT/3Dstructure-DB” link).

7. SAbDab has a continually-updated section for all therapeutic antibody structures deposited in the PDB.

CURRENT STATUS OF THE THERAPEUTIC

8. Search the therapeutic name on AdisInsight, or Pharmacodia to see its current clinical trial status, and whether or not it has been withdrawn.

I just wanted TensorFlow

Finally got TensorFlow to install on my Mac. You’d be tempted to think, “Jin, it’s just a pip install, surely?”

No, MacOS begs to differ! You see, if you’re on a slightly older macOS version like I was (10.12), then you’d still be using TLS 1.0 – long story short, when querying PyPI via pip to get any packages on TLS 1.0, your requests will get rejected. And this cutoff was chosen something like a week ago – SAD! If you have MacOS 10.13 and onward, TLS should be set to 1.2 so you need not worry.

TL;DR:

  1. Get a new version of pip (10.0); see Stack Overflow post.
  2. Install any dependencies for pip as necessary by doing tons of source compilations.
  3. Install desired package(s) as necessary.

Biophysical Society 62nd Annual Meeting

In February I was very fortunate to attend the Biophysical Society 62nd Annual Meeting, which was held in San Francisco – my first real conference and my first trip to North America. Despite arriving with the flu, I had a great time! The conference took place over five days, during which there were manageable 15-minute talks covering a huge range of Biophysics-related topics, and a few thousand more posters on display (including mine). With almost 6,500 attendees, it was also large enough to slip across the road to the excellent SF Museum of Modern Art without anyone noticing.

The best presentation of the conference was, of course, Saulo’s talk on integrating biological folding features into protein structure prediction [1]. Aside from that, here are a few more of my favourites:

Folding proteins from one end to the other
Micayla A. Bowman, Patricia L. Clark [2]

Here in the COFFEE (COtranslational Folding Family of Expert Enthusiasts) office, we love to talk about the vectorial nature of cotranslational folding and how it contributes to the efficiency of protein folding in vivo. Micayla Bowman and Patricia Clark have created a novel technique that will allow the effects of this vectorial folding to be investigated specifically in vitro.

The Clp complex grabs, unfolds and degrades proteins (diagram from [3]). ClpX, the translocase unit of this complex, was used to recapitulate vectorial protein refolding in vitro for the first time.

ClpX is an A+++ molecular motor that grabs proteins and translocates them through its pore. In vivo, its role is to denature substrates and feed them to an associated protease (ClpP) [3]. Bowman & Clark have used protein tags to initiate translocation of the target protein through ClpX, resulting in either N-C or C-N vectorial refolding.

The YKB construct used to demonstrate the vectorial folding mediated by ClpX (diagram from [4]).

They demonstrate the effect using YKB, a construct with two mutually exclusive native states: YK-B (fluoresces yellow) and Y-KB (fluoresces blue) [4]. In vitro refolding results in an equal proportion of yellow and blue states. Cotranslational folding, which proceeds in the N-C direction, biases towards the yellow (YK-B) state. C-N refolding in the presence of ClpX and ATP biases towards the blue (Y-KB) state. With this neat assay, they demonstrate that ClpX can mediate vectorial folding in vitro, and they plan to use the assay to investigate its effect on protein folding pathways and yields.

An ambiguous view of protein architecture
Guillaume Postic, Charlotte Perin, Yassine Ghouzam, Jean-Christope Gelly [Poster abstract: 5, Paper: 6]

This work addresses the ambiguity of domain definition by assigning multiple possible domain boundaries to protein structures. Their automated method, SWORD (Swift and Optimised Recognition of Domains), performs protein partitioning via the hierarchical clustering of protein units (PUs) [7], which are smaller than domains and larger than secondary structures. The structure is first decomposed into protein units, which are then merged depending on the resulting “separation criterion” (relative contact probabilities) and “compactness” (contact density).

Their method is able to reproduce the multiple conflicting definitions that often exist between domain databases such as SCOP and CATH. Additionally, they present a number of cases for which the alternative domain definitions have interesting implications, such as highlighting early folding regions or functional subdomains within “single-domain” structures.

Alternative SWORD domain delineations identify (R) an ultrafast folding domain and (S,T) stable autonomous folding regions within proteins designated single-domain by other methods [6]

Dual function of the trigger factor chaperone in nascent protein folding
Kaixian Liu, Kevin Maciuba, Christian M. Kaiser [8]

The authors of this work used optical tweezers to study the cotranslational folding of the first two domains of 5-domain protein elongation factor G.

In agreement with a number of other presentations at the conference, they report that interactions with the ribosome surface during the early stages of translation slows folding by stabilising disordered states, preventing both native and misfolded conformations. They found that the N-terminal domain (G domain) folds independently, while the subsequent folding of the second domain (Domain II) requires the presence of the folded G domain. Furthermore, while partially extruded, unfolded domain II destabilises the native G domain conformation and leads to misfolding. This is prevented in the presence of the chaperone Trigger factor, which protects the G domain from unproductive interactions and unfolding by stabilising the native conformation. This work demonstrates interesting mechanisms by which Trigger factor and the ribosome can influence the cotranslational folding pathway.

Optical tweezers are used to interrogate the folding pathway of a protein during stalled cotranslational folding. Mechanical force applied to the ribosome and the N-terminal of the nascent chain causes unfolding events, which can be identified as sudden increases in the extension of the chain. (Figure from [9])

Predicting protein contact maps directly from primary sequence without the need for homologs
Thrasyvoulos Karydis, Joseph M. Jacobson [10]

The prediction of protein contacts from primary sequence is an enormously powerful tool, particularly for predicting protein structures. A major limitation is that current methods using coevolution inference require a large multiple sequence alignment, which is not possible for targets without many known homologous sequences.

In this talk, Thrasyvoulos Karydis presented CoMET (Convolutional Motif Embeddings Tool), a tool to predict protein contact maps without a multiple sequence alignment or coevolution data. They extract structural and sequence motifs from known sequence-structure pairs, and use a Deep Convolutional Neural Network to associate sequence and structure motif embeddings. The method was trained on 137,000 sequence-structure pairs with a maximum of 256 residues, and is able to recreate contact map patterns with low resolution from primary sequence alone. There is no paper on this yet, but we’ll be looking out for it!


1. de Oliveira, S.H. and Deane, C.M., 2018. Exploring Folding Features in Protein Structure Prediction. Biophysical Journal, 114(3), p.36a.
2. Bowman, M.A. and Clark, P.L., 2018. Folding Proteins From One End to the Other. Biophysical Journal, 114(3), p.200a.
3. Baker, T.A. and Sauer, R.T., 2012. ClpXP, an ATP-powered unfolding and protein-degradation machine. Biochimica et Biophysica Acta (BBA)-Molecular Cell Research, 1823(1), pp.15-28.
Acta (BBA) – Molecular Cell Research, 2012, 1823 (1), 15-28
4. Sander, I.M., Chaney, J.L. and Clark, P.L., 2014. Expanding Anfinsen’s principle: contributions of synonymous codon selection to rational protein design. Journal of the American Chemical Society, 136(3), pp.858-861.
5. Postic, G., Périn, C., Ghouzam, Y. and Gelly, J.C., 2018. An Ambiguous View of Protein Architecture. Biophysical Journal, 114(3), p.46a.
6. Postic, G., Ghouzam, Y., Chebrek, R. and Gelly, J.C., 2017. An ambiguity principle for assigning protein structural domains. Science advances, 3(1), p.e1600552.
7. Gelly, J.C. and de Brevern, A.G., 2010. Protein Peeling 3D: new tools for analyzing protein structures. Bioinformatics, 27(1), pp.132-133.
8. Liu, K., Maciuba, K. and Kaiser, C.M., 2018. Dual Function of the Trigger Factor Chaperone in Nascent Protein Folding. Biophysical Journal, 114(3), p.552a.
9. Liu, K., Rehfus, J.E., Mattson, E. and Kaiser, C., 2017. The ribosome destabilizes native and non‐native structures in a nascent multi‐domain protein. Protein Science.
10. Karydis, T. and Jacobson, J.M., 2018. Predicting Protein Contact Maps Directly from Primary Sequence without the Need for Homologs. Biophysical Journal, 114(3), p.36a.

Working with Jupyter notebook on a remote server

To celebrate the recent beta release of Jupyter Lab (try it out of you haven’t already), today we’re going to look at how to run a Jupyter session (Notebook or Lab) on a remote server.

Suppose you have lots of data which lives on a remote server and you want to play with it in a Jupyter notebook. You can’t copy the data to your local machine (well, you can, but you’re sensible so you won’t), but you can run your Jupyter session on the remote server. There’s just one problem – since Jupyter notebook is browser-based and works by connecting to the Jupyter session running locally, you can’t just run Jupyter remotely and forward X11 like you would a traditional graphical IDE. Fortunately, the solution is simple: we run Jupyter remotely, create an ssh tunnel connecting a local port to the one used by the Jupyter session, and connect directly to the Jupyter session using our local browser. The best part about this is that you can set up the Jupyter session once then connect to it from any browser on any machine once an ssh tunnel is created, without worrying about X11 forwarding.

Here’s how to do it.

1. First, connect to the remote server if you haven’t already

ssh fergus@funkyserver

1.5. Jupyter takes browser security very seriously, so in order to access a remote session from a local browser we need to set up a password associated with the remote Jupyter session. This is stored in jupyter_notebook_config.py which by default lives in ~/.jupyter. You can edit this manually, but the easiest option is to set the password by running Jupyter with the password argument:

jupyter notebook password
>>> Enter password:

This password will be used to access any Jupyter session running from this installation, so pick something sensible. You can set a new password at any time on the remote server in exactly the same way.

2: Launch a Jupyter session on the remote server. You can specify the access port using the --port option. This might be useful on a shared server where others might be doing the same thing. You’ll also want to run this without launching a browser on the remote server since this is of no use to you.

jupyter lab --port=9000 --no-browser &

Here I’m using Jupyter Lab, but this works in exactly the same way for Jupyter Notebook.

3: Now for the fun part. Jupyter is running on our remote server, but what we really want is to work in our favourite browser on our local machine. To do this we just need to create an ssh tunnel between a port on our machine and the port our Jupyter session is using on the remote server. On our local machine:

ssh -N -f -L 8888:localhost:9000 fergus@funkyserver

For those not familiar with ssh tunneling, we’ve just created a secure, encrypted connection between port 8888 on our local machine and port 9000 on our remote server.

  • -N tells ssh we won’t be running any remote processes using the connection. This is useful for situations like this where all we want to do is port forwarding.
  • -f runs ssh in the background, so we don’t need to keep a terminal session running just for the tunnel.
  • -L specifies that we’ll be forwarding a local port to a remote address and port. In this case, we’re forwarding port 8888 on our machine to port 9000 on the remote server. The name ‘localhost’ just means ‘this computer’. If you’re a Java programmer who lives for verbosity, you could equivalently pass -L localhost:8888:localhost:9000.

4: If you’ve done everything correctly, you should now be able to access your Jupyter session via port 8888 on your machine. Fire up your favourite browser and type localhost:8888 into the address bar. This should bring up a Jupyter session and prompt you for a password. Enter the password you specified for Jupyter on the remote server.

Congratulations! You now have a Jupyter session running remotely which you can connect to anytime, anywhere, from any machine.

Disclaimer: I haven’t tried this on Windows, nor do I intend to. I value my sanity.

Dealing with indexes when processing co-evolution signals (or how to navigate through “sequence hell”)

Co-evolution techniques provide a powerful way to extract structural information from the wealth of protein sequence data that we now have available. These techniques are predicated upon the notion that residues that share spatial proximity in a protein structure will mutate in a correlated fashion (co-evolve). This co-evolution signal can be inferred from a multiple sequence alignment, which tells us a bit about the evolutionary history of a particular protein family. If you want to have a better gauge at the power of co-evolution, you can refer to some of our previous posts (post1, post2).

This is more of a practical post, where I hope to illustrate an indexing problem (and how to circumvent it) that one commonly encounters when dealing with co-evolution signals.

Most of the co-evolution tools available Today output pairs of residues (i,j) that were predicted to be co-evolving from a multiple sequence alignment. One of the main applications of these techniques is to predict protein contacts, that is pairs of residues that are within a predetermined distance (quite often 8Å).  Say you want to compare the precision of different co-evolution methods for a particular test set. Your test set would consist of a number of proteins for which the structure is known and for which sufficient sequence information is available for the contact prediction to be carried out. Great!

So you start with your target sequences, generate a number of contact predictions of the form (i,j) for each sequence and, for each pair, you check if the ith and jth residues are in contact (say, less than 8Å apart) on the corresponding known protein structure. If you actually carry out this test, you will obtain appalling precision for a large number of test cases. This is due to an index disparity that a friend of mine quite aptly described as “sequence hell”.

This indexing disparity occurs because there is a mismatch between the protein sequence that was used to produce the contact predictions and the sequence of residues that are represented in a protein structure. Ask a crystallographer friend if you have one, and you will find that in the process of resolving a protein’s structure experimentally, there are many reasons why residues would be missing in the final structure. More so, there are even cases where residues had to be added to enable protein expression and/or crystallisation. This implies that the protein sequence (represented by a fasta file) frequently has more (and sometimes fewer) residues than the proteins structure (represented by a PDB file).  This means that if the ith  and jth residues in your sequence were predicted to be in contact, that does not mean that they correspond to the ith and jth residues in order of appearance in your protein structure. So what do we do now?

A true believer in the purity and innocence of the world would assume that the SEQRES entries in your PDB file, for instance, would come to the rescue. The SEQRES describes the sequence of residues exactly as they appear on the atom coordinates of a particular PDB file. This would be a great way of mitigating the effects of added or altered residues, and would potentially mitigate the effects of residues that were not present in the construct. In other words, the sequences described by SEQRES would be a good candidate to validate whether your predicted contacts are present in the structure. They do, however, contain one limitation. SEQRES also describe any residues whose coordinates were missing in the PDB. This means that if you process the PDB sequentially and that some residues could not be resolved, the ith residue to appear on the PDB could be different to the ith residue in the SEQRES.

An even more innocent person, shielded from all the ugliness of the the universe, would simply hope that the indexing on the PDB is correct, i.e. that one can use the residue indexes presented on the “6th column” of the ATOM entries and that these would match perfectly to the (i,j) pair you obtained using your protein sequence. While, in theory, I believe this should be the case, in my experience this indexing is often incorrect and more frequently than not, will lead to errors when validating protein contacts.

My solution to the indexing problem is to parse the PDB sequentially and extract the sequence of all the residues for which coordinates are actually present. To my knowledge, this is the only true and tested way of obtaining this information. If you do that, you will be armed with a sequence and indexing that correctly represent the indexing of your PDB. From now on, I will refer to these as the PDB-sequence and PDB-sequence indexing.

All that is left is to find a correspondence (a mapping) between the sequence you used for the contact prediction and the PDB-sequence. I do that by performing a standard (global) sequence alignment using the Needleman–Wunsch algorithm. Once in possession of such an alignment, the indexes (i,j) of your original sequence can be matched to adjusted indexes (i',j') on your PDB-sequence indexing. In short, you extracted a sequential list of residues as they appeared on the PDB, aligned these to the original protein sequence, and created a new set of residue pairings of the form (i',j') which are representative of the indexing in PDB-sequence space. That means that the i’th residue to appear on the PDB was predicted to be in contact with the j’th residue to appear.

The problem becomes a little more interesting when you hope to validate the contact predictions for other proteins with known structure in the same protein family. A more robust approach is to use the sequence alignment that is created as part of the co-evolution prediction as your basis. You then identify the sequence that best represents the PDB-sequence of your homologous protein by performing N global sequence alignments (where N is the number of sequences in your MSA), one per entry of the MSA. The highest scoring alignment can then be used to map the indexing. This approach is robust enough that if your homologous PDB-sequence of interest was not present in the original MSA for whatever reason, you should still get a sensible mapping at the end (all limitations of sequence alignment considered).

One final consideration should be brought to the reader’s attention. What happens if, using the sequence as a starting point, one obtains one or more (i,j) pairs where either i or j is not resolved/present in the protein structure? For validation purposes, often these pairs are disregards. Yet, what does this co-evolutionary signal tell us about the missing residues in the structure? Are they disordered/flexible? Could the co-evolution help us identify low occupancy conformations?

I’ll leave the reader with these questions to digest. I hope this post proves helpful to those braving the seas of “sequence hell” in the near future.

Interactive Illustration of Collaboration Networks with D3

D3 is a JavaScript Library that allows the creation of interactive data visualisations. D3 stands for Data-Driven Documents and its advantage is that it is using internet standards as HTML, CSS, and SVG as the foundation. This gives maximal compatibility across all moderns browsers. It is widely used by journalists, data scientists, and starts to be used by academic scientists, too.

Here I want to present a simple way of creating interactive illustrations of collaboration networks, e.g., for your own website. Before going into details, have a look at the example below, it presents some of Rosalind Franklin’s papers and her co-authors.

[d3-source canvas=”wpd3-3903-2″]

The network illustration is a so-called bipartite network because it consists of two types of nodes, blue nodes represent scientists and orange nodes represent publications. These nodes are connected if a scientist is an author on a publication. This way of presenting is a so-called force layout, which simulates repelling Coulomb forces between all nodes and spring-like attractive forces act on nodes that are connected via links. You can drag the nodes around and explore the behaviour of the network.

You will have noticed that you can also see the name of the publications and co-authors when you hover over the nodes. For some of them, a representative figure or photograph is shown, too. You can also double-click the nodes and will be directed to the webpage of the publication or author.

For larger collaboration networks the visualisation is still possible but might get a bit messy. Thus, not showing any figures is advised. Furthermore, the name of the nodes should be shown either below the illustration or as a tooltip (the later is not possible in WordPress but on normal web pages as here). The illustration below shows all 124 publications written by OPIG, together with the 310 authors. Here, I had to remove the Head of the Group Charlotte Deane, as she is on almost all of these papers and would clutter the illustration, unfortunately. In network science, we call such a node an ego node.

[d3-source canvas=”wpd3-3903-0″]

If you want to play around with this, check out my Github repository. The creation of the network is fairly simple. You only need a Bibtex file and can then execute a Python script that creates a JSON file that is read into the JavaSript. This can then be incorporated in all web pages easily.

PS: To enable D3 in WordPress you need a special plugin.  Apparently, it is not up to date with the current stable D3 version 4 but you can load any missing functions manually.

Fun with Proteins and 3D Printing!

When I’m not postdoc-ing, as part of my job I’m also involved with teaching at the Doctoral Training Centre here in Oxford. I mainly teach the first-year students of the Systems Approaches to Biomedical Science CDT – many members of this group are doing (or have done) their DPhils through this program (including myself!). Recently, I and some other OPIGlets were responsible for two modules called Structural Biology and Structure-Based Drug Discovery, and as part of those modules we arranged a practical session on 3D printing.

Most of the time, the way we ‘see’ protein structures is through a computer screen, using visualisation software such as PyMOL. While useful, these virtual representations have their limitations – since the screen is flat, it’s difficult to get a proper feel for the structure1, and seeing how your protein could interact and form assemblies with others is difficult. Physical, three-dimensional models, on the other hand, allow you to get hands-on with your structure, and understand aspects of your protein that couldn’t be gained from simply looking at images. Plus, they look pretty cool!

This year, I printed three proteins for myself (shown in the photo above). Since my most recent work has focused on transmembrane proteins, I felt it was only right to print one – these are proteins that cross membranes, usually to facilitate the transport of molecules in and out of the cell. I chose the structure of a porin (top of the photo), which (as the name suggests) forms a pore in the cell membrane to allow diffusion across it. This particular protein (1A0S) is a sucrose-specific porin from a type of bacteria called Salmonella typhimurium, and it has three chains (coloured blue, pink and purple in the printed model), each of which has a beta barrel structure. You can just about see in the photo that each chain has regions which are lighter in colour – these are the parts that sit in the cell membrane layer; the darker regions are therefore the parts that stick out from the membrane.

My second printed model was the infamous Zika virus (bottom right). Despite all the trouble it has caused in recent years, in my opinion the structure of the Zika virus is actually quite beautiful, with the envelope proteins forming star-like shapes in a highly symmetrical pattern. This sphere of proteins contains the viral RNA. The particular structure I used to create the model (5IRE) was solved using cryo-electron microscopy, and required aligning over 10,000 images of the virus.

Finally, I printed the structure of a six-residue peptide, that’s probably only interesting to me… Can you tell why?!2

 

1 – However, look at this link for an example of looking at 3D structures using augmented reality!

2 – Hint: Cysteine, Leucine, Alanine, Isoleucine, Arginine, Glutamic Acid…