Teaching Network Science to High School Students

In the recent years, a lot of effort went into outreach events, in particular for science and mathematics. Here, I am going to mention a summer course on Network Science which I organized and taught together with Benjamin F. Maier from the Humboldt University Berlin.

The course was part of an established German summer school called Deutsche Schülerakademie (German Pupils Academy), an extracurricular event for highly motivated pupils. It lasts sixteen days and the participants join one of six courses, which cover all ranges of academic disciplines, from philosophy over music to science.

Our course was titled Netzwerke und Komplexe Systeme (Networks and Complex Systems) and rather than going too much in depth in one particular area we covered a broad selection of topics, as we wanted to give students an overview and also an idea of how different disciplines approach complex phenomena. We discussed pure Mathematics topics as the colouring of graphs, algorithmic discussions as the travelling salesman problem, social network analysis, computational neuroscience, dynamical systems, and fractals.

A network of the former monastery in Rossleben, where the summer school was held. The students created the network themselves. To parallelise the task they split up into four groups, each covering one level of the building. They then used this network to simulate the spread of a contagious disease, starting at the biological lab (A35, in red).

A couple of thoughts on what went well and which parts might need improvement for further of such events:

We did a questionnaire before and asked the pupils some questions like “Do you know what a vector is?” and also concerning their motivation to join the course. This was very helpful in getting a rough idea about their knowledge level.
We gave them some material to read before the course. In retrospective, it probably would be better to give them something to read, as well as, some problems to solve, such that the learning outcome is clearer and more effective.
The students gave presentations on topics we choose for them based on their answers to the questionnaire. The presentations were good but a lot of students overrun the allocated time because they were very enthusiastic about the topics.
The students were also enthusiastic about the programming exercises, for which we used Python and the NetworkX library. One challenge was the heterogeneity in programming experience, this made the splitting up into two groups, beginner and advanced, necessary.
In contrast to courses covering similar topics at university-level, the students did not have the necessary mathematical background for the more complicated aspects of network science. Accordingly, it is better to choose less of these and allocate time to introduce the mathematical methods, for example, eigenvectors or differential equations, beforehand.
The students very much liked hand on exercises, for example, the creation of random networks of different connection probabilities with the help of dice or the creation of a network of the floor plan of the building in which the summer school was held, as shown in the Figure.

It was great fun to introduce the students to the topic of network science and I strongly can recommend others to organise similar outreach events! You can find some of our teaching materials, including the worksheets and programming exercises in the original German and a translated English version, online. A paper describing our endeavours is under review.

Four Erdős–Rényi random graphs as generated by the participants by rolling dice. A twenty-sided dice was used for the probabilities p = 1/20 and p = 10 and a six-sided dice for p = 1/6 and p=1/3. This fun exercise allows the discussion of degree distributions, the size of the largest connected component, and similar topics for ER random graphs.

Young Entrepreneurs Scheme (YES) Competition

Fair warning: I’m going to use this BLOPIG post to promote the YES competition and talk about how semi-amazingly we did at it!

For those who don’t know, the YES competition runs yearly and is designed to develop the entrepreneurial spirit amongst graduates and post-graduates. The YES workshops come in two parts, the first being an intensive crash course in small business start-ups. These are delivered by financial experts, successful start-ups, and intellectual property teams. Carefully mixing theory with useful anecdotes, these talks were hugely insightful and all entusiastically given by people passionate about science start-ups. We were lucky to have many of these speakers mentoring for the second part: the development of our own business plan.

Our team, the fantastically named Team SolOx, developed a licence-selling business for a theoretical catalyst, which mimicked photosynthesis. Our product produced lightweight hydrocarbons from atmospheric gases quickly and efficiently. The comic value of our idea aside, we designed a 10 year business plan that saw SolOx develop and licence our catalyst. This process was eye-opening, with the mentors highlighting the hurdles we would face and taught us how to overcome them. Our pitch landed us a place in the final at the Royal Society in London, as one of the winners of the YES Industrial Challenges workshop 2017. Although the final judging panel didn’t find our plan as financially sound as others, we had a fantastic experience and would thoroughly recommend it to anyone interested in business start-ups.

Team SolOx: Winners of the YES Industrial Challenges 2017 Workshop. Left to right: Natasha Rhys, Tom Dixon, Joe Bluck, Sarah-Beth Amos and Alex Skates.

Finally, I would like to thank the Systems Approaches to Medical Science Centre for Doctoral Training for their financial support and their focus on promoting entrepreneurial skills.

A short intro to machine precision and how to beat it

Most people who’ve ever sat through a Rounding Error is Bad lecture will be familiar with the following example:

> (0.1+0.1+0.1) == 0.3
FALSE

The reason this is so unsettling is because most of the time we think about numbers in base-10. This means we use ten digits $\{0, 1, \dots, 9\}$ , and we perform arithmetic based on this ten digit notation. This doesn’t always matter much for pen and paper maths but it’s an integral part of how we think about more complex operations and in particular how we think about accuracy. We see $0.1$ as a finite decimal fraction, and so it’s only natural that we should be able to do accurate sums with it. And if we can do simple arithmetic, then surely computers can too? In this blog post I’m going to try and briefly explain what causes rounding errors such as the one above, and how we might get away with going beyond machine precision.

Take a number $x \in [0; 1)$ , say $x=1/3$ . The decimal representation of $x$ is of the form $x=\sum_{i=1}^{\infty} a_i \times 10^{-i}$ . The $a_i \in \{0, 1, \dots, 9\}$ here are the digits that go after the radix point. In the case of $x=1/3$ these are all equal $a_i=3$ , or $x=0.333\dots _{10}$ . Some numbers, such as our favourite $x$ , don’t have a finite decimal expansion. Others, such as $0.3$ , do, meaning that after some $i \in \mathbb{N}$ , all $a_{i+j}=0$ . When we talk about rounding errors and accuracy, what we actually mean is that we only care about the first few digits, say $i\leq 5$ , and we’re happy to approximate to $x\approx \sum_{i=1}^{5} a_i \times 10^{-i}=0.33333$ , potentially rounding up at the last digit.

Computers, on the other hand, store numbers in base-2 rather than base-10, which means that they use a different series expansion $x=\sum_{i=1}^{\infty} b_i \times 2^{-i}$ , $b_i \in \{0, 1\}$ to represent the same number. Our favourite number $x$ is actually stored as $0.1010101\dots _{2}$ rather than $0.3333333\dots _{10}$ , despite the fact it appears as the latter on a computer screen. Crucially, arithmetic is done in base-2 and, since only a finite number of binary digits are stored ( $i\leq 52$ for most purposes these days), rounding errors also occur in base-2.

All numbers with a finite binary expansion, such as $0.25_{10}=0\times 1/2+1\times 1/4=0.01_{2}$ also have a finite decimal expansion, meaning we can do accurate arithmetic with them in both systems. However, the reverse isn’t true, which is what causes the issue with $0.1+0.1+0.1\neq 0.3$ . In binary, the nice and tidy $0.1_{10} = 0.00011001100\dots _{2}$ . We observe the rounding error because unlike us, the computer is trying to sum over infinite series.

While it’s not possible to do infinite sums with finite resources, there is a way to go beyond machine precision if you wanted to, at least for rational $x=p/q$ , where $p, q \in \mathbb{N}$ . In the example above, the issue comes from dividing by 10 on each side of the (in)equality. Luckily for us, we can avoid doing so. Integer arithmetic is easy in any base, and so

> (1+1+1) == 3 
TRUE

Shocking, I know. On a more serious note, it is possible to write an algorithm which calculates the binary expansion of $x=p/q$ using only integer arithmetic. Usually binary expansion happens in this way:

set x, maxIter
initialise b, i=1
while x>0 AND i<=maxIter {
 if 2*x>=1
    b[i]=1
 else
    b[i]=0
 x = 2*x-b[i]
 i = i+1 
}
return b

Problems arise whenever we try to compute something non-integer (the highlighted lines 3, 4, and 8). However, we can rewrite these using $x = p/q$ and shifting the division by $q$ to the right-hand side of each inequality / assignment operator:

set p, q, maxIter
initialise b, i=1 
while p>0 AND i<=maxIter { 
 if 2*p>=q 
    b[i]=1 
 else 
    b[i]=0 
 p = 2*p-b[i]*q 
 i = i+1 
} 
return b

Provided we’re not dealing with monstrously large integers (i.e. as long as we can safely double $p$ ), implementing the above lets us compute $p/q$ with arbitrary precision given by maxIter. So we can beat machine precision for rationals! And the combination of arbitrarily accurate rationals and arbitrarily accurate series approximations (think Riemann zeta function, for example) means we can also get the occasional arbitrarily accurate irrational.

To sum up, rounding errors are annoying, partly because it’s not always intuitive when and how they happen. As a general rule the best way to avoid them is to make your computer do as little work as possible, and to avoid non-integer calculations whenever you can. But you already knew that, didn’t you?

This post was partially inspired by the undergraduate course on Simulation and Statistical Programming lectured by Prof Julien Berestycki and Prof Robin Evans. It was also inspired by my former maths teacher who used to mark us down for doing more work than necessary even when our solutions were correct. He had a point.

The Seven Summits

Last week my boyfriend Ben Rainthorpe returned from Argentina having successfully climbed Aconcagua – the highest mountain in South America. At a staggering 6963m above sea level it is the highest peak outside of the Himalayas. The climb took 20 days in total with a massive 14 hours of hiking and climbing on summit day.

Aconcagua is part of the mountaineering challenge known as the Seven Summits. This is achieved by summiting the highest mountain in each of the seven continents. This was first successfully completed in 1985 by Richard Bass. In 1992 Junko Tabei became the first woman to complete the challenge. In December Ben quit his job as a primary teacher to follow his dream of achieving this feat. Which mountains constitute the seven summits is debated and there are a number of different lists. In addition the challenge can be extended by including the highest volcano in each continent.

The Peaks:

1.Kilimanjaro – Africa (5895m)

Kilimanjaro is usually the starting point for the challenge. At 5895 m above sea level and no technical climbing required it is a good introduction to high altitude trekking. However, this often means it is underestimated and the most common cause of death on the mountain is altitude sickness.

2. Aconcagua – South America (6963 m)

The next step up from Kilimanjaro Aconcagua is the second highest of the seven summits. However the lack of technical climbing required make it a good second peak to ascend after Kilimanjaro. For Aconcagua however, crampons and ice axes are required. The trek takes three weeks instead of one.

3. Elbrus – Europe (5,642 m)

Heralded as the Kilimanjaro of Europe, Elbrus even has a chair lift part of the way up! This mountain is regularly underestimated causing a high number of fatalities per year. Due to snowy conditions crampons and ice axes are once again required. Some believe that Elbrus should not count as the European peak and instead Mount Blanc should be summited – a much more technical and dangerous climb.

4. Denali – North America (6190 m).

Denali is a difficult mountain to summit. Although slightly lower than other peaks, the distance from the equator means the effects of altitude are more keenly felt. More technical skills are needed. In addition there are no porters to help carry additional gear so climbers must carry a full pack and drag a sled.

5. Vinson Massif – Antartica (4892 m).

Vinson is difficult because of the location rather than any technical climbing. The costs of going to Antartica are great and the conditions are something to be battled with.

6. Puncak Jaya – Australasia (4884 m) or Kosciuszko – Australia (2228 m)

The original Seven Summits included Mount Kosciuszko of Australia – the shortest and easiest climb on the list. However it is now generally agreed that Puncak Jaya is the offering from the Australasia continent. Despite being smaller than others on the list this is the hardest of the seven to climb with the highest technical rating. It is also located in an area that is highly inaccessible to the public due to a large mine, and is one of the few where a rescue by helicopter is not possible.

7. Everest – Asia (8848 m).

Everest is the highest mountain in the world at 8848 m above sea level. Many regard the trek to Everest Base Camp as challenge enough. Some technical climbing is required as well as bottled oxygen to safely reach altitudes of that level. One of the most dangerous parts is the Khumbu Icefall which must be traversed every time the climbers leave base camp. As of 2017 at least 300 people have died on Everest – most of their bodies still remain on the mountain.

Ben has now climbed two of the Seven Summits. His immediate plans are to tackle Elbrus in July (which I might try and tag along to) and Vinson next January. If you are interested in his progress check out his instagram (@benrainthorpe).

TCR Database

Back-to-back posting – I wanted to talk about the growing volume of TCR structures in the PDB. A couple of weeks ago, I presented my database to the group (STCRDab), which is now available at http://opig.stats.ox.ac.uk/webapps/stcrdab.

Unlike other databases, STCRDab is fully automated and updates on Fridays at 9AM (GMT), downloading new TCR structures and annotating them with the IMGT numbering (also applies for MHCs!). Although the size of the data is significantly smaller than, say, the number of antibody structures (currently at 3000+ structures and growing), the recent approval of CAR therapies (Kymriah, Yescarta), and the rise of interest in TCR engineering (e.g. Glanville et al., Nature, 2017; Dash et al., Nature, 2017) point toward the value of structures.

Feel free to read more in the paper, and here are some screenshots. 🙂

STCRDab front page.

Look! 5men, literally.

Possibly my new favourite PDB code.

STCRDab annotates structures automatically every Friday!

ABodyBuilder and model quality

Currently I’m working on developing a new strategy to use FREAD within the ABodyBuilder pipeline. While running some tests I’ve realised that some of the RMSD values that there were some minor miscalculations of CDR loops’ RMSD in my paper.

To start with, the main message of the paper remains the same; the overall quality of the models (Fv RMSD) was correct, and still is. ABodyBuilder isn’t necessarily the most accurate modelling methodology per se, but it’s unique in its ability to estimate RMSD. ABodyBuilder would still be capable of doing this calculation regardless of what the CDR loops’ RMSD may be. This is because the accuracy estimation looks at the RMSD data and places a probability that a new model structure would have some RMSD value “x” (given the CDR loop’s length). Our website has now been updated in light of these changes too.

Update to Figure 2 of the paper.

Update to Figure S4 of the paper.

Update to Figure S5 of the paper.

Crystallographic programming: Super short tour of the cctbx

Two of the leading packages in crystallography are Phenix and CCP4. For most practicing crystallographers they will interact via with these to progress a single crystallographic data-set from diffraction images, through integration, merging, phasing, model building and hopefully deposition.

However, if you want to develop crystallographic software, you will likely need to decide on a framework to build upon. Phenix is built on the comprehensive cctbx library, whereas CCP4 programs are typically standlone, although common crystallographic libraries such as clipper and cctbx are utilised.

CCTBX is written mainly in python, with core crystallographic functionality written in C++. My usual starting place for understanding functionality is through the pdb parser tutorial. This introduces the concept of a hierarchy, a iterative way to represent a macromolecule:

from iotbx.pdb import hierarchy
pdb_in = hierarchy.input(file_name="model.pdb")
for chain in pdb_in.hierarchy.only_model().chains() :
  for residue_group in chain.residue_groups() :
    for atom_group in residue_group.atom_groups() :
      for atom in atom_group.atoms() :
        if (atom.element.strip().upper() == "ZN") :
          atom_group.remove_atom(atom)
      if (atom_group.atoms_size() == 0) :
        residue_group.remove_atom_group(atom_group)
    if (residue_group.atom_groups_size() == 0) :
      chain.remove_residue_group(residue_group)
f = open("model_Zn_free.pdb", "w")
f.write(pdb_in.hierarchy.as_pdb_string(
  crystal_symmetry=pdb_in.input.crystal_symmetry()))
f.close()

Although there are many ways to parse a pdb file, the introduction to iotbx.pdb, gives a view of how xray structure data can be associated to the model. The tour of the cctbx can be helpful starting place, especially for understanding how the python and c++ functionality interact through boost and the scitbx.array_family.flex. Unfortunately, documentation on cctbx tends to vary in quality and quantity throughout the modules:

Other components of the library include ways to simulate crystallographic data through simtbx, and tools for processing xfel data.

As the library is open source, github hosted source code allows exploration of previously written routines, which can be very helpful for understanding the inner workings of the library. Note that there are also bulletin boards for users and developers of phenix and cctbx respectively. A few tutorials can also be found.

Hopefully this post will give someone other than me a reminder of where to find resources to get started developing within CCTBX.

Paper review: “Inside the black box”

There are nearly 17,000 Oxford students on taught courses. They turn up reliably every October. We send them to an army of lecturers and tutors, drawn from every rank of the research hierarchy. As members of that hierarchy, we owe it to the students – all 17,000 of them – to teach them as best we can.

And where can we learn the most about how to teach? There are 438,000 professional teachers in the UK. Maybe people who spend all of their working time on the subject might have good strategies to help people learn.

The context of the paper

Teachers obsess over assessment. Assessment is the process by which teachers figure out what students have learned. It is probably true that assessment is the only reason we have classrooms at all.

Inside the Black Box is of the vanguard of recent changes in educational thinking. Modern teaching regards good pedagogy as a practical skill. Like other types of performance, it depends on a specific set of concrete actions which can be taught and learned. Not everyone is a natural teacher – but nearly everyone can become a competent teacher.

Formative assessment is the focus of Inside the Black Box. The article argues that this process, in which teachers figure what students know and tell them how it’s going wrong, is essential to good classroom practice.

What is the black box?

The black box is the classroom. After societal convulsions over class sizes, funding deficits, curriculum reforms, and examination structure, it’s time – says the article, in 2001 – that we focus on what actually goes on inside the classroom. These social changes, it says, adjust the inputs to the black box, and society expects better things out of the black box. But what if changing the inputs makes the work inside the black box harder? Don’t we have an obligation to figure out what needs to happen to get students to learn?

The article touches three questions:

Is there evidence that improving formative assessment raises standards?
Is there evidence that there is room for improvement?
Is there evidence about how to improve formative assessment?

The answers are yes, yes, and yes. In meta-analyses of educational experiments, formative assessment consistently raises standards. These experiments match the experience of teachers, who know that the least effective lessons are those which do not respond to students’ needs. Standard observations – such as those from Ofsted – ask teachers to answer what are they learning, and then how do you know, and then what are you doing about it?

The second question – is there room for improvement? – is one they address in great detail in the context of primary and secondary education. Some criticisms (the giving of grades for its own sake, unintentional encouragement of “rote or superficial learning”, relentless competition between students) seem applicable in different parts of our university context. A greater weakness is a lack of emphasis. People engaged in university teaching frequently center the delivery of knowledge instead of learning, an idea exacerbated by our obsession with lectures and masked by the long lag between those lectures and the exams in which we assess them.

Recommendations

Inside the Black Box makes specific recommendations for instructors about how to engage in formative assessment. Those recommendations – unusually, for an item in the educational literature – are specific and detailed. But rather than focus on them, it is worth examining three themes which run across the article.

The overriding focus is the importance of formative assessment. If we care about what students learn, then we’ve got to be checking what it is that they actually are learning. Opportunities for formative assessment should be “designed into any piece of teaching”. In extremis, this idea has interesting implications for the institution of lectures, which generally lack them entirely.

A subsidiary idea is the importance of setting clear objectives for learning. Too many students view learning as a series of exercises rather than a step in the formation of a coherent body of knowledge. The overarching direction should be made clear. And on a more detailed level, we need to be explicit about what outcomes we want our students to obtain so that they know whether they are making satisfactory progress. Formative assessment must make reference to expectations, and formative self- or peer assessment becomes impossible if those expectations are not well-understood.

And this discussion ties into a final point: when students truly apply themselves to the task of learning, their self-perception and self-esteem becomes bound up in it. Ineffective expectation-setting and insufficient clarity about the means for improvement result in students feeling demotivated, which causes them to revise their goals downward. They put in less effort and achieve outcomes that are worse. These effects are costly and can be avoided by effective formative assessment.

Inside the Black Box is a diversion from our diet of scientific articles, but I think it is worth our attention. Pedagogy is difficult to get right. In the university context, good practice is the subject of little attention and rarely assessed. Thinking about good asssessment means that our students benefit.

But all communication activities are a form of teaching. Really good teachers communicate really well. When good communication happens, everyone benefits, inside and outside the black box.

Journal Club: Large-scale structure prediction by improved contact predictions and model quality assessment.

With the advent of statistical techniques to infer protein contacts from multiple sequence alignments (which you can read more about here), accurate protein structure prediction in the absence of a template has become possible. Taking advantage of this fact, there have been efforts to brave the sea of protein families for which no structure is known (about 8,500 – over 50% of known protein families) in an attempt to predict their topology. This is particularly exciting given that protein structure prediction has been an open problem in biology for over 50 years and, for the first time, the community is able to perform large-scale predictions and have confidence that at least some of those predictions are correct.

Based on these trends, last group meeting I presented a paper entitled “Large-scale structure prediction by improved contact predictions and model quality assessment”. This paper is the culmination of years of work, making use of a large number of computational tools developed by the Elofsson Lab at Stockholm University. With this blog post, I hope to offer some insights as to the innovative findings reported in their paper.

Let me begin by describing their structure prediction pipeline, PconsFold2. Their method for large-scale structure prediction can be broken down into three components: contact prediction, model generation and model quality assessment. As the very name of their article suggests, most of the innovation of the paper stems from improvements in contact prediction and the quality assessment protocols used, whereas for their model generation routine, they opted to sacrifice some quality in favour of speed. I will try and dissect each of these components over the next paragraphs.

Contact prediction relates to the process in which residues that share spatial proximity in a protein’s structure are inferred from multiple sequence alignments by co-evolution. I will not go into the details of how these protocols work, as they have been previously discussed in more detail here and here. The contact predictor used in PconsFold2 is PconsC3, which is another product of the Elofsson Lab. There was some weirdness with the referencing of PconsC3 on the PconsFold2 article, but after a quick google search, I was able to retrieve the article describing PconsC3 and it was worth a read. Other than showcasing PconsC3’s state-of-the-art contact prediction capabilities, the original PconsC3 paper also provides figures for the number of protein families for which accurate contact prediction is possible (over 5,000 of the ~8,500 protein families in Pfam without a member of known structure). I found the PconsC3 article feels like a prequel to the paper I presented. The bottom line here is that PconsC3 is a reliable tool for predicting contacts from multiple sequence alignments and is a sensible choice for the PconsFold2 pipeline.

Another aspect of contact prediction that the authors explore is the idea that the precision of contact prediction is dependent on the quality of the underlying multiple sequence alignment (MSA). They provide a comparison of the Positive Predicted Value (PPV) of PconsC3 using different MSAs on a test set of 626 protein domains from Pfam. To my knowledge, this is the first time I have encountered such a comparison and it serves to highlight the importance the MSA has on the quality of resulting contact predictions. In the PconsFold2 pipeline, the authors use consensus approach; they identify the consensus of four predicted contact maps each using a different alignment. Alignments were generated using Jackhmmer and HHBlits at E-Value cutoffs of 1 and 10^-4.

Now, moving on to the model generation routine. PconsFold2 makes use of CONFOLD to perform model generation. CONFOLD, in turn, uses the simulated annealing routine of the Crystallographic and NMR System (CNS) to produce models based on spatial and geometric constraints. To derive those constraints, predicted secondary structure and the top 2.5 L predicted contacts are given as input. The authors do note that the refinement stage of CONFOLD is omitted, which is a convenience I assume was adopted to save computational time. The article also acknowledges that models generated by CONFOLD are likely to be less accurate than the ones produced by Rosetta, yet a compromise was made in order to make the large-scale comparison feasible in terms of resources.

One particular issue that we often discuss when performing structure prediction is the number of models that should be produced for a particular target. The authors performed a test to assess how many decoys should be produced and, albeit simplistic in their formulation, their results suggest that 50 models per target should be sufficient. Increasing this number further did not lead to improvements in the average quality of the best models produced for their test set of 626 proteins.

After producing 50 models using CONFOLD, the final step in the PconsFold2 protocol is to select the best possible model from this ensemble. Here, they present a novel method, PcombC, for ranking models. PcombC combines the clustering-based method Pcons, the single-model deep learning method ProQ3D, and the proportion of predicted contacts that are present in the model. These three scores are combined linearly, and are given weights that were optimised via a parameter sweep. One of my reservations relating to this paper is that little detail is given regarding the data set that was used to perform this training. It is unclear from their methods section if the parameter sweep was trained on the test set with 626 proteins used throughout the manuscript. Given that no other data set (with known structures) is ever introduced, this scenario seems likely. Therefore, all the classification results obtained by PcombC, and all of the reported TM-score Top results should be interpreted with care since performance on validation set tends to be poorer than on a training set.

Recapitulating the PconsFold2 pipeline:

Step 1: generate four multiple sequence alignments using HHBlits and Jackhmmer.
Step 2: generate four predicted contact maps using PconsC3.
Step 3: Use CONFOLD to produce 50 models using a consensus of the contact maps from step 2.
Step 4: Use PCombC to rank the models based on a linear combination of the Pcons and ProQ3D scores and the proportion of predicted contacts that are present in the model.

So, how well does PconsFold2 perform? The conclusion is that it depends on the quality of the contact predictions. For the protein families where abundant sequence information is available, PconsFold2 produces a correct model (TM-Score > 0.5) for 51% of the cases. This is great news. First, because we know which cases have abundant sequence information beforehand. Second, because this comprises a large number of protein families of unknown structure. As the number of effective sequence (a common way to assess the amount of information available on an MSA) decreases, the proportion of families for which a correct model has been generated also decreases, which restricts the applicability of their method to protein families with abundant sequence information. Nonetheless, given that protein sequence databases are growing exponentially, it is possible that over the next years, the number of cases where protein structure prediction achieves success is likely to increase.

One interesting detail that I was curious about was the length distribution of the cases where modelling was successful. Can we detect the cases for which good models were produced simply by looking at a combination of length and number of effective sequences? The authors never address this question, and I think it would provide some nice insights as to which protein features are correlated to modelling success.

We are still left with one final problem to solve: how do we separate the cases for which we have a correct model from the ones where modelling has failed? This is what the authors address with the last two subsections of their Results. In the first of these sections, the authors compare four ways of ranking decoys: PcombC, Pcons, ProQ3D, and the CNS contact score. They report that, for the test set of 626 proteins, PcombC obtains the highest Pearson’s Correlation Coefficient (PCC) between the predicted and observed TM-Score of the highest ranking models. As mentioned before, this measure could be overestimated if PcombC was, indeed, trained on this test set. Reported PCCs are as follows: PcombC = 0.79, Pcons = 0.73, ProQ3D = 0.67, and CNS-contact = -0.56.

In their final analysis, the authors compare the ability of each of the different Quality Assessment (QA) scores to discern between correct and incorrect models. To do this, they only consider the top-ranked model for each target according to different QA scores. They vary the false positive rate and note the number of true positives they are able to recall. At a 10% false positive rate, PcombC is able to recall about 50% of the correct models produced for the test set. This is another piece of good news. Bottomline is: if we have sufficient sequence information available, PconsFold2 can generate a correct model 51% of the time. Furthermore, it can detect 50% of these cases, meaning that for ~25% of the cases it produced something good and it knows the model is good. This opens the door for looking at these protein families with no known structure and trying to accurately predict their topology.

That is exactly what the authors did! On the most interesting section of the paper (in my opinion), the authors predict the topology of 114 protein families (at FPR of 1%) and 558 protein families (at FPR of 10%). Furthermore, the authors compare the overlap of their results with the ones reported by a similar study from the Baker group (previously presented at group meeting here) and find that, at least for some cases, the predictions agree. These large-scale efforts force us to revisit the way we see template-free structure prediction, which can no longer be dismissed as a viable way of obtaining structural models when sufficient sequences are available. This is a remarkable achievement for the protein structure prediction community, with the potential to change the way we conduct structural biology research.

Latexing with gvim

Here I’ll share my set-up for writing Latex with gvim instead of a separate Latex editor. If you are text-editor averse, this blog post is not for you. But if, like me, you love vim and hate useless GUIs, this might be helpful.

We’re lucky to have nice big screens in the Stats Department, but I tend to prefer writing on my MacBook (I find it’s easier to transport to e.g. a cafe, my home, etc). Until now, I’ve been happily using TexMaker for writing, but during a recent period of intense Latexing I started to find the useable screen space oppressively small. The unnecessary GUI had to go.

No offence TexMaker but I don’t like you

One of our good friends in Statistical Genetics recommended some things to help me with the transition to just using good old (g)vim, which I will now recommend to you.

The key thing is the LaTex-Box plug-in for vim, which gives you the compilation commands, as well as the essentials such as smart indentation, highlight matching, command completion, etc. I used pathogen to install it (see the GitHub for instructions).

Of course, you can then customise your .vimrc file to add more helpful things. This can be the simple preferences, such as using a light background when using gvim:

if has(“gui_running”)

        set background=light

endif

You can also do more complicated magic like tabbing through available commands, and the ability to minimise sections, etc. Sidenote: to make working with paragraphs easier, I recommend setting the up/down arrows to move the cursor to the next line in the GUI rather than the next actual line. I prefer overriding this behaviour only in gvim, while leaving the normal behaviour in vim (for actual coding). But each to their own.

To get started, open a .tex file, then compile and view the document with the command Latexmk.

Command suggestions are an example of a magical feature added in .vimrc

The configurations for this command are set in the file .latexmkrc. Mine looks like this:

$recorder = 1;
$pdf_mode = 1;
$bibtex_use = 2;
$pdflatex = "pdflatex --shell-escape %O %S";
$pdf_previewer = "start open -a skim %O %S";

My pdf viewer of choice on Mac is Skim, which autoupdates. I view the source and preview at the same time using split view. Please admire the beauty below:

Wow what a beautiful screen

My favourite part is that whenever you save (w), it recompiles and updates the preview. As someone who accidentally types :w everywhere that isn’t vim, it’s nice that this is now productive. It also recompiles automatically if the .bib file is updated. Note that if you have errors at compilation (I’m sure you don’t), you can view them with the command LatexErrors.

Now you too can be a (nearly) GUI-free lightweight Latexer. Enjoy!

Oxford Protein Informatics Group

or "OPIG" to friends