Category Archives: Uncategorized

Typography in graphs.

Typography [tʌɪˈpɒɡrəfi]
n.: the style and appearance of printed matter.

Perhaps a “glossed” feature of making graphs, having the right font goes a long way. Not only do we have the advantage of using a “pretty” font that we like, it also provides an aesthetic satisfaction of having everything (e.g. in a PhD thesis) in the same font, i.e. both the text and graph use the same font.

Fonts can be divided into two types: serif and sans-serif. Basically, serif fonts are those where the letters have little “bits” at the end; think of Times New Roman or Garamond as the classic examples. Sans-serif fonts are those that lack these bits, and give it a more “blocky”, clean finish – think of Arial or Helvetica as a classic example.

Typically, serif fonts are better for books/printed materials, whereas sans-serif fonts are better for web/digital content. As it follows, then what about graphs? Especially those that may go out in the public domain (whether it’s through publishing, or in a web site)?

This largely bottles down to user preference, and choosing the right font is not trivial. Supposing that you have (say, from Google Fonts), then there are a few things we need to do (e.g. make sure that your TeX distribution and Illustrator have the font). However, this post is concerned with how we can use custom fonts in a graph generated by Matplotlib, and why this is useful. My favourite picks for fonts include Roboto and Palatino.

The default font in matplotlib isn’t the prettiest ( I think) for publication/keeping purposes, but I digress…

To start, let’s generate a histogram of 1000 random numbers from a normal distribution.

The default font in matplotlib, bitstream sans, isn’t the prettiest thing on earth. Does the job but it isn’t my go-to choice if I can change it. Plus, with lots of journals asking for Type 1/TrueType fonts for images, there’s even more reason to change this anyway (matplotlib, by default, generates graphs using Type 3 fonts!). If we now change to Roboto or Palatino, we get the following:

Sans-serif Roboto.

Serif font Palatino.

Basically, the bits we need to include at the beginning of our code are here:

# Need to import matplotlib options setting method
# Set PDF font types - not necessary but useful for publications
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42

# For sans-serif
from matplotlib import rc
rc("font", **{"sans-serif": ["Roboto"]}

# For serif - matplotlib uses sans-serif family fonts by default
# To render serif fonts, you also need to tell matplotlib to use LaTeX in the backend.
rc("font", **{"family": "serif", "serif": ["Palatino"]})
rc("text", usetex = True)

This not only guarantees that images are generated using a font of our choice, but it gives a Type 1/TrueType font too. Ace!

Happy plotting.

Biological Space – a starting point in in-silico drug design and in experimentally exploring biological systems

What is the “biological space” and why is this space so important for all researchers interested in developing novel drugs? In the following, I will first establish a definition of the biological space and then highlight its use in computationally developing novel drug compounds and as a starting point in the experimental exploration of biological systems.

While chemical space has been defined as the entirety of all possible chemical compounds which could ever exist, the definition of biological space is less clear. In the following, I define biological space as the area(s) of chemical space that possess biologically active (”bioactive”) compounds for a specific target or target class¹. As such, they can modulate a given biological system and subsequently influence disease development and progression. In literature, this space has also been called “biologically relevant chemical space”².

Only a small percentage of the vast chemical space has been estimated to be biologically active and is thus relevant for drug development, as randomly searching bioactive compounds in chemical space with no prior information resembles the search for “the needle in a haystack”. Hence, it should come as no surprise that bioactive molecules are often used as a starting point in in-silico explorations of biological space.
The plethora of in-silico methods for this task includes similarity and pharmacophore searching methods^3-6 for novel compounds, scaffold-hopping approaches to derive novel chemotypes^7-8 or the development of quantitative structure-activity relationships (QSAR)^9-10 to explore the interplay between the 3D chemical structure and its biological activity towards a specific target.

The biological space is comprised of small molecules which are active on specific targets. If researchers want to explore the role the role of targets in a given biological system experimentally, they can use small molecules which are potent and selective towards a specific target (thus confided to a particular area in chemical space)^11-12.
Due to their high selectivity ( f.e. a greater than 30-fold selectivity towards proteins of the same family¹²), these so-called “tool compounds” can help establish the biological tractability – the relationship between the target and a given phenotype – and its clinical tractability – the availability of biomarkers – of a target¹¹. They are thus highly complementary to methods such as RNAi, CRISPR¹² and knock-out animals¹¹. Consequently, tool compounds are used in drug target validation and the information they provide on the biological system can increase the probability of a successful drug ¹¹. Most importantly, tool compounds are particularly important to annotate targets in currently unexplored biological systems and thus important for novel drug development¹³.

Sophie Petit-Zeman, http://www.nature.com/horizon/chemicalspace/background/figs/explore_b1.html, accessed on 03.07.2016.
Koch, M. A. et al. Charting biologically relevant chemical space: a structural classification of natural products (SCONP). Proceedings of the National Academy of Sciences of the United States of America 102, 17272–17277 (2005).
Stumpfe, D. & Bajorath, J. Similarity searching. Wiley Interdisciplinary Reviews: Computational Molecular Science 1, 260–282 (2011).
Bender, A. et al. How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space. Journal of Chemical Information and Modeling 49, 108–119 (2009).
Ai, G. et al. A combination of 2D similarity search, pharmacophore, and molecular docking techniques for the identification of vascular endothelial growth factor receptor-2 inhibitors: Anti-Cancer Drugs 26, 399–409 (2015).
Willett, P., Barnard, J. M. & Downs, G. M. Chemical Similarity Searching. Journal of Chemical Information and Computer Sciences 38, 983–996 (1998)
Sun, H., Tawa, G. & Wallqvist, A. Classification of scaffold-hopping approaches. Drug Discovery Today 17, 310–324 (2012).
Hu, Y., Stumpfe, D. & Bajorath, J. Recent Advances in Scaffold Hopping: Miniperspective. Journal of Medicinal Chemistry 60, 1238–1246 (2017)
Cruz-Monteagudo, M. et al. Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discovery Today 19, 1069–1080 (2014).
Bradley, A. R., Wall, I. D., Green, D. V. S., Deane, C. M. & Marsden, B. D. OOMMPPAA: A Tool To Aid Directed Synthesis by the Combined Analysis of Activity and Structural Data. Journal of Chemical Information and Modeling 54, 2636–2646 (2014).
Garbaccio, R. & Parmee, E. The Impact of Chemical Probes in Drug Discovery: A Pharmaceutical Industry Perspective. Cell Chemical Biology 23, 10–17 (2016).
Arrowsmith, C. H. et al. The promise and peril of chemical probes. Nature Chemical Biology 11, 536–541 (2015).
Fedorov, O., Müller, S. & Knapp, S. The (un) targeted cancer kinome. Nature chemical biology 6, 166–169 (2010).

A Day in the Life of a DPhil Student… that also rows for Oxford.

I couldn’t decide whether to write this blog post. However, I sifted through the archives of BLOPIG and found in the original post this excerpt:

“And if your an athlete, like Anna (Dr. Lewis) who crossed the atlantic in a rowing boat or Eleanor who used to row for the blues – what can I say, this is how we roll, or row [feeble attempt at humour] – thats a non-scientific but unique and interesting experience too (Idea #8). .”

Therefore I’ve decided that it might be an interesting post to look into what life is like when you are studying for a DPhil and also training for the blues. Rowing in particular is a controversial sport – I have heard of many stories advocating that rowing will be the absolute detriment to your DPhil. I’ve never felt pressured as part of OPIG to give up rowing – all of my supervisors have been very fair, in that if I get the work done then they accept this is part of my life. However, I realise all supervisors are not so understanding. I hope this blog post will give some insight into what it is like to trial for a Blues sport (in this case Women’s Lightweight Rowing), whilst studying for a DPhil at Oxford.

4:56 am – Alarm goes off. If its after September it’s dark, cold and likely raining. No breakfast as I will do the first training session fasted – just get dressed and go!

5:15 am – Leave the house with a bag full of kit, food for the day, laptop and papers to cycle to Iffley Sport’s Centre

5:45 am – Lightweight Women’s minibus leaves from Iffley to drive to Wallingford. Some girls try to study in the bus, but to be honest its too dark and we’re all a bit too sleepy.

6:15 am – Arrive at Wallingford. Get onto the water for a session in the boats. Although in the Boat Race we race in an 8 (8 rowers with one oar each, with a cox steering), we spend lots of time in different boats throughout the season. Perhaps unlike our openweight counterparts, we also do a lot of sculling (two oars per rower) as the only Olympic class boat for lightweight women is a sculling boat. We travel to Wallingford for a much longer, emptier stretch of river and normally get to see the sunrise.

8:10 am – We leave Wallingford to head back to Oxford. Start waiting in A LOT of traffic once you hit the ring road, and there’s a lot of panic in the bus about whether 9 am lectures will be made on time!

8:50 am – Arrive back at Iffley Sport’s Centre. Grab bike and cycle to the department.

9:00-9:15 am – Arrive at the Department. Quick shower to thaw frozen fingers and to not repulse my fellow OPIG members. I then get to eat warm porridge (highlight of the day) and go through my emails. I also check whether any of my jobs have finished on the group servers – one of the great perks of being in OPIG is the computational resources available to the group. Check the to-do list from yesterday and write a to-do list for today and get to work (coding, plotting results, reading papers or writing)!

11:00 am (Tuesdays & Thursdays) – Coffee morning! Although if it’s any time close to a race no bourbon biscuits or cake for me. This is a bit of an issue because at OPIG we eat a lot of cake. However, one member can usually be relied upon to eat my portion..

1:00 pm – Lunchtime! As a lightweight rower I am required to weigh-in at 59kg on the day of the Boat Race. If I am over that weight I don’t get to race. Therefore, I spend a portion of the year dieting to make sure I hit that target. The dieting lunch consists of soup and Greek yogurt. The post race non-dieting lunch consists of pasta from Taylors, chocolate and a Coke (yum!). OPIG members generally all have lunch at this time and enjoy solving the Times Cryptic Crossword. I’m not the best at crosswords so I normally chat to Laura and don’t concentrate.

2:00 pm – Back to work. Usually coding whilst listening to music. I normally start rushing to be able to submit some jobs to the group servers before I have to leave the office.

3:00 pm – Go to get a chocomilk with Clare. A chocomilk from the vending machine in our department costs 20p and is only 64 calories!

5:30 pm – Cycle to Iffley Sports Centre for the second training session of the day.

5:45 pm – If it’s light enough we hop in the minibus to go to Wallingford for another outing on the water. However, for most of the season its too dark and we head to the gym. This will either consist of weights to build strength, or we will use the indoor rowing machine (erg) to build fitness. The erg is my nemesis, so this is not a session I look forward to. Staring at a screen that constantly tells you how hard you are pushing, or if you are no longer pushing as hard I find to be psychologically quite tough. I’d much rather be gliding along the river.

8:35 pm – Leave Iffley after a long session to head home. Quickly down a Yazoo (strawberry milk) to boost recovery as I won’t be eating dinner until 45 minutes to an hour after the end of the session.

9:00 pm – Arrive home. I “cook” dinner which when I’m dieting consists of chucking sweet potato and healthy sausages from M&S in the oven while I pack my kit bag for the next day.

9:30 pm – Wolf down dinner and drink about a pint of milk, whilst finally catching up with my boyfriend about both our days.

10:00 pm – Bedtime at the latest.

Repeat!

Computational immunogenicity reduction

In my last presentation, I talked about the article by King et al. describing a method for computationally removing T-cell receptor epitopes from proteins. The work could have significant impact on the field of designing protein therapeutics, where immunogenicity is a serious obstacle.

One of the major challenges when developing a protein therapeutic is the activation of the immune system by the drug and subsequent production of antibodies against it, rendering the therapeutic ineffective. This process is known as immunogenicity. Immunogenicity is triggered by T-cells recognition of peptide epitopes displayed on the MHC (major histocompatibility complex). This recognition can be impeded by designing the protein therapeutic to remove the potential T-cell epitopes from its surface. There has been some success in experimental T-cell epitope removal, but the process remains resource and time consuming.

In this work, King et al. created a function which assigns to each residue a score that measures its propensity to be a part of a T-cell epitope. The score consists of three parts. The first part is based on a SVM (Support Vector Machine) score calculated over each 15-residue long window, that attempts to predict how likely is the corresponding peptide sequence to bind the MHC. The SVM has been trained on the available immunological data from the Immune Epitope Database (IEDB). The second part of the score is calculated on each 9-residue window and compares the frequency of the 9-mer in the host genomic data and in the known epitope data (a sequence occurring with a high frequency in a human genome would be rewarded while the opposite is true for sequences occurring in the known epitope data). The third part penalizes any deviations from the original charge of the protein. These three parts are combined with a standard Rosetta score that measures the stability of the protein. The weights assigned to each segment were calibrated on existing protein structures. The combined score would be used to score the mutations in the sequence of the protein of interest, according to their propensity of reducing immunogenicity. The top scoring mutations would then be combined in a greedy fashion.

The authors tested their method on fluorescent reporter protein superfolder GFP (sfGFP) and the toxin domain of the cancer therapeutic HA22. In the case of sfGFP the authors targeted the four top-scoring T-cell epitopes. They created eight different proteins designs, out of which all preserved the function of the original protein (fluorescence). The authors selected the top scoring design for experimental immunogenicity testing. The experiments have shown that the selected design had a significantly reduced immunogenicity in comparison to the original protein. In the case study of HA22 the authors created five designs, out of which three displayed cytotoxicities at the same level or higher than the original protein. The two most cytotoxic designs were further characterized experimentally for their propensity to induce immune response. The authors have found that the two mutants elicited a significantly reduced T-cell response.

Figure 1: Reduction of immunogenicity without loss of function. A) Three of the five designs show cytotoxicity at the same level or higher than the original protein. B) Two of the three cytotoxicity-preserving designs show reduced immunogenicity

Overall, this very interesting study showed that computational methods can be successfully used for reducing immunogenicity of protein therapeutics, opening new avenues for computational protein design.

Computationally designing antibodies using a known binding motif

This blogpost is be about the “Computational design of an epitope-specific Keap1 binding antibody using hotspot residues grafting and CDR loop swapping” by Liu et. al. that I presented at group meeting in May.

Antibody design is a subject that I am closely interested in, especially methods that have an important computational step. So far the go-to methods for designing an antibody used by industry are animal model immunisation and/or phage display, with little or no use of computational methods. In the past few years, however, a few computational methods for rational design of antibodies have been making a showing. Firstly, there are the ones where a structure of the docked antibody-antigen already exists, and the antibody is further refined computationally to increase binding affinity. Then there are the ones where the paratope of the antibody is proposed by the designer against a specific target. The paper I am summarizing here by Liu et. al follows the latter idea in a neat way.

Liu et. al. show that if a specific motif is important for binding a certain target, i.e. there is a crystal structure which shows that the motif is buried in the target and/or you predict that its residues are important for binding, it is worthwhile trying to graft that that motif in the CDR area of antibody (the one which is responsible for antibody specificity and affinity). Grafting of entire CDR loops has been long used for antibody humanisation, with many examples where CDR loops maintaining conformation and binding specificity when being transferred from a non-human scaffold to a human scaffold. This is somewhat aided by the fact that the starting and end points of the area being grafted is stable (i.e. the anchors are conformationally the same in all the antibody structures that we observe), which is not the case in Liu et al where they graft a four residue motif. The cool thing they do which makes it more probable for the motif to maintain conformation is identify an antibody which has in one of its CDR loops a fragment with the same backbone conformation with the motif they are trying to graft. They then just replace the residue types to the ones that are known to bind the target. For the Nrf2 motifs (that binds Keap1) they managed to create 5 potential designs. These were further expanded, using rational point mutations on the rest of the antibody in order to increase possibility of binding, to 10. Out of the 10 two showed binding.

One of the potential issues in a real scenario however is the fact that not an entire binding site is copied on antibody, the motif being a subset of the whole, which means the possibility of a low affinity and/or low chances of competing with the original protein (i.e. Nrf2) from which the motif was copied. This actually turned out to be the case, with the initial designs showing low mM affinity. Liu et. al. further worked on improving the initial designs, and they did so by computationally swapping the H3 CDR of the initial designs to a set of other H3 structures that have been seen in other solved antibodies using the Rosetta design protocol. They retained the ones that had a predicted buried SASA of > 2000 A^2, a change in energy of more than 20 REU and a shape complementarity greater than 0.6. These were then tested experimentally with a few of them showing nM affinities, a result which at this time should make you very happy if your entire design phase was done computationally.

A brief history of usage of the word “decoy” in protein structure prediction

Some concepts in science are counter-intuitive, like the Monty Hall problem or the Mpemba effect. Occasionally, this is also true for terminology, despite the best efforts of scientists to ensure that their work can be explained unambiguously to newcomers. Specifically, in our field of protein structure prediction, the word “decoy” has been used to mean one of many conformations generated by a de novo modelling protocol such as Rosetta, or alternative conformations of loops produced by an ab initio program e.g. Sphinx. Though slightly baffled by this usage when I started working in the field, I have now become so familiar with its strange new meaning that I have to remind myself to explain it in talks to a more general audience, or simply aim to avoid the term altogether. Nonetheless, following a heated discussion over the term in a recent group meeting, I thought it would be interesting to trace the roots of the new meaning.

Let’s begin with a definition from Google:

decoy

noun

noun: decoy; plural noun: decoys

/ˈdiːkɔɪ,dɪˈkɔɪ/

a bird or mammal, or an imitation of one, used by hunters to attract other birds or mammals.

“a decoy duck”

a person or thing used to mislead or lure someone into a trap.

“we need a decoy to distract their attention”

So we start with the idea of something distracting, resembling the true thing but with the intent to deceive. So how has this sense of the word evolved into what we use now? I attempted to dig out the earliest mention of decoy for a computationally generated protein conformation with a Google scholar search for “decoy protein”, which led to the work of Thomas and Dill published in 1996. Here the authors describe a method of distinguishing the native fold of a protein from the sequence threaded, without gaps, onto alternative structures from the PDB. This problem of discrimination between native and non-native had been carried out previously, but Thomas and Dill chose to describe the alternatives as “decoy conformations” or just “decoys”.

A similar problem was commonly attempted over the following years, of separating native structures from sets of computationally generated conformations. Due to the demands of conformer generation at this time, some sets were published themselves in online databases to be used as a resource for training scoring functions.

When it comes to the problem of de novo protein structure prediction, unfortunately it isn’t as simple as picking out the correct answer from a population of incorrect answers. Even among hundreds of thousands of conformations generated by the best methods, the exact native crystal structure will not be found (though a complication here that the protein is dynamic and will occupy an ensemble of native conformations). Therefore, the aim of any scoring function in structure prediction is instead to select which incorrect conformation is closest to the native structure, hoping to obtain at least the correct fold.

It is for this reason that we move towards the idea of choosing a model from a pool of decoys. Zhu et al. (2003) use “decoy” in precisely this way:

“One strategy for ab initio protein structure prediction is to generate a large number of possible structures (decoys) and select the most fitting ones based on a scoring or free energy function”

This seems to be where the idea of a decoy as incorrect and distracting is lost, and takes on its new meaning as one of a large and diverse set of protein-like conformations, which has continued until now.

So is it ever helpful to refer to “decoys” as opposed to “models”? What is communicated by “decoy” that is not achieved by using the word “model”? I think this may come down to the impression which is given by talking about a pool of decoys. People would not generally assume that each decoy on its own has any effective use for prediction of function. There is a sense that this is not the final result of the structure prediction pipeline, there is work yet to be done in refining, clustering, and making human judgments on the suitability of the output. Only after these stages would I feel more comfortable using the word “model”, to express the greater confidence we have in the structure (small though that may be in the de novo structure prediction world). However, the inadequacy of “model” does not alone justify this tenuous usage of “decoy”. Perhaps we could speak more often about populations of “conformations”. In any case, “decoy” is widespread in the community, and easily understood by those who are most likely to be reading, reviewing and editing the literature so I think we will be stuck with it for a while yet.

Interesting Antibody Papers

Here we highlight two antibody papers, one from the past one more recent. The more recent one talks about developing an affinity maturation model. The older one is a refresher on the Developability Index — how to computationally harness hydrophobicity and accessible surface areas to predict aggregation.

Mouse antibody maturation model — the most expanded (common) clones might not be the ones with highest affinities here (van Kampen lab). The authors of the paper define a model of affinity maturation. The main take-home message of the paper is that the ‘most expanded’ clones might not be the ones with highest affinity — expanded clones are assumed to be the ones ‘responding’ to the antigenic challenge. The model is based on Ordinary Differential Equations, tracing cell fate in a germinal center. The model was compared to experimental expansion data from lymph nodes for accuracy. In each such model one needs to assume a lot of parameters, such as which day post-immunization do we start somatic hypermuatation? The paper is a very nice example of a model of maturation and a good starting point for tracing references citing germinal center biology and numbers for parameters used for models (also the general canon of construction of such models!).

Developability index here. (Trout lab at MIT). The authors touch on a very important subject of antibody developability: after you produced your ab binder, does it have physicochemical characteristics which are suitable to carry on with it as a therapeutic. Such characteristics include stability, expression yields and aggregation propensity. Aggregation propensity is one of the most important factors here as it affects the pharmacokinetics of the drug as well as shelf life. In this manuscript, authors address attempt to predict the aggregation propensity of antibodies. As background data, they use twelve antibodies whose long term stability has been measured over several years. To develop a computational method to predict antibody aggregation propensity, they use a score which combines hydrophobicity and electrostatic factors. The hydrophobicity is an adapted SAP score which the authors developed previously, and whose main parameters are the exposed residue area and hydrophobicity of the residue as defined by Black and Mould. The electrostatics are calculated using PROPKA. Since combining the scores into a predictive model involved parametrization, they use seven of the antibodies to adjust the coefficients. They use the rest to demonstrate that their model has predictive power. Calculation of their models requires a structure of an antibody which they obtain using WAM. Take home messages? It is a nice dataset to play with aggregation prediction and it demonstrates how to calculate electrostatics and hydrophobicity of a molecule.

Protein structure determination using metagenome sequence data

For this week’s journal club, I presented a recent paper from Ovchinnikov, and the David Baker group – Protein structure determination using metagenome sequencing data. This discussed how incorporating metagenome sequence data into multiple sequence alignments, can assist with and improve residue-residue contact prediction. The paper concludes with the prediction of over 600 structures from protein families that currently have no solved structures.

The Pfam database contains 14,849 protein families with 50 or more residues. However, only 4752 of these families have at least one member with an experimentally determined structure. 3984 of the remaining 10,097 families have reliable comparative models built on the basis of homologs of known structure. Less confident comparative models can be built for a further 902 families, however this leaves 5211 families with no structural information.

The recent technological advances in genome sequencing have provided an increasingly large number of amino acid sequences to work with. Large numbers of sequences allows the identification of compensatory mutations that have occurred in residues that are in contact with each other. This is called evolutionary co-variance and can allow the relatively accurate prediction of residues that are in contact in a structure. Rosetta utilises these co-evolutionary couplings, along with partial structural matches (found by combining the predicted contacts with contact patterns of known structures, using the map_align algorithm ) to predict structures from a number of families with fold-level accuracy ( TM-score > 0.5 ). However, it was unknown if this method could be used to accurately predict protein structures on a large-scale.

One challenge in using co-evolutionary couplings to predict residue-residue contacts is that a large number of sequences (hundreds to thousands) are needed. The accuracy of the predicted contacts is also dependent on the diversity of the sequences in a family, and the length of the protein. N_f is a measure that incorporates all of these factors :

Figure 1A shows the dependence of Rosetta structure prediction accuracy on the N_f. In general, where N_f≥64, accuracy typical of comparative modelling (TM-score > 0.7) can be achieved. For N_f≥32, fold-level accuracy (TM-score > 0.5) can be achieved, below this, accuracy falls off. Of the 5211 families with no structural information, only ~400 of these had N_f≥64; therefore accurate structural modelling could not be achieved for the remaining ~4800 of these families using the sequencing data available on UniRef100.

Fig 1. (a) Accuracy of predicted structures produced with and without refinement by Rosetta for families with different Nf values. (b) Number of protein families with Nf≥64 between 2009 and 2015 using UniRef100 database, and UniRef100 and Metagenome data. (c) Percentage of protein families with Nf scores 4, 8, 16, 32, and 64 including sequences from UniRef100 and metagenome data.

The addition of metagenome sequence data (from shotgun sequencing microbial DNA from environmental samples) increased the proportion of families with N_f≥64 from 0.08, to 0.25. The proportion of families with N_f≥32 also increased from 0.16, to 0.33. The difference in the fraction of protein families with N_f≥64 before and after the addition of metagenome sequence data can be seen in Figure 1B, and Figure 1C shows the percentage of families with N_f scores above 4, 8, 16, 32 and 64.

After running a set of benchmark calculations, this larger set of sequence data were used to generate models for 921 protein families, which now had N_f≥64 and also had number of long range contacts greater than half the number of residues in the protein. Of these 921 protein families, models with predicted TM scores > 0.65 were generated for 614 families. Although these were only predicted TM scores, crystal structures for members of 5 of the 614 families have since been published and had a TM-score > 0.7 when compared with the corresponding model.

Limitations with this using this data include the lack of eukaryotic genetic information currently, as well as the lack of explicit modeling of ligands, co-factors and lipids using the Rosetta workflow. However, the fast rate of increase in metagenome sequencing data (as compared to the rate of increase of sequencing data in UniRef100) means that while these new models fill roughly 12% of the unknown structural information for protein families, the potential for future structural prediction is bright.

Colour page counter

So you’ve written the thesis, you’ve been examined, the corrections are done, and now you are left with just wearing the silly clown robes to get a piece of paper with your name on it. However, you’ve been informed that you aren’t allowed to don the silly robes until you print the damn thing (again) and submit it to the Bod to be ignored for generations to come. Oh, and the added bonus is that you have to pay for it. Naturally, you want the high-quality printing and paper to match for the final versions, but it’s all so expensive. At least you can save a few meagre pounds by specifying only the pages for colour printing. Naturally, I decided that I would spend far more time making a script than just counting them myself (which I did anyway to verify it works). Enjoy.

#!/usr/bin/env Rscript


library(data.table)

args <- commandArgs(trailingOnly=TRUE) x <- system(paste("gs -q -o - -sDEVICE=inkcov",args[1]," | awk '{print $1,$2,$3}'"),intern=TRUE) x <- as.data.table(tstrsplit(x,' ')) x[,c("V1","V2","V3"):=.(as.numeric(V1),as.numeric(V2),as.numeric(V3))] print(paste("Colour pages total:",sum(rowSums(x)!=0))) print(paste("Colour pages:", paste(which(rowSums(x) != 0),collapse=', ')))

Faster FREAD with Pandas

One of the things I like to do is to scale up things using the ridiculous amount of cores at my disposal (sometimes even for a good reason). One of these examples is when I had to model millions of CDRs (or loops) using FREAD.

The process through which you model a loop in Fread is:

Pre-filtering step: Anchor Ca separation and ESST score in between your target and all the templates in the DB. The ones that pass a threshold are saved for step 2
Anchor RMSD test

The major bottleneck for such an analysis is step 1, where most of the templates are filtered out so for step 2 you get a very reduced subset. The data for doing the Anchor Ca separation and ESST score is all stored for each possible template in one row of an sqlite database. So when you do step 1 you will go through each row of this table and calculate the score, with the database is stored on the hard drive so costly I/O. This is fine for the original purpose of Fread, where you filled in a missing loop for one structure, but when you are doing it for 100 million examples, going through a table stored on a hard drive 100 million times, sequentially, it is going to be SLOW. I say sequentially because for the python implementation using sqlite3 I had a lot of trouble trying to use a db handle on multiple threads, or load the same sql file on different instances on threads, it just crashes for no good reason. There has been chat about this on stackoverflow and I think this has moved on since I implemented it in 2015. Nevertheless, I wanted a simple and clean solution.

I decided to transform the sqlite3 database into a Pandas object. Pandas objects are basically a convenient way of storing tables with methods available that mimick conventional querying mechanisms for databases. These are stored in memory, easily dumped as pickle files, and can be easily duplicated between threads so no issues with thread safety. Obviously you need to have enough memory to store all of that, but for my application that was not a problem. Below is some sample code on how I used it to transform the template DB from FREAD.

import pandas as pd
import sqlite3 as sql

rows = []

# connect to your fread sql file
conn = sql.connect("fread_sql_file.sql")
try:
    query = "SELECT dihedral, sequence, pdbcode, start, anchor, bound FROM loops"
    results = conn.execute(query)
    for row in results:
        # store the rows as a list of dictionaries
        rows.append(dict(zip(['dihedral', 'length', 'pdbcode', 'anchor', 'sequence', 'start', 'bound'], [row[0], len(row[1]), row[2], row[4] ,row[1], row[3], row[5]])))
        
except Exception as e:
    print "Error during query", str(e)
    conn.close()

# create a pandas dataframe from the list of dictionaries 
df = pd.DataFrame(rows)
# store the table as a pickle file which you can reload later (this is very fast!)
df.to_pickle("fread_pandas_file.pickle")

After running this you will have your sql database as a pandas dataframe, and you can write methods which are thread safe to model loops as below:

import pandas as pd

THRESHOLD = 25
cdr_db pd.read_pickle("fread_pandas_file.sqlite")


def model_loop(query_sequence, query_anchors_ca):
    # score_sequence_db_helper is your function that attaches a scores based on your query sequence and the row in the template db
    scores = cdr_db.apply(lambda row: score_sequence_db_helper(row, query_sequence, query_anchors_ca), axis=1)

    # attach the score
    results = zip(list(cdr_db['pdbcode']), scores, list(cdr_db['sequence']))

    # keep the ones that are over the threshold
    results = filter(lambda (pdb_code, score, sequence): score>=THRESHOLD, results)
    
    return results

Oxford Protein Informatics Group

or "OPIG" to friends