In MATLAB, it’s colormaps all the way down

My overriding emotion, working in R, has been incomprehension: incomprehension at the gallery of ugly gnomes that populate the namespace and worried puzzlement over the strange incantations required to get them to dance in a statistically harmonious way. But all that aside, I resolved, joining the group, to put aside my misgivings and give the gnomes another try.

Soon, I found myself engaged in a reassessment of my life choices. I realized that life’s too short to spend it tickling gnomes – especially when only one of them knows how to do linear regression, but he won’t tell you your p value unless you give him the right kinds of treats. I fired up MATLAB and I haven’t looked back.

However, there was issue of continued perplexity, and I’m not referring to why MATLAB insists on shouting itself at you. I need to make a lot of 2-D plots of protein distance matrices. The trouble is that I like to highlight parts of them, and that’s not straightforward in MATLAB. Let’s have a look at an example:

>> dists=dlmread('1hel.distances');
>> colormap gray;
>> imagesc(dists>8);
>> axis square;

Now, let’s load up a set of residues and try to overlay them on top of the first image:

>> resn=dlmread('1hel.resn');
>> mask = zeros(size(dists));
>> mask(resn,resn)=1;
>> hold on
>> imagesc(1-mask, 'AlphaData',mask*.5);

So far, so easy. To review the main points:

mask is a matrix which has a one at all the pixels that we want to highlight. But we use imagesc(1-mask) because the gray colormap displays black at 0 and white at 1. If we did imagesc(mask), we would end up with grey everywhere and white only where we hoped to highlight – the opposite effect from the one that we sought.

AlphaData is a property which sets the transparency of the image. We want the image to be fully transparent where mask is 0 – so as not to fog out the underlying image – and partially transparent where mask is 1. 0.5*mask is a matrix which is 0.5 everywhere that mask is 1 and 0 everywhere else. If we set 0.5*mask as the AlphaData property, then the colour we add will be at half transparency and the white areas will be fully transparent.

But this isn’t a very pleasant image. We want to be able to highlight the regions in some colour other than grey. Let’s try.

>> close all
>> imagesc(dists>8)
>> colormap gray
>> axis square
>> imagesc(1-mask, 'AlphaData',mask*.3,'ColorMap','jet');
Error using image
There is no ColorMap property on the Image class.

Error in imagesc (line 39)
hh = image(varargin{:},'CDataMapping','scaled');

No luck! What’s more, setting the colormap between calls to image() and imagesc() also doesn’t work. Here’s the problem: the colormap is a property of the figure, not the data. (More precisely, it is not a property of the MATLAB axes.) When you change the colormap, you change the colors of every datapoint in the image.

The fix

MATLAB’s colormap mechanism is just simple enough to be confusing. MATLAB stores colours as 1×3 vectors, where each element in the vector is the proportion of red, green, or blue, respectively. [1 1 1] is white, [0 0 0] is black, and [1 0 0] is a frightfully iridescent red. A colormap is just a list of colors – 64 will normally do – which change smoothly from from one colour to another. To have a look at the built-in MATLAB colormaps, see here.

image rounds every value in the matrix to the nearest whole number (call that number i) and plots that pixel with the color given by colormap(i,:). Zero or below gets the first entry in the colormap and any index higher than the maximum is displayed with the last color in the colormap. So: if we construct a new colormap by concatenating two colormaps – the first running from rows 1 to 64 and the second running from 65 to 128 – if we scale our data so that the minimum is 65 and the maximum is 128, the data will never use the first set of colors. And, likewise, if we scale so that the lowest value is 1 and the highest is 64, we will use the first colormap. This seems like the sort of thing that we could manage automatically – and should, in fact. So I set myself to replace image and imagesc so that they would accept a ColorMap parameter.

How would it work?

>> colormap bone
>> imagesc(dists>8)
>> hold on
>> imagesc(mask,'ColorMap',[0.5 0 0.5],'AlphaData',0.5*(mask>0))
>> axis square

Beautiful!

Implementation notes

image is implemented in the MATLAB Java source code, but imagesc is a wrapper to image, written directly in MATLAB code. Therefore, overloading image requires the new function to be placed in a special directory called @double, while imagesc can be placed anywhere (except it cannot be placed in @double). If you then want to call the original version of image(), you can use builtin(‘image’,arg1,arg2,…), whereas if you want to call the original imagesc, it is a right pain. Instead, I used type imagesc to extract the source of imagesc and I modified that source directly – obviating any need to call the original imagesc. For reference, though, the most efficient way works out to be to find the function with which('imagesc'), cd into the containing directory, create a function handle to imagesc, and then cd out. As I said, it’s a mess.
These edits break colorbars. I added a spacer entry in each colormap which stores the maximum and minimum ‘real’ values of the data – in case that is useful for when I get around to extending colorbar. colormap entries must be inside [0,1] so these data are stored in the first twelve decimal places of the colormap entries: a strange burlesque on floating points. It’s a hack, but for my purposes it works.
In addition to the standard colormaps, I often require a mask in a particular color. For this purpose it helps to have a colormap that smoothly varies from white to the color in question. It actually doesn’t matter if it varies from white or any other color – ultimately, I only use the full colour value, since I set the transparency of all other pixels to maximum – but either way, passing the colour on [0,1] scale or [0,255] scale sets a colormap which varies from white to that color.

The code is available on MATLAB File Exchange at this link and is installable by copying imagesc.m, bootleg_fp.m, and the directory @double into your working directory. The idea to concatenate colormaps is widely available online – for example, here.

A Day in the Life of a DPhil Student… that also rows for Oxford.

I couldn’t decide whether to write this blog post. However, I sifted through the archives of BLOPIG and found in the original post this excerpt:

“And if your an athlete, like Anna (Dr. Lewis) who crossed the atlantic in a rowing boat or Eleanor who used to row for the blues – what can I say, this is how we roll, or row [feeble attempt at humour] – thats a non-scientific but unique and interesting experience too (Idea #8). .”

Therefore I’ve decided that it might be an interesting post to look into what life is like when you are studying for a DPhil and also training for the blues. Rowing in particular is a controversial sport – I have heard of many stories advocating that rowing will be the absolute detriment to your DPhil. I’ve never felt pressured as part of OPIG to give up rowing – all of my supervisors have been very fair, in that if I get the work done then they accept this is part of my life. However, I realise all supervisors are not so understanding. I hope this blog post will give some insight into what it is like to trial for a Blues sport (in this case Women’s Lightweight Rowing), whilst studying for a DPhil at Oxford.

4:56 am – Alarm goes off. If its after September it’s dark, cold and likely raining. No breakfast as I will do the first training session fasted – just get dressed and go!

5:15 am – Leave the house with a bag full of kit, food for the day, laptop and papers to cycle to Iffley Sport’s Centre

5:45 am – Lightweight Women’s minibus leaves from Iffley to drive to Wallingford. Some girls try to study in the bus, but to be honest its too dark and we’re all a bit too sleepy.

6:15 am – Arrive at Wallingford. Get onto the water for a session in the boats. Although in the Boat Race we race in an 8 (8 rowers with one oar each, with a cox steering), we spend lots of time in different boats throughout the season. Perhaps unlike our openweight counterparts, we also do a lot of sculling (two oars per rower) as the only Olympic class boat for lightweight women is a sculling boat. We travel to Wallingford for a much longer, emptier stretch of river and normally get to see the sunrise.

8:10 am – We leave Wallingford to head back to Oxford. Start waiting in A LOT of traffic once you hit the ring road, and there’s a lot of panic in the bus about whether 9 am lectures will be made on time!

8:50 am – Arrive back at Iffley Sport’s Centre. Grab bike and cycle to the department.

9:00-9:15 am – Arrive at the Department. Quick shower to thaw frozen fingers and to not repulse my fellow OPIG members. I then get to eat warm porridge (highlight of the day) and go through my emails. I also check whether any of my jobs have finished on the group servers – one of the great perks of being in OPIG is the computational resources available to the group. Check the to-do list from yesterday and write a to-do list for today and get to work (coding, plotting results, reading papers or writing)!

11:00 am (Tuesdays & Thursdays) – Coffee morning! Although if it’s any time close to a race no bourbon biscuits or cake for me. This is a bit of an issue because at OPIG we eat a lot of cake. However, one member can usually be relied upon to eat my portion..

1:00 pm – Lunchtime! As a lightweight rower I am required to weigh-in at 59kg on the day of the Boat Race. If I am over that weight I don’t get to race. Therefore, I spend a portion of the year dieting to make sure I hit that target. The dieting lunch consists of soup and Greek yogurt. The post race non-dieting lunch consists of pasta from Taylors, chocolate and a Coke (yum!). OPIG members generally all have lunch at this time and enjoy solving the Times Cryptic Crossword. I’m not the best at crosswords so I normally chat to Laura and don’t concentrate.

2:00 pm – Back to work. Usually coding whilst listening to music. I normally start rushing to be able to submit some jobs to the group servers before I have to leave the office.

3:00 pm – Go to get a chocomilk with Clare. A chocomilk from the vending machine in our department costs 20p and is only 64 calories!

5:30 pm – Cycle to Iffley Sports Centre for the second training session of the day.

5:45 pm – If it’s light enough we hop in the minibus to go to Wallingford for another outing on the water. However, for most of the season its too dark and we head to the gym. This will either consist of weights to build strength, or we will use the indoor rowing machine (erg) to build fitness. The erg is my nemesis, so this is not a session I look forward to. Staring at a screen that constantly tells you how hard you are pushing, or if you are no longer pushing as hard I find to be psychologically quite tough. I’d much rather be gliding along the river.

8:35 pm – Leave Iffley after a long session to head home. Quickly down a Yazoo (strawberry milk) to boost recovery as I won’t be eating dinner until 45 minutes to an hour after the end of the session.

9:00 pm – Arrive home. I “cook” dinner which when I’m dieting consists of chucking sweet potato and healthy sausages from M&S in the oven while I pack my kit bag for the next day.

9:30 pm – Wolf down dinner and drink about a pint of milk, whilst finally catching up with my boyfriend about both our days.

10:00 pm – Bedtime at the latest.

Repeat!

When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?

When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?

For my journal club in June, I chose to present a Journal of Medicinal Chemistry article entitled “When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?” by Malhotra and Karanicolas. This article uses a large scale collection of ligand pairs to investigate the circumstances in which elaborations of a ligand change the original binding mode.

One of the primary goals in medicinal chemistry is the optimisation of biological activity by chemical elaboration of a hit compound. This hit-to-lead optimisation often assumes that addition of functional groups to a given hit scaffold will not change the original binding mode.

In order to investigate the circumstances in which this assumption holds true and how often it holds true, they built up a large-scale collection of 297 related ligand pairs solved in complex with the same protein partner. Each pair consisted of a larger and smaller ligand; the larger ligand could have arisen from elaboration of the smaller ligand. They found that for 41 out of the 297 pairs (14%), the binding mode changed upon elaboration of the smaller ligand.

They investigated many physicochemical properties of the ligand, the protein-ligand complex and the protein binding pocket. They summarise the statistical significance and predictive power of the investigated properties with the table shown below.

They found that the property with the lowest p-value was the “rmsd after minimisation of the aligned complex” (RMAC). They developed this metric to probe whether the larger ligand could be accommodated in the protein without changing binding mode. They did so by aligning the shared substructure of the larger ligand onto the smaller ligand’s complex and then carrying out an energy minimisation. By monitoring the RMSD difference of the larger ligand relative to the initial pose (RMAC), they can gauge how compatible the larger ligand is with the protein. Larger RMAC values indicate greater incompatibility, hence a greater likelihood for the binding mode to not be preserved.

The authors generated receiver operating characteristic (ROC) plots to compare the predictive power of the properties considered. ROC curves are made by plotting the true positive rate (TPR) against the false positive rate (FPR). A random classifier would yield the dotted line from the bottom left to the top right, shown in the plots below. The best predictors would give a point in the top left corner of the plot. The properties that do well include RMAC, pocket volume, molecular weight, lipophilicity and potency.

They also combined properties to enhance predictive power and conclude that RMAC and molecular weight together offers good predictivity.Finally, the authors look at the pairs that have low RMAC values (i.e. the elaboration should be compatible with the protein pocket), yet show a change in binding mode. For these cases, a specific substitution may enable formation of a new, stronger interaction or for pseudosymmetric ligands, the alternate pose can mimic many of the interactions of the original pose.

Antibody Developability: Experimental Screening Assays

[This blog post is centered around the paper “Biophysical properties of the clinical-stage antibody landscape” (http://www.pnas.org/content/114/5/944.abstract) by Tushar Jain and coworkers. It is designed as a very basic intro for computational scientists into the world of experimental biophysical assays.]

A major concern in the development of antibody therapies is being able to predict “developability issues” at the screening stage, to avoid costly developmental dead-ends. Examples of such issues include an antibody being difficult to manufacture, possessing unsuitable pharmacodynamic or pharmokinetic profiles, having a propensity to aggregate (both in storage and in vivo) and being highly immunogenic.

This post is designed to give a clear and concise summary of the principles behind some of the most common biophysical experimental assays used to assess antibody candidates for future developability issues.

1. Ease of manufacture

HEK Titre (HEKt): This assay tests the expression level of the antibody (the higher the better). The heavy and light chain sequences are subcloned into vectors (such as pcDNA 3.4+, ThermoFisher) and these vectors are subsequently transfected into a suspension of Human embryonic kidney (HEK293) cells. After a set number of days the supernatant is harvested to assess the degree of expression.

2. Stability of 3D structure

Melting temperature using Differential Scanning Fluorimetry (Tm with DSF) Assay: This assay tests the thermal stability of the antibody. The higher the thermal stability, the less likely the protein will spontaneously unfold and become immunogenic. The antibody is mixed with a dye that fluoresces when in contact with hydrophobic regions, such as SPYRO orange. The mixture is then taken through a range of temperatures (eg. 40°C -> 95°C at a rate of 0.5°C/2min). As the protein begins to unfold, buried hydrophobic residues will become exposed and the level of fluorescence will suddenly increase. The value of T when the increase in fluorescence intensity is greatest gives us a Tm value.

(Further reading: http://www.beta-sheet.org/resources/T22-Niesen-fingerprinting_Oxford.pdf)

3. Stickiness assays (Aggregation propensity/Low solubility/High viscosity)

Affinity-capture Self-interaction Nanoparticle Spectroscopy (AC-SINS) Assay: This assay tests how likely an antibody is to interact with itself. It uses gold nanoparticles that are coated with anti-Fc antibodies. When a dilute solution of antibodies is added, they rapidly become immobilised on the gold beads. If these antibodies subsequently attract one another, it leads to shorter interatomic distances and longer absorption wavelengths that can be detected by spectroscopy.

(Further reading: https://www.ncbi.nlm.nih.gov/pubmed/24492294)

Clone Self-interaction by Bio-layer Interferometry (CSI-BLI) Assay: A more high-throughput method that uses a label-free technology to measure self-interaction. Antibodies are loaded onto the biosensor tip and white light is shone down the instrument to yield an internal reflection interference pattern. Then the tip is inserted into a solution of the same antibody, and if self-interaction occurs, then the interference pattern shifts by an amount proportional to the change in thickness of the biological layer. Images from: http://www.fortebio.com/bli-technology.html

(Further Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3896597/)

Hydrophobic Interaction Chromatography (HIC) Assay: Antibodies are mixed into a polar mobile phase and then washed over a hydrophobic column. UV-absorbance or other techniques can then be used to determine the degree of adhesion.

(Further Reading: https://www.ncbi.nlm.nih.gov/pubmed/4094424)

Standup Monolayer Chromatography (SMAC) Assay: Antibodies are injected onto a pre-packed Zenix HPLC column and their retention times are calculated. The longer the retention time, the lower their colloidal stability and the more prone they are to aggregate.

(Further Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4622974/)

Size-exclusion Chromatography (SEC) Assay: Antibodies are flowed through a column consisting of spherical beads with miniscule pores. Non-aggregated antibodies are small enough to get trapped in the pores, whereas aggregated antibodies will flow through the column more rapidly. Percentage aggregation can be worked out from the concentrations of the different fractions.

4. Degree of specificity

Cross-Interaction Chromatography (CIC) Assay: This assay measures an antibody’s retention time as it flows across a column conjugated with polyclonal human serum antibodies. If an antibody takes longer to exit the column, it indicates that its surface is likely to interact with several different in vivo targets.

(Further Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3896597/)

Enzyme-linked Immunosorbent Assay (ELISA) – with common antigens or Baculovirus Particles (BVPs): Common antigens or BVPs are fixed onto a solid surface and then a solution containing the antibody of interest linked to an enzyme (such as horseradish peroxidase, HRP) is washed over them. Incubation lasts for about an hour before any unreacted antibodies are washed off. When the appropriate enzyme substrate is then added, it triggers emission of a visible, fluorescent or luminescent nature, which can be detected. The intensity is proportional to the amount of antibody stuck to the surface.

(Further Reading: https://www.thermofisher.com/uk/en/home/life-science/protein-biology/protein-biology-learning-center/protein-biology-resource-library/pierce-protein-methods/overview-elisa.html)

Poly-Specificity Reagent (PSR) Binding Assay: A more high-throughput method that uses fluorescence-activated cell sorting (FACS), a type of flow cytometry. A PSR is generated by biotinylating soluble membrane proteins (from Chinese hamster ovary (CHO) cells, for example) and then is incubated with IgG-presenting yeast. After washing a secondary labeling mix is added, and flow cytometry is used to determine a median fluorescence intensity – the higher the median intensity, the greater the chance of non-specific binding.

(Further Reading: https://www.ncbi.nlm.nih.gov/pubmed/24046438)

Computational immunogenicity reduction

In my last presentation, I talked about the article by King et al. describing a method for computationally removing T-cell receptor epitopes from proteins. The work could have significant impact on the field of designing protein therapeutics, where immunogenicity is a serious obstacle.

One of the major challenges when developing a protein therapeutic is the activation of the immune system by the drug and subsequent production of antibodies against it, rendering the therapeutic ineffective. This process is known as immunogenicity. Immunogenicity is triggered by T-cells recognition of peptide epitopes displayed on the MHC (major histocompatibility complex). This recognition can be impeded by designing the protein therapeutic to remove the potential T-cell epitopes from its surface. There has been some success in experimental T-cell epitope removal, but the process remains resource and time consuming.

In this work, King et al. created a function which assigns to each residue a score that measures its propensity to be a part of a T-cell epitope. The score consists of three parts. The first part is based on a SVM (Support Vector Machine) score calculated over each 15-residue long window, that attempts to predict how likely is the corresponding peptide sequence to bind the MHC. The SVM has been trained on the available immunological data from the Immune Epitope Database (IEDB). The second part of the score is calculated on each 9-residue window and compares the frequency of the 9-mer in the host genomic data and in the known epitope data (a sequence occurring with a high frequency in a human genome would be rewarded while the opposite is true for sequences occurring in the known epitope data). The third part penalizes any deviations from the original charge of the protein. These three parts are combined with a standard Rosetta score that measures the stability of the protein. The weights assigned to each segment were calibrated on existing protein structures. The combined score would be used to score the mutations in the sequence of the protein of interest, according to their propensity of reducing immunogenicity. The top scoring mutations would then be combined in a greedy fashion.

The authors tested their method on fluorescent reporter protein superfolder GFP (sfGFP) and the toxin domain of the cancer therapeutic HA22. In the case of sfGFP the authors targeted the four top-scoring T-cell epitopes. They created eight different proteins designs, out of which all preserved the function of the original protein (fluorescence). The authors selected the top scoring design for experimental immunogenicity testing. The experiments have shown that the selected design had a significantly reduced immunogenicity in comparison to the original protein. In the case study of HA22 the authors created five designs, out of which three displayed cytotoxicities at the same level or higher than the original protein. The two most cytotoxic designs were further characterized experimentally for their propensity to induce immune response. The authors have found that the two mutants elicited a significantly reduced T-cell response.

Figure 1: Reduction of immunogenicity without loss of function. A) Three of the five designs show cytotoxicity at the same level or higher than the original protein. B) Two of the three cytotoxicity-preserving designs show reduced immunogenicity

Overall, this very interesting study showed that computational methods can be successfully used for reducing immunogenicity of protein therapeutics, opening new avenues for computational protein design.

Computationally designing antibodies using a known binding motif

This blogpost is be about the “Computational design of an epitope-specific Keap1 binding antibody using hotspot residues grafting and CDR loop swapping” by Liu et. al. that I presented at group meeting in May.

Antibody design is a subject that I am closely interested in, especially methods that have an important computational step. So far the go-to methods for designing an antibody used by industry are animal model immunisation and/or phage display, with little or no use of computational methods. In the past few years, however, a few computational methods for rational design of antibodies have been making a showing. Firstly, there are the ones where a structure of the docked antibody-antigen already exists, and the antibody is further refined computationally to increase binding affinity. Then there are the ones where the paratope of the antibody is proposed by the designer against a specific target. The paper I am summarizing here by Liu et. al follows the latter idea in a neat way.

Liu et. al. show that if a specific motif is important for binding a certain target, i.e. there is a crystal structure which shows that the motif is buried in the target and/or you predict that its residues are important for binding, it is worthwhile trying to graft that that motif in the CDR area of antibody (the one which is responsible for antibody specificity and affinity). Grafting of entire CDR loops has been long used for antibody humanisation, with many examples where CDR loops maintaining conformation and binding specificity when being transferred from a non-human scaffold to a human scaffold. This is somewhat aided by the fact that the starting and end points of the area being grafted is stable (i.e. the anchors are conformationally the same in all the antibody structures that we observe), which is not the case in Liu et al where they graft a four residue motif. The cool thing they do which makes it more probable for the motif to maintain conformation is identify an antibody which has in one of its CDR loops a fragment with the same backbone conformation with the motif they are trying to graft. They then just replace the residue types to the ones that are known to bind the target. For the Nrf2 motifs (that binds Keap1) they managed to create 5 potential designs. These were further expanded, using rational point mutations on the rest of the antibody in order to increase possibility of binding, to 10. Out of the 10 two showed binding.

One of the potential issues in a real scenario however is the fact that not an entire binding site is copied on antibody, the motif being a subset of the whole, which means the possibility of a low affinity and/or low chances of competing with the original protein (i.e. Nrf2) from which the motif was copied. This actually turned out to be the case, with the initial designs showing low mM affinity. Liu et. al. further worked on improving the initial designs, and they did so by computationally swapping the H3 CDR of the initial designs to a set of other H3 structures that have been seen in other solved antibodies using the Rosetta design protocol. They retained the ones that had a predicted buried SASA of > 2000 A^2, a change in energy of more than 20 REU and a shape complementarity greater than 0.6. These were then tested experimentally with a few of them showing nM affinities, a result which at this time should make you very happy if your entire design phase was done computationally.

Using bare git repos

Git is a fantastic method of doing version control of your code. Whether it’s to share with collaborators, or just for your own reference, it almost acts as an absolute point of reference for a wide variety of applications and needs. The basic concept of git is that you have your own folder (in which you edit your code, etc.) and you commit/push those changes to a git repository. Note that Git is a version control SYSTEM and GitHub/BitBucket etc. are services that host repositories using Git as its backend!

The basic procedure of git can be summarised to:

1. Change/add/delete files in your current working directory as necessary. This is followed by a git add or git rm command.
2. “Commit” those changes; we usually put a message reflecting the change from step 1. e.g. git commit -m "I changed this file because it had a bug before."
3. You “push” those changes with git push to a git repository (e.g. hosted by BitBucket, GitHub, etc.); this is sort of like saying “save” that change.

Typically we use services like GitHub to HOST a repository. We then push our changes to that repository (or git pull from it) and all is good. However, a powerful concept to bear in mind is the ‘bare’ git repository. This is especially useful if you have code that’s private and should be strictly kept within your company/institution’s server, yet you don’t want people messing about too much with the master version of the code. The diagram below makes the bare git repository concept quite clear:

The bare repo acts as a “master” version of sorts, and every other “working”, or non-bare repo pushes/pulls changes out of it.

Let’s start with the easy stuff first. Every git repository (e.g. the one you’re working on in your machine) is a WORKING/NON-BARE git repository. This shows files in your code as you expect it, e.g. *.py or *.c files, etc. A BARE repository is a folder hosted by a server which only holds git OBJECTS. In it, you’ll never see a single .py or .c file, but a bunch of folders and text files that look nothing like your code. By the magic of git, this is easily translated as .py or .c files (basically a version of the working repo) when you git clone it. Since the bare repo doesn’t contain any of the actual code, you can safely assume that no one can really mess up with the master version without having gone through the process of git add/commit/push, making everything documented. To start a bare repo…

# Start up a bare repository in a server
user@server:$~  git init --bare name_to_repo.git

# Go back to your machine then clone it
user@machine:$~ git clone user@server:/path/to/repo/name_to_repo.git .

# This will clone a empty git repo in your machine
cd name_to_repo
ls
# Nothing should come up.

touch README
echo "Hello world" >> README
git add README
git commit -m "Adding a README to initialise the bare repo."
git push origin master # This pushes to your origin, which is user@server:/path/to/repo/name_to_repo.git

If we check our folders, we will see the following:

user@machine:$~ ls /path/to/repo
README # only the readme exists

user@server:$~ ls /path/to/repo/name_to_repo.git/
branches/ config description HEAD hooks/ info/ objects/ refs/

Magic! README isn’t in our git repo. Again, this is because the git repo is BARE and so the file we pushed won’t exist. But when we clone it in a different machine…

user@machine2:$~ git clone user@server:/path/to/repo/name_to_repo.git .
ls name_to_repo.git/
README
cat README
Hello world #magic!

This was a bit of a lightning tour but hopefully you can see that the purpose of a bare repo is to allow you to host code as a “master version” without having you worry that people will see the contents directly til they do a git clone. Once they clone, and push changes, everything will be documented via git, so you’ll know exactly what’s going on!

Experimental Binding Modes of Small Molecules in Protein-Ligand Docking

Protein-ligand docking tends to be very good at generating binding modes that resemble experimental binding modes from X-ray crystallography and other methods (assuming we have a high quality structure…); but it is also very good at generating plausible models for ligands that don’t bind. These so-called “false positives” lead to reduced accuracy in structure-based virtual screening campaigns.

Structure-based methods are not the only way of approaching virtual screening: when all we know is the chemical structure of an active molecule, but nothing about its target (or targets), we can use ligand-based virtual screening methods, which operate on the principle of molecular similarity (Maggiora et al., 2014).

But what if we combine both methods?

Continue reading →

A brief history of usage of the word “decoy” in protein structure prediction

Some concepts in science are counter-intuitive, like the Monty Hall problem or the Mpemba effect. Occasionally, this is also true for terminology, despite the best efforts of scientists to ensure that their work can be explained unambiguously to newcomers. Specifically, in our field of protein structure prediction, the word “decoy” has been used to mean one of many conformations generated by a de novo modelling protocol such as Rosetta, or alternative conformations of loops produced by an ab initio program e.g. Sphinx. Though slightly baffled by this usage when I started working in the field, I have now become so familiar with its strange new meaning that I have to remind myself to explain it in talks to a more general audience, or simply aim to avoid the term altogether. Nonetheless, following a heated discussion over the term in a recent group meeting, I thought it would be interesting to trace the roots of the new meaning.

Let’s begin with a definition from Google:

decoy

noun

noun: decoy; plural noun: decoys

/ˈdiːkɔɪ,dɪˈkɔɪ/

a bird or mammal, or an imitation of one, used by hunters to attract other birds or mammals.

“a decoy duck”

a person or thing used to mislead or lure someone into a trap.

“we need a decoy to distract their attention”

So we start with the idea of something distracting, resembling the true thing but with the intent to deceive. So how has this sense of the word evolved into what we use now? I attempted to dig out the earliest mention of decoy for a computationally generated protein conformation with a Google scholar search for “decoy protein”, which led to the work of Thomas and Dill published in 1996. Here the authors describe a method of distinguishing the native fold of a protein from the sequence threaded, without gaps, onto alternative structures from the PDB. This problem of discrimination between native and non-native had been carried out previously, but Thomas and Dill chose to describe the alternatives as “decoy conformations” or just “decoys”.

A similar problem was commonly attempted over the following years, of separating native structures from sets of computationally generated conformations. Due to the demands of conformer generation at this time, some sets were published themselves in online databases to be used as a resource for training scoring functions.

When it comes to the problem of de novo protein structure prediction, unfortunately it isn’t as simple as picking out the correct answer from a population of incorrect answers. Even among hundreds of thousands of conformations generated by the best methods, the exact native crystal structure will not be found (though a complication here that the protein is dynamic and will occupy an ensemble of native conformations). Therefore, the aim of any scoring function in structure prediction is instead to select which incorrect conformation is closest to the native structure, hoping to obtain at least the correct fold.

It is for this reason that we move towards the idea of choosing a model from a pool of decoys. Zhu et al. (2003) use “decoy” in precisely this way:

“One strategy for ab initio protein structure prediction is to generate a large number of possible structures (decoys) and select the most fitting ones based on a scoring or free energy function”

This seems to be where the idea of a decoy as incorrect and distracting is lost, and takes on its new meaning as one of a large and diverse set of protein-like conformations, which has continued until now.

So is it ever helpful to refer to “decoys” as opposed to “models”? What is communicated by “decoy” that is not achieved by using the word “model”? I think this may come down to the impression which is given by talking about a pool of decoys. People would not generally assume that each decoy on its own has any effective use for prediction of function. There is a sense that this is not the final result of the structure prediction pipeline, there is work yet to be done in refining, clustering, and making human judgments on the suitability of the output. Only after these stages would I feel more comfortable using the word “model”, to express the greater confidence we have in the structure (small though that may be in the de novo structure prediction world). However, the inadequacy of “model” does not alone justify this tenuous usage of “decoy”. Perhaps we could speak more often about populations of “conformations”. In any case, “decoy” is widespread in the community, and easily understood by those who are most likely to be reading, reviewing and editing the literature so I think we will be stuck with it for a while yet.

Oxford Protein Informatics Group

or "OPIG" to friends