Category Archives: Code

[Database] SAbDab – the Structural Antibody Database

An increasing proportion of our research at OPIG is about the structure and function of antibodiesCompared to other types of proteins, there is a large number of antibody structures publicly available in the PDB (approximately 1.8% of structures contain an antibody chain). For those of us working in the fields of antibody structure prediction, antibody-antigen docking and structure-based methods for therapeutic antibody design, this is great news!

However, we find that these data are not in a standard format with respect to antibody nomenclature. For instance, which chains are “heavy” chains and which are “light“? Which heavy and light chains pair? Is there an antigen present? If so, to which H-L pair does it bind to? Which numbering system is used … etc.

To address this problem, we have developed SAbDab: the Structural Antibody Database. Its primary aim is for easy creation of antibody structure and antibody-antigen complex datasets for further analysis by researchers such as ourselves. These sets can be selected using a number of criteria (e.g. experimental method, species, presence of constant domains…) and redundancy filters can be applied over the sequences of both the antibody and antigen. Thanks to Jin, SAbDab now also includes associated curated affinity (Kd) values for around 190 antibody-antigen complexes. We hope this will serve as a benchmarking tool for antibody-antigen docking prediction algorithms.

sabdab

Alternatively, the database can be used to inspect and compare properties of individual structures. For instance, we have recently published a method to characterise the orientation between the two antibody variable domains, VH and VL. Using the ABangle tool, users can select structures with a particular VH-VL orientation, visualise and quantify conformational changes (e.g. between bound and unbound forms) and inspect the pose of structures with certain amino acids at specific positions. Similarly, the CDR (complimentary determining region) search and clustering tools, allow for the antibody hyper-variable loops to be selected by length, type and canonical class and their structures visualised or downloaded.

structure_viewer

 

SAbDab also contains features such as the template search. This allows a user to submit the sequence of either an antibody heavy or light chain (or both) and to find structures in the database that may offer good templates to use in a homology modelling protocol. Specific regions of the antibody can be isolated so that structures with a high sequence identity over, for example, the CDR H3 loop can be found. SAbDab’s weekly automatic updates ensures that it contains the latest available data. Using each method of selection, the structure, a standardised and re-numbered version of the structure, and a summary file containing information about the antibody, can be downloaded both individually or en-masse as a dataset. SAbDab will continue to develop with new tools and features and is freely available at: opig.stats.ox.ac.uk/webapps/sabdab.

Making Protein-Protein Interfaces Look (decently) Good

This is a little PyMOL script that I’ve used to draw antibody-antigen interfaces. If you’d like a commented version on what each and every line does, contact me! This is a slight modification of what has been done in PyMOL Wiki.

load FILENAME
set_name FILENAME, complex	

set bg_rgb, [1,1,1]  	

color white 	     		

hide lines
show cartoon

select antibody, chain a
select antigen, chain b

select paratopeAtoms, antibody within 4.5 of antigen 
select epitopeAtoms, antigen within 4.5 of antibody

select paratopeRes, byres paratopeAtoms
select epitopeRes, byres epitopeAtoms

distance interactions, paratopeAtoms, epitopeAtoms, 4.5, 0

color red, interactions
hide labels, interactions

show sticks, paratopeRes
show sticks, epitopeRes

set cartoon_side_chain_helper, on

set sphere_quality, 2
set sphere_scale, 0.3
show spheres, paratopeAtoms
show spheres, epitopeAtoms
color tv_blue, paratopeAtoms
color tv_yellow, epitopeAtoms

set ray_trace_mode, 3
unset depth_cue
set specular, 0.5

Once you orient it to where you’d like it and ray it, you should get something like this.
contacts

Standardize a PDB

In many applications you need to constrain PDB files to certain chains. You can do it using this program.

A. What does it do?

Given a pdb file, write out the ATOM and HETATM entries for the supplied chain(s), put them on a single chain with the name provided by the user.

PDB_standardize needs four arguments:

  1. PDB file to constrain.
  2. Chains from the pdb file to constrain.
  3. Output file.
  4. Name of the new chain

B. Requirements:

Biopython – should be installed on your machines but in case you want to use it locally, download the latest version into the PDB_constrain.py’s directory (don’t need to build).

C. Example use:

C.1 Constrain 1A2Y.pdb to chains A and B, placed on a single chain C – write results in constr.pdb

python PDB_standardize.py -f 1A2Y.pdb -c AB -o const.pdb -s C

 

C.2 Constrain 1ACY to chain L, with a new chain name F, write results in const.pdb – this example shows that the constrainer works well with ‘insertion’ residue numbering as in antibodies where you have 27A, 27B etc.

python PDB_standardize.py -f 1ACY.pdb -c L -o const.pdb -s F

 

Inside Memoir: MP-T aligns membrane proteins

Although Memoir has received a lot of air-time on this blog, we haven’t gone into a great deal of detail about how it models membrane proteins. Memoir is a pipeline involving a series of programs iMembrane -> MP-T -> Medeller -> Fread, and in this post I’ll explain the MP-T step (I’ll briefly touch on Medeller too).

Let’s first look at the big-picture. There are several ways of modelling a protein’s 3D structure. In an ideal world we could specify an extended polypeptide, teach a computer some physics, set if off simulating, and watch the exact folding pathway of a protein. This doesn’t work. A second method would be to build up a protein from lots of fragments of unrelated proteins… this is usually what is meant by ‘ab initio’ modelling. The most accurate (and least sophisticated) approach is to find a protein of known structure with similar sequence, align the sequences, and copy over the coordinates of the aligned residues to make a model for the query protein. This is the approach taken by Memoir and is called homology modelling or comparative modelling.

The diagram below shows an example of how homology modelling might work. Four membrane protein sequences are aligned (left) and the alignment specifies a structural superposition (right). Assume now that the red structure is unknown: we could make a good model for it just by copying over the aligned parts of the blue, green and yellow structures.

Screen shot 2013-05-20 at 14.26.06
The greatest difficulty in the modelling described above is making an accurate alignment. As sequences become more distantly related they share less and less sequence identity, and working out the optimum alignment becomes challenging. This problem is especially acute for membrane protein modelling: there are so few structures from which to copy coordinates that a randomly chosen query protein has a good chance of having <30% sequence identity to the nearest related structure.

Although alignment is the most important facet of homology modelling it is not the only consideration. In the above diagram the centres of the proteins are structurally very conserved (so copying coordinates will lead to a good model in this region), but the top of the proteins differ (the stringy loops don’t sit on top of one another). It is the role of coordinate generation software to distinguish which coordinates to copy. It turns out that the pattern of a conserved centre and varying top/bottom is generally true for membrane proteins, and Memoir uses our Medeller coordinate generation software to take advantage of this pattern.

Back then to alignment. The aim of alignment is to work out which amino acids in one protein are related to amino acids in another. All alignment methods have at their heart a set of scores which encode the propensity for one amino acid to mutate to another, and for that mutation to become fixed in a population. These scores form a substitution table (here mutation + fixation = substitution). More sophisticated alignment methods augment these scores in different ways — for example by adding in scoring based on secondary structure, smoothing scores over a window, or estimating a statistical supplement to the score determined from a related set of pre-aligned sequences — but at some level a substitution table is always present. Using a substitution table, the most likely evolutionary relationship between two sequences can be detected and this is reported in the form of an alignment.

So that’s general alignment, now to apply this to membrane proteins. The cell membrane is composed of a lipid bilayer: a sandwich with a hydophobic filling and hydrophilic crusts. The part of a membrane protein that touches the filling will have different preferences for amino acids (and, more importantly, substitutions between these amino acids) than the part of a membrane protein that touches the crust. Similarly there are systematic preferences for amino acid substitutions depending on whether part of a protein is buried or exposed, and on which type of secondary structure it assumes. The figure below shows a membrane protein with different regions of the membrane and different types of secondary structure annotated.

Screen shot 2013-05-20 at 14.27.19 Screen shot 2013-05-20 at 14.28.42

 

It is possible to make separate substitution tables for each environment within a membrane protein, where an environment specifies where the protein sits in the membrane, what secondary structure it has, and whether it is accessible or buried. Below is a principal components analysis of the resulting set of tables: each table is represented by a single point and the axes show the direction of the greatest variation between the tables. The plot on the right shows a separation of the points based on whether they are buried (more hydrophobic) or accessible (more hydrophilic). The hydrophobic centre (red circles) and hydrophilic edges (green circles) of the membrane fall into this general pattern. The table on the left shows that the tables further divide by secondary structure type. In summary there are systematic substitution preferences in practice as well as theory, and for membrane proteins it is most important to consider hydrophobicity when aligning two protein sequences.

Screen shot 2013-05-20 at 14.30.09

On then to modelling. The conventional approach to aligning a pair of sequences for homology modelling is to take a set of pre-aligned sequences (a sequence profile), and use them to estimate a supplement to the standard substitution score for aligning two sequences. This is termed profile-profile alignment. Memoir takes a different approach by using the MP-T program to construct a multiple sequence alignment scored with environment-specific substitution tables. The alignment includes a set of homologous sequences to the pair of interest.

Profile-profile alignment methods and MP-T are very different. It is unclear whether the substitution preferences at a position are best estimated by MP-T’s tables or the supplements derived from sequence profiles, and the answer probably depends on how well the profiles are made — garbage in, garbage out. Similarly the MP-T algorithm only determines the upper limit of alignment accuracy, and the actual accuracy depends on how the homologous sequences in the alignment are chosen.

In general we find little difference between the fraction of an alignment that MP-T and either HHsearch or Promals (profile-profile alignment methods) gets right. However we do find a difference in the fraction of the alignment that these methods get wrong (part of an alignment can be right, wrong or simply not aligned, so it’s possible to get a lower fraction wrong whilst getting the same fraction right). It turns out that on average MP-T gets less of an alignment wrong for simple reasons of combinatorics: for a pair of proteins, the number of possible multiple sequence alignments is much greater than the number of possible profile-profile alignments. This means that, just by chance, the number of incorrectly aligned positions between the two sequences of interest will be lower for MP-T than for a conventional profile-profile alignment method.

Now for a little sales-pitch. The source code for MP-T is freely available and easy to expand (if you have a passing familiarity with Haskell). Only two or three lines of code need to be changed to define a new set of protein environments, and to feed it a substitution table for each environment. I’d be happy to help anyone who wants to try it out.

Making small molecules look good in PyMOL

Another largely plagiarized post for my “personal notes” (thanks Justin Lorieau!) and following on from the post about pretty-fication of macromolecules.  For my slowly-progressing confirmation report I needed some beautiful small molecule representation.  Here is some PyMOL code:

show sticks
set ray_opaque_background, off
set stick_radius, 0.1
show spheres
set sphere_scale, 0.15, all
set sphere_scale, 0.12, elem H
color gray40, elem C
set sphere_quality, 30
set stick_quality, 30
set sphere_transparency, 0.0
set stick_transparency, 0.0
set ray_shadow, off
set orthoscopic, 1
set antialias, 2
ray 1024,768

And the result:

ligand

Beautiful, no?

A javascript function to validate FASTA sequences

I was more than a bit annoyed of not finding this out there in the interwebs, being a strong encourager of googling (or in Jamie’s case duck-duck-going) and re-use.

So I proffer my very own fasta validation javascript function.

/*
 * Validates (true/false) a single fasta sequence string
 * param   fasta    the string containing a putative single fasta sequence
 * returns boolean  true if string contains single fasta sequence, false 
 *                  otherwise 
 */
function validateFasta(fasta) {

	if (!fasta) { // check there is something first of all
		return false;
	}

	// immediately remove trailing spaces
	fasta = fasta.trim();

	// split on newlines... 
	var lines = fasta.split('\n');

	// check for header
	if (fasta[0] == '>') {
		// remove one line, starting at the first position
		lines.splice(0, 1);
	}

	// join the array back into a single string without newlines and 
	// trailing or leading spaces
	fasta = lines.join('').trim();

	if (!fasta) { // is it empty whatever we collected ? re-check not efficient 
		return false;
	}

	// note that the empty string is caught above
	// allow for Selenocysteine (U)
	return /^[ACDEFGHIKLMNPQRSTUVWY\s]+$/i.test(fasta);
}

Let me know, by comments below, if you spot a no-no.  Please link to this post if you find use for it.

p.s. I already noticed that this only validates one sequence.  This is because this function is taken out of one of our web servers, Memoir, which specifically only requires one sequence.  If there is interested for multi sequence validation I will add it.