Author Archives: Elliot Nelson

Which fragment first?

Crystallographic fragment based lead discovery is a now a routine technique, which can sample 1000’s of compounds per week. But how do we identify the most appropriate compounds to screen against our target of interest?

Continue reading →

Check My Blob

A brief overview and discussion of: Automatic recognition of ligands in electron density by machine learning .This paper aims to reduce the bias of crystallographers fitting ligands into electron density for protein ligand complexes. The authors train a supervised machine learning model using known ligand sites across the whole protein databank, to produce a classifier that can identify which common ligands could fit to that electron density.

Continue reading →

Cinder: Crystallographic Tinder

Protein structure determination is still dominated by xray diffraction. For diffraction studies structural biologists need to grow and optimise protein crystals until they diffract to an usable and optimal resolution. A purified protein sample is exposed to a number of crystallisation screens, each comprising a selection of chemical conditions that are designed to explore a reasonably wide area of potential crystallisation conditions.

Many crystallography labs routinely image these in large plate storage systems, which reduces the human interaction to viewing a set of usually 100-1000 images at various time points. This is a slow and laborious process, and highly applicable to machine learning approaches tailored to looking at images. TexRank, a texton analysis ranking software was developed by Jia Tsing in OPIG and is used at the Structural Genomics Consortium (SGC). This ranking reduces the number of images that a human needs to search through, providing a quicker review process. Continue reading →

Crystallographic programming: Super short tour of the cctbx

Two of the leading packages in crystallography are Phenix and CCP4. For most practicing crystallographers they will interact via with these to progress a single crystallographic data-set from diffraction images, through integration, merging, phasing, model building and hopefully deposition.

However, if you want to develop crystallographic software, you will likely need to decide on a framework to build upon. Phenix is built on the comprehensive cctbx library, whereas CCP4 programs are typically standlone, although common crystallographic libraries such as clipper and cctbx are utilised.

CCTBX is written mainly in python, with core crystallographic functionality written in C++. My usual starting place for understanding functionality is through the pdb parser tutorial. This introduces the concept of a hierarchy, a iterative way to represent a macromolecule:

from iotbx.pdb import hierarchy
pdb_in = hierarchy.input(file_name="model.pdb")
for chain in pdb_in.hierarchy.only_model().chains() :
  for residue_group in chain.residue_groups() :
    for atom_group in residue_group.atom_groups() :
      for atom in atom_group.atoms() :
        if (atom.element.strip().upper() == "ZN") :
          atom_group.remove_atom(atom)
      if (atom_group.atoms_size() == 0) :
        residue_group.remove_atom_group(atom_group)
    if (residue_group.atom_groups_size() == 0) :
      chain.remove_residue_group(residue_group)
f = open("model_Zn_free.pdb", "w")
f.write(pdb_in.hierarchy.as_pdb_string(
  crystal_symmetry=pdb_in.input.crystal_symmetry()))
f.close()

Although there are many ways to parse a pdb file, the introduction to iotbx.pdb, gives a view of how xray structure data can be associated to the model. The tour of the cctbx can be helpful starting place, especially for understanding how the python and c++ functionality interact through boost and the scitbx.array_family.flex. Unfortunately, documentation on cctbx tends to vary in quality and quantity throughout the modules:

Other components of the library include ways to simulate crystallographic data through simtbx, and tools for processing xfel data.

As the library is open source, github hosted source code allows exploration of previously written routines, which can be very helpful for understanding the inner workings of the library. Note that there are also bulletin boards for users and developers of phenix and cctbx respectively. A few tutorials can also be found.

Hopefully this post will give someone other than me a reminder of where to find resources to get started developing within CCTBX.

Proteins evolve on the edge of supramolecular self-assembly

Inspired by Eoin’s interesting talks on prions and prion diseases, and Nick’s discussion of how Cyro-Electron microscopy is going to be the end of an era for Crystallography. I thought I’d look at a paper that discusses aggregation of protein complexes, with some cryo-electron microscopy thrown in for good measure.

a, A molecule gaining a single self-interacting patch forms a finite dimer. A self-interacting patch repeated on opposite sides of a symmetric molecule can result in infinite assembly. b, A point mutation in a dihedral octamer creates a new self-interacting patch (red), triggering assembly into a fibre.

Supramolecular assemblies are folded protein complexes forming into much larger units. This formation can be triggered by a mutation on a copy of the constituent homomers of the complex, acting as a self-interacting patch. If this patch were to form in a non-symmetric complex, it would likely form a finite assemble with a limited number of copies of the complex. However, if the complex has dihedral symmetry such that a patch is accessible at multiple separated locations, then complex can potentially form near infinite supramolecular assemblies. Continue reading →

CCP4 Study Weekend 2017: From Data to Structure

This year’s CCP4 study weekend focused on providing an overview of the process and pipelines available, to take crystallographic diffraction data from spot intensities right through to structure. Therefore sessions included; processing diffraction data, phasing through molecular replacement and experimental techniques, automated model building and refinement. As well as updates to CCP4 and where is crystallography going to take us in the future?

Surrounding the meeting there was also a session for Macromolecular (MX) crystallography users of Diamond Light Source (DLS), which gave an update on the beamlines, and scientific software, as well as examples of how fragment screening at DLS has been used. The VMXi (Versatile Macromolecular X-tallography in-situ) beamline is being developed to image crystals that are forming in situ crystallisation plates. This should allow for crystallography to be optimized, as crystallization conditions can be screened, and data collected on experiments as they crystallise, especially helpful in cases where crystallisation has routinely led to non-diffracting crystals. VXMm is a micro/nanofocus MX beamline, which is in development, with a focus to get crystallographic from very small crystals (~300nm to 10 micron diameters, with a bias to the smaller size), thereby allowing crystallography of targets that have previously been hard to get sufficient crystals. Other updates included how technology developed for fast solid state data collection on x-ray free electron lasers (XFEL) can be used on synchrotron beamlines.

A slightly more in-depth discussion of two tools presented that were developed for use alongside and within CCP4, which might be of interest more broadly:

ConKit: A python interface for contact prediction tools

Contact prediction for proteins, at its simplest, involves estimating which residues within a certain certain spatial proximity of each other, given the sequence of the protein, or proteins (for complexes and interfaces). Two major types of contact prediction exist:

Evolutionary Coupling
- Take a series of sequence homologues, and identifying co-evolved residues from multiple sequence alignment of the protein family. These co-evolved residues are hypothesized to share a functional dependence. Discussed previously on BLOPIG: Predicted protein contacts: is it the solution to (de novo) protein structure prediction?
Supervised machine learning
- Using ab initio structure prediction tools, without sequence homologues, to predict which contacts exist, but with a much lower accuracy than evolutionary coupling.

fullscreen

ConKit is a python interface (API) for contact prediction tools, consisting of three major modules:

Core: A module for constructing hierarchies, thereby storing necessary data such as sequences in a parsable format.
- Providing common functionality through functions that for example declare a contact as a false positive.
Application: Python wrappers for common contact prediction and sequence alignment applications
- CCMPred
- CdHit
- HHblits
- HHfilter
- Jackhmmer
- Psicov
- BbContacts
I/O: I/O interface for file reading, writing and conversions.

Contact prediction can be used in the crystallographic structure determination field, during unconventional molecular replacement, using a tool such as AMPLE. Molecular replacement is a computational strategy to solve the phase problem. In the typical case, by using homologous structures to determine an estimate a model of the protein, which best fits the experimental diffraction intensities, and thus estimate the phase. AMPLE utilises ab initio modeling (using Rosetta) to generate a model for the protein, contact prediction can provide input to this ab initio modeling, thereby making it more feasible to generate an appropriate structure, from which to solve the phase problem. Contact prediction can also be used to analyse known and unknown structures, to identify potential functional sites.

For more information: Talk given at CCP4 study weekend (Felix Simkovic), ConKit documentation

ACEDRG: Generating Crystallographic Restraints for Ligands

Small molecule ligands are present in many crystallographic structures, especially in drug development campaigns. Proteins are formed (almost exclusively) from a sequence containing a selection of 20 amino acids, this means there are well known restraints (for example: bond lengths, bond angles, torsion angles and rotamer position) for model building or refinement of amino acids. As ligands can be built from a much wider selection of chemical moieties, they have not previously been restrained as well during MX refinement. Ligands found in PDB depositions can be used as models for the model building/ refinement of ligands in new structures, however there are a limited number of ligands available (~23,000). Furthermore, the resolution of the ligands is limited to the resolution of the macro-molecular structure from which they are extracted.

ACEDRG utilises the crystallorgraphy open database (COD), a library of (>300,000) small molecules usually with atomic resolution data (often at least 0.84 Angstrom), to generate a dictionary of restraints to be used in refining the ligand. To create these restraints ACEDRG utilises the RDkit chemoinformatics package, generating a detailed descriptor of each atom of the ligands in COD. The descriptor utilises properties of each atom including the element name, number of bonds, environment of nearest neighbours, third degree neighbours that are aromatic ring systems. The descriptor, is stored alongside the electron density values from the COD. When a ACEDRG query is generated, for each atom in the ligand, the atom type is compared to those for which a COD structure is available, the nearest match is then used to generate a series of restraints for the atom.

ACEDRG can take a molecular description (SMILES, SDF MOL, SYBYL MOL2) of your ligand, and generate appropriate restraints for refinement, (atom types, bond lengths and angles, torsion angles, planes and chirality centers) as a mmCIF file. These restraints can be generated for a number of different probable conformations for the ligand, such that it can be refined in these alternate conformations, then the refinement program can use local scoring criteria to select the ligand conformation that best fits the observed electron density. ACEDRG can accessed through the CCP4i2 interface, and as a command line interface.

Hopefully a useful insight to some of the tools presented at the CCP4 Study weekend. For anyone looking for further information on the CCP4 Study weekend: Agenda, Recording of Sessions, Proceedings from previous years.

Network Representations of Allostery

Allostery is the process by which action at one site, such as the binding of an effector molecule, causes a functional effect at a distant site. Allosteric mechanisms are important for the regulation of cellular processes, altering the activity of a protein, or the whole biosynthetic pathway. Triggers for allosteric action include binding of small molecules, protein-protein interaction, phosphorylation events and modification of disulphide bonds. These triggers can lead to changes in accessibility of the active site, through large or small motions, such as hinge motion between two domains, or the motion of a single side chain.

Figure 1 from [1]: Rearrangement of a residue–residue interaction in phosphofructokinase. Left panel: interaction between E241 and H160 of chain A in the inactive state; right: this interaction in the active state. Red circles mark six atoms unique to the residue–residue interface in the I state, green circles mark four atoms unique to the A state, and yellow circles mark three atoms present in both states. In these two residues, there are a total of 19 atoms, so the rearrangement factor R(i,j) = max(6, 4)/19 = 0.32

One way to consider allostery is as signal propagation from one site to another, as a change in residue to residue contacts. Networks provide a way to represent these changes. Daily et al [1] introduce the idea of contact rearrangement networks, constructed from a local comparison of the protein structure with and without molecules bound to the allosteric site. These are referred to as the active and inactive structures respectively. To measure the whether a residue to residue contact is changed between the active and inactive states, the authors use a rearrangement factor (R(i,j)). This is the ratio of atoms which are within a threshold distance (5 angstroms) in only one of the active or inactive states (whichever is greater), to the total number of atoms in the two residues.The rearrangement factor is distributed such that the large majority of residues have low rearrangement factors (as they do not change between the active and inactive state). To consider when a rearrangement is significant the authors use a benchmark set of non allosteric proteins to set a threshold for the rearrangement factor. The residues above this threshold form the contact rearrangement network, which can be analysed to assess whether the allosteric and functional sites are linked by residue to residue contacts. In the paper 5/15 proteins analysed are found to have linked functional and allosteric sites.

Adaption of Figure 2 from [1]. Contact rearrangement network for phosphofructokinase. Circles in each graph represent protein residues, and red and green squares represent substrate and effector molecules, respectively. Lines connect pairs of residues with R(i,j) ≥ 0.3 and residues in the graph with any ligands which are adjacent (within 5.0 Å) in either structure. All connected components which include at least one substrate or effector molecule are shown.

Collective rigid body domain motion was not initially analysed by these contact rearrangement networks, however a later paper [2], discusses how considering these motions alongside the contact rearrangement networks can lead to a detection of allosteric activity in a greater number of proteins analysed. These contact rearrangement networks provide a way to assess the residues that are likely to be involved in allosteric signal propagation. However this requires a classification of allosteric and non-allosteric proteins, to undertake the thresholding for significance of the change in contacts, as well as multiple structures that have and do not have a allosteric effector molecule bound.

Figure 1 from [3]. (a) X-ray electron density map contoured at 1σ (blue mesh) and 0.3σ (cyan mesh) of cyclophilin A (CYPA) fit with discrete alternative conformations using qFit. Alternative conformations are colored red, orange or yellow, with hydrogen atoms added in green. (b) Visualizing a pathway in CYPA: atoms involved in clashes are shown in spheres scaled to van der Waals radii, and clashes between atoms are highlighted by dotted lines. This pathway originates with the OG atom of Ser99 conformation A (99A) and the CE1 atom of Phe113 conformation B (113B), which clash to 0.8 of their summed van der Waals radii. The pathway progresses from Phe113 to Gln63, and after the movement of Met61 to conformation B introduces no new clashes, the pathway is terminated. A 90° rotation of the final panel is shown to highlight how the final move of Met61 relieves the clash with Gln63. (c) Networks identified by CONTACT are displayed as nodes connected by edges representing contacts that clash and are relieved by alternative conformations. The node number represents the sequence number of the residue. Line thickness between a pair of nodes represents the number of pathways that the corresponding residues are part of. The pathway in b forms part of the red contact network in CYPA. (d) The six contact networks comprising 29% of residues are mapped on the three-dimensional structure of CYPA.

Alternatively, Van den Bedem et al [3] define contact networks of conformationally coupled residues, in which movement of an alternative conformation of a residue likely influences the conformations of all other residues in the contact network. They utilise qFit, a tool for exploring conformational heterogeneity in a single electron density map of a protein, by fitting alternate conformations to the electron density. For each conformation of a residue, it assesses whether it is possible to reduce steric clashes with another residue, by changing conformations. If a switch in conformations reduces steric clashes, then a pathway is extend to the neighbours of the residue that is moved. This continued until no new clashes are introduced. Pathways that share common members are considered as conformationally coupled, and grouped into a single contact network. As this technique is suitable for a single structure, it is possible to estimate residues which may be involved in allosteric signalling without prior knowledge of the allosteric binding region.

These techniques show two different ways to locate and annotate local conformational changes in a protein, and determine how they may be linked to one another. Considering whether these, and similar techniques highlight the same allosteric networks within proteins will be important in the integration of many data types and sources to inform the detection of allostery. Furthermore, the ability to compare networks, for example finding common motifs, will be important as the development of techniques such as fragment based drug discovery present crystal structures with many differently bound fragments.

[1] Daily, M. D., Upadhyaya, T. J., & Gray, J. J. (2008). Contact rearrangements form coupled networks from local motions in allosteric proteins. Proteins: Structure, Function and Genetics. http://doi.org/10.1002/prot.21800

[2] Daily, M. D., & Gray, J. J. (2009). Allosteric communication occurs via networks of tertiary and quaternary motions in proteins. PLoS Computational Biology. http://doi.org/10.1371/journal.pcbi.1000293

[3] van den Bedem, H., Bhabha, G., Yang, K., Wright, P. E., & Fraser, J. S. (2013). Automated identification of functional dynamic contact networks from X-ray crystallography. Nature Methods, 10(9), 896–902. http://doi.org/10.1038/nmeth.2592

SMEs in Research

Scientific research relies on collaboration, between academics across the world, between big and small groups of people, and between companies and universities. Many people’s first thoughts of companies involved in academia will be large multinational corporations, such as pharmaceutical or aerospace companies. However mutually beneficial relationships exist between smaller companies and academia. Scientific spin out companies, such as Oxford Nanopore Technologies, Theta Technologies or Reinnervate, are one way research is expanded from an idea held by researchers at a university and expanded into a commercial product or service. However, there are many other ways and reasons for scientists collaborating with small companies.

Small and medium sized enterprises (SMEs) are independent businesses with up to 249 people, they represent over 99% of all UK and EU businesses, and 45-50% of turnover and employment. But why would researchers be interested in involvement with SMEs? Access to unique intellectual property and innovation in the SMEs, as well as access to funding targeted at fostering research led projects in SMEs. Innovate UK, support growth by enabling and funding innovative opportunities. SMEs can access this support through Knowledge Transfer Partnerships (KTP), a three-way partnership between a graduate, academic institution and a business. These KTPs can last between 12 months and 3 years, and provide financial support for academics to monitor the graduate’s work. KTP projects lead to an average of 2 papers per project, and are attractive to SMEs as they only fund 33% of the project cost.

Impact of research outside academia is becoming increasingly important to identify as impact case studies form a part of the Research Excellence Framework (REF), making many sources of funding contingent on providing a strong case for the wider impact of the research. Work with innovation focused SMEs can provide a focus for research impact in engagement and economic terms. For example, a REF case study highlighting the impact of spintronics research in the development of non-contact sensors, through a spin out company, Salunda.

Oxford Protein Informatics Group

or "OPIG" to friends