A brief overview and discussion of: Automatic recognition of ligands in electron density by machine learning .This paper aims to reduce the bias of crystallographers fitting ligands into electron density for protein ligand complexes. The authors train a supervised machine learning model using known ligand sites across the whole protein databank, to produce a classifier that can identify which common ligands could fit to that electron density.
Crystallographic refinement is an iterative process that aims to improve the quality of protein model, which will in turn improve the associated phases and electron density maps. As proteins are composed from the 20 amino acids, there is a well determined set of restraints for potential conformations of protein structure. These are used in automated protein structure building programs such as phenix.autobuild, or Buccaneer.
Potential binding ligands are drawn from a very large chemical space, and can have different potential conformations on binding. Geometrical restraints can defined from small molecule crystal structures, distribution of conformers in PDB or from predicted values. Ligand density needs to be modeled after the protein is near completion, and is often weaker due to lower occupancy. Therefore tools that fit ligands into electron density have typically been limited to fitting a single known ligand, such as FLYNN, RHOFIT, Phenix.ligandfit. This is problematic, as when a ligand is fitted into electron density, the phases are biased towards the model used. This leads to a bias towards fitting ligands into density when they are not well supported.
The authors utilise existing conformations of common ligands present in the protein databank. By extracting the corresponding electron density, they use numerical descriptors to describe the blob of electron density. These descriptors include: Zernike moment invaraints, bounding box volume of the density, and difference between the features at different contour levels of electron density. The set of features is reduced from 382 different features by recursive feature selection to the top 60 descriptors. These descriptors are training data supplied to four machine learning models: k-nearest-neighbours, random forests, gradient boosting machine and a stacked combination of the three previous methods. The datasets are set into three subsets, CMB (CheckMyBlob) uses the majority of data which the authors set quality metrics to get ligand sites of sufficient quality, TAMC and CL are used to replicate other ligand fitting papers.
The reported accuracy of combined stacking machine learning model (which was best performing), on the largest dataset was 0.575. Although this is relatively low recall of the modelled ligand, the current use case of this software is to present a list of the mostly common ligands that could be fitted to the electron density to a crystallographer, and therefore instead we can look at accuracy in the top-n. The top 5 accuracy is reported as 0.852, top 10 as 0.913, would mean a crystallographer would most likely be presented an appropriate common ligand for a blob within reasonably few tries.
The Check My Blob routine presents the user only the most common ligands in the pdb, and therefore would need to be compared to the quality of fit (i.e Real Space Correlation Coefficent, RSCC) of any known, but novel/ uncommon ligand bound into the structure. Furthermore, the model does not appear to estimate the likelihood that no common ligand would be a good fit to the density. This is problematic for both for automated fitting of ligands into density and human modellers, as it becomes difficult to test the hypothesis of novel ligands, or that the density could be explained better without a ligand, such as being fitted by protein alternate conformations.
The authors present a tool which is likely to be very valuable to crystallographers in speeding up the fitting of common co-factors. It should reduce the bias to fit the soaked or “desired” ligand to any potential area of unexplained electron density. Further work is likely to be directed into improving the accuracy towards removing the human user altogether, and to be able to propose that no common ligand well fits the electron density blob.
The software is currently available as source code, and likely will initially benefit the community most if it were developed into a web server.