Bad chemistry in old protein-ligand binding complex data set

The Astex Diverse set [1] is a dataset containing the crystallized poses of 85 protein-ligand complexes. It was introduced in 2007 to address problems in previous datasets such as incorrect ligand representation.

Loading the 85 ligand files with today’s version of the cheminformatics toolkit RDKit [2] is, however, not as straightforward as you might expect.

29 out of the 85 files fail RDKit’s sanitization checks because they each contain a neutral Nitrogen atom which is connected to two Carbon atoms and two Hydrogen atoms.

Ligand LI9 of complex 1YWR

Luckily, we have ways to rectify this situation. First, we can download a chemically valid ligand file with the same coordinates from the PDB [3] at the cost of loosing the protonation state that was chosen for the Astex Diverse set.

Alternatively, we can either add a positive charge to the Nitrogen or delete one Hydrogen atom to satisfy the sanitization checks. Adding a positive charge for ligand LI9 shown above yields a secondary amine with a positive charge on the Nitrogen atom. The pKa=0.8 of such a structure is extremely low and therefore unlikely to be the intended structure of a ligand which acts near the physiological pH.

So the best option in this case is to remove one of the two Hydrogens. We only have to be careful to follow the intended Astex Diverse set’s protocol and optimize the positions of the remaining Hydrogen atoms using force field optimization.

[1] Hartshorn, M. J. et al. Diverse, high-quality test set for the validation of protein−ligand docking performance. J. Med. Chem. 50, 726–741 (2007).
[2] RDKit: Open-source cheminformatics. https://www.rdkit.org
[3] Berman, H. M. The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000).

Author