In deep learning based compound generation models the metric of fraction of RDKit-valid compounds is ubiquitous, but is problematic from the cheminformatics viewpoint as a large fraction may be driven by pyrrolic nitrogens (see below) rather than Texas carbons (carbon with 5 bonds like the Star of Texas). In RDKit, no error is more irksome that the KekulizeException
or ValenceException
from RDKit sanitisation. These are raised when the molecule is not correct. This would make the RDKit-valid a good metric, except for a small detail: the validity is as interpreted from the the stated implicit and explicit hydrogens and formal charges on the atoms, which most models do not assign. Therefore, a compound may not be RDKit-valid because it is actually impossible, like a Texas carbon, but in many cases it is because the formal charge or implicit hydrogen numbers of some atoms are incorrect. In both case, the major culprit is nitrogen. Herein I go through what they are and how to fix them, with a focus on aromatic nitrogens.
Continue reading →