RDKit is very fussy when it comes to inputs in SDF format. Using the SDMolSupplier, we get a significant rate of failure even on curated datasets such as the PDBBind refined set. Pymol has no such scruples, and with that, I present a function which has proved invaluable to me over the course of my DPhil. For reasons I have never bothered to explore, using pymol to convert from sdf, into mol2 and back to sdf format again (adding in missing hydrogens along the way) will almost always make a molecule safe to import using RDKit:
from pathlib import Path from pymol import cmd def py_mollify(sdf, overwrite=False): """Use pymol to sanitise an SDF file for use in RDKit. Arguments: sdf: location of faulty sdf file overwrite: whether or not to overwrite the original sdf. If False, a new file will be written in the form <sdf_fname>_pymol.sdf Returns: Original sdf filename if overwrite == False, else the filename of the sanitised output. """ sdf = Path(sdf).expanduser().resolve() mol2_fname = str(sdf).replace('.sdf', '_pymol.mol2') new_sdf_fname = sdf if overwrite else str(sdf).replace('.sdf', '_pymol.sdf') cmd.load(str(sdf)) cmd.h_add('all') cmd.save(mol2_fname) cmd.reinitialize() cmd.load(mol2_fname) cmd.save(str(new_sdf_fname)) return new_sdf_fname