Finding the parts in common between two molecules appears to be a straightforward, but actually is a maze of layers. The task, maximum common substructure (MCS) searching, in RDKit is done by Chem.rdFMCS.FindMCS
, which is highly customisable with lots of presets. What if one wanted to control in minute detail if a given atom X and is a match for atom Y? There is a way and this is how.
Aim
First, there are several resources on the basics of MCS, such as Greg Landrum’s tutorial on the subject. Herein, I want to discuss a specific use case: full control over the atom matching itself.
There are two options: calculate all combinations and filter them, or create a custom atom matcher. The first option will grind to a halt very quickly, so the second option is often the only viable one.
For Fragmenstein in order to map a molecule onto overlapping 3D molecules I had to resort to the latter deep in the code.
Basics
As the basics are well covered elsewhere, I am going to wizz through them.
To find the MCS between two molecules, the following code can be used.
The function `name2mol` (get a mol from a name via NIH Cactus) and `flatgrid` (a `Draw.MolsToGrid` wrapper) are defined in the footnote.
from rdkit import Chem from rdkit.Chem import rdFMCS capsaicin = name2mol('capsaicin') jasminaldehyde = name2mol('jasminaldehyde') result: rdFMCS.MCSResult = rdFMCS.FindMCS([capsaicin, jasminaldehyde]) flatgrid([capsaicin, jasminaldehyde, Chem.MolFromSmarts(result.smartsString)])
This is not what one would expect, because one needs to tweak the parameters of the search to match that expectation.
The function rdFMCS.FindMCS
went throw several evolutions and is overloaded. The first older/messier way is that it accepts several arguments, which I would say fall into two groups:
- The boolean arguments
matchValences
,ringMatchesRingOnly
,completeRingsOnly
,matchChiralTag
- The compare enum arguments
atomCompare
,bondCompare
,ringCompare
The two are mostly mutually exclusive but don’t have a univocal mapping (ringMatchesRingOnly
, completeRingsOnly
are both covered by ringCompare
). The compare enums are more powerful.
For example, here are the values they can take:
atomCompare
argument accepts therdFMCS.AtomCompare
enum, which has the values
❧CompareAny
,CompareElements
,CompareIsotopes
,CompareAnyHeavyAtom
.bondCompare
wantsrdFMCS.BondCompare
❧CompareAny
,CompareOrder
,CompareOrderExact
ringCompare
wantsrdFMCS.RingCompare
❧IgnoreRingFusion
,PermissiveRingFusion
,StrictRingFusion
The newer one, accepts a single parameter object (rdFMCS.MCSParameters
). This objects two attributes with boolean-holding namedtuple-like objects (AtomCompareParameters
and BondCompareParameters
), which are a reshuffle of the boolean arguments; and two “typer” attributes (AtomTyper
and BondTyper
), which can be more than just the enums and allows some nice stuff.
For Fragmenstein, I was in need of transpiling the former system into the latter as arguments may be passed in the former system but the latter was used. As a result I had a function to do so:
def transmute_FindMCS_parameters( **mode: Unpack[ExtendedFMCSMode]) -> rdFMCS.MCSParameters: # noqa lowercase not applicable """ The function ``rdFMCS.FindMCS`` has two ways of being used. In one, a series of arguments are passed, in another a ``rdFMCS.MCSParameters`` object is passed (a wrapped C++ structure). Unfortunately, there does not seem to be a way to transmute the former into the other. Hence, this function The ``params.AtomTyper`` and ``params.BondTyper`` members can be either * the enum ``rdFMCS.AtomCompare`` or ``rdFMCS.BondCompare`` * or a subclass of ``rdFMCS.MCSBondCompare`` or ``rdFMCS.MCSAtomCompare`` The ``rdFMCS.RingCompare`` appears to be absorbed into the ``params.BondCompareParameters`` member. """ params = rdFMCS.MCSParameters() # three integers: https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/FMCS/FMCS.h # they are not rdFMCS.AtomCompare rdFMCS.BondCompare and rdFMCS.RingCompare enums? # atom parameters atomCompare: rdFMCS.AtomCompare = mode.get('atomCompare', rdFMCS.AtomCompare.CompareElements) params.AtomTyper = atomCompare # int or a callable params.AtomCompareParameters.MatchIsotope = atomCompare == rdFMCS.AtomCompare.CompareIsotopes params.AtomCompareParameters.CompleteRingsOnly = mode.get('completeRingsOnly', False) params.AtomCompareParameters.MatchChiralTag = mode.get('matchChiralTag', False) params.AtomCompareParameters.MatchValences = mode.get('matchValences', False) params.AtomCompareParameters.RingMatchesRingOnly = mode.get('ringMatchesRingOnly', False) # bond parameters bondCompare: rdFMCS.BondCompare = mode.get('bondCompare', rdFMCS.BondCompare.CompareOrder) ringCompare: rdFMCS.RingCompare = mode.get('ringCompare', rdFMCS.RingCompare.IgnoreRingFusion) params.BondTyper = bondCompare params.BondCompareParameters.CompleteRingsOnly = mode.get('completeRingsOnly', False) params.BondCompareParameters.MatchFusedRings = ringCompare != rdFMCS.RingCompare.IgnoreRingFusion params.BondCompareParameters.MatchFusedRingsStrict = ringCompare == rdFMCS.RingCompare.StrictRingFusion params.BondCompareParameters.RingMatchesRingOnly = mode.get('ringMatchesRingOnly', False) params.Threshold = mode.get('threshold', 1.0) params.MaximizeBonds = mode.get('maximizeBonds', True) params.Timeout = mode.get('timeout', 3600) params.Verbose = mode.get('verbose', False) params.InitialSeed = mode.get('seedSmarts', '') # parameters with no equivalence (i.e. made up) params.BondCompareParameters.MatchStereo = mode.get('matchStereo', False) params.AtomCompareParameters.MatchFormalCharge = mode.get('matchFormalCharge', False) params.AtomCompareParameters.MaxDistance = mode.get('maxDistance', -1) # params.ProgressCallback Deprecated return params
Comparators
This meandering around legacy code-burden is to get to the new comparator objects. The atomTyper
attribute accepts the aforementioned enums, but also can accept a callable… such as a subclass of rdFMCS.MCSAtomCompare
or rdFMCS.MCSBondCompare
. These are callable class instances that when called with the parameter obejct, the two molecules and the two atoms or bonds that are to be compare, they return a boolean verdict. The largest number of contiguous truths wins!
Obviously, there’s a catch. RDKit is PyBoost11–wrapped C++ code —as is evident by its flagrant disregard for PEP8 guidelines turned dogmas. In the case of subclassing a wrapped C++-class one cannot do super calls to unwrapped methods of the base class as the Python signatures will not match the C++ ones.
As a result the method __call__ needs to be written without parent calls:
class VanillaCompareAtoms(rdFMCS.MCSAtomCompare): """ Give than the following does not work, one has to do it in full: .. code-block:: python :caption: This will raise an error: super().__call__(parameters, mol1, atom_idx1, mol2, atom_idx2) This class replicates the vanilla functionality """ def __init__(self, comparison: rdFMCS.AtomCompare = rdFMCS.AtomCompare.CompareAnyHeavyAtom): """ Whereas the atomCompare is an enum, this is a callable class. But in parameters there is no compareElement booleans etc. only Isotope... In https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/FMCS/Wrap/testFMCS.py it is clear one needs to make one's own. """ super().__init__() # noqa the p_obect (python object aka ctypes.py_object) is not used self.comparison = comparison def __call__(self, # noqa signature matches... it is just Boost being Boost parameters: rdFMCS.MCSAtomCompareParameters, mol1: Chem.Mol, atom_idx1: int, mol2: Chem.Mol, atom_idx2: int) -> bool: a1: Chem.Atom = mol1.GetAtomWithIdx(atom_idx1) a2: Chem.Atom = mol2.GetAtomWithIdx(atom_idx2) # ------- isotope ------------------------ if parameters.MatchIsotope and a1.GetIsotope() != a2.GetIsotope(): # noqa return False elif self.comparison == rdFMCS.AtomCompare.CompareIsotopes and a1.GetIsotope() != a2.GetIsotope(): # noqa return False elif self.comparison == rdFMCS.AtomCompare.CompareElements and a1.GetAtomicNum() != a2.GetAtomicNum(): # noqa return False elif self.comparison == rdFMCS.AtomCompare.CompareAnyHeavyAtom \ and (a1.GetAtomicNum() == 1 or a2.GetAtomicNum() == 1): # noqa return False elif self.comparison == rdFMCS.AtomCompare.CompareAny: pass # ------- valence ------------------------ if parameters.MatchValences and a1.GetTotalValence() != a2.GetTotalValence(): # noqa return False # ------- chiral ------------------------ if parameters.MatchChiralTag and not self.CheckAtomChirality(parameters, mol1, atom_idx1, mol2, atom_idx2): return False # ------- formal ------------------------ if parameters.MatchFormalCharge and not self.CheckAtomCharge(parameters, mol1, atom_idx1, mol2, atom_idx2): return False # ------- ring ------------------------ if parameters.RingMatchesRingOnly and not self.CheckAtomRingMatch(parameters, mol1, atom_idx1, mol2, atom_idx2): return False # ------- complete ------------------------ if parameters.CompleteRingsOnly: # don't know its intended to be used pass # ------- distance ------------------------ if parameters.MaxDistance: # integer... # todo fill! pass return True
Now, we can subclass it and change the call as we please.
In my case, I wrote a class that accepts a mapping that has to be obeyed. For that I still however, need to add a method to filter out the matching substructures, but that is an easy feat as there is already a method that says when two atoms match, i.e. the __call__
method of the comparator!
Edit
In recent versions, the comparisons are called by rdFMCS.FindMCS not only between molA to molB atoms, but also molB to molB atoms, presumably to address isomorphisms.
Footnote
The following snippet to gets me a molecule from a name using Cactus:
import requests from urllib.parse import quote from rdkit import Chem from rdkit.Chem import AllChem def name2mol(name) -> Chem.Mol: """ Using Cactus get a molecule from a name. """ response: requests.Response = requests.get( f'https://cactus.nci.nih.gov/chemical/structure/{quote(name)}/smiles') response.raise_for_status() smiles: str = response.text mol: Chem.Mol = Chem.MolFromSmiles(smiles) mol.SetProp('_Name', name) AllChem.EmbedMolecule(mol) return mol
I always want to see the molecules on a grid but in 2D, so:
from rdkit.Chem import Draw from typing import List from IPython.display import Image def flatgrid(mols, *args, **kwargs) -> Image: copies: List[Chem.Mol] = [Chem.Mol(m) for m in mols] *map(AllChem.Compute2DCoords, copies), # noqa, it's in place if 'legends' not in kwargs: kwargs['legends'] = [m.GetProp('_Name') if m.HasProp('_Name') else '-' for m in mols] return Draw.MolsToGridImage(copies, *args, **kwargs)
Obviously, in the main text I have a figure, so the code was in reality:
from IPython.display import Image img: Image = flatgrid([capsaicin, jasminaldehyde, Chem.MolFromSmarts(result.smartsString)]) with open('molecules.png', 'wb') as fh: fh.write(image.data)