I really hope my compounds get the green light

As a cheminformatician in a drug discovery campaign or an algorithm developer making the perfect Figure 1, when one generates a list of compounds for a given target there is a deep desire that the compounds are well received by the reviewer, be it a med chemist on the team or a peer reviewer. This is despite scientific rigour and training and is due to the time invested. So to avoid the slightest shadow of med chem grey zone, here is a hopefully handy filter against common medchem grey-zone groups.

SMARTS

In RDKit one can find if a compound is a substructure of another using the method Chem.Mol.HasSubstructMatch, and when the smaller compound is derived from a SMARTS, it can be ambiguous allowing greater matches. Ambiguity is SMARTS is super-powerful. In a project (Arthorian Quest) I convert ‘regular’ molecules (e.g. read from SMILES) to the ‘ambiguous’ ones (akin to read from SMARTS) to tailor my searchers based on my needs —technically the ‘ambiguous’ molecules will have Chem.QueryAtom as opposed to Chem.Atom.
One nice feature of SMARTS patterns is the ability to specify if an atom is part of a ring or not ([R] vs. [!R]) or whether an atom is aromatic or not. An example of the benefit is a nitrogen bonded with an oxygen, which as a N-hydroxyl group is toxic, whereas in isooxaole it is a chemically useful (e.g. in flucloxacillin). Obviously, no group is perfect and an isooxaole ring is UV liable for example.

In RDKit, there is a class called FilterCatalog that allows a query compound to be compared against a catalogue. It is marginally faster than running a for-loop of Chem.Mol.HasSubstructMatch calls simply because it circumvents the continual transversal of the Boost framework, but it is tidier. There are some prebuilt catalogues, most notably the PAINS catalogue:

from rdkit.Chem import rdfiltercatalog
# some code will import rdkit.Chem.FilterCatalog but
# `rdkit.Chem.FilterCatalog.FilterCatalog` is `rdkit.Chem.rdfiltercatalog.FilterCatalog`
# and the latter's module contains the same functions, so I would avoid `from rdkit.Chem import FilterCatalog`
# as `FilterCatalog.FilterCatalog()` will cause mistakes.
# Also `rdfiltercatalog.FilterCatalogParams.FilterCatalogs` is an Enum, whose attributes are _not_ `FilterCatalog` catalogues.

# ## Example of catalogue is commonly used catalogue: Pains
# these are stuff that aggregate of cause interference in assays
pains_catalog_params = rdfiltercatalog.FilterCatalogParams()
pains_catalog_params.AddCatalog(rdfiltercatalog.FilterCatalogParams.FilterCatalogs.PAINS)
pains_catalog = rdfiltercatalog.FilterCatalog(pains_catalogue_params)

However, in this case, I am after a dataset that removes groups in the MedChem grey zone. Wherein I can name a couple of FDA compounds with each groups, a reviewer —especially one that needs to show off— will nitpick and raise flags at them. And if a warning flag is raised, then other flags will be raised faster, so playing it overly safe may be advantageous —disclaimer: this is not life advice.

Advice not rules

For example, recently, in one campaign for a neuronal target I was involved in, a follow-up compound was forming interesting interactions via a nitro group, yet this was killed because a nitro group can be toxic, even if flunitrazepam (Rhohypnol) is a BBB crossing drug that is FDA approved. It is a group that raises warning flag: a counter example for the nitro group is ranitidine (Zantec), whose degradation product may be carcinogenic. Therefore, it must go. This borders on the endless debate of at what stage of drug discovery should you worry about ADME properties. During hit discovery, including fragment-hit optimisation, I, for one, believe it depends on the screening assay: for example in XChem’s crystal soaking assay enthalpic interactions will be detected more readily than entropic ones, so high TPSA is good for the assay but bad for the BBB permeability.

Some groups are even less problematic, for example diphenhydramine has a hydrazine, which is likely to be flagged as it can be metabolised easily. This is ironic in the case of diphenhydramine used as a sleeping medication (Nytol) as the half life of diphenhydramine (~10 hours) is too long and leaves one groggy in the morning.

These example raises the point that there is not a singular ideal set of useful drug features. Metronidazole kills bacteria thanks to the toxicity of its nitro group. Novocaine has a really short half life, which is a good thing as anyone who has snuck out to the dentist during work hours can vouch for. GABA_A receptor modulators make a nice example as they have a range of different lifetimes and uses. Namely, Z-drugs, like zopiclone, and benzodiazepine drugs, like diazepam (Valium), bind in the same pocket, yet have different half-lives which drives their use (along with specificity to other GABA receptor): the ~2 and ~5 hour half-life of zolpidem (Ambien in US) and zopiclone make them suitable against insomnia, while the ~50 and ~18 hours for diazepam and flunitrazepam does not.

Another example of case specific groups are covalent warheads. For example, in the recent FBDD poster-child, sotorasib, the acrylamide is essential for its activity against KRas, yet when not intended as a warhead it is a hard no. Even the topic of covalents is highly contentious as discussed in a previous post. From the comp-chem point of view, these are irksome to deal with, so for once the med chem personal option debate comes in handy.

PAINS groups are often taken as hard rules, but it depends on the where and most are not oncogenic or cause interference in a given assay. These ‘filters’ are even less rule-like, but advice based on utterly arbitrary rules. Some reviewers, may for example dislike all alicylics beyond morpholino, piperizine or piperidine. The reason is commonly because they were burnt by a given group or scaffold in the past as opposed to hard and fast rule. Furthermore, this is not meant to be exhaustive, but simply a first pass.

Filter

Here is the filter: remember it is a starting point, so do alter it.

Methyl groups are utterly fine in moderation, hence why it is commented out. The methyl group is prone to cytochrome oxidation, but some 67% of FDA drugs have them. In some cases it does result in scaffold hops or weird strategies like deuteration (cf. deutetrabenazine). So is a case of moderation!
As for why humans are efficient at breaking down methyl groups in xenobiotics, the reason is an evolutionary battle against isoprenoids from organisms that do no wish to be eaten —a blog topic for another day!
I have also commented out carboxylic acids as I rather gamble on its rejection than censor a group that forms such strong interactions during hit discovery.

Below are four categories. Two are reactive or oncogenic. One category is driven by chemical synthesis: groups that are reactive groups or protective groups. And one is arbitrary.

Regarding reactive/protective groups, this does kill some potentially interesting chemistry. Boronic acids are an exiciting compound that is gaining traction as it has a unique tetrahedral coordination of two hydroxyl groups and can form reversible covalent bonds with serine, as seen in Velcade. However, they are used in several reactions, such as Suzuki or Chan-Lam, so with no explanation would just seem that one forgot to go through the compounds by eye to remove protecting groups and the like from automated results —”I totally painstakingly checked the results. cough, maybe”. Less of a loss, silicon is very underrepresented in drug discovery and only one compound, a pesticide, that has a silane, but silyls are protection group for O, so it is best to get rid of them en masse. Same with exocyclic carbamate, a protection group for amines, which does appears in a drug or two.

The last group is an example of groups that are utterly personal choice. For example, long alkanes groups are added for solubility reasons at lead-opt, not at hit-discovery. The FMOC protection group is a carbamate of methylflourene, but there’s nothing wrong with flourene by itself, although I can guarantee it will be erroneous flagged as a FMOC group, so switching to a carbazole is an easy fix. The latter is not a chemistry problem, but a human one —as said, I am playing it very safe here!

from rdkit.Chem import rdfiltercatalog

# these are advices read as rules to play it extra safe.
unwanted = {}

# ## protection groups or reactive groups
unwanted['exocyclic carbamate'] = Chem.MolFromSmarts('[N!R]-C(=O)-O')
unwanted['silicon'] = Chem.MolFromSmarts('[Si]')
unwanted['boron'] = Chem.MolFromSmarts('[B]')

# ## Easily metabolised
# several drugs have imines, such as brotidine
unwanted['exocyclic imine'] = Chem.MolFromSmarts('[C!R]=[N!R]')
# loads of drugs have esters, eg. aspirin
unwanted['exocyclic ester'] = Chem.MolFromSmarts('[C!R](=O)-[OH0!R]')
# unwanted['methyl'] = Chem.MolFromSmarts('[CH3]')
# unwanted['carboxylic acid'] = Chem.MolFromSmarts('C(=O)-[O-,OH1]')

# ## Oncogenic
unwanted['hydrazine'] = Chem.MolFromSmarts('[N,n]-[N!R]') # see text for examples
unwanted['nitro'] = Chem.MolFromSmarts('[N+](-[O-])=[O]') # see text for examples
unwanted['nitroso'] = Chem.MolFromSmarts('[N!R]=[O]')
# these are worse, but do appear in a few drugs
unwanted['diazene'] = Chem.MolFromSmarts('[N,n]=[N!R]')
unwanted['hydroxylamine'] = Chem.MolFromSmarts('[N]-[O!R]')

# ## Warheads
unwanted['acrylamide'] = Chem.MolFromSmarts('[N,c]-[C!R](=O)-[C!R]=[C!R]')  # covalent form: 'NC(=O)CC*' or 'NC(=O)C(*)C'
unwanted['ynamide'] = Chem.MolFromSmarts('[N,O,c]-[C!R](=O)-[C!R]#[C!R]')  # covalent form: 'NC(=O)C=C*'
unwanted['haloacetamide'] = Chem.MolFromSmarts('[N,c]-[C!R](=O)-[C!R]-[Cl,Br,I]')  # covalent form: 'NC(=O)C*'
unwanted['vinylsulfonamide'] = Chem.MolFromSmarts('[N,c]-[S!R](=O)(=O)-[C!R]=[C!R]')  # covalent form: 'NS(=O)(=O)CC*'
unwanted['sulfonylhalide'] = Chem.MolFromSmarts('[S](=O)(=O)-[Cl,Br,I]')  # covalent form: 'S(=O)(=O)*'
# phosphonylhalide is a hard no regardless (cf. sarin gas...)
unwanted['phosphonylhalide'] = Chem.MolFromSmarts('[P](=O)(-[O-,OX2])-[Cl,Br,I]')  # covalent form: 'P(=O)(-O)*'
unwanted['aldehyde'] = Chem.MolFromSmarts('[CH1]=O')  # covalent form: 'C(O)*'
unwanted['amidonitrile'] = Chem.MolFromSmarts('[N,O,c]-[C!R](=O)-C#N')  # covalent form: 'NC(=O)C(=N)*'
unwanted['epoxide'] = Chem.MolFromSmarts('C1CO1')  # covalent form: '*CCO'
unwanted['aziridine'] = Chem.MolFromSmarts('C1CN1')  # covalent form: '*CCN', requires UV
unwanted['maleimide'] = Chem.MolFromSmarts('[CH1]1=[CH1]-C(=O)-[NX3]-C1=O')  # classic biochem thiol crosslinker
unwanted['NHS'] = Chem.MolFromSmarts('C1=C-C(=O)-N(-O)-C1=O')  # classic biochem amine crosslinker

# ## Personal choice
unwanted['aliphatic alkane'] = Chem.MolFromSmarts('[CH2!R]-[CH2!R]-[CH2!R]-[CH2!R]')
unwanted['hemiacetal'] = Chem.MolFromSmarts('[C!R](-O)-O')  # these are unstable
unwanted['hydroxylmethylflourene'] = Chem.MolFromSmarts('c1ccc2c3ccccc3C(CO)c2c1')  # avoid confusion

# ## Make catalogue
catalog = rdfiltercatalog.FilterCatalog()
for name, baddie in unwanted.items():
    sm = FilterCatalog.SmartsMatcher(baddie)
    entry = FilterCatalog.FilterCatalogEntry(name, sm)
    catalog.AddEntry(entry)
        
# ## Example
# use a ``rdfiltercatalog.FilterCatalog`` thusly:
catalog.GetMatches(mol)

Author