Tanimoto similarity of ECFPs with RDKit: Common pitfalls

A common measure for the similarity of two molecules is the Tanimoto similarity of their ECFPs (Extended Connectivity FingerPrint). However, there is no clear standard in literature for what kind of ECFPs should be used when calculating the Tanimoto similarity, and that choice can lead to substantially different results. In this post I wish to shed light on some results you should know about before you jump into your calculations.

A blog post on how ECFPs are generated was written by Marcus Dablander in 2022 so please take a look at that. In short, ECFPs have a hyperparameter called the radius r, and sometimes a fingerprint length L. Each entry in the fingerprint indicates the presence or absence of a particular substructure in the molecule of interest, and the radius r defines how large the substructures that you consider are. If you have r=3 then you consider substructures made by going up to three hops away from each atom in your molecule. This is best explained by this figure from Marcus’ post:

The vector of all possible substructures that can arise in molecules for e.g. r=3 is very large, so it’s common to reshape the output vector into a fixed size L (e.g. 1024, 2048) using a hash map. This means that there are bit collisions, i.e. each entry in the output vector has to indicate the presence or absence of multiple substructers. In RDKit, you can choose whether you use a very large sparse vector representation, or whether you fold your vector into a fixed size L. You can also choose whether your fingerprint is a bit vector, with entries 0 or 1, or a count vector, which tells you how many times a substructure occurs. We will only cover bit vectors in this blog post.

Tanimoto similarity of two vectors is a number between 0 and 1, and is calulated as the fraction of two counts; the count of entries that are activated in both vectors, and the count of entries that are activated in either vector.

In the following tutorial I will use RDKit to generate ECFPs and calculate pairwise Tanimoto similarity between 42 molecules, i.e. 861 distinct pairs of molecules. I will demonstrate that the distribution of similarities differ depending on ECFP choices.

Watch out for bit collisions?

When you fold your sparse vector representation into a bit vector of fixed size you inevitably lose some information, but how much does that affect the Tanimoto similarity? In the figure below we calculate the Tanimoto similarity for the 861 pairs for different values of L, and with the sparse vector representation. For this set of molecule pairs there is minimal difference in the resulting distribution, as long as you don’t choose a small length. I however would recommend using the sparse representation as then you preserve all the information.

Below is a code snippet to generate the fingerprints.

from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator
from rdkit import Chem
suppl = Chem.SDMolSupplier("data/ligands.sdf", removeHs=True)

fp_gen = GetMorganGenerator(radius=3, fpSize=2048)
fps_2048 = [fp_gen.GetFingerprint(lig) for lig in suppl]

fp_gen = GetMorganGenerator(radius=3)
fps_sparse = [fp_gen.GetSparseFingerprint(lig) for lig in suppl]

The radius matters

It’s intuitive that it’s “easier” for two molecules to be similar if the radius hyperparameter is small. For example if r=0, then the molecules only have to have the same atom types to be the same. For larger radii, some more complicated substructures have to be present in both molecules for high Tanimoto similarity. We highlight this in the figure below. There’s clearly a large difference in the resulting distributions, and they shift to lower similarity as r increases. It is therefore imperative that you carefully consider what radius you choose for your Tanimoto similarity calculations.

Author

Ísak Valsson

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends

Tanimoto similarity of ECFPs with RDKit: Common pitfalls

Watch out for bit collisions?

The radius matters

Author