Finding a way to express the similarity of irregular and discrete molecular graphs to enable quantitative algorithmic reasoning in chemical space is a fundamental problem in data-driven small molecule drug discovery.
Virtually all algorithms that are widely and successfully used in this setting boil down to extracting and comparing (multi-)sets of subgraphs, differing only in the space of substructures they consider and the extent to which they are able to adapt to specific downstream applications.
A large body of recent work has explored approaches centred around graph neural networks (GNNs), which can often maximise both of these considerations. However, the subgraph-derived embeddings learned by these algorithms may not always perform well beyond the specific datasets they are trained on and for many generic or resource-constrained applications more traditional “non-parametric” topological fingerprints may still be a viable and often preferable choice .
This blog post gives an overview of the topological fingerprint algorithms implemented in RDKit. In general, they count the occurrences of a certain family of subgraphs in a given molecule and then represent this set/multiset as a bit/count vector, which can be compared to other fingerprints with the Jaccard/Dice similarity metric or further processed by other algorithms.
MACCS keys fingerprints
MACCS keys are the simplest and most restrictive fingerprint implemented in RDKit. It counts the occurrences of a collection of pre-defined, expert-derived substructures that are commonly used to quantify pharmacologically relevant molecular similarity. The fingerprints can only be accessed in dense bit vector format.
Atom pair and topological torsion fingerprints
Atom pair fingerprints were first introduced by Carhart et al. JCICS 25:64-73 (1985). An atom pair substructure is defined as a triplet of two (non-hydrogen) atoms and their shortest path distance in the molecular graph, i.e. (atom type 1, atom type 2, geodesic distance). In the standard RDKit implementation, distinct atom types are defined by tuples of atomic number, number of heavy atom neighbours, aromaticity and chirality. All unique triplets in a molecule are enumerated and stored in sparse count or bit vector format.
Topological torsion fingerprints were first introduced in Nilakatan et al. JCICS 27:82-5 (1987). They aim to complement the predominantly long-range relationships captured in atom pair fingerprints by representing short-range information contained in the torsion angles of a molecule. They use the same atom type definitions as atom pair fingerprints, but only count four-atom sequences, i.e. (atom type 1, atom type 2, atom type 3, atom type 4). They can only be accessed in sparse count vector format.
Both of these fingerprints can be interpreted as domain-specific special cases of random walk graph kernels and differentiable versions thereof.
Extended-connectivity and functional-class fingerprints
Extended-connectivity fingerprints (ECFPs) were first introduced in Rogers and Hahn JCIM 50:742-54 (2010). The algorithm starts by assigning each atom an integer label (e.g. its atomic number) and then updating these labels by aggregating them with the labels of their one-hop neighbours using a hash function. This aggregation procedure is repeated a certain number of times, counting all distinct labels that arise throughout these iterations. The authors also propose a variation called functional-class fingerprints (FCFPs), which replace the initial labels with more abstract chemical concepts, denoting whether an atom is an H-bond donor/acceptor, part of an aromatic system, a halogen, and basic/acidic.
Both of these fingerprints can be accessed as sparse count vectors and dense (folded) bit vectors. They can be interpreted as domain-specific special cases of the Weisfeiler-Lehman kernel and differentiable versions thereof.
RDKit fingerprints
RDKit also provides an eponymous fingerprint that starts by labelling all atoms with their atomic number/aromaticity and all edges with their bond type and then exhaustively enumerates the (multi-)set of all subgraphs within a certain size limit (1-7 atoms by default). It can be accessed as sparse count vectors and dense (folded) bit vectors.