After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).
ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.
Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types: single, double, triple, or aromatic. To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [2], but other versions of ECFPs that use pharmacophoric atom features also exist [1]. Optionally, the algorithm also allows for the stereochemical distinction between atoms with respect to tetrahedral chirality.
Using RDKit, a SMILES string can be transformed into an ECFP in a straightforward manner via the following function:
# import packages import numpy as np from rdkit.Chem import AllChem # define function that transforms SMILES strings into ECFPs def ECFP_from_smiles(smiles, R = 2, L = 2**10, use_features = False, use_chirality = False): """ Inputs: - smiles ... SMILES string of input compound - R ... maximum radius of circular substructures - L ... fingerprint-length - use_features ... if false then use standard DAYLIGHT atom features, if true then use pharmacophoric atom features - use_chirality ... if true then append tetrahedral chirality flags to atom features Outputs: - np.array(feature_list) ... ECFP with length L and maximum radius R """ molecule = AllChem.MolFromSmiles(smiles) feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule, radius = R, nBits = L, useFeatures = use_features, useChirality = use_chirality) return np.array(feature_list)
If the length L of the ECFP is chosen to be very large, then each of its dimensional components informs about the presence or absence of one particular and unambiguous circular subgraph with atom- and bond features. The associated bit is then set to 1 if and only if this circular substructure is present anywhere in the molecule, otherwise it is set to 0. However, if L becomes small then hash collisions start to occur that reduce the resolution of the ECFP; this can cause a fingerprint-component to become ambiguous and correspond to one out of several possible distinct circular subbstructures. Therefore, L must be chosen sufficiently large as to guarantee the expressivity of the ECFP. Common choices are L = 1024 or L = 2048.
In the literature, ECFP-featurisations with radius R are often written in the form ECFP2R with 2R being interpreted as the maximum fingerprint-diameter. For example, the frequently used 1024-bit ECFP4-featurisation describes an ECFP with maximum radius R = 2 and length L = 1024. ECFPs are simple yet powerful molecular featurisations and I like to look at them as a non-differentiable type of message-passing graph neural network (GNN). Have fun using them for molecular machine learning!
[1] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.
[2] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.