Background¶
Sort and Slice (SNS) was developed by a former OPIGlet, Markus, as a method for improving Extended Connectivity Fingerprints (ECFPs) by overcoming bit collisions. ECFPs are a form of topological fingerprint which denote the absence and presence of circular substructures in a molecule. The steps for deriving an ECFP from a molecule are as follows:
Identifier assignment:
Each atom in the molecule is assigned an initial numerical identifier; this is typically generated by hashing a tuple of atomic properties called Daylight atomic invariants into a 32-bit integer. These properties are:
- Number of non-hydrogen neighbours.
- Valence – number of neighbouring hydrogens.
- Atomic number.
- Atomic mass.
- Atomic charge.
- Number of hydrogen neighbours.
- Ring membership.*
*Ring membership is an additional property that is often used but is not one of the original Daylight atomic invariants.
Iterative updating:
Each substructure identifier is updated by hashing the central identifier with all the identifiers of its neighbours, i.e. performing a 1-hop aggregation. This processes is repeated upto a predetermined maximum radius.
Duplicate removal:
Duplicate substructures are identified and removed.
Folding into a smaller bit vector:
The 32-bit integers are used to produce a sparse vector, with ‘on’ bits for each substructure identifier in the molecule. This sparse bit vector is folded into smaller fixed length bit vector, typically to 1024 or 2048 bits. This final folding step often produces ‘bit collisions’, with the same bit being activated by different substructures, adding noise to the representation.
For background more on ECFPs, I recommend this paper by Rogers and Hahn (2010) which introduced ECFPs, Leo’s blopig post on topological fingerprints, Markus’ blopig post on turning SMILES into ECFPs with RDKit, and Greg Landrum’s blog post on fingerprint generators in RDKit.
SNS is a form of dataset dependent feature selection applied to ECFPs. Instead of folding the final sparse vector, SNS retains only the most frequent substructures in a previously observed dataset. Let’s see how we might do this in Python.
# imports
import pandas as pd
from rdkit.Chem.rdFingerprintGenerator import AdditionalOutput, GetMorganGenerator
import numpy as np
from sklearn.model_selection import train_test_split
from rdkit import Chem
from tqdm import tqdm # progress bar
import jdc # allows dynamic class definition across cells
import warnings
Data¶
First of all we need a dataset to test our code on. QM9 is a commonly used dataset of small molecules with calculated quantum properties, which can be easily downloaded from MoleculeNet using pandas:
qm9 = 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv'
qm9 = pd.read_csv(qm9)
qm9.head().iloc[:,:2]
mol_id | smiles | |
---|---|---|
0 | gdb_1 | C |
1 | gdb_2 | N |
2 | gdb_3 | O |
3 | gdb_4 | C#C |
4 | gdb_5 | C#N |
Let’s perform a train/test split and convert the SMILES strings into RDKit molecule objects. This will give us train data to generate an SNS featurizer from and test data to run the SNS featurizer over.
train, test = train_test_split(qm9, test_size=0.4, random_state=42)
train_smiles = train['smiles'].to_list()
test_smiles = test['smiles'].to_list()
train_mols = [Chem.MolFromSmiles(smi) for smi in train_smiles]
test_mols = [Chem.MolFromSmiles(smi) for smi in test_smiles]
Coding Sort and Slice¶
Now we have our data, how can we build a SortAndSlice
class? There are several components our class needs to possess:
- An RDKit Morgan Fingerprint generator for performing substructure identifier generation.
- An RDKit AdditionalOutput object for obtaining substructure identifiers from the Morgan generator.
- A method for obtaining all the substructure identifiers in the training dataset.
- A dictionary for holding all the identifiers in the training dataset.
- An encoding dictionary of a fixed length for mapping substructures to bits.
- A method for converting new molecules to SNS bit vectors.
Class Constructor¶
Below is the constructor method of our SortAndSlice
class. This method takes a list of training data molecules, a Morgan Generator, the desired bit length, and a verbosity flag as input arguments. The constructor defines a fingerprint generator and AdditionalOutput instance as attributes of our class, before computing the identifiers in the dataset and the encoder.
class SortAndSlice:
encoder = {}
def __init__(
self,
data: list[Chem.Mol],
generator: GetMorganGenerator = GetMorganGenerator(radius=2),
bit_length: int = 2048,
verbose: bool = False,
):
self.generator = generator
self.ao: AdditionalOutput = AdditionalOutput()
self.ao.AllocateBitInfoMap()
self.verbose = verbose
self.identifiers = self.get_identifiers(data)
self.set_encoder(bit_length)
Sort¶
With our constructor written, the next requirement is a method which calculates the frequency of substructure identifiers across our dataset and sorts the identifiers by frequency. The method below, get_identifiers
, does the following:
- Creates an empty dictionary to tally substructure identifiers.
- Iterates over each molecule in the input data. For each iteration:
- Calculates the sparse Morgan fingerprint of a molecule.
- Obtains all the substructure integer identifiers from the additional output object.
- For each integer, adds 1 to the total tally.
- Sorts the identifiers by frequency.
- Returns the identifiers as a dictionary, with identifiers as keys and tallies as values.
%%add_to SortAndSlice
def get_identifiers(self, data: list[Chem.Mol]) -> dict[int, int]:
identifiers = {}
pbar = tqdm(total=len(data), desc='Collecting identifiers', disable=not self.verbose)
for mol in data:
self.generator.GetSparseFingerprint(mol, additionalOutput=self.ao)
bitmap = self.ao.GetBitInfoMap()
for identifier in bitmap:
count = identifiers.get(identifier, 0)
identifiers[identifier] = count + 1
pbar.update(1)
pbar.close()
identifiers = dict(sorted(identifiers.items(), key=lambda x: x[1], reverse=True))
return identifiers
Slice¶
Having computed the identifiers in our dataset and sorted them by frequency, the slicing operation comes next. The method set_encoder
takes an integer, bit_length
, as an argument. The SortAndSlice
encoder is then set by taking only the top bit_length
substructures by frequency and labelling them with a rank, i.e. the identifier dictionary is sliced and enumerated.
%%add_to SortAndSlice
def set_encoder(self, bit_length: int):
if self.verbose:
print(f'Setting bit length of encoder to a max of {bit_length}.')
encoder = {}
for i, k in enumerate(self.identifiers.keys()):
if i >= bit_length:
break
encoder[k] = i
self.encoder = encoder
if len(encoder) < bit_length:
warnings.warn(f'Encoder is only {len(encoder)} bits long.')
if self.verbose:
print(f'Encoder set to {len(encoder)} bits.')
Featurization¶
With an encoder dictionary now set, new molecules can be encoded based on the molecules that have already been seen. The encode
method takes an rdkit molecule as an argument and performs the following:
- Calculates the sparse morgan fingerprint of the molecule.
- Obtains the 32-bit substructure identifiers in the molecule.
- Generates an output vector of zeros that’s the length of the encoder.
- Iterates over the identifiers in the molecule. For each identifier:
- Checks if the identifier is in the encoder.
- If the identifier is encoded, the rank of the identifier is used to turn a bit in the output vector to 1.
- Returns the SNS feature vector for the molecule.
%%add_to SortAndSlice
def encode(self, mol: Chem.Mol) -> np.ndarray:
self.generator.GetSparseFingerprint(mol, additionalOutput=self.ao)
bitmap = self.ao.GetBitInfoMap()
out = np.zeros(len(self.encoder))
for identifier in bitmap:
if identifier in self.encoder:
out[self.encoder[identifier]] = 1
return out
Usage¶
Let’s see how this SortAndSlice
class can be used. Below, a Morgan Fingerprint generator is defined. Next, this generator and the QM9 training molecules are used as arguments for an instance of the SortAndSlice class:
gen = GetMorganGenerator(radius=2, includeChirality=True)
sns = SortAndSlice(train_mols, generator=gen, bit_length=2048, verbose=True)
Collecting identifiers: 100%|██████████| 80331/80331 [00:02<00:00, 31337.66it/s]
Setting bit length of encoder to a max of 2048. Encoder set to 2048 bits.
To use this SortAndSlice
instance to featurize a set of molecules, we can do the following:
test_sort_and_slice = [sns.encode(mol) for mol in test_mols]
print(
f'Number of feature vectors: {len(test_sort_and_slice)},\n\
Length of the 0th feature vector: {test_sort_and_slice[0].shape[0]},\n\
On bits in the 0th feature vector: {np.where(test_sort_and_slice[0])[0]}'
)
Number of feature vectors: 53554, Length of the 0th feature vector: 2048, On bits in the 0th feature vector: [ 0 1 4 5 10 20 21 22 35 53 69 100 629 901 1242]
And there you have it! A set of topological fingerprints without bit collisions!
For clarity, here’s how the SortAndSlice class looks without cell breaks:
class SortAndSlice:
"""
Class to sort and slice the output of a Morgan fingerprint generator.
Args:
data (list[Chem.Mol]): List of RDKit molecules.
generator (GetMorganGenerator): RDKit Morgan fingerprint generator.
bit_length (int): Length of the output vector.
verbose (bool): Whether to print progress.
Attributes:
generator (GetMorganGenerator): RDKit Morgan fingerprint generator.
ao (AdditionalOutput): RDKit fingerprint generator output.
verbose (bool): Whether to print progress.
identifiers (dict[int, int]): Dictionary of identifiers and counts.
encoder (dict[int, int]): Dictionary of identifiers and indices.
Methods:
get_identifiers: Collects identifiers from data.
set_encoder: Sets the encoder to a specific length.
encode: Encodes a molecule into a binary vector.
"""
encoder = {}
def __init__(
self,
data: list[Chem.Mol],
generator: GetMorganGenerator = GetMorganGenerator(radius=2),
bit_length: int = 2048,
verbose: bool = False,
):
self.generator = generator
self.ao: AdditionalOutput = AdditionalOutput()
self.ao.AllocateBitInfoMap()
self.verbose = verbose
self.identifiers = self.get_identifiers(data)
self.set_encoder(bit_length)
def get_identifiers(self, data: list[Chem.Mol]) -> dict[str, int]:
"""
Collects and sorts identifiers from molecule data.
Args:
data (list[Chem.Mol]): List of RDKit molecules.
Returns:
dict[int, int]: Dictionary of identifiers and counts.
"""
identifiers = {}
pbar = tqdm(total=len(data), desc='Collecting identifiers', disable=not self.verbose)
for mol in data:
self.generator.GetSparseFingerprint(mol, additionalOutput=self.ao)
bitmap = self.ao.GetBitInfoMap()
for identifier in bitmap:
count = identifiers.get(identifier, 0)
identifiers[identifier] = count + 1
pbar.update(1)
pbar.close()
identifiers = dict(sorted(identifiers.items(), key=lambda x: x[1], reverse=True))
return identifiers
def set_encoder(self, bit_length: int):
"""
Slices substructure identifiers to a specific length enumerates them.
Sets the encoder attribute to the enumerated identifiers.
Args:
bit_length (int): Length of the output vector.
"""
if self.verbose:
print(f'Setting bit length of encoder to a max of {bit_length}.')
encoder = {}
for i, k in enumerate(self.identifiers.keys()):
if i >= bit_length:
break
encoder[k] = i
self.encoder = encoder
if len(encoder) < bit_length:
warnings.warn(f'Encoder is only {len(encoder)} bits long.')
if self.verbose:
print(f'Encoder set to {len(encoder)} bits.')
def encode(self, mol: Chem.Mol) -> np.ndarray:
"""
Encodes a molecule into a binary sort and slice vector.
Args:
mol (Chem.Mol): RDKit molecule.
Returns:
np.ndarray: Binary vector indicating substructure presence.
"""
self.generator.GetSparseFingerprint(mol, additionalOutput=self.ao)
bitmap = self.ao.GetBitInfoMap()
out = np.zeros(len(self.encoder))
for identifier in bitmap:
if identifier in self.encoder:
out[self.encoder[identifier]] = 1
return out
References:¶
- Dablander, M. et al. (2024) Sort and Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints. https://doi.org/10.48550/arXiv.2403.17954
- Rogers, D. and Hahn, M. (2010) Extended-Connectivity Fingerprints. https://doi.org/10.1021/ci100050t
- Weininger, D., Weininger, A., Weininger, J.L. (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. https://doi.org/10.1021/ci00062a008
- Klarner, L. (2022) Exploring Topological Fingerprints. https://www.blopig.com/blog/2022/06/exploring-topological-fingerprints-in-rdkit/
- Dablander, M. (2022) How to turn a SMILES string into an extended-connectivity fingerprint using RDKit. https://www.blopig.com/blog/2022/11/how-to-turn-a-smiles-string-into-an-extended-connectivity-fingerprint-using-rdkit/
- Landrum, G. (2023) FingerprintGenerator tutorial. https://greglandrum.github.io/rdkit-blog/posts/2023-01-18-fingerprint-generator-tutorial.html
- Ramakrishnan, R. et al (2014) Quantum chemistry structures and properties of 134 kilo molecules. https://doi.org/10.1038/sdata.2014.22
- Wu, Z. et al. (2018) MoleculeNet: a benchmark for molecular machine learning. https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a
WordPress conversion from tutorial.ipynb by nb2wp v0.3.1
GitHub repo: https://github.com/smkyrle/sort-and-slice-tutorial