Featurisation is Key: One Version Change that Halved DiffDock’s Performance

1. Introduction

Molecular docking with graph neural networks works by representing the molecules as featurized graphs. In DiffDock, each ligand becomes a graph of atoms (nodes) and bonds (edges), with features assigned to every atom using chemical properties such as atom type, implicit valence and formal charge.

We recently discovered that a change in RDKit versions significantly reduces performance on the PoseBusters benchmark, due to changes in the “implicit valence” feauture. This post walks through:

How DiffDock featurises ligands

What happened when we upgraded RDKit 2022.03.3 → 2025.03.1

Why training with zero-only features and testing on non-zero features is so bad

TL:DR: Use the dependencies listed in the environment.yml file, especially in the case of DiffDock, or your performance could half!

2. Graph Representation in DiffDock

DiffDock turns a ligand into input for a graph neural net by

Loading the ligand from an SDF file via RDKit.

Stripping all hydrogens to keep heavy atoms only.

Featurising each atom into a 16-dimensional vector:

0: Atomic number

1: Chirality tag

2: Total bond degree

3: Formal charge

4: Implicit valence

5: Number of implicit H’s

6: Radical electrons

7: Hybridisation

8: Aromatic flag

9-15: Ring-membership flags (rings of size 3–8)

Building a PyG HeteroData containing node features and bond-edges.

Randomizing position, orientation and torsion angles before inputting to the model for inference.

3. PoseBusters Benchmark & RDKit Version Bump

Using the supplied evaluation.py script which docks into whichever protein chains the ground truth is bound to, we evaluated on the 428-complex PoseBusters set using two different RDKit versions:

RDKit version	<2rmsd success rate
2022.03.3	50.89 %
2025.03.1	23.72 %

With no changes other than the RDKit version, the success rate dropped by over half.

Having checked the evaluation and conformer-generation steps, I took a more detailed look at the preprocessed data being fed into the model using the different RDKit versions. Everything was identical except implicit valence:
– RDKit 2022.03.3: implicit valence = 0 for every atom
– RDKit 2025.03.1: implicit valence ranges from 0-3

Relevant Changes to RDKit’s GetImplicitValence()

Between 2022.03.3 and 2025.03.1, RDKit was refactored so that implicit hydrogen counts are recomputed rather than permanently zeroed out after stripping explicit H’s.

Old 2022.03.3 behavior:

RemoveHs() deletes all explicit hydrogens and sets each heavy atom’s internal flag df_noImplicit = true, keeping only a heavy atom representation.

Once df_noImplicit is set, asking for implicit valence always returns 0, even if you re-run sanitization.

New 2025.03.1 behavior:

RemoveHs() deletes explicit hydrogens but does not flag df_noImplicit = true, allowing recomputation of implicit valence.

Sanitization calculates implicit valence = allowed valence – sum of explicit bonds

GetImplicitValence() returns the correct implicit valence, even after stripping all H’s.

These changes mean:
Old (2022.03.3): RemoveHs() → df_noImplicit → GetImplicitValence() always 0
New (2025.03.1): RemoveHs() (flag untouched) → sanitization recomputes → GetImplicitValence() returns the correct implicit-H count

Because DiffDock was only ever trained on zeros at that index, suddenly inputting non-zero values at inference caused this collapse in performance.

We force-zeroed that column and recovered peformance under the new RDkit, validating that this caused the drop in performance:

– implicit_valence = atom.GetImplicitValence()

+ implicit_valence = 0

RDKit build	Success rate
2022.03.3 baseline	50.89 %
2025.03.1 unpatched	23.72 %
2025.03.1 patched	50.26 %

4. Why Zero-Trained → Non-Zero-Tested Is So Bad

The weight, w, controls how much “implicit valence” influences the network. There’s also a built-in bias b and an activation function ϕ. Together they compute:

output = ϕ (w v + b)

Where v is the implicit valence feature.

What Happens When You Train on Only Zeros?

Implicit valence (v) = 0 every time you train.

Since the input is always zero, there’s no signal telling w to move. In the absence of an explicit mechanism for the weights to become zero, such as weight decay, they will remain non-zero.

Effectively, the model learns that the implicit valence feature column doesn’t matter, and w remains at the random starting point.

What happens at test time?

The implicit valence feature (v) might be 1, 2, or 3 now.

The unchanged, random w multiplies this new v, producing unpredictable activations ϕ(w_random v+b).

These activations continue through downstream layers to the final prediction output.

5. Conclusion

Featurisation is very important – in the case of DiffDock, one library tweak changed one feature column and halved the performance! The fix was easy once it was found, but remember:

Featurization is key

Particularly in the case of DiffDock, use the listed dependency versions!

If you see a sudden large change in performance, it might be worth checking the package versions and the features…

Happy docking!

Author

James Broster

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends