The XChem trove of protein–small-molecules structures not in the PDB

The XChem facility at Diamond Light Source is truly impressive feat of automation in fragment-based drug discovery, where visitors comes clutching a styrofoam ice box teeming with apo-form protein crystals, which the shifter soaks with compounds from one or more fragment libraries and a robot at the i04-1 beamline kindly processes each of the thousands of crystal-laden pins, while the visitor enjoys the excellent food in the Diamond canteen (R22). I would especially recommend the jambalaya. Following data collection, the magic of data processing happens: the PanDDA method is used to find partial occupancy in the density, which is processed semi-automatedly and most open targets are uploaded in the Fragalysis web app allowing the ligand binding to be studied and further compounds elaborated. This collection of targets bound to hundreds of small molecules is a true treasure trove of data as many have yet to be deposited in the PDB, making it a perfect test set for algorithm design: fragments are notorious fickle to model and deep learning models cannot cheat by remembering these from the protein database.

API

This is not a guide to using Fragalysis programmatically, but simply a guide to downloading all the data.

The API root in Fragalysis is https://fragalysis.diamond.ac.uk/api/.
Interactions with the Fragalysis database are via Django and thanks to Swagger (now OpenAPI) definitions the endpoints are auto-documented as HTML when viewed in a browser (which does a GET request with content-type set as application/xml in the header). The JSON responses follow the paginated response convention of a dictionary of count, next, previous and results, where the latter is a list of entries requested.

There was a Python SDK to access fragalysis that was called Fragalysis-API. It was a tad clunky due to its all-in-one nature —backend and frontend, deposition and retrieval. So I (Matteo) had made a simpler SDK, but that may no longer functional in the future as something better will be made at some point. However, the principle is the same. Each target of /targets has an associated /media route (under the value value zip_archive), with all the files.

import requests
import pandas as pd
    
def retrieve_targets(url: str) -> pd.DataFrame:
    response: requests.Response = requests.get(url)
    response.raise_for_status() # beware this only raises for non 200 codes,
    # ... i.e. 404 or 503, whereas permissions errors might not present the typical response status
    paginated_response: dict = response.json()
    assert 'error' not in paginated_response, paginated_response
    return pd.DataFrame(paginated_response['results'])

There are URLs:

https://fragalysis.diamond.ac.uk/api/targets/ — the current production server
https://fragalysis-legacy.xchem.diamond.ac.uk/api/targets/ — the legacy server
https://fragalysis.xchem.diamond.ac.uk/api/targets/ — the staging server, which may contain one or more targets relative to the production server.

To download the data, the field zip_archive will provide a URL. Not all will work due to link rot (or archiving) and allow_redirects=True in the requests.get is advisable.

Privacy can play a role. If you have a DLS FedID you can log-in on the browser, but for API access things are a tad more complicated —requiring a keycloak key and for most cases (e.g. you have one private target) it is way faster to do it manually using a browser by logging in and then changing the URL to the API one.

Target meaning

Of note is /targets: think of this as as project as this is a whole campaign against a target protein. If a visitor runs a second screening run (/visit), with some exceptions it would be added to this. If a visitor make a new crystal form or protein variant, it will be added, but if they make a crystal with a new oligomeric state bound to some different protein it will most likely be a new target. Another case is when a structure or more in a target needs a different privacy, for example the Moonshot data is public, but there may or may not be or more secret MPro projects.

As of version 2.1 of Fragalysis, there is not much metadata on the targets. For the nature of the protein, I wrote a script to infer the biological origin of the targets in 2022 (not maintained as it was for a teaching course that no longer runs). Targets from the CMD (formerly the Oxford SGC), may finish in ‘A’, this is an isoform annotation needed by the internal data manager Scarab. Who the owner of a dataset is can be answered by a web search or asking —these are not disclosed online due to privacy.

Metadata file

The metadata.csv file contains a row for every crystal structure that was solved and will have a compound soak ID, like x0001.

Fragalysis as of October 2024 is on version 2. As the data processing stack changes with time, different targets from the older version (legacy) have a different formats (pre-alpha, alpha, beta, v1) so the data is inconsistent and some files lack metadata.csv. In later formats there is a column containing the PDB accession if present. This is modified by the visitor, who may not update the records.
For legacy entries without metadata.csv, I would suggest simply using the method .PerceiveBondOrders() of an openbabel.openbabel.OBMol instance to assign bond order. If this sounds like gibberish, see my previous post on rdkit sanitisation.
The compound of interest is universally the residue LIG.

NB. The SMILES of covalents are in warhead form, not reacted form, and the SMILES of charged compounds are occasionally salted as sold.

Crystal alignments

As mentioned, each soak gets a code, generally starting with an x followed by sequential digits. The XChem codes for two different targets do not match (to avoid privacy leaks). However, there often is a need to align the crystal structures, as there may be multiple peptide chains in the crystal cell, in the downloaded files in many targets (current by default) are found with a letter or digit+letter representing the alignment to the reference. Do note, that in older legacy targets, there is no reference.pdb.

Fragment-hit details

Four details should be noted:

multiple ligands,
ligand ≠ hit,
chemical isomorphism, and
chirality

Multiple instances

A crystal may have the compound bound more than once (new index) either in the same chain or in the different chains (chain id of ligand is closest protein) or the binding might be ambiguous (altloc).

Ligand ≠ hit

The terminology can get complex and no two users can agree, but briefly. A bound compound in a structure is a ligand, but is not necessarily a hit.
A hit sensu stricto is a compound that has a kinetic effect on the activity of the protein and can be classed as inhibitor (enzymatic) / antagonist (PPI) or activator / agonist, and this modulation can be orthosteric (or “normal” in casual speak) or allosteric. A fragment that binds in the desired place is referred to as fragment-hit (red in figure). Not all fragment-hits will have a detectable kinetic effect. This totally depends on the assay. If the product or substrate or partner is hard to detect over noise there will be less sensitivity for weak modulators. if enzymatic and the turnover is fast things are problematic: too little substrate may lead to substrate depletion, which is a deviates from the assumption of a steady-state regime for Michaelis-Menten kinetics, while, based on the Cheng-Prusoff equation, if the substrate concentration is far higher than the Michaelis constant, one gets very small and noisy IC50 values. The converse is also true. Some assays are less sensitive to certain compounds. In crystallographic screens, somer entropic binders are harder to detect as the slip and slide around and can be confused with bulk solvent (I call them sliders, grey in figure).

A ligand might bind in the crystal interface and is called an crystallographic artefact (lime in figure). Flat arenes can stack with themselves causing issues in assays (blue in figure); these are normally called chemical artefacts, swill or PAINS compounds.

Ligands bind in thermodynamic favourable places (sinks), but active sites as crystallised are not always great thermodynamic places nor are all sinks active sites. A protein might be modulate by another protein or be scaffolded somewhere and so forth, if that site is not important in the drug discovery campaign, in XChem it gets informally referred to as a miss (green in figure).

For all compounds soaked the data is not available publicly as it is in a soak file (SQLite). This is because the status of a soak can be complicated ie. the negatives are not all compounds that do not bind (true negative, white in figure) as there are multiple possible ways to have a false negative:

user failures, e.g. I saw an entry ‘puck dropped’…
assay failure, e.g. the compound sticks to the plastic walls, the shifter missed, the compound precipitates in DMSO (termed brick dust), or the compound cannot diffuse in by soaking but would work if co-crystalised, or is a sliding compound (see above).
crystal cracking failure: the compound binds but the conformational shift cracks the crystal. This is the case when the crystal is in an apo form, while the substrate bound form is profoundly different.

To make definitions worse, in some places I call fragments that bind multiple targets sluggers (top hitter in baseball), and compounds that never bind, ligaints (ain’t a ligand). This is because some fragment are very promiscuous (such as the substituted N-aryl,aryl-amides), while others are never seen (such as the more hydrophobic compounds in the screening deck). The distribution is not gaussian, but rather bimodal, but that is a story for another time.

Chemical isomorphism

The diffraction of C, N and O cannot be distinguished. As a result the orientation of certain groups with this ambiguity may be incorrectly assigned. With big compounds, this is not too bad, but can be fatal for a fragment.
I personally run a script to enumerate these and score them or run MD as they are most often wrong.

Chirality

The compounds are frequently racemic, but sometimes only one enantiomer binds. In some studies with non-structural assays (example), these (“stereoprobes”) are used in signal enrichment. In XChem, the racemic nature keeps costs down. The SMILES will be racemic in the metadata even if only one enantiomer was seen. However, this does not mean they are not enantioenriched. A nice example of this is PDB:5SPD, where an racemic elaboration binds Mac1, but the isomer with best affinity (0.5 µM) has worst occupancy. In fact, when Stefan was refining the structure he solved first the unexpected enantiomer and shared it, which caused some panic.
This leads to another caveat: not all crystallographers may have considered all enantiomers and instead stick with only one solution.

Not in the PDB

There are three reason why a structure may be in Fragalysis but in the PDB.

Work is ongoing and not paper-ready,
the project got cancelled/shelved as the researcher left, or
Some structures were deposited but not all for a target, because PDB deposition is rather painful, so only fragment-hits of interest were deposited, while the others where not refined well enough.

Footnote: docking

Lastly, if using these (cryo-diffracted–crystal—bound) fragments for re-docking, I would personally suggest lowering the strength of entropic terms (hydrophobic interactions). This is from personal experience and I have only tried it in Gold where 1/3 gave marginally better AUC. Generally, however, the entropic contribution required MD trajectories with explicit solvent to be modelled slightly correctly. The emphasised words in italics are because the results are close to random noise for predicting a priori what may have detectable binding out of the whole screening library/deck, the conformation for a redocked fragment-hit is about 20–40% within 2Å depending on the target/receptor.
The reasons for this are addressed in a previous post about demystifying the thermodynamics of ligand binding, but briefly, a ligand will bind rigidly thanks to enthalpic interactions, which may come at an entropic cost of rigidification (but lower B-factors and better chance of detection) and also will change the solvent interactions, by displacing weakly binding waters (increasing entropy, cf. chelate effect) and shielding waters from the hydrophobic effect. In the case of fragments (<250 Da), fewer water molecules are displaces than drug-like compounds (<500 Da), so an implicit solvent model will be more susceptible to deviations. Not to mention, a smaller compound will be more susceptible to bounce a bit due to collisions with water molecules at RT. As mentioned, a dynamic ligand is harder to discern from the bulk solvent as the crystal represents an average across different macromolecules. The crystals are flash-frozen in liquid nitrogen making the water network vitrify and not crystallise into an ice-cube: the whole crystal will reach 70K in a few microseconds, while individual macromolecules will be in 10–100 nanoseconds, which does give some time for the ligands to lock into low-energy conformation, but not really enough find the enthalpic minimum, so when there is a larger ensemble of conformations, the atoms will be more blurry (the B in B-factors) and harder to detect. However, this leads to the rabbit hole of false negative that show an inhibitory activity but no detected density when soaked: my advice is to keep away from these!

Author

Matteo Ferla

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends