Binding a desired protein tightly is important for biotechnology. Recent advances in deep learning have allowed the de novo design of (mostly α-helical) binding protein, sidestepping the laborious process of raising antibodies or nanobodies or evolving affibodies, darpins or similar. These deep learning designed binders will bind with okay affinity, but what if the affinity required were much stronger?
<Enter autocatalytic isopeptide bonds>

Dream-journal parenthesis

Generating cool ideas and hypotheses is easy—an neural network was even written to do it— but is very much like someone’s night-time dreams: nobody wants to hear about either of them. Time and cash are the biggest problem (for the former). I have an idea document that is reams of pages long, which I will never do. What I can do is write blog posts about some of the more challenging ones —like a dream journal but with less dragons…

In particular I want to write about an idea that I think is cool, but is not easily fundable: recently I wrote a project proposal that bombed, because it was very much a salad of engineering solutions for a scientific problem that everyone and their dog are keen on addressing —improving protein design. Not only that the solutions were in my opinion technically interesting, but for everyone else were high esoteric. One of the test cases proposed was overly ambitious and with a high predicted failure rate: making de novo binders that spontaneously form isopeptide bonds with their target protein.

Previously I wrote about using isopeptide bonds in Rosetta. In my spare time I half played around with finding internal positions in a given protein for locations that could host an isopeptide bond, which I started collating into a GitHub repo.

As I was reminded of this earlier this week, here I am discussion this idea, namely what things could be done without actually doing them, with an eye out for the challenges —with a sprinkle of code to pretend its didactic!

Background

Catalysis

An isopeptide bond is a covalent amide bond formed between a lysine and an aspartate/asparagine/glutamate/glutamine sidechain. Some enzymes can catalyse the formation of such bonds, while in certain protein the bond is formed autocatalytically inside the protein core. This is an irreversible bond that forms from the spontaneous maturation of three strained residues, and increases the stability of the protein.

Were it engineered into a protein, an isopeptide bond could be used to stabilise the protein structure, and could be used to create a binding protein that once bound and crosslinked to its target could not be removed.

Autocatalytic isopeptides are found in Ig-like folds of pili (extracellular attachment proteins) of bacteria and archaea, and are equivalent to a disulfide bond in eukaryotes: they are used to stabilise the protein structure, or, at least force, to refold without aggregation.

Putting aside the technical details about the topology (ResidueType definition), the maturation of an isopeptide bond appears trivial, but is a tight balancing act to have a folded protein but with strained starting residues. Three residue are required:

The nucleophile, such as the ε-amino group of lysine (Lys) —deprotonated
The electrophile, such as the β-carboxyl group of aspartic acid (Asp) —protonated
The catalytic residue, such as the γ-carboxyl group of glutamic acid (Glu) —protonated

Note the protonation states are not your normal pH 7.4 states, where the conjugate acids/bases are seen. Hence why lysine is a nucleophile that attacks the carbonyl carbon.

Additionally, an oxyanion hole to stabilitise the hemiacetal transition state is essential and results in a trapped water molecule in the mature protein. The transition state is a hemiacetal, which is chiral —a problem for later. One oxygen is hydrogen-bonded to the catalytic residue, while the other is snuggly in the oxyanion hole.

Unlike the hydrolysis of an amide, the amide form is less strained than the non-reacted form. That is because the residues are in a strained conformation and rigidly in place, whereas in proteolysis the products have more entropic rotational freedom. In a Schotten–Baumann reaction the amide has a lower energy than the acyl chloride —unfortunately acyl chlorides are explosively unstable in water so they are not found in protein. In ribosomes the new residue’s amine attacks the ester carbonyl anchor of the growing peptide. In DCC or EDC-driven amidation, a coupling agent activates the carboxylic acid. In all these cases, the carboxylic acid is an activated form, a luxury unavailable here, which is more akin to strain release reactions (eg. propellanes) than simple nucleophilic substitution, as the energy driving the reaction comes from the relief of conformational stress rather than external reagents or high-energy intermediates.

With the caveat that a carboxylic acid is a rubbish electrophile, the electrophile can be one of several choices, but aspartate is the headliner. Glutamic acid features in some examples. In some cases asparagine and glutamine feature, but in isolation they are less electrophilic. Contrastively, the leaving group for Asp is water, while for Asn is an ammonia: in the latter the reverse reaction would be hampered by the lower concentration of ammonia. However, the driver with enzymes is partial charge shifts from the enzyme pocket. An Asn to Asp mutant of Streptococcus pneumoniae RrgB domain (one such domain with asparagine) has a much decreased activity… and stability. So ‘teamwork makes the dream work’ —grumble
Parenthetically, threonine and serine can also form crosslinks, but they are different and reversible.

The maturation of an isopeptide bond or of the chromophore in GFP is a catalytic single use reaction —previously I noted that GFP could be seen as an enzyme, albeit one that does a suicide reaction (a term normally reserved for inhibitors or promiscuous activities). Therefore, theozyme design rules fully apply here. The transition state is crucial. Curiously, the strained starting pose is the opposite of normal case, as it needs to be energetically unhappy, whereas normally the bound product needs to be less happy than the substrate or risk product inhibition.

Parenthetically, “Energetically happy” is not official thermodynamics jargon, the correct term would lower potential energy, but the ‘lower is better’ does get some readers confused —if you’d like a quick primer on molecular thermodynamics, check out this post.

History

When isopeptide bonds were first reported in extracellular domains of bacterial pili, they were thought to be a mistake in the crystal structure, but retrospective analyses showed that several historical crystal structures had isopeptide bonds –and the value is likely higher.
As far as I know nobody has bothered check EBI-AlphaFold database for protein with (strained) buried 3-way salt bridges, in order to put a number on it. But then the result of that would be a “that’s nice, how do we use it?”. A dataset of fixed isopeptide bonds would be useful for training a neural network to predict them, but a problem is that many would be buried salt bridges that are not isopeptides.

The Streptococcus pyogenes fibronectin binding protein Fbab-B was split into two parts, SpyTag and SpyCatcher, which form an isopeptide bond when mixed, making them great biotech tools. My second job after my PhD was about this, hence my interest!

Engineering

Given any protein, two design avenues to play with are possible:
(a) in one the isopeptide bond is engineered into an existing protein,
(b) in the other a de novo binding protein is designed that forms an isopeptide bond with its target.
The application of the first is more stable proteins for, say, white/green biotechnology, while for the latter is a padlock that captures a native target for diagnostics or therapeutics.

The former can be done by scanning the protein for suitable sites and then designing the neighbourhood. The second would require getting a surface lysine on a desired target and then designing the neighbourhood and the domain around it.

One could do these with physics based methods or with deep learning. These are not dichotomous approaches but a synergistic, in my opinion —the catch is that the physics approach requires a lot of attention to detail to work. For example, a physics-based approach could generate a synthetic dataset, optimise the designs, and score them.

Considerations

Tight rope

As mentioned, the “substrate” form is strained. This means that isopeptides form in a goldilocks zone between:

Too little strain in unreacted form: Isopeptide forms less energetically favourable than unreacted form → no isopeptide
Too much strain: Overly unstable unreacted form → misfolding

To reiterate, in a normal biocatalysis challenge, the product needs to weakly binding to prevent product inhibition, not the substrate, even if these generally share most of their pharmacophoric features. So it’s backwards, but not quite.

How much is too much strain would depend on the specific protein. It could be guestimated in a case by case basis using one of several metrics, each with its flaws. For example, in a QM calculation the nearby atoms would be frozen, but these atoms are forcing the strain and can alleviate it, so would be partially misleading. Whereas the per-residue RMSF in an MM MD run of the unreacted form would give a numerical value, but given that proteins do not blow up in an MD run at a silly temperatures, ascribing a maximum value a priori without calibration would be foolhardy. Namely, the usual “we need a dataset for this” problem. My personal gut feeling is that, even though the rule of thumb for a destabilising mutation is +2 kcal/mol (or +10 kJ/mol), the goldilocks zone for the difference between isopeptide and unreacted is probably in the 5–20 “kcal/mol” or 20–50 “kJ/mol” —quotes because the units will be tinkered beyond their original calibration.

Tangled structures

A point that needs to be raised is that PDBs with isopeptide bonds need to be energy minimised before doing anything —against the CCP4 map is fine (discussed in previous post). The reason for this is that crystal structures are generally an average of the ensemble of positions found in the crystal lattice and in some cases a tangled structure happens, namely the average of two or more diverging sidechains or backbones leads to a density that if obeyed is energetically unfavourable. As an extreme version of this, getting an MD run and converting not only the temperature factors to per atom RMSF (normal) but the atomic positions to their average, one gets a very artistic mess (pictured).

Chiral transition state

The most annoying thing with sp³ transition states is that they are chiral or pseudochiral. In the case of a lys+asn isopeptide, the hemi-aminal moeity (CC(-[O-])(-[OH])NC) has two oxygen substituents, but in compchem terms, not only one is protonated, these have unique atom names and making the Cγ pseudochiral and need to be treated as two separate isomers when modelled —or four when considering an isopeptide can be formed by asparaginine too.

A key feature is the ex carbonyl oxygen is stabilised by an oxyanion hole, which can be seen as a trapped / crystal water in proteases and isopeptide bonds. This is crucial to lowering the transition state energy barrier.

The four transition states options are problematically different, but are only four, so running something four times is not unfeasible.

Fold polarity: little wiggle room?

Another concept to keep in mind is fold polarity. Fold polarity is the independence of an active site from the scaffold of the protein, the higher this is, the more amenable to new activities the enzyme would be. Some enzyme folds, like Rossmann folds or TIM barrels, dominate metabolism, because they have a high fold polarity and were evolved early on back in the LUCA days. Pilin, the natural isopeptide-bond containing protein are β-sandwiches with low fold polarity. It is rather conceivable that the positions most suitable for a de novo isopeptide bond would be the most rigid because the scaffold would constraint the sidechain packing resulting in the strain needed, yet this would mean that there would be a narrower wiggle room in terms of choice.
This would be in different ways: when the protein chassis is fixed, the isopeptide pocket is the only thing that can be changed, when the binding target is fixed, a lot of the isopeptide pocket is fixed.

Non-canonical amino acids: opening up choices!

Granted that I started the Wikipedia article on non-canonical amino acids a dozen years ago as my PhD was heavy on discussions of primordial metabolism, so I am quite partial to non-canonical amino acids, but in my opinion these could be the solution to better design the pocket.

One thing to note that there are different non-canonical amino acids are mentioned for different reasons… and the technological limitations are huge.

Post-translational modification: the human proteome is heavily regulated by post-translational modifications, such as phosphorylations.
Protein technology: biorthogonal reactive labels (like azido-leucine for CuAAC), photocrosslinking (azido or diazirine based), photocaged (o-nitrobenzyl based), fluorescent labels (e.g. coumarin based) and so forth. For PTMs, non-hydrolysable mimetics (phosphonate) of PTMs are some times used.
Primoridial metabolism: many non-canonical amino acids can be found abiotically (e.g. on the Murchison meteorite or in Miller-Urey–type experiments), including some that are easily confused or were usurped. For example, norleucine was likely usurped by methionine as SAM needs methionine, but it’s not found abiotically, while ornithine was likely usurped by arginine resulting in the funky AGP codon. The whole NUN section of the genetic code encodes residues with hydrophobic sidechains: given how hard it is to discriminate between different alkyl groups and the plethora of abiotic amino acids, it make me shudder think of how unutterably rubbish, messy and slow primordial life must have been —luckily billions of years of evolution fixed most things.
Biocatalysis: these can fall up into two categories: novel or enhanced catalysis and driving stability/selectivity.

Maybe not arenes

Biotechnologically, there are two ways to add ncAAs: in vivo genetically encoded, using a modified tRNA synthase, and in vitro translation, using purified ribosomes —commonly casually referred to by its product name, NEB PURExpress® (∆ kit in this case). In the latter, as chemical aminoacylation of tRNAs is problematic, enzymatic aminoacylation is more common. So in both, engineered tRNA synthases are needed. Methanococcus jannaschii Tyr-tRNA synthetase has been a workhorse for novel synthase design yielding many aryl-aminoacids. In this case, it is useless for increasing isopeptide reactivity, given that benzoic acid is a worse electrophile than an alkyl-carboxylic acid and that aniline is a worse nucleophile than an alkyl-amine base (i.e. dastardly delocalysed electrons).

In terms of interactions with the amide, the (unjustly) seldomly spoken of π-amide interactions could benefit the isopeptide form. In PDB:2X5P there is a tyrosine 4.8Å away from the electron-deficient π conjugated carbonyl of the isopeptide: a phenol is a more electron-rich arene than benzene, so there could be something there, but this is rather far. Ignoring clashes, it is hard to say if a ncAA with an electron-poor arene would be worse (like benzoic acid), while something more electron-rich would be better (like anisole, aniline, pyrrole, naphthalene, or an arene with more EDGs etc.) . Annoyingly, this is a strong guess and may have no effect in most cases.

The perfect fit

However, the isopeptide residues could be other alkyl chained residues, for example aminoadipate (homoglutamate) or ornithinine. Or better still, an aliclic sidechain with a primary amine or carboxylic acid / amide substituent, as these would be conformationally stapled avoiding entropic costs of rigidification. The catch is that in most cases the latter amino acids have not been made yet, let alone there is no synthase for these, so best not venture too much into SciFi territory.

The pocket residues could be ncAAs to attain a better fit, for example using residues with differ hydrophobic steric bulk (such as aminobutyrate (homoalanine), norvaline, norleucine, cyclopropylalanine, tert-leucine etc.), altered polarity (say ornithine, hydroxyproline or substituted phenylalanines: ortho-phenol, meta-phenol, aniline, fluorobenzene etc), or altered Ramachandran propensities (such as the alpha-helix inducing amino-isobutyrate).

The major catch to this argument is however that if one were to use ncAA, why not use one involved in a different reaction? For example diazirine-ncAA + serine + UV -> ether. That is a good question. My answer: “Ma, no! it’s less challenging, where is the fun in that?!”.

Quantification

While on the “let’s hide details under the carpet” front. There is the quantification problem. For a binder, dissociation is easy enough to quantify. Whereas quantifying the fraction of a protein with an isopeptide bond is trickier. SDS-PAGE is laborious, a shift of an heavy atom of the spectra of a protein in MS is not great, while thermal shift assay by thermaflour is funky with isopeptides similarly to intrinsically disordered protein.

Implementation thoughts

Now, in terms of implementation. Scoring a potential model requires a physics-based approach, which has its own caveats (discussed below). For scanning for a potential new isopeptide with an existing protein, as this is heavily constrained, so a classical approach, with smart screening could be done —brute force is time-intensive, but doable with tricks (and not that fun). For a de novo reactive binder, deep learning has to be used (more fun). There are many ways, each with caveats (discussed below).

A heuristic search for the best location within a protein to add an isopeptide would start with a transition state search. Namely,

iterating all pairs of scaffold residues and comparing their backbone to those of four reference rotamer sets of the four possible transition states (vide chiral transition state above). The scaffold only assumption is because loop mobility might be problematic (vide fold polarity above).
Given a shortlist of pairs, filter for those that can handle a well placed catalytic residue.
Then design neighbourhood residues to best suit the transition state. Scan for variants that have a larger/worse potential in the unreacted form than the isopeptide form. Sounds computationally heavy, but easy enough, what could go wrong?

The devil in the PyRosetta detail

(To pretend this post is didactic here is some technical details on scoring the physics based way)

PyRosetta is my physics-base modelling tool of choice, because it is intended for protein design and it’s in Python. So I am using it as example of some issues that one could encounter.

Scorefunction

Previously using PyRosetta I predicted the effect of pocket variants of isopeptide bond containing protein, both retrospectively and prospectively. I am rather pleased with a pointless graph that shows that the PyRosetta could predict that in the reference that the unreacted had a worse internal energy that the reacted form. In terms of actual point mutation scanning this has little importance as the variants are compared to a reference in the same state (roughly improving either ∆∆G^‡ or ∆∆G_isopeptide – ∆∆G_unreacted) and actually Rosetta scorefunctions are not suited for such predictions. For starters the default energy function, ref2015, has the cart_bonded term disabled (harmonic deviation from ideal bond length, angle and dihedral), then there is the fact that they are hydrid scorefunctions with several statistical frequency terms. Most notably, the ref term, which is effectively a correction to make a mutation prediction better match ProTherm data and simply consists of a scalar value for each amino acid. Charged residues happen to have more negative values, while a non-canonical amino acid, such as the transition state, has a placeholder value of zero —making the transition state less favourable than the reacted form regardlessly. The fa_dun term is the Dunbrack rotamer propensity and, in terms of the unreacted state, the residues involved in the reaction are contorted funkily. That both Rosetta scorefunctions (both vanilla and tweaked) and RDKit MMFF give a hill with a more stable product than unreacted is nice as these are molecular mechanics not quantum calculations and are static snapshots in implicit solvent (Rosetta) and in vacuo (MMFF), so can be done fast.

prc: ModuleType = pyrosetta.rosetta.core
pru: ModuleType = pyrosetta.rosetta.utility

scorefxn = pyrosetta.create_score_function('ref2015_cart')
# Good:
scorefxn.set_weight(prc.scoring.ScoreType.ref, 0) # default 1.
# Not good idea (lower yes, but not needed):
scorefxn.set_weight(prc.scoring.ScoreType.fa_dun, 0) # default 0.7
# better doing a crude subtraction on the lines of:
v = pru.vector1_bool(pose.total_residue())
v[asp_idx] = 1 
v[lys_idx] = 1
scorefxn(pose) - scorefxn.get_sub_score(pose, v)

As a test I had done a round trip in PyRosetta: I got PDB:2X5P, energy minimised it, broke the isopeptide, energy minimised it again, remaking the isopeptide but did not energy minimise the pose only the residues involved. The predicted energy of the isopeptide form was worse, but was still better than the unreacted form (cart_bonded enabled, fa_dun and ref disabled). This point indicates that a toy physics model might work at predicting isopeptide favourable residues.

For Fragmenstein, I do a preminimisation in RDKit for the ligand, discussed in another post. MMFF is in vacuo, but for the core of the protein, with the protein neighbourhood, this is not an issue, so that could also be done. The major catch is that the score is internal energy (≈enthalpy) so does not account for cost of rigidification —protein folding is entropically driven not enthalpically.

Bonding

In terms of the high magic of creating an isopeptide bond in PyRosetta without reading a PDB with a LINK record, the trick is pose.conformation().declare_chemical_bond:

def mutate_to_isopeptide(pose: pyrosetta.Pose, asx_idx1: int, lys_idx1: int) -> None:
    """
    Mutate the given residues to an isopeptide bond
    """
    MutateResidue(target=asx_idx1, new_res='ASP:SidechainConjugation').apply(pose)
    MutateResidue(target=self.lys_i, new_res='LYS:SidechainConjugation').apply(pose)
    pose.conformation().declare_chemical_bond(seqpos1=lys_idx1, atom_name1='NZ', seqpos2=asx_idx1, atom_name2='CG')
    constrain_isopeptide(pose, asx_idx1, lys_idx1)


def constrain_isopeptide(...):
    raise NotImplementedError
    # we will get here in a minute

As it is fiddlier than soldering SMT chips with a dirty tip with poor lighting, It is always best to check it actually worked, so:

def describe_res_cxns(pose: pyrosetta.Pose, idx1: int):
    """
    Print the details of the connects of residue `idx1` (pose index).
    """
    r = pose.residue(idx1)
    for c_idx in range(1, r.connect_map_size()+1):
        other = pi.pose2pdb(r.connected_residue_at_resconn(c_idx)).strip().replace(' ', ':') if r.connected_residue_at_resconn(c_idx) else '!!'
        print(f'connid={c_idx} w/ idx1={r.connected_residue_at_resconn(c_idx)} pdb={other} okay={not r.connection_incomplete(c_idx)}')

A notable case where the bond fails is if the atom has more than one connection, for example when starting to model transition states (in MM at first), in which case one has to specify the CONN numbers, but this does not apply to isopeptide bonds and transition states is a monolithic topic for another blog post:

# res1 CONN2 to res2 CONN1 say:
pose.residue(res1_idx1).residue_connection_partner(res, res1_idx1, 1)
pose.residue(res2_idx1).residue_connection_partner(1, res2_idx1, 2)

Deep learning considerations

Now to deep learning. A cool application is designing a binding protein that covalently crosslinks to a native protein. De novo protein are highly immunogenic, so the protein target would have to be an ex vivo native protein, so the use would be detection.

A major catch is that there is a limited dataset of isopeptide bonds to operate on, so the heuristic physics approach would be needed anyway to generate synthetic data to augment the dataset — a mad amount of effort.

In terms of machine learning, how much 3D geometry is made to plays a role can vary in enzyme engineering tasks. Assuming one found ancient magical scrolls that enabled quick DTMA cycles by providing a fast, high-throughput and flawless assay, one could do a random forest regression and then, through many cycles, optimise the primary sequence without ever worrying about the 3D geometry. The random forest regressor is great, but deep learning can do a better job, therefore a variety of architectures can and have been used in enzyme design on primary sequences alone. For certain targets like P450 enzymes where the dynamic entrance pocket plays a strong part, deep learning on primary sequence works better generally. However, this is a 3D problem as the fit needs to be sterically perfect —plus I like 3D so I will think about that.

The most popular models at the minute for this kind of task are diffusion models and flow matching models. Discussion of these is beyond the scope of this post. Diffusion models are generally more robust to noise (cf. tangled structures discussed), while flow matching generally requires less data (cf. synthetic data generation).

The next detail is the resolution. a residue level resolution. Many algorithms in protein design operate at a residue–level resolution, where each residue is a single particle. For example, RFdiffusion generates protein backbones with no specific residue, which are assigned by proteinMPNN, and placed with AlphaFold (or by threading as I discussed in a previous post). RFdiffusion all-atom models only the ligands at the atomic resolution, but not all residues. The problem with protein sidechains is that they have lots of rotatable bonds, making the trigonometric calculations in Cartesian space painful for a human, so it is not surprising that many diffusion models (e.g. DiffDock) do not, or not fully, operate in trig-heavy Cartesian space, but in internal space (or a variant, hence the name variants/rebrands: dihedral space, toroidal space etc).

Assuming one had a nice algorithm that could handle a full-atom pocket well despite the sidechains, the next problem then becomes that the amide/hemi-aminal and the sidechain carboxylic acid of the catalytic glutamic acid is what counts, not the conformation of the long lysine and aspartate sidechains. Were one to design a theozyme around the hemi-aminal transition state, the reasonable thing would be to inpaint these sidechains as opposed to try endless possible transition state alternatives, say the transition state was it’s own residue (formed of two fragments) covalently bound to norleucine (lysine), alanine (aspartate) and homoalanine (glutamate). but that would be a recipe for fiddly code with non-canonical amino acids.

As mentioned, the unreacted form needs to be okay, while the isopeptide form needs to be great. One could treat the problem as if it were not in 3D space, but in 3×3=9D space, 3D dimensions for the position of the atoms in each state, but the relationship between the states would need to be encoded in the loss function, thus precluding several neural network models (such as denoising diffusion, which operates differently). An alternative approach would be model the transition state, but with virtual particles And then the difference between unreacted and isopeptide needs to be encoded without resorting to forcefields if implemented in the network. I do not know if a happy transition state implies that the isopeptide would be happier than the unreacted, but there is a chance it could be true. Assuming it is not, one would have to hope that putting more weight into the transition state and isopeptide forms would address this.

Conclusion

The idea of designer isopeptide bonds within a protein, and better still, between a native target protein and a de novo binder is cool, but rather unfeasible as the risks for failure are high, other technologies could conceivably achieve somewhat similar results, and there are too many technical details in the way. In my failed proposal I budgeted a year for this and I feel very naïve in hindsight.

Author

Matteo Ferla

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends

De novo protein padlocks