In molecular biology cutting and tweaking a protein construct is an often under-appreciated essential operation. Some protein have unwanted extra bits. Some protein may require a partner to be in the correct state, which would be ideally expressed as a fusion protein. Some protein need parts replacing. Some proteins disfavour a desired state. Half a decade ago, toolkits exists to attempt to tackle these problems, and now with the advent of de novo protein generation new, powerful, precise and way less painful methods are here. Therefore, herein I will discuss how to generate de novo inserts and more with RFdiffusion and other tools in order to quickly launch a project into the right orbit.
Furthermore, even when new methods will have come out, these design principles will still apply —so ignore the name of the de novo tool used.

Overview

Herein I will discuss:

why design inserts,
what to consider
methodology

My reasons for designing inserts

There are many applications of de novo design, therefore, I ought to first to explain why I design inserts/linkers as that drives many of the pointers given herein. One caveat that I should raise is that I am discussing how I do things and this will be my way —other ways exist. My background is computational biochemistry with a lot of protein engineering and coding: so I am very comfortable doing the tasks discussed, other folk may feel differently. In other ways, there may be multiple ways up a mountain.

I work in collaboration with wet lab researchers in the Centre for Medicine Discovery planning to run experiments at the XChem facility (Diamond Light Source) for fragment-based drug-design screening by crystallography. One of the tasks I do is use de novo design tools (and classical molecular mechanics ones) to engineer protein variants. I have redesigned a couple of target in order to help the requirement is that the protein be crystallised and in a state that is useful, namely close to the substrate bound state, but without the substrate present.

Force a given state

Wrong state

Substrate binding in protein can trigger some large scale changes: designed molecules ought to favour or work in a given state. Yet proteins often crystallise in the wrong state: I know of at least one target in the CMD, where a non-hydrolysable analogue of the native substrate cracks the crystal: this is because it is forcing the crystal lattice to change into an more closed state. Therefore, it would be optimal to having the the crystal to be at an ideal state to start with. Luckily, the energy difference between open and close states is not large, so a small helping nudge is what is needed.

Balance

Catalysis is a combination of different binding potentials: namely for the substrate(s), transition state(s) and product(s). The transition state must be highly stabilised to lower the activation energy of the reaction, the substrate needs to bind well, in Nature generally with a Michaelis constant close to physiological concentration, and the product needs to bind weakly or else product inhibition occurs. Additionally, undesired reactions need to be penalised, which is achieved by weaker binding for the unwanted physiological substrates or plausible transition states. Kinases for example bind, with the hinge region, ATP, but not other NTPs, as a result chemical matter can bind the adenine site really tightly, and they also bind a large protein, the majority of which common to the product, so as a result the protein binding parts are generally weaker, while specific for a given protein. Therefore I would argue that it is a fair trade-off to gain a screenable state at the cost of some weaker potential sites for chemical matter to bind near but not in the active site, especially as the most exciting fragment hits found in a screen are those that form a bonding network with the catalytic residues, which can be capricious.

Car-jack

One approach is make a part of the protein lock the state by acting like a car jack. By de novo design, one can pick a region on one part of the protein and design it to interact rigidly on the other side of the pocket. The designed region would be near the active site entrance, but not within the active site, and would preferably be one missing density or with a high B-factor. These locked state variants will likely lose substrate binding ability, but the active site is not occluded, so are suitable for fragment screening, but not biochemical assays. However, there are few things to consider as discussed below (cf. § Know your protein).

Point mutation scanning

It deserves mentioning that the “traditional” way of doing this is point mutational scanning. This is used to shift the balance of the states by running two or more point mutational scans, one for each state: this allows the shortlisting of mutations where the difference between the difference in folding potential between the preferred state and non-preferred states decreases (=good) without destabilising the preferred state (formally, say for mutation m the undesired default state potential is $\Delta G_{m}$ , and the desired state is $\Delta G^{*}_{m}$ , allowing $\Delta\Delta\Delta G_{m} = \Delta G_{m}^{*} - \Delta G_{m}$ so one can select all variants that satisfy $\Delta\Delta\Delta G_{m} < 0) \land (\Delta G^{*}_{m} \leq 0)$ ). This works well, but sometimes nothing comes up, which means some section, close but not in the active site, preferably a well positioned helix, need extending to form a car-jack to keep the other side in place as mentioned. Often the density is poor in the active site without the ligand and there is a missing density. Using a well resolved structure with a not-cleavable analogue or threading onto such a close homologue gives a nice starting point. A common target with this problem are kinases, discussed below.

The necessity to improve crystallisation

Protein crystallisation is a complex affair, where molecules fall out of solution not by aggregating but by forming an orderly lattice. In some instances protein refuse to crystallise, to name a few:

Poor stability, which leads to aggregation, either because they are in isolation without binding partners or because of a mechanic reason
Large mobile domains are wiggling too much —entropic cost of rigidification
The sidechains are wiggling too much
Or the surface is fine as it is —crystallisation is a caused by unhappy surfaces, but not unhappy enough for the protein to aggregate

The latter has a corner case of high salt: ammonium sulfate forces protein to salt out of solution by making hydrophobic effect more entropically costly. A problem of this is that in a fragment screen the greasy/hydrophobic/entropic binders will stick, whereas the polar/enthalpic binders will be happy interacting with the salt. As the former tend to slide, while the latter are better for initial design as form interactions recapitulated in elaborations and analogues (cf. previous post about thermodynamics in drug discovery and Ferla et al. 2025).

Improve crystallisation (or solubility) by redesign of large disordered loops

Large disordered loops, “spaghetti loops” in alphafold vernacular, are common in eukaryotic protein. These spaghetti loops are traditionally removed to aid in crystallography. However, they will leave a gap, which needs to be filled or else there will be a loss of stability as is often seen in a first pass.

An example of that can be seen in a nice cryoEM (not X-ray crystallography) paper, Gawriljuk et al 2024, where disordered loops were removed and the melting temperature of the protein decreased from 60°C to 55°C. While in many cases the constructs are insoluble.

Parenthesis: Why do spaghetti loops exist

Parenthetically, it should be said —because it always gets asked— that because they exist does not mean they are critical for your protein in isolation.

Spaghetti under evolutionary selection. The conserved spaghetti (pLDDT > 60% in AF3) may be disordered in isolation, but otherwise might bind another protein which is absent (ordered in complex), for example transcript factor feature long C-terminal tails that are highly modified by PTMs to stick to DNA and bind competitively the mediator complex. Stress on complex vs. in isolation.

Spaghetti under neutral drift. To further muddy the waters, these loops might exist by evolutionary selection, but may also be neutrally drifting. Homopolymeric repeats exist due to transcriptional inframe slippage on AT-rich spans, resulting in stretches of Lys (AAA), Asn (AAT), Ile (ATT), or Pro (TTT) — Plasmodium with a GC content of 18% is littered with these.

Fake spaghetti. Lastly, these could be the result of human annotation error: introns boundaries can be tricky and occasionally you do get the misannotation. The latter can be see for a human protein if there is no evidence for the canonical construct in GTex database, while for non-model organism protein, it is common if the annotation was cursory, which is annoying for the lamprey proteome (as discussed in previous post on MSA for picture purposes).

Stabilise fusion protein

Another task is fusing a complex of a large domain and a peptide binder. A fusion helps avoid the decreasing expression levels down a polycistronic operon or having two promoters which evolution will recombine to break your circuit.

For this task I did two approaches, one was design a linking insert, and other was designing a collage where I inserted two secondary structures elements from the smaller protein into the bigger one so the designed de novo linkers that were much shorter than the A+B fusion. Another approach could have been doing a circular permutation, wherein the N and C termini are fused, and a new position chosen to be the N/C termini. Circular permutation is a technique used in biocatalysis, not because the wheel-chart charts of activity look cool, but because a more plastic state may be found due to the increased mobility (decreased thermostability though) with promiscuous activities not detectable in the wild type enzyme. In terms of AlphaFold3 modelling, the MSA needs to also be permuted, but it is unable to predict if a protein will be affected as it will most often fold fine (not tested). The rule of thumb was to avoid breaking up across large distances sequentially and physically close sheets or bundles as there is a finite ability by the ribosome to buffer exposed hydrophobic patches.

Improve crystallisation (or solubility) by de novo oligomerisation

This is more esoteric and more ad lib. The simplest non-DL aided approach is using a homologue. The PDB is full of crystallised protein from mouse or other non-human animal without a human version most likely because the human protein did not crystallise —if I had more time and the problem were not so niche, I’d love to assemble a training dataset of these missing/crystallised homologue pairs in order to make a predictor. In terms of engineering, common strategies include among other surface energy reduction (replacing flexible Lys, Glx or Asx with Ala, Ser or Thr), introducing histidines or copying surface patches from homologues (sometimes ambiguously called “crystal epitopes”).

A more extreme approach can be designing an insert that oligomerises. This consists of picking a reference crystal to copy or deciding on a confabulated cell, figuring out the symmetry, plonking a relevant helix bundle layout and linking it up. This requires a lot of rototranslations and bounding box predictions / fudging. I have played with this as proof-of-principle, so I cannot say if it works. And the discussion of symmetry and bounding boxes is beyond the scope of this post.

Know the protein and its assays

Now that I have gone through why design inserts, before jumping into the how-to fun, I ought to start with the caveats.

Cherry-picked

Each given problem is different and each protein is different. One crucial detail to keep in mind is that for technology development the target is chosen to showcase a tool, while for a target-specific problem a target is fixed. The former target is cherry-picked to be easy to test (cf. de novo luciferases), while the latter target by virtue of having a problem with it might be a rotten apple, in the dark. So small steps are often needed. However, with a bit of forethought the problem can be addressed in a workable manner as opposed to blindly throwing huge amounts of GPU time at the problem only to find that no design was as hoped.

Web lab testing time/money cost

I do not do any wet lab, but I do need to make sure the constructs are wet lab sound, which actually makes it tricker as I have a limited idea of what will happen. Therefore, a crucial discussion that must happen is how a given construct will be tested and if can be parallelised, how much time does the construct require to be tested, and what the budget is: the testing/cost will drive the number of constructs tested and whether design iterations can be done. These time estimates need to be honest, not idealised. For example, I made a series of designs, but, once testing was thought about, the project was shelved. Another example, albeit molecular mechanics based is that I recently did two-state point-mutational scanning for two collaborators: one wanted to test 96 point mutants, while the other was weary of going past 6 QuickChange runs.

Human time cost

Human time is the actual rate limiting factor, but slapdash approaches cost more than save.

For me a design, needs about 2–3 full days: RFdiffusion can take a day for 500 designs, ProteinMPNN is lightnight fast and AlphaFold3 is fairly fast if a butchered A3M MSA is given, and PyRosetta is CPU only. The human time drain is reading up, staring at the structure and set things up and writing custom analysis code —the latter can drag things out longer. I will admit that a couple of hours reading papers on a protein I do not know is not great, and will either result in the collaborator asking about the angles in the catalytic triad discussed in the SI of some paper I did not read, or the collaborator not knowing of a mouse structure of the given protein. The former makes good science, the latter makes a quick task.

Not really a revelation but, running some small tests helps steer the design the right way even though it eats into human time: Several times, I have done low importance/urgency de novo design experiments in the background of other tasks, by simply submitting a large set of jobs, only to have rather rubbish results that I had to analyse heavily (often tripping in analysis rabbit holes) or repeat with better fine tuning.

Structural questions to ask

There are many structural questions worth asking. I do not know the intricacies of most of the protein I am entasked with, so I may be ham-fisted: to be frank I am the first to fall asleep in cell biology talks about NF-κB. When I ask for details of the protein, I normally get sent mostly irrelevant cellular or physiological information with nearly irrelevant diagrams of regulation: I want thermodynamics not cartoon physics of arrows between circles. However, some details are crucial and need to be ascertained. In my option these are:

Location

Where would you like the designed insert/binder to go?

The wise advise is to mimic where other protein bind. If not known, a small number of an unconstrained binder designs will give a good idea.

De novo binder design studies have found that surfaces that bind another protein are likely to be better at binding de novo binders in vitro. Therefore, for certain scenarios it can be advantageous to check with AlphaFold3 how a protein know to bind binds as that could be a good region for certain types of de novo design operations. This is true even for when a region is reported: once for a protein a region was reported as the binding site, but AlphaFold2 had a sharp consensus in favour of a different region, possibly because the constructs ignored the likely tertiary structure of the protein and when run in AlphaFold2 they blew up. For one project, I predicted what bound where in AlphaFold3 for a protein and its partners with the aim to get inspiration for how to design a stabilising insert, however, the results ended up flipping the project upside-down and no longer needing a stabilising insert.

Running Haddock (protein–protein docking) will give an idea of what is sticky according to a thermodynamic model —a good protein is 4TQL, a designed 3-helix bundle. Agreement between toy models is a good thing, but not a smoking gun. I have done this once and did not glean much so I do not know if it actually helps.

Is your preferred location a loop?

Loops will likely not be the location chosen by RFdiffusion, unless there is a geometric reason for it, i.e. no other way round the loop. So strategically placing fixed residues in the template will do the trick. Do note that loops may need to be optimised with PyRosetta with FastRelax with design enabled and with a suitable res_type_constraint weight (see previous post on the topic). This is not perfect, especially as protein-protein interactions are often water bridge mediated (cf. barnase·barnstar compex: PDB:1BRS), which are not captured by implicit solvent: Rosetta uses statistically trained Lazaridis–Karplus solvation energies, which, for example, does not account for proton hopping and thus is said to be erroneously be more tolerant of exposed apolar atoms.

Beyond the span you want to design, are there any other gaps?

This can go under the radar, but will need to be fixed. If two required inserts are somewhat close, it might be best to create and test them separately. In my hands, when I hoped to form interactions between separate designed parts it there were no or limited resulting interactions. Doing the two parts separately may mean better designs as one group is not piggybacking off the other, and it will be easier for web lab assembly (2 parts vs. 4 parts) and testing.

Protein details

Accuracy

What is the resolution of the structure used?

If the resolution is higher than 3Å, you might want to energy minimise it, especially if from the 90s or earlier. In my experience, energy minimisation is not needed for most cases, but can help especially for downstream comparisons: if comparing RMSD and Rosetta score to a minimised template, what to make of the fact that the minimised copy has a bit of an RMSD from the original template? Plus, parts that moved despite a decreasing coordinate restrained minimisation regime, make a nice red flag unless, of course, unless the minimised protein was from a structure of a complex and the protein-protein interface was exposed, which is perfect.

What are the B-factors in the area of interest?

If they are high are you going to redesign the high B-factor region? If not but plan on doing point mutations combined with an insert design may seem sensible, but adds to the cloning burden, so do check with the experimentalist if that is honestly fine.

One option is using the inpainting options allowing these parts to move during inference.

Crystal B-factors are a result of the distribution of conformations found in the ensemble, which is not quite the same as motion in solution, but close enough. The latter can be calculated by MD. Comparing potential energies of MD trajectories is overly complex, but RMF of the common residues is an easy metric. There will be differences especially with parts of the protein pinned by crystal packing, but it is an okay proxy.

Does your planned design involve a ligand?

This would require ligandMPNN adds to the workload. This is not overly problematic, but I prefer to keep it simple as often the ligand is not important for the design. Personally, the few times I have done ligandMPNN experiments the hypothesis was flimsy: for example once, I wanted to add a gold ion in a pocket for improved cyroEM contrast, but then got stuck by the fact I could not analyse the designs for a rather alien ion.

If an AlphaFold3 model is being use, are you sure its a monomer?

The oligomer stoichiometry (in ColabFold ‘cardinality’ after mathematical set cardinality) is crucial. EBI-AlphaFold database is monomeric only. The correct stoichiometry is crucial.

…and if AlphaFold3, what structures inspired it?

SWISS-MODEL is an older webserver that has a database of threaded Uniprot sequences against close PDB. This gives a similarity score and what the template was, which may be a complex. I-Tasser is another older server that hybridised various threaded models and give these with the result. AlphaFold3 does not give this information, but can be gandered with FoldSeek. In some cases knowing the resolution and nature of the close templates is essential: AlphaFold3 often leaves holes where partners or ligands or PTMs ought to have gone —there is a whole body of literature investigating the ability to generate state ensembles with AlphaFold. The above concerns of 3Å structures also apply.

What is the exact construct sequence and what are the cloning fluffs?

The cloning fluff is rather cruicial as it will be there even if it is fluff. For example, I designed a beautiful trimeric bundle off the N-terminal of a protein only to find out that a crucial cloning extra had to go there.

For his-tags adding nickel in AlphaFold modelling makes things worse as there are no nickel bound his₆-tags in the PDB (actually most his₆-tags are purposefully not modelled) as a the nickel-NTA is bound to the beads while the protein is eluted with high imidazole, and there is unlikely to be nickel or zinc in the crystal as there is no reason for nickel or chelated nickel to be added randomly, in the Hampton screen for crystallography cobalt, cadmium, nickel or zinc are in only a couple of conditions per plate, so the only case would be zinc added for a zinc finger protein, which often gets forgotten.

Parenthetically, I actually have a counter example due to luck: to model a scrambled mass-spec reference concatenated peptide (QConcat), I accidentally added zinc (and phosphate) and fortuitously found with AlphaFold3 that a zinc finger was conserved, but that is a fluke and no excuse for not doing background reading!

Is your protein a kinase?

Kinases are annoying. They bind a protein or small molecule, ATP and two magnesium ions. In the canonical kinase fold the active site is sandwiched between two sheets, a larger one (N-lobe) and a smaller one (C-lobe), where the latter typically ends in the activation loop that starts with a DFG motif, which can adopt two states, DFG-in (active, Asp binds ion, physiological) or the DFG-out (inactive, Phe sits in the pocket, transient physiologically / inactivated). The latter conformation is triggered by binding of unnatural ligands in the active site or by protein binding further down causing the loop to twist. The DFG-out state is preferred for inhibitors, so that might need stabilising by point mutations more so than de novo design. A traditional strategy is to add a glycine prior the DFG motif. One problem is that the apo state is highly disordered. The two lobes are connected by the hinge region which binds the adenosine moiety of ATP: in its absence the two lobes can therefore move, and likewise several loops around the activate site are disordered until binding. Previously, I mention the car-jack approach to lock a conformation: the C-terminus of the activation loop or the nearby helix is useful for this. Additionally de novo design might be handy to fix disordered loops on the C-lobe: these bind the substrate/product weakly: a crystallised protein with compromises is better than none.

Design

Workflow, not a pipeline

My workflow is not fully automated as I keen manually intervening in different ways —a hands off pipeline will not give you what you want. Plus I run RFdiffusion as a Slurm job on any GPU, AlphaFold3 on a high VRAM GPU, while ProteinMPNN works fine on CPU if there is not free GPU.

I generally first run a couple of designs and check the results —what was I expecting and does it match? I cannot stress how crucial this is even if uses up precious time: I always mess up the first few settings. After running RFdiffusion if most designs are rubbish but there is a design that is good, then the partial diffusion might be needed.
The more I optimise the RFdiffusion design and the moreI flilter the results the easier the final shortlisting gets and the more confident I feel.
For example, given the two case:

running a workflow on 1,000 designs and rejecting 95% of the designs
running a dozen initial testes, rejecting all bar one or two conditions, or maybe even tweaking the template (cf. partial diffusion) and running 250 designs and rejecting none

The latter seems a much better strategy, except for the human in-the-loop detail.

Multiple tasks

Often I have a very specific objective, which can be split up into sub-objectives. Addressing all the parts all at once often fails, where splitting the objective into steps will not. For example, I wanted to trimerise a protein in specific configuration. When I asked RFdiffusion to design an N-terminal domain that did so, the design failed as RFdiffusion does not willing olimerise large domains (I saw this twice), while when I asked RFdiffusion for, first, an isolated trimer designs, I picked the most suitable helical bundle and placed it within the target protein trimer and then I asked RFdiffusion for linkers and run proteinMPNN and the rest of the pipeline, I got several designs that AlphaFold3 predicted to be trimers as I hoped —the attrition rate of 85% was higher than I hoped, but several is more than none.

Design decision — improvements without remaking

An obvious crucial detail I learn the hard way is that to improve a detail, such as inter-domain binding, one should avoid remaking the existing weak parts and instead build around it. I was entasked to improve a real but weak interface: when I designed new variants for this, the results were underwhelming with larger domains with worse binding energy that AlphaFold2 failed to recapitulate. When instead it was kept and a nearby less attractive region was targetted, the tables were turned.

Design decisions — filling a gap

The easiest case is a gap between two adjacent antiparallel sheets, which simply needs to complete the β-hairpin and the β-turn can be tested (if even) in AlphaFold3 or PyRosetta without any de novo design, generally using the type I turn Asn-Pro-Gly-Ser or the type II turn Asn-Gly-Ala-Leu will do —the internet is full of tables of examples, one will match your start and end residue. If using RFdiffusion & proteinMPNN, I would not do more than a couple of designs for {1,2,3} length inserts followed by proteinMPNN, which will converge sharply on a textbook hairpin sequence. For short non-hairpin gaps (eg. adjecence helices in a bundle), one is likely to have some spacer residues with a β-turns, Asx turns and S/T turns, but the convergence will be sharp both for RFdiffusion and proteinMPNN (e.g. a dozen with do).

For example, I run 250 designs for 2,3 and 4 inserts. After 5x proteinMPNN, the diversity was all Asp-Gly for 2, 37% Pro-Ser-Gly, 20% Asp-Asp-Gly…

For longer spans, the traditional rule was to measure the gap divide by 3.0Å or 3.5Å and multiply by up to 150% —or a variant thereof depending on who trained you—, and use that many glycines, with a serine added after every four glycines for good measure (commonly called a Gly4Ser linker) or a variant based on some 1990s paper.

A traditional way of designing linker is with Rosetta Remodel (discussed in a previous blogpost in regards to its usage with PyRosetta). It is non-trivial to run properly, but majorly it is problematic from the technical point of view as for spans longer than half a dozen amino acids there was an increasing chance of lack of convergence, plus it generates simple loops, nothing fancy. In controls, wherein the native sequence was modelled it rarely converged in my hands.

Contemporary de novo DL methods rightfully like α-helices, wherein the span between 7 residues is about 11Å: one residue bridges therefore only 1.5Å for the back-of-the-envelope maths mentioned.

If staring at a structure in PyMol, here are some useful commands (click to expand)

To add a cylinder between residue this will do it:

cmd.load_cgo([9.0, *cmd.get_model('👾👾👾').atom[0].coord, *cmd.get_model('👽👽👽').atom[0].coord, 3, 0,0,1, 1,0,0], '🏷️🏷️🏷️')

# where '👾👾👾' and '👽👽👽' are PyMol selections, and '🏷️🏷️🏷️' is a new object name, say:
cmd.load_cgo([9.0, *cmd.get_model('AF and name CA and resi 490').atom[0].coord, *cmd.get_model('AF and name CA and resi 669').atom[0].coord, 3, 0,0,1, 1,0,0], 'sausage')

To colour like AlphaFold:

color 0xFF7D45, element C and b < 50
color 0xFFDB13, element C and (b > 50 or b = 50) and b < 70
color 0x65CBF3, element C and (b > 70 or b = 70) and b < 90
color 0x0053D6, element C and (b > 90 or b = 90)

In some cases making a macaroni artwork is called for.

Design problem: “Don’t sit there”

In most cases, a design is part of a larger protein complex or is an enzyme with an active site, which results in site that designs stick to in a non-desired way. There are three solutions to this:

Make the conserved protein bigger
Use hotspots
Make a macaroni art

The tampered reference golden hack

Before going into the details of the three options listed, I need to exclude a fourth: as far as I can tell there is no golden hack to address unwanted sticky spots by tampering with the reference (template/input PDB).

Munged source code. In an ideal world, I would love to spend time adding an extra guiding potential for the gradient in the RFdiffusion code to somehow implement a no-go zone (tricky as it would involve among other things a rototranslated xyz ↔ adjacency plane in the t2d embedding). However, a new breadwinning model will surely come out this year or next and I do not have zero time —I assume everyone else thinks likewise. But I have trialled some ~~work-arounds~~ kludges.

Alien amino acid composition. RFdiffusion cares about geometry way more than amino acid identity: changing a sticky patch to polytryptophan, polycysteine or polyproline has near zero effect.

Alien amino acid wall. RFdiffusion is scared of unnatural inter-residue lengths and configurations, but making a grid wall with random residue indices can result in a jump through the wall —cheatyface AI— or an uncaught error where np.linalg.svd did not converge.

Loopification. RFdiffusion tends towards bundling helices or appending more strands to sheets, so making the positions of the sticky part more loopy by shearing/blurring them is a nightmare to work with and often fails even with inpainted sequence specified.

Designed barricade. As RFdiffusion is not fussed by residue type, using a prior output (polyglycine) as template works. However, using something that stuck to the no-zone will result in more stuff sticking to that as it likely will be a helical bundle. At which point, using the native binding partner works.

Size bloat

What can happen is that a test run make the design creep where it should not, so more protein gets added. This quickly bloats the size of the complex and grinds calculations to a trickle.

A big limitation, in fact, is the size of the protein complex. RFdiffusion takes about 20 minutes for a 100AA designed span in a 1,000 AA complex on a A100: it would take a week for 500 designs.

Symmetry, a topic for another post, does slow things down a wee bit for RFdiffusion, but in AlphaFold3, where symmetry is not a thing, multiple chains can grind things to a halt, eg. 1,500 amino acids takes 4 hours on a A100. Likewise, symmetry in Rosetta is so painful it best avoided, so things can be slow, especially with constraints, even with LocalRelax.

500 AA is okay, but often the problem can be simplified into subparts, which can give better results.

Whereas truncating a protein sequence and erroneously exposing its core will make it highly sticky for AlphaFold3 multimer inference, RFdiffusion is less alarmed, so can be an option.

Hotspots

One selling point of RFdiffusion is the ability to make the design bind to given residues, termed hotspots. This mode is present when designing a chain or part of a chain (“binder”) to bind to a constant one (“receptor”), otherwise an error is raised during preparation because the key “receptor_con_ref_pdb_idx” is missing when converting the PDB positions to idx0 positions —this can fixed by tweaked the code to use the other keys, which has the bonus of enabling intra-chain hotspots, but I have not done any benchmarking on this hack. For a non-hacky approach for linker design, simply making the two preceding residues and two succeeding residues a different chain works.

One problem is that the hotspots are not as strongly enforced as one would want as they are learnt and there is no way to increase their “weightiness”. In the source code there is too much mathemagic going for a quick win from a quick read of the code: the hotspot_idx0s become a channel in the t1d embedding, which the guiding potentials do not use, but gets used in the forward propagation by the main class (RoseTTAFoldModule, a pytorch.nn.Module subclass), which in turn calls the forward method (technically the __call__ wrapper) of another (Templ_emb), which in turn calls the forward propagation call of a variant of the Attention class from RoseTTAFold.

Macaroni art

Macaroni art is made by kindergarteners with sticky fingers and the ones displayed are fixed by a teacher. Here, RFdiffusion is the teacher that fixes a rubbish artwork made hastily in PyMOL or in TextEdit. The trick is to use the partial diffusion feature of RFdiffusion with a template that is a PVA glue mess.

Specifically, PyMol has an easy-to-use “sculpting” mode which is rather fun, albeit not too scientific.

Find a 3-button mouse
Save the session
Protect from editing the frozen residues which the command protect (e.g. protect resi 1-60)
Activate scupting in the menu>Build>Activate or type sculpt_activate all
Change mouse to “3-button editing”
Click the anchor point
Shift+⌘ drag atoms around

The sculpting tries to minimise the repulsion (the protect command against it), which makes it really fun to drag around. There is also a function, cmd.fab, which, with ss=1 can make helices. Other useful commands are cmd.rotate(.., camera=False) and cmd.translate(.., camera=False) which I use with the marker salt PDB from below.

All this does not feel scientific, but at the resulting scores from the downstream applications are very good. After all, do you end up photoshopping DALL·E figures when you want something specific?

The partial diffusion argument in RFdiffusion requires a specific number of steps, which is discussed further below.

Methodology

Regarding methodologies, I use RFdiffusion + ProteinMPNN approaches along with other tools, specifically AlphaFold3, gemmi and PyRosetta. As the latter is powerful yet tricky I have previously written a monographical post on the topic —I mention this because I generally do not do RosettaCM threading , which is reserved for the nasty cases, but use AlphaFold, or I use gemmi to superpose structures without worrying about missing sidechains.

Installation

Starting from the top is installation, but the installation of RFdiffusion, proteinMPNN and AlphaFold3 is easy —hopefully with better version compatibility of CUDA 12 (see previous post if still having problems), larger adoption of numpy version 2.2 and the ultimate death of CentOS 7 (stuck on glibc 2.18, see previous post). However, I would like to mention two pointers:

RFdiffusion, proteinMPNN and AlphaFold3 installation assumes one operates within the cloned GitHub repo. In my conda enviroment for RFdiffusion I have moved run_inference.py to $CONDA_PREFIX/bin, I also pass the flags --config-path=$RFDIFFUSION_CONFIG and hydra.output_subdir=$OUTPUT_FOLDER where $RFDIFFUSION_CONFIG is a conda environment variable (conda env config vars) set to $CONDA_PREFIX/config/inference. Likewise, for ProteinMPNN with the script protein_mpnn_run.py and --path_to_model_weights $PROTEINMPNN_WEIGHTS flag, and AlphaFold3 with the script run_alphafold.py and --db_dir $AF3_DB --model_dir $AF3_MODEL flags. (One step further is to add !/usr/bin/env python3 shebang line, remove the .py suffix and hardcode the variables. I am not too keen on the latter because I’d be temped to mod it further causing issues for future-me, because I did not document the change)
AlphaFold3 databases cannot be placed on networked disks and has to stay on local NVMe disks on the node. It is 0.7TB, but majorly the folder mmcif_files has 196,000 files, which would cause severe NFS performance degradation.

RFdiffusion design — record keeping

For my RFdiffusion design runs (which I can experiments to the annoyance of wet lab experimentalists), I write a yaml file of the settings used for record keeping, else it is rather hard to keep track of after a few months.

Yaml usage example

Yaml file example:

%YAML 1.2
---

hairpin:
    codename: mini-hairpin
    change: 485-778
    contig: '[A304-486/5-12/A780-931/0 B1-79/0]'
    num_designs: 
    template: '8OYP_stripped.pdb'

helixtopper:
    codename: helixtopper
    change: 424-434
    contig: '[A304-423/10-30/A435-486/8-8/A780-931/0 B1-79/0]'
    template: '8OYP_stripped.pdb'
    hotspots: '[A465,A467,A469,A481,A793,A467,A474,A465,A795,A790,A860,A812,A814]'

Example code in Python:

import yaml

with open('USP11_RF.yaml', 'r') as file:
    experiment_definitions = yaml.safe_load(file)

text = ''
for definition in experiment_definitions.values():
    definition['hotspots'] = definition.get('hotspots', '')
    definition['template'] = definition.get('template', 'template.pdb')
    definition['numdesigns'] = int(definition.get('hotspots', 1_000))
    text += 'export EXPERIMENT={codename};'.format(**definition)
    text += 'export NUMDESIGNS={numdesigns};'.format(**definition)
    text += 'export CONTIGMAP={contig};'.format(**definition)
    text += 'export TEMPLATE={template};'.format(**definition)
    text += 'sbatch rf.slurm.sh;\n\n'

Path('temp_jobs.sh').write_text(text)

Where rf.slurm.sh is:

#!/bin/bash
# run: export EXPERIMENT='...'; export CONTIGMAP='...'; sbatch /opt/xchem-fragalysis-2/mferla/library_making/third_pass.slurm.sh

#SBATCH --job-name=RFdiffusion
#SBATCH --chdir=/opt/xchem-fragalysis-2/mferla
#SBATCH --output=/opt/xchem-fragalysis-2/mferla/logs/slurm-error_%x_%j.log
#SBATCH --error=/opt/xchem-fragalysis-2/mferla/logs/slurm-error_%x_%j.log
#SBATCH --partition=gpu
##SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --gres=gpu:1
#SBATCH --priority=0
#SBATCH --export=NONE,OUTPUT_FOLDER,EXPERIMENT,CONTIGMAP,NUMDESIGNS,HOTSPOTS,TEMPLATE,EXTRALINES,SLACK_WEBHOOK

# -------------------------------------------------------

export SUBMITTER_HOST=$HOST
export HOST=$( hostname )
export USER=${USER:-$(users)}
export DATA=/opt/xchem-fragalysis-2
export HOME2=$DATA/mferla;
export HOME=$HOME2;
source /etc/os-release;

echo "Running $SLURM_JOB_NAME ($SLURM_JOB_ID) as $USER in $HOST which runs $PRETTY_NAME submitted from $SUBMITTER_HOST"
echo "Request had cpus=$SLURM_JOB_CPUS_PER_NODE mem=$SLURM_MEM_PER_NODE tasks=$SLURM_NTASKS jobID=$SLURM_JOB_ID partition=$SLURM_JOB_PARTITION jobName=$SLURM_JOB_NAME"
echo "Started at $SLURM_JOB_START_TIME"
echo "job_pid=$SLURM_TASK_PID job_gid=$SLURM_JOB_GID topology_addr=$SLURM_TOPOLOGY_ADDR home=$HOME cwd=$PWD"

export CONDA_PREFIX=$DATA/mferla/waconda-slurm
export CONDA_QUIET=true
export CONDA_YES=true

source $CONDA_PREFIX/etc/profile.d/conda.sh
#export CONDA_ENVS_PATH="$CONDA_PREFIX/envs:$DATA/mferla/waconda/envs"

conda activate RFdiffusion2;

echo "************************"
echo "HELLO $SLURM_JOB_NAME!"
echo "************************"
echo "Greetings from $SLURM_JOB_NAME script ${0} as $USER in $HOST which runs on $PRETTY_NAME with $CONDA_PREFIX"

# END OF FLUFF
# -------------------------------------------------------
# ACTUAL COMMAND

echo $EXPERIMENT;
if [ -z "$EXPERIMENT" ]; then
  echo "Error: EXPERIMENT is not set."
  curl -X POST -H 'Content-type: application/json' --data '{"text":"'$SLURM_JOB_NAME' missing EXPERIMENT var"}' $SLACK_WEBHOOK
  exit 1
fi
echo $CONTIGMAP;
if [ -z "$CONTIGMAP" ]; then
  echo "Error: CONTIGMAP is not set."
  curl -X POST -H 'Content-type: application/json' --data '{"text":"'$SLURM_JOB_NAME' missing CONTIGMAP var"}' $SLACK_WEBHOOK
  exit 1
fi
export NUMDESIGNS=${NUMDESIGNS:-1};

mkdir -p $OUTPUT_FOLDER;

if [ -n "$HOTSPOTS" ]; then
run_inference.py \
--config-path=$RFDIFFUSION_CONFIG \
hydra.output_subdir=$OUTPUT_FOLDER \
inference.output_prefix=$EXPERIMENT/$EXPERIMENT \
inference.input_pdb=$TEMPLATE \
inference.write_trajectory=false \
contigmap.contigs="$CONTIGMAP" \
inference.num_designs=$NUMDESIGNS \
inference.write_trajectory=false \
ppi.hotspot_res="$HOTSPOTS" \
$EXTRALINES;
else
run_inference.py \
--config-path=$RFDIFFUSION_CONFIG \
hydra.output_subdir=$OUTPUT_FOLDER \
inference.output_prefix=$EXPERIMENT/$EXPERIMENT \
inference.input_pdb=$TEMPLATE \
inference.write_trajectory=false \
contigmap.contigs="$CONTIGMAP" \
inference.write_trajectory=false \
inference.num_designs=$NUMDESIGNS\
$EXTRALINES;
fi


# END OF ACTUAL COMMAND
# -------------------------------------------------------
# FINAL FLUFF

JOB_NAME=${SLURM_JOB_NAME:-"Unknown Job"}
JOB_ID=${SLURM_JOBID:-"Unknown Job ID"}
HOST=$(hostname)
TIMESTAMP=$(date)

# Construct the message payload
MESSAGE=":tada: *SLURM Job Completed* :tada:\n*Job Name:* $JOB_NAME\n*Job ID:* $JOB_ID\n*Host:* $HOST\n*Time:* $TIMESTAMP"

# Send message to Slack
curl -s -o /dev/null -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$MESSAGE\"}" $SLACK_WEBHOOK

These analyses do generate a lot of files in plain text PDB. AlphaFold3 generates a JSON with two huge MSAs in them and not one PDB, but six for a single seed (5 runs per seed). This can cause a large footprint. Luckily gemmi can save as .pdb.gz and .cif.gz, which are readable in PyMOL —note that Windows out-of-the-box cannot decompress gzip files in the GUI file manager (whatever their Files/Nautilus/Finder is called).

Gemmi snippet for compressed PDB / CIF and interconverting between Structure and Block

Saving as compressed PDB:

import gemmi
import datetime as dt

structure: gemmi.Structure = gemmi.read_structure(str(pdb_path))
model: gemmi.Model = structure[0]
...


# ## Model to Structure
# I for one always pass around gemmi.Model forgetting gemmi.Structure way back in the mist of functions
# so I need to make a new one
structure = gemmi.Structure()
structure.add_model( model )
# remember that a copy gets added, so **do not** touch `model` anymore:
del model

# ## Save
structure.write_pdb('👾👾👾.pbd.gz')

For the roundtrip… From gemmi.cif.Block to gemmi.Structure:

structure: gemmi.Structure = gemmi.make_structure_from_block(block)

Saving as a CIF requires making a CIF document, which is actually a handy structure:

doc: gemmi.cif.Document = structure.make_mmcif_document()
doc.write_file('👾👾.cif.gz')
# or if there are many one can use blocks:
block: gemmi.cif.Block =  doc.sole_block()
# or:
block: gemmi.cif.Block = structure.make_mmcif_block()
# which can be written directly
block.write_file('👾👾.cif.gz')

PDBs/CIFs contain metadata too. The mmCIF dictionary term definitions are maddeningly narrow and unfriendly to computationally manipulated structures —this is because the PDB in the early days was a dumpster fire of errors, so the bulk of the archiving efforts are rightfully focused on quality metrics.However, this does not mean these cannot be used for computational biochemistry: AlphaFold3 uses the _ma_qa_metric_global and _ma_qa_metric_local loop for the plDDT metric. Nor does it mean that non-standard approaches cannot be taken: Rosetta adds # comment at the end of PDB files (non-standard) or _rosetta_remarks in CIFs. Custom fields are considered heretical, but they can be very helpful, the catch is that these can break some parsers, so do check first. Within the guidelines are some useful metadata that is worth filling in an automated fashion for the sanity of future-you. One thing to note is that in gemmi, there is info and metadata, the former contains title and keywords, while the latter is a strictly parsed version of the PDB REMARK entries (kept as raw_remarks). REMARK 300 / _struct_biol.details is kept, while free text unnumbered REMARKS are binned. The full CIF dictionary (see code snippet before for saving CIF), is a bit tricker to deal as it involves finding the loops (=tables) and remembering that all the values have to be escaped (gemmi.cif.quote or gemmi.cif.quote_list) or a rogue space will break everything.

Gemmi snippets for manipulating metadata

structure: gemmi.Structure = ...

# add common metadata = info
structure.info.update({'_entry.id': 'XXXX', '_exptl.method': 'THEORETICAL MODEL', # traditional
                       # whatever you fancy:
                       '_struct.title': '👾👾👾'.upper(),
                       '_struct_keywords.pdbx_keywords': '👾👾 👾👾'.upper(),
                       '_struct_keywords.text': '👾👾 👾👾'.upper(), 
                       # this helps:
                       '_pdbx_database_status.recvd_initial_deposition_date': dt.date.today().isoformat()})

# add REMARK metadata
structure.meta.solved_by = '👾👾👾'
# REMARK 300 is free text meant for bio detail

For the CIF dictionary entries themselves, given a block (eg. from structure.make_mmcif_block() ), one can theoretically iterate it for the items (pairs / loops), but its funky:

item: gemmi.cif.Item
for item in block:
    if item.pair: # None or Tuple[str, str]
        print('pair',  item.pair[0] )
    elif item.loop: # None or gemmi.cif.Loop
        print('pair',  block[3].loop.tags ) # .tags is List[str]

So a slightly less tedious away is using block.find_loop or block.find_pair. If they don’t exists, one can create with block.init_loop a new gemmi.cif.Item with a gemmi.cif.Loop (see docs). The loops in turn can be modified with loop.set_all_values, loop.add_row or for the adventurous loop.add_columns. As mentioned, do remember to gemmi.cif.quote! For example saving a pandas table as a loop:

RFdiffusion results

Digression: granularity

In molecular mechanics, the objects that bounce around or act as anchors are referred to as particles, these commonly represent atoms, but non-atomic particles are possible, for example the Drude particle discussed in a previous post. This semantic distinction becomes helpful when coarse grain models are used, namely in models where a particle represents one or more atoms. In a Martini model, for example, a trio of atoms is represented as a single particle, called a bead. In Rosetta, there are two resolutions: centroid, wherein residues are represented by the backbone heavy atoms and a centroid atom, a faux β-carbon of different sizes, and full-atom, where every atom is present, including hydrogens. These particles have certain types, which are not simply elements, but in my opinion could be very useful inspiration for how to atom novel neural networks. In the Martini bead types the abstraction is very interesting as there are four categories of bead types (charged, polar, apolar and nonpolar) with a size subcategory. In Rosetta and MD forcefields, the atom types are not solely elements, but also hybridisation states and much more, for example disentangling pyrridinic and pyrrolic nitrogens, which cause problems in diffusion models as discussed previously.

This parenthesis on granularity is required to discuss which flavour of RFdiffusion to use. There are two versions, the original, the “vanilla” version (vanilla in CS slang is the unmodified original, which may not be legacy), and the all-atom version. The two are similar and differ in how ligands are handled (or not): when ligands are absent, it makes sense using the vanilla version as the code will be quicker. This is because the ligands are represented with a particle for each atom, while the residues are represented in the same way: a single particle per residue, and the guiding gradient potentials are unchanged.

PDB file

The output PDB file has the designed residues as glycines with a B-factor of 0, while the conserved residues are without sidechains with a B-factor of 1. The position of the design will differ from the template if the argument inference.recenter is not passed or under certain conditions which will ignore inference.recenter=false.

Salty axis PDB

Here is a PDB with salt in the origin and along the axes for sanity checks (useful for symmetry).

HEADER    FAKE PDB STRUCTURE
HETATM    1  NA      Z   1       0.000   0.000   0.000  1.00  0.00           NA  
HETATM    2  CL      Z   2       0.000   0.000   5.000  1.00  0.00           CL  
HETATM    3  MG      Z   3       0.000   0.000  10.000  1.00  0.00           MG  
HETATM    3  MG      Z   3       0.000   0.000  20.000  1.00  0.00           MG  
HETATM    4  CA      Z   4       0.000   5.000   0.000  1.00  0.00           CA  
HETATM    5  CA      Z   5       5.000   0.000   0.000  1.00  0.00           CA  
TER
END

The re-centred coordinates will be centre-of-mass at the origin and the moments-of-inertia along the axes. A special case is for symmetry, where the axis of rotational symmetry is the Z-axis, so the template needs to be place just right.

Gemmi does not like the fact the residues have no sidechains, so reading a PDB will result in gemmi.EntityType.Unknown, which will not give a one-letter sequence and a few other tripping points. Fix:

Gemmi snippet to read RFdiffusion PDB

from typing import Union
from pathlib import Path
import gemmi

def read_rf_pdb(path: Union[str, Path]) -> gemmi.Structure:
    """
    Gemmi has a problem with RFdiffusion pdbs as they lack sidechains,
    and will NOT assign the ``gemmi.EntityType.Polymer``.
    this allows ``chain.get_polymer().make_one_letter_sequence()`` to work.
    """
    design: gemmi.Structure = gemmi.read_structure(str(path))
    # Who are you calling smelly? Force to be a polymer
    residue: gemmi.Residue
    for residue in design[0]['A']:
        residue.entity_type = gemmi.EntityType.Polymer
    return design

This works for sequence extraction (chain.get_polymer().make_one_letter_sequence().replace('-','') where the ‘-‘ in the extracted sequence are gaps a classic tripping point), but it would be more sensible to mask the designed residues ():

Gemmi snippet to read sequence from PDB masking designed (using `atom.b_iso`)

from Bio.SeqUtils import IUPACData

def get_masked_seq(chain: gemmi.Chain) -> str:
    """
    RFdiffusion PDB have a B-factor of 0. for the designed residues.
    This is the same as ``chain.get_polymer().make_one_letter_sequence()``,
    but adds X for the designed positions.
    """

    get_letter = lambda residue: IUPACData.protein_letters_3to1[residue.name.capitalize()]
    letters = [get_letter(residue) if residue.get_ca().b_iso != 0.0 else 'X' for residue in design['A']]
    return ''.join(letters)

To superpose by alignment to the template one can do the following (see also previous post for the same but with PyRosetta)

Superpose by alignment in Gemmi

import gemmi
import io
import numpy as np
import json
from typing import List, Tuple, Union, Any, Dict, Sequence
from Bio.Align import PairwiseAligner, Alignment, substitution_matrices

def get_transformation(ref_model, mob_model, residue_pairs: Sequence[Tuple[Tuple[str, int], Tuple[str, int]]]) -> gemmi.SupResult:
    ref_positions = []
    mob_positions = []
    if isinstance(residue_pairs, dict):
        residue_pairs = list(residue_pairs.items())
    for (res1, res2) in residue_pairs:
        chain_1, resnum_1 = res1
        chain_2, resnum_2 = res2
        ref_residue: gemmi.Residue = ref_model[chain_1][resnum_1]
        mob_residue: gemmi.Residue = mob_model[chain_2][resnum_2]
        assert ref_residue.name == mob_residue.name, (res1,ref_residue.name, res2, ref_residue.name) 
        ref_atom = ref_residue.get_ca()
        mob_atom = mob_residue.get_ca()
        ref_positions.append(ref_atom.pos)
        mob_positions.append(mob_atom.pos)
    
    transformation = gemmi.superpose_positions(ref_positions, mob_positions)
    return transformation

def get_rmsd(ref_model, mob_model, residue_pairs):
    return  get_transformation(ref_model, mob_model, residue_pairs).rmsd

def calculate_ref2query(ref_seq, query_seq) -> Dict[int, int]:
    aligner = PairwiseAligner()
    aligner.internal_gap_score = -2
    aligner.extend_gap_score = -0.01
    aligner.end_gap_score = -0.01
    aligner.match = 1
    aligner.mismatch = -1
    aln: Alignment = aligner.align(ref_seq, query_seq)[0]
    return {t: q for t, q in zip(aln.indices[0], aln.indices[1]) }


#-------------------------
# ## Load
design: gemmi.Model = read_rf_pdb(str(path))[0]
hal_seq = get_masked_seq(design['A'])
ref_seq = ref['A'].get_polymer().make_one_letter_sequence().replace('-', '')  # remember gaps!

# ## Superpose
ref2query = calculate_ref2query(ref_seq=ref_seq, query_seq=hal_seq)
mapping = [(('A', i), ('A', j)) for i, j in ref2query.items() if i != -1 and j != -1]
design.transform_pos_and_adp(transformation.transform)
print(transformation.rmsd, 'Å RMSD')

# ## save
out_struct = gemmi.Structure()
out_struct.add_model(design)
out_struct.write_pdb('test.pdb')

Often it may be benefit to establish no-go zones, so given a PDB of some ghost to avoid after superposition one can do calculate the number of atoms that would clash:

Count clashes against ghost position

import numpy as np
from scipy.spatial.distance import cdist

# functions from earlier
...
ref: gemmi.Model = read_rf_pdb('👾👾👾.pdb')[0]
prior: gemmi.Model = gemmi.read_structure('prior_ligands.pdb')[0]
nogo: np.ndarray = model_to_numpy(prior, CA_only=False)
...

# superposition (see previous)
...

design.transform_pos_and_adp(transformation.transform)
pos = model_to_numpy(design)
distances = cdist(pos, nogo, metric='euclidean')
# who are you chaps? check ``set(np.where(distances < 2.0)[0])``
n_clashes = ((distances < 2.0).sum(axis=1) > 0).sum()

The above approach works in reverse too, namely to filter for closeness to a given aimed for position. For enumeration of interacting residues in Gemmi see AF3 analysis section.

TRB file

The TRB file is a python pickle that contains useful data. Namely list of residue that are conserved (not designed, but might have been partially diffused) in 3 combinations:

both zero-index format (*_idx0 suffix) or PDB numbering (*_pdb_idx suffix, tuple of chain letter & number)
For the template (*_ref_*) and design (*_hal_*)
For the main chain (no prefix, con_*), for other chain if present (receptor_con_*) or both if multichain (complex_con_*)

One tripping point for me is that con_ref_idx0 is based on the contigmap used, so when the definition truncated the N-terminus then idx0 will start after it. So it best to use con_ref_pdb_idx for referencing operations.

Load trb

import pickle

def read_trb(design_name: str, folder_path=Path('.')) -> dict:
    if isinstance(design_name, Path):
        trb_path = design_name
    else:
        design_name = str(design_name).replace('.trb', '')
        design_name = design_name if design_name[-1].isdigit() else design_name[:-1] # proteinMPNN suffix
        trb_path = Path(folder_path) / f'{design_name}.trb'
    if not trb_path.exists():
        raise ValueError(f'{trb_path} does not exist')
    trb = pickle.load(trb_path.open('rb'))
    return trb

The keys for a single chain design are:

dict_keys(['config', 'plddt', 'device', 'time', 'con_ref_pdb_idx', 'con_hal_pdb_idx', 'con_ref_idx0', 'con_hal_idx0', 'inpaint_str', 'inpaint_seq', 'sampled_mask', 'mask_1d'])

Whereas for a two chain design they are:

dict_keys(['config', 'plddt', 'device', 'time', 'con_ref_pdb_idx', 'con_hal_pdb_idx', 'con_ref_idx0', 'con_hal_idx0', 'complex_con_ref_pdb_idx', 'complex_con_hal_pdb_idx', 'receptor_con_ref_pdb_idx', 'receptor_con_hal_pdb_idx', 'complex_con_ref_idx0', 'complex_con_hal_idx0', 'receptor_con_ref_idx0', 'receptor_con_hal_idx0', 'inpaint_str', 'inpaint_seq', 'sampled_mask', 'mask_1d'])

The TRB file is useful downstream, to get what residues were original designed, say one has a pose with sidechains (e.g. AlphaFold3 model after proteinMPNN, or sidechains added in PyRosetta as seen in previous post), one can select the designed residues and get what the interacting residues were in the starting PDB, for example:

PyRosetta selectors from TRB, with CloseContactResidueSelector to get interactions but based on template PDB numbering.

As RFdiffusion resets the PDB

import pickle
from typing import TypeAlias, List
Idx0Int: TypeAlias = int # one-based index
Idx1Int: TypeAlias = int # zero-based index
pr_res: ModuleType = pyrosetta.rosetta.core.select.residue_selector
... # pyrosetta initialised?

def _get_full_key(trb, base, use_complex=True):
    if use_complex and 'complex_'+base in trb:
        return 'complex_'+base
    else:
        return base

def make_conserved_selector(trb, use_complex=True) -> pr_res.ResidueSelector:
    """
    Selector the conserved residues.
    """
    con = pr_res.ResidueIndexSelector()
    key: str = _get_full_key(trb,'con_hal_idx0', use_complex)
    for chain, idx0 in trb[key]:
        # technically, the numbering is reset by AF3 again, but the same way 1 onwards
        con.append_index(idx0 + 1)
    return con

def make_novel_selector(trb) -> pr_res.ResidueSelector:
    """
    Selector for the designed (not neccessarily inpainted)
    """
    con = make_conserved_selector(trb)
    novel = pr_res.AndResidueSelector(pr_res.NotResidueSelector(con), pr_res.ChainSelector('A'))
    return novel

def get_novel_itx_resi(pose, trb, threshold=3.5) -> pru.vector1_unsigned_long:
    """
    Get the close contact residue idx1s between the novel and the rest.
    """
    novel = make_novel_selector(trb)
    contacts = pr_res.CloseContactResidueSelector()
    contacts.central_residue_group_selector(novel)
    contacts.threshold(threshold)
    return pr_res.ResidueVector( contacts.apply(pose) )

def get_itxs(pose, trb) -> List[int]:
    """
    Convert the itxs to those the the ref PDB.
    This is for single chain design.
    For other cases, tweak it for case specific.
    """
    neighs = get_novel_itx_resi(pose, trb)
    hal_key: str = _get_full_key(trb,'con_hal_pdb_idx', use_complex)
    ref_key: str = _get_full_key(trb,'con_ref_pdb_idx', use_complex)
    hal2ref = dict(zip(trb[hal_key], trb[ref_key]))
    return [hal2ref['A', idx1][1] for idx1 in neighs if ('A', idx1) in hal2ref]

Another value that is useful is the minimum pLDDT (the residue with the least confidence). The pLDDT np.array is shaped (residue_number, diffuser.T), where the latter is 50. The last column is the last model.

Secondary structure sequence

One detail that is missing that can be useful is the secondary structure. First a distinction needs to be made: SS annotation = PDB used, SS prediction = sequence used. In 1983 a program called DSSP came out for SS annotation, and variants have made their way into some programs. There is a rewrite of DSSP for python (dssp). There is a function in PyRosetta:

DSS

... # pyrosetta initialised?

pose: pyrosetta.Pose = pyrosetta.pose_from_pdb(pdb_path.as_posix())
dss: str = pyrosetta.rosetta.core.scoring.dssp.Dssp(pose).get_dssp_secstruct()

As far as I know, gemmi lacks a DSSP wrapper. But there are many tools that offer, it say mdtraj.compute_dssp. So it’s basically down to personal preference.

Visualisation of pLDDT to show that the last column is the final

import numpy as np
import pandas as pd
import plotly.express as px

df = pd.DataFrame(trb['plddt']).reset_index().melt(id_vars="index", var_name="residue_idx", value_name="pLDDT").rename(columns={"index": "generation"})
fig = px.scatter(df, x="residue_idx", y="pLDDT", color="generation",
                 title='Improvement of pLDDT per generation',
              template="plotly_white")

fig.show()

Shape

RFdiffusion does tend to go wild and make long and skinny designs that stick out. An easy metric is extracting the coordinates in gemmi, getting the centroid and finding the most distant point. A more elegant solution is finding the principle moments of inertia (with numpy.linalg.eigh on a inertia tensor). A third more convoluted solution is calculating the rotated bounding box:

Shape in gemmi

Parenthesis: Idx0

When dealing with poses, explicitly stating in the variable names idx0 and idx1 it is the best way to avoid fencepost errors. This is especially true with PyRosetta, which is one-indexed / Fortran-indexed. This does make the code a bit Java-esque, so typehinting can help.

Typealias

from typing import TypeAlias, List
Idx0Int: TypeAlias = int # one-based index
Idx1Int: TypeAlias = int # zero-based index
# in 3.12 onwards one can use the type statement
# type Idx0Int = int

mutanda: List[Idx0Int] = ...

Trajectory files

In the slurm script above the flag inference.write_trajectory=false is passed: this is about 0.1MB per residue depending on the settings, so will get very heavy quickly. The 👾👾_n_pX0_traj.pdb is the partial noising, while 👾👾_n_Xt-1_traj.pdb is the denoising. These are for pretty videos, as the only debugging info one gets is that each partial noising step can cause about ~5Å shifts, which is useful for back-of-the-envelope maths to determine how many iterations to set the partial diffusion to (maximum lazy maths: double the max acceptable shift).

RFdiffusion analysis

pLDDT vs. final outcome

One point I would have hoped for was that the pLDDT scores at this stage would recapitulate downstream metrics: what is rubbish now, will still be rubbish later. For most experiments there is a weak correlation with most downstream metrics. I will discuss below the pitfall of comparing absolute Gibbs free energies of protein of different lengths, i.e. it is not a great metric out-of-the-box, but generally there’s a weak trend if any. This holds even between RFdiffusion pLDDT

Visual inspection

John Tukey said that visual inspection is the most powerful tool in the arsenal of an analyst as provides indications of unexpected phenomena. It is as true for the outliers beyond the namesake’s fences, as for protein design. …Because the unexpected is the default.

See above for automated superposing with Gemmi or PyRosetta.

However, in a pinch opening a dozen in PyMOL, ‘aligning’ to template — Actions > align > enabled to this, or in the terminal [cmd.align(name, 'template') for name in cmd.get_names() if '👾👾👾' in name] if they have a common pattern 👾👾👾 in their name. As the designed span will have zero B-factors, these can be highlighted nicely in PyMol:

color 0xFF8559, element C  # coral all
color 0x40E0D0, element C and template  # colour object you called 'template' turquoise
color white, element C and b > 0.1 and not template
set cartoon_transparency, element C and b > 0.1 and not template

ProteinMPNN

The next step is the inverse folding, i.e. assigning a sequence to a fold.

Custom helper functions

In a monographic post about PyRosetta used with RFdiffusion, I mention that I modified / rewrite the proteinMPNN helper scripts to be usable as a Python API. This is because as are they are rather problematic to use for what I want them to do.

The actual run is invoked with a few arguments, crucially one JSONL (--jsonl_path) with chains_definitions, one JSON (--chain_id_jsonl) with fixed_chains, one JSON (--fixed_positions_jsonl) with fixed positions. The first is a JSONL, where each line is a JSON dictionary with the field name. Each of these lines will trigger a run —any repeated names will be run again and over-ridden. Whereas the others JSON files contain a single dictionary, with keys are those from name. The chain definition JSONL contains atomic coordinates, so can get big (i.e. delete after use).

One argument (--path_to_model_weights) is the model to use. The soluble_model_weight is the recommended one and gets called solubleMPNN is some papers.

Results: scores

For each protein a fasta is generated. The first is the reference with polyglycine. The sequence headers contain the fields ‘score’, ‘global_score’, and ‘seq_recovery’. The two scores are the derived from the losses as calculated by torch.nn.NLLLoss. Specificially they are the arithmetic mean of the losses either masked for design or all positions. That means low score is good, like the “score” in a golf game, not the score in say Carcassonne. High seq_recovery is good. The script has also a scoring-only mode (argument --score_only 1) wherein a provided fasta (--path_to_fasta) is scored, which can give a nice point of reference for better-than-wild-type cutoffs.

Snippet to read fasta files to dataframe

import string

from pathlib import Path
import re
import pandas as pd
from typing import List

def parse_header(header):
    return dict(re.findall(r'([\w_]+)=([^\s,]+)', header))

def parse_seqs(path: Path) -> List[dict]:
    seq_info = []
    lines = path.read_text().strip().split('\n')
    if len(lines) < 2:
        print(f'{path} is empty...')
        return []  # 'not my problem'
    first = {'model': lines[0][1:].split(',')[0], 'variant': 0,**parse_header(lines[0]), 'is_control': True, 'seq':  lines[1]}
    seq_info.append(first)
    for i, (header, seq) in enumerate(zip(lines[2::2], lines[3::2])):
        current = {**first, 'variant': i+1, 'is_control': False, **parse_header(header), 'seq': seq}
        seq_info.append(current)
    return seq_info

def parse_seq_folder(seq_folder: Path) -> List[dict]:
    seq_info = []
    for path in Path(seq_folder).glob('*.fa'):
        seq_info.extend( parse_seqs(path) )
    return seq_info

def seqs2df(seq_folder):
    seq_info: List[dict] = parse_seq_folder(seq_folder)
    assert len(seq_info)
    df = pd.DataFrame(seq_info)
    df['score'] = df['score'].astype(float)
    df['global_score'] = df['global_score'].astype(float)
    df['seq_recovery'] = df['seq_recovery'].astype(float)
    df['letter'] = df.variant.map({0: 'Ø', **{i+1: l for i, l in enumerate(string.ascii_uppercase)}})
    return df.sort_values('global_score').copy()

# --------------
df = seqs2df('sheet-extensions/chosen/seqs')
df['group'] = df.model.apply(lambda name: re.match('[^\d_]+', name).group(0).replace('B', '') if re.match('\D+', name) else name.split('_')[0])

I keep the polyglycine as a negative control and sort by global_score and use the first instance of a polyglycine sequence as a hard cutoff for global_score. For larger inserts there is a strong difference, while for smaller there is not —i.e. a polyglycine linker is morely to pass off as okay.

The score and global_score tend to behave differently depending on the length: score tends to be lower with smaller insert designs, while global_score tends to get smaller with larger inserts on account of the greater number of contacts.

Number of runs

I normally perform 5 inverse folding runs (--num_seq_per_target) for each RFdiffusion design, although I believe the authors used 3. Using more might help, but on quick tests, in my hands there is not a major an improvement in either the scores or the sequence distance (see Linear sequence distance section). Changing the sampling temperature (--sampling_temp) from the default 0.1 to higher will increase the diversity. I assign each sequence a letter, so due to the bleaching in AlphaFold3 names, wherein unicode characters are stripped and all uppercase letters become lowercase, the hard limit in my workflow is 26.

Linear sequence diversity distance

Possible task is determine the distance between the linear sequences. That is to say, the 3D independent sequences. To do an alignment can be done, but with tweaks to force the alignment to match the ends for example. Regardless of the exact weights the distance metric has to be using a substitution matrix. The pairwise distances can be used for tSNE plots, or, more usefully, to annotate close repetitions in a dataframe that is sorted.

Linear sequence distance snippet

Using Bio.Align

from Bio import Align
from Bio.Align import substitution_matrices

def sequence_distance(seq1: str, seq2: str, length_normalized=False) -> float:
    """
    Given two sequences get the alignment distance.
    The returned value is a negative modified BLOSUM62 alignment score (see code for weights), wherein opening gaps at the ends are actually forbidden.
    """
    aligner = Align.PairwiseAligner()
    aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
    aligner.open_gap_score = -10  # P to W is 4
    aligner.end_open_gap_score = -20
    aligner.extend_gap_score = -8
    score = aligner.score(seq1, seq2)
    if not length_normalized:
        return -score
    else:
        return -score / max([min([len(seq1), len(seq2)]), 1])

This distance function can used to generate the triangular matrix of pairwise distances:

seqs: List[str] = ...
distances_tril = np.array( [[sequence_distance(seqA, seqB) if i<j else float('nan') for i, seqA in seqs] for j, seqB in seqs] )
distances = np.where(np.isnan(distances_tril), distances_tril.T, distances_tril)

Given a sorted dataframe, one can use the distances to mark repetitions, for example:

def add_distance_metrics(main_df, seq_series):
    """
    seq_series is a series with indices matching _some_ in main_df, ie. it's a subset
    """
    seqs = list(seq_series.reset_index(drop=True).items())
    distances_tril = np.array( [[sequence_distance(seqA, seqB) if i<j else float('nan') for i, seqA in seqs] for j, seqB in seqs] )
    distances = np.where(np.isnan(distances_tril), distances_tril.T, distances_tril)
    main_df.loc[seq_series.index, 'lowest_seq_dist'] = pd.Series(np.nanmin( distances, axis=1), index=seq_series.index)
    tril_mins = np.nanmin( distances_tril, axis=1)
    betterers_lowest = pd.Series(tril_mins, index=seq_series.index).fillna(float('inf'))
    main_df.loc[seq_series.index, 'better_seq_dist'] = betterers_lowest
    main_df.loc[seq_series.index, 'is_better_than_closers'] = betterers_lowest >= -5
    print(len(seq_series), sum(betterers_lowest >= -5))

AlphaFold3

The next step is validating the sequences from the inverse folding step, by predicting their structure and comparing to the RFdiffusion structure. Recently, I have been using AlphaFold3 for this.

Input generation

The AlphaFold3 input is easy to make.

Example of Pythonic input JSON generation

name = 'wt'
out_folder=Path('👾👾')
other_sequences=[]
sequence = '👾👾👾'

out_path = Path(out_folder) / f'{name}.json'
prot_sequence = dict(id='A', sequence=sequence, modifications=[], unpairedMsa=None, pairedMsa=None)      
job = dict(name=name,
           modelSeeds=[888],
           sequences=[dict(protein=prot_sequence),
                      *other_sequences
                     ],
           dialect='alphafold3',
           version=2
          )
out_path.write_text( json.dumps(job) )

The nice thing is one can add ATP, MG and other extras easily —in the above snippet other_sequences=[dict(ligand=dict(id='Z', ccdCodes=['ATP', 'MG', 'MG']))].

Any density gaps that were ignored in design need to be fixed however: ironically in a first pass polyglycine loops are okay.

Especially in the case where the N-terminus and C-terminus are wrapped inwards, getting the exact cloning construct details is important as these might be incompatible with the designed span.

In some cases, splitting or joining the sequence might be needed.

Modify wild type MSA for each variant

With AlphaFold2 and AlphaFold3, the MSA can be the rate-limiting step.

AlphaFold3 uses a really tidy input set-up, and in the sequence, one can set the key-value pairs of unpairedMsa and pairedMsa to either None or a string. If None, the MSA will be generated. If string, that will be used. So if it’s blank, then no MSA is used. This allows these 5 options:

Spend 20 minutes generating an MSA per variant…
Quickly generate a MSA
No MSA
Use template PDB
Modifying reference MSA

A faster MSA generation variant is found in ColabFold that uses MMSeqs2 and clustered uniprot sequences. However, this strikes me as wasteful.

One detail that is important is that the conserved part should be preferably reliably solved. I run a wild-type Rossmann fold with no MSA and the result looked mostly okay, but there was a loop that was bad that differed in each and every repeat. This is the exact opposite of what I want to see.

One can provide a template and a mapping, however, in my hands, the variation between repeats of the wild type was not great. I do not know if this a common phenomenon, but it was a good reason for me to avoid it.

This means the best option, in my opinion, is to modify a reference MSA to suit the sequence.

The two MSA blocks generated during inference in AlphaFold3 are stored in the 👾👾_data.json file. These are, like for AlphaFold2, in the A3M format. Namely, it is like a fasta, but with masked residues in lowercase, specifically, the first sequence is in uppercase, the second will have hyphens for gaps and lowercase letters for inserts.

Read MSA

d = json.load(open('wt/wt_data.json'))
ref_paired_msa = d['sequences'][0]['protein']['pairedMsa']
ref_unpaired_msa = d['sequences'][0]['protein']['unpairedMsa']

Path('paired_msa.a3m').write_text( ref_paired_msa )
Path('unpaired_msa.a3m').write_text( ref_unpaired_msa )

If the query sequence has an insert, this would mean that all aligned sequences would likely have a gap. Therefore, for each sequence the aligned positions need to be matched and a gap inserted in the appropriate location. This is nominally simple, but rather convoluted.

Modify MSA

An A3M block can be modified thusly:

import pickle
import json
import os
from typing import Dict, Optional
from Bio.Align import PairwiseAligner, Alignment, substitution_matrices
from Bio import SeqIO
import io
import re


def calculate_ref2query(ref_seq, query_seq) -> Dict[int, int]:
    """
    Predict mapping of req to query sequence
    """
    aligner = PairwiseAligner()
    aligner.internal_gap_score = -2
    aligner.extend_gap_score = -0.01
    aligner.end_gap_score = -0.01
    aligner.match = 1
    aligner.mismatch = -1
    aln: Alignment = aligner.align(ref_seq, query_seq)[0]
    return {t: q for t, q in zip(aln.indices[0], aln.indices[1]) }

def convert_msa(design_seq, msa, present2wanted):
    """
    Given an A3M-format MSA string (``msa``) generated against a sequence differing from the desired sequence ``design_seq``,
    and a diction mapping the idx0 of residue from present sequence to the wanted seq.

    The latter attribute is generated via ``calculate_ref2query``
    """
    new_msa = f'>query\n{design_seq}\n'
    for record in SeqIO.parse(io.StringIO(msa), "fasta"):
        new_seq = convert_seq(str(record.seq), present2wanted, full_len=len(design_seq))
        new_msa += f'>{record.description}\n{new_seq}\n'
    return new_msa

def convert_seq(seq, present2wanted, full_len):
    """
    Given an A3M sequence and mapping modify to match.
    See ``calculate_ref2query`` function for mapping.
    
    This is achieved in two steps.
    First the sequence is fragmented into the parts that map to reference sequence (each element is an uppercase / gap plus extra lowercases).
    Then this is modified based on the new spacing.
    """
    fragmented_seq = []
    for res in seq:
        if 97 <= ord(res) < 97 + 26:
            fragmented_seq[-1] += res
        else:
            fragmented_seq.append(res)
    new_seq = ['-' for p in range(full_len)]
    previous_wanted_idx0 = -1
    for present_idx0, wanted_idx0 in present2wanted.items():
        if present_idx0 == -1:
            pass
        if wanted_idx0 == -1 and previous_wanted_idx0 != -1:
            new_seq[previous_wanted_idx0] += fragmented_seq[present_idx0].lower().replace('-', '')
        else:
            new_seq[wanted_idx0] = fragmented_seq[present_idx0]
        previous_wanted_idx0 = wanted_idx0
    return ''.join(new_seq)

This can be run twice for each query sequence in a dataframe

def generate_AF3_jobs(df,
                      out_folder: Path,
                      listing_path: Path,
                      max_job: int=-1,
                      name_col: str='name',
                      sequence_col: str='seq',
                      ref_seq='',
                      ref_paired_msa='',
                      ref_unpaired_msa='',
                      other_sequences=()):
    """
    Generate a bunch of input JSON for AlphaFold3 with modified MSAs
    for the sequences in column ``sequence_col``,
    with job name as specified in ``name_col``.

    ``ref_paired_msa`` and/or ``ref_unpaired_msa`` are A3M MSA blocks
    for ``ref_seq`` which may differ from that of the query sequences.

    If ``ref_seq`` is not specified, then an MSA will be generated.

    If ``ref_seq`` is specified, but if ``ref_paired_msa`` or ``ref_unpaired_msa``
    is not, no MSA will be used for that.

    If a ``ref_seq`` is specified and ``ref_paired_msa`` and/or ``ref_unpaired_msa``
    are, then the A3M will be modified to match the query.
    
    ``out_folder`` is for the AF3 input jsons: the designs paths are specified by AF3 run
    """
    os.makedirs(out_folder, exist_ok=True)
    job_list = []
    for i, row in df.reset_index().iterrows():
        if row.is_control:
            break
        if i == max_job:
            break
        name = row[name_col]
        out_path = Path(out_folder) / f'{name}.json'
        if s
        sequence = row[sequence_col]
        prot_sequence = dict(id='A', sequence=sequence, modifications=[], unpairedMsa=None, pairedMsa=None)      
        if ref_seq:
            ref2query = calculate_ref2query(ref_seq, sequence)
            if ref_paired_msa:
                prot_sequence['pairedMsa'] = convert_msa(sequence, ref_paired_msa, ref2query)
            if ref_unpaired_msa:
                prot_sequence['unpairedMsa'] = convert_msa(sequence, ref_unpaired_msa, ref2query)
        job = dict(name=name,
                   modelSeeds=[888],
                   sequences=[dict(protein=prot_sequence),
                              *other_sequences
                             ],
                   dialect='alphafold3',
                   version=2
                  )
        out_path.write_text( json.dumps(job) )
        job_list.append(out_path.as_posix())
    listing_path.write_text('\n'.join(job_list))
    return job_listx

The latter can be run via a simple for loop in bash. Obviously, there are always issues so my for loop looks like this to had an external kill switch (STOP file) and skippage of already done sequences:

touch 'STOP';
rm 'STOP';
while IFS= read -r JSON_JOB; do
    # Check if STOP file exists
    if [[ -f STOP ]]; then
        echo "STOP file detected. Exiting loop.";
        break
    fi
    # hyphens are not bleached
    LOWERCASED=$(echo $(basename "$JSON_JOB" .json) | tr '[:upper:]' '[:lower:]');
    if [[ -f "af3_designs/${LOWERCASED}/${LOWERCASED}_model.cif" ]]; then
        echo "${JSON_JOB} DONE ALREADY";
        continue
    fi
    echo $JSON_JOB
    python run_alphafold.py --json_path="$JSON_JOB" \
        --output_dir=$DATA/mferla/designs/af3_designs \
        --db_dir=/tmp/public_databases \
        --model_dir=/tmp/af3_weights;
done < designs/jobs.txt

Analysis

The AlphaFold3 step is required to validate the sequence. There are a few metrics returned/available:

RMSDs
- against original design
- of replicate against original design
- against desired and undesired scaffold state
Confidence in structure
Secondary structure
PyRosetta score

Analysis — RMSDs

As mentioned, Gemmi is handy and fast for superposing structures

Gemmi RMSD

import re
import io
import gemmi
from pathlib import Path

def calc_rmsds(path_rf, path_af3):
    rf: gemmi.Structure = gemmi.read_structure(str(path_rf))
    af: gemmi.Structure = gemmi.read_structure(str(path_af3))
    rfA: gemmi.Chain = rf[0]['A']
    afA: gemmi.Chain = af[0]['A']
    assert len(rfA) == len(afA)
    # align by conserved
    con_mask: List[bool] = [resi.sole_atom('CA').b_iso > 0. for resi in rfA]
    con_ref_positions: List[gemmi.Position] = [resi.sole_atom('CA').pos for masked, resi in zip(con_mask, rfA) if masked]
    con_mob_positions: List[gemmi.Position] = [resi.sole_atom('CA').pos for masked, resi in zip(con_mask, afA) if masked]
    transformation: gemmi.SupResult = gemmi.superpose_positions(con_ref_positions, con_mob_positions)
    con_rmsd = transformation.rmsd
    af[0].transform_pos_and_adp(transformation.transform)
    # calculate rmsd of designed (w/o superposition)
    hal_ref_positions: List[gemmi.Position] = [resi.sole_atom('CA').pos for masked, resi in zip(con_mask, rfA) if not masked]
    hal_mob_positions: List[gemmi.Position] = [resi.sole_atom('CA').pos for masked, resi in zip(con_mask, afA) if not masked]
    # Unfortunately, RFdiffusion models do not register as polymertype.PeptideL
    #sup = gemmi.calculate_current_rmsd(rfA.get_subchain('A'), afA.get_subchain('A'), gemmi.PolymerType.PeptideL, gemmi.SupSelect.CaP)
    distances = np.array([r.dist(m) for r, m in zip(hal_ref_positions, hal_mob_positions)])
    hal_rmsd = np.sqrt((distances**2).sum())
    # mod structure
    df = extract_pose_table(path_af3)  # this is a pyrosetta minimised AF3
    for masked, resi in zip(con_mask, afA):
        min_score = df['total'].iloc[2:].min()
        max_score = df['total'].iloc[2:].max()
        for score, atom in zip(df['total'].iloc[2:], resi):
            atom.occ = 1 if masked else 0.5
            atom.b_iso = (float(score) - min_score) / (max_score - min_score) * 10
    return {'model': af.make_pdb_string(), 'con_rmsd': con_rmsd, 'hal_rmsd': hal_rmsd, 'con_mask': con_mask}

Which can be run like this for example:

_rmsds = {}
for af3_path in Path('designs/af3_relaxed').glob('*.pdb'):
    group = '_'.join(af3_path.stem.split('_')[:-1])
    rf_folder = Path('designs') / group
    rf_path = rf_folder / (af3_path.stem[:-1]+'.pdb')
    if not rf_path.exists():
        print(rf_path)
        continue
    info = calc_rmsds(path_rf=rf_path, path_af3=af3_path) # ['model', 'con_rmsd', 'hal_rmsd', 'con_mask']
    del info['model']
    _rmsds[af3_path.stem] = info
    
rmsds = pd.DataFrame(_rmsds).T.sort_values('hal_rmsd').reset_index(names='design_name').copy()

AlphaFold3 generates 5 structures per seed, and the best of all is chosen. For some cases, for example when state A is preferred over state B, these replicates can be helpful. This is the same as the above but run over each model and against more than one template.

Analysis — Secondary structure

As explained above, DSS (secondary sequence assignment) can be helpful for specific cases and can be done at the RFdiffusion step or the AlphaFold3 step.

PyRosetta

Minimising the AlphaFold3 structures in PyRosetta and getting the score endows with a physics-based metric, whereas all prior scores are confidences do to not say if the model is physically okay, just that it is likely.

Minimisation script and energy table reading snippet

I run multiple instances of a script (on CPU nodes) that minimises with due checks, all PDBs in a folder. Something on these lines:

#!/usr/bin/env python
INPUT_FOLDER = '👾👾👾'
OUTPUT_FOLDER = '👾👾👾'


import pandas as pd
from pathlib import Path
import json, os, random, traceback
from pebble import ProcessPool, ProcessFuture
import pyrosetta
import pyrosetta_help as ph
from types import ModuleType
# Better than star imports:
prc: ModuleType = pyrosetta.rosetta.core
prp: ModuleType = pyrosetta.rosetta.protocols
pru: ModuleType = pyrosetta.rosetta.utility
prn: ModuleType = pyrosetta.rosetta.numeric
prs: ModuleType = pyrosetta.rosetta.std
pr_conf: ModuleType = pyrosetta.rosetta.core.conformation
pr_scoring: ModuleType = pyrosetta.rosetta.core.scoring
pr_res: ModuleType = pyrosetta.rosetta.core.select.residue_selector
pr_options: ModuleType = pyrosetta.rosetta.basic.options

logger = ph.configure_logger()
pyrosetta.distributed.maybe_init(extra_options=ph.make_option_string(no_optH=False,
                                                                     ex1=None,
                                                                     ex2=None,
                                                                     ignore_unrecognized_res=False,
                                                                     load_PDB_components=True,
                                                                     ignore_waters=True,
                                                                    )
                                 )

def path2pose(path:Path, ranking_score=0.) -> pyrosetta.Pose:
    path = Path(path)
    if '_database_PDB_remark.text' not in path.read_text():
        with path.open('a') as fh:
            fh.write(f'\n_database_PDB_remark.text  "AlphaFold3 rank={path.name} ranking_score={ranking_score}"\n')
    
    pose = pyrosetta.Pose()
    prc.import_pose.pose_from_file(pose, path.as_posix(), False, pyrosetta.rosetta.core.import_pose.FileType.CIF_file)
    return pose

from Bio.Align import PairwiseAligner, Alignment, substitution_matrices
from typing import TypeAlias, Dict
FTypeIdx: TypeAlias = int  # one-based index
CTypeIdx: TypeAlias = int  # zero-based index

def minimize(model_path):
    name = model_path.stem.replace('_model', '')
    out_path = Path(OUTPUT_FOLDER)  / f'{name}.pdb'
    taken_path = Path(OUTPUT_FOLDER)  / f'{name}.pdb.hold'
    if not out_path.parent.exists():
        os.makedirs(out_path.parent, exist_ok=True)
    if out_path.exists() or taken_path.exists():
        return True
    taken_path.write_text('working on it!')
    scorefxn = pyrosetta.get_fa_scorefxn()
    scorefxn.set_weight(pr_scoring.ScoreType.atom_pair_constraint, 1)
    relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, 5)
    pose = path2pose(model_path)
    # disabled multichain!
    # last_polymer_idx1 = pose.total_residue() - int(pose.sequence()[-1] == 'Z') 
    # ph.add_stretch_constraint(pose, weight=1, residue_index_B=last_polymer_idx1)
    relax.apply(pose)
    pose.dump_pdb(out_path.as_posix())
    os.remove(taken_path)
    return True

def get_max_cores():
    """
    the number of cores to use.
    Called by main
    """
    if os.environ['SLURM_JOB_CPUS_PER_NODE']:
        return int(os.environ['SLURM_JOB_CPUS_PER_NODE'])
    else:
        return os.cpu_count()

def main(timeout=60*60*5):
    futuredex = {}
    with ProcessPool(max_workers=get_max_cores() - 1, max_tasks=0) as pool:
        # queue jobs
        model_paths = list(Path(INPUT_FOLDER).glob('*/*_model.cif'))
        random.shuffle(model_paths)
        for model_path in model_paths:
            future: ProcessFuture = pool.schedule(minimize, kwargs=dict(model_path=model_path, ), timeout=timeout)
            futuredex[model_path.stem] = future
        print(f'Submitted {len(futuredex)} processes')
        # ## Get results
        for name, future in futuredex.items():
            try:
                print(name)
                future.result()  # blocks until results are ready
            except Exception as error:
                error_msg = str(error)
                if isinstance(error, TimeoutError):
                    print(f'Processing {name} took longer than {timeout} seconds {error}')
                else:
                    print(f"Processing {name} raised {error}")
                    traceback.print_tb(error.__traceback__)  # traceback of the function
        print('Complete', flush=True)

if __name__ == '__main__':
    main()

The PDB files outputted with have PyRosetta scores as a final non-standard comment. This can read easily:

import re
import io


def extract_pose_table(path):
    block = path.read_text().strip()
    start = block.find('#BEGIN_POSE_ENERGIES_TABLE')
    assert start != -1, f'{path} is not a pyrosetta minimised structure'
    end = block.find('#END_POSE_ENERGIES_TABLE')
    block = block[start:end]
    return pd.read_csv(io.StringIO(block), skiprows=1, delimiter=' ')


    
_scores = {}
for path in Path('af3_relaxed_designs').glob('*.pdb'):
    # rosetta files have scores in comments.
    for line in path.read_text().split('\n'):
        if not line.strip():
            continue
        parts = line.split()
        if parts[0] == 'label':
            header = parts[1:]
        elif parts[0] == 'pose':
            _scores[path.stem] = dict( zip(header, map(float, parts[1:])) )
            break

# Example
scores = pd.DataFrame(_scores).T.sort_values('total').reset_index(names='design_name').copy()
del _scores

The Rosetta minimised pose can be used for total score, per-residues scores, and interactions —see above for example with CloseContactResidueSelector.

A nice filter is eliminating poses with a residue whose score is greater than than the one seen in the reference structure. Generally this is in the 6–7 kcal/mol mark (higher for catalytic residues). If I have lots of poses, I generally excluding any pose with max designed-span residue score greater than the worst quartile.

However, PyRosetta minimisation has some issues:

Small shifts
Comparing absolute energies
Conserved part variations
Solubility ≠ thermostability

No RMSD

The PyRosetta minimisation can move some parts slightly causing an increased RMSD when compared to the RFdiffusion models, so it best to compare the RMSDs of the AlphaFold3 models with —one would guess that the physics would make the two converge, but considering minimisation can change deposited PDBs and the DL methods are trained on deposited PDBs it make sense even if the RFdiffusion structures can have nasty torsions.

Absolute free energy

A Rosetta score is an attempt to predict an absolute free energy, which will be a sum of the contributions of each atoms involved: the more these are the lower the score. If a variety of lengths exist, then the best bet is

Solubility — fa_sol

Proteins are soluble because they are are thermostable at that temperature and do not aggregate and precipitate. Given a protein species and a particular solvent and conditions, it will have a concentration at which it is supersaturated and falls out of solution. Precipitation happens when parts of a protein are sticky, possibly sticking to another unravelled protein. This happens especially because there is a surface patch of exposed hydrophobic residues. Therefore assessing this can be useful.
For my problems, solubility is a two edged sword: I want the protein to express but I want them also to crystallise …and precipitation vs. crystallisation are both processes when the protein come out of solution when supersaturated, but the latter differs in that it is orderly (cf. the metastable zone graph in any crystallography 101 slide deck). I effectively do not want the protein to be overly soluble, but I do not want it to aggregate. Therefore, I am not overly zealous with solubility as a ranking tool, only as a filter.

One Rosetta term (fa_sol) assesses the solvation of the residues and can be used to exclude possibly problematic structures, where the mean per-residue solvation term is overly bad. A single bad residue is not a problem (as the term maxes out at 10 kcal/mol), so the mean fa_sol or fraction of residues with the term greater than 5 kcal/mol is a better value.

Per-residue terms

In addition to reading the terms one can score the residue. It is a bit convoluted so here I will simply use my own package:

import pyrosetta_help as ph  # my package I use, also seen in init etc
... # init pyrosetta

scores = ph.pose2pandas(pose, pyrosetta.get_fa_scorefxn())

scores[['residue', 'total_score', 'fa_atr','fa_elec','fa_rep','fa_sol']]

Parenthetically, solvation models are a topic of continual discussion/criticism and this is especially true for Rosetta. The implicit solvation model used is a Lazaridis-Karplus potential, which is quicker but less accurate than the Generalised Born model and has been noted to under-penalise buried polar residues (see Liu et al 2017 for a nice discussion of this), so is far from perfect. To complicate matters, there is the ref term, a statistically-derived constant scalar for each residue type, which makes some predictions trickier. This empirical term exists to make ∆∆G calculations balances, but it appears to disfavouring apolar residues (the ref values for ref2015 are GLU=-2.7, ASP=-2.1, PRO=-1.6, GLN=-1.5, ASN=-1.3, LYS=-0.7, HIS=-0.3, SER=-0.3, ARG=-0.1, TYR=0.6, GLY=0.8, THR=1.2, PHE=1.2, ALA=1.3, MET=1.7, LEU=1.7, TRP=2.3, ILE=2.3, VAL=2.6, CYS=3.3), thus is likely to go against the solvatation term, which is likely scaled to counter-balance this. I have not accounted for this in any design ranking/filtering: I only mention this note that the system is not perfect.

Solubility — hydrophobic SASA

The solvent accessible surface area (SASA) is a common and useful metric. It is a measure of much surface area is there. This is measure of compactness (less surface, more compactness), but not necessarily for globularity —for that see the gemmi shape discussion above.

Arguably, the most useful SASA calculation is the SASA of hydrophobic atoms only, hSASA in PyRosetta. SASA and hSASA do not correlate with the fa_sol term.

SASA

sasa = pyrosetta.rosetta.core.scoring.sasa.SasaCalc()
sasa.calculate(pose)

res_scores: pd.DataFrame = ... # see above for per-residue scores
res_scores['sasa'] = sasa.get_residue_sasa()
res_scores['hsasa'] = sasa.get_residue_hsasa()

Solubility — inspection

Given a shortlist of under 20 designs manually checking them is not much effort. For solubility, using the APBS plugin in PyMOL to check against large white patches is easy. One can also use one of a few servers to assess solubility, although note should be taken that the most popular ones such as CamSOL are sequence not structure based.

Solubility — isoelectric point

The isoelectric point (pI) of a protein is a classic metric that correlates with solubility: when the isoelectric point is close the solvent pH it will tend to precipitate. cytosolic protein tend to have a pI of 5, while nuclear protein will be 9. For de novo region design this does not make much sense especially as the majority of the protein is conserved and proteinMPNN tends to give unanimously low isoelectric points — a consequence of the PDB as the training set.

Calculate pI

from Bio.SeqUtils.IsoelectricPoint import IsoelectricPoint

sequence = pose.sequence()
analyser = IsoelectricPoint(sequence)
pI = analyser.pi()

Multi-factor selection

The final aim is to get a list of two or more designs for testing that are more likely to work and are diverse. There are multiple metrics in play, some can be used as filters and some for ranking. When there are multiple factors deemed beneficial there are a few options. Clustering by interactions is a sure approach for diversity, but often this is not relevant, e.g. when bridging a gap.

One can do a weighted multi-term score, where the weights are effectively chosen by whim. When wearing my cheminformatics hat, I do this for compound enumeration as it is easier to tweak the weights and inspect the nature of the resulting poses and compounds (generally bad). For designs, this is far from clear so I have not done it. The values of each need ideally to be normalised and tapered to avoid disruptive outliers:

Tanh-tapered z-score

def normalize(series: pd.Series, sigma_taper:float=2):
    zscore = (series - series.mean()) / series.std()
    return zscore.apply(lambda val: np.tanh(val / sigma_taper) * sigma_taper)

For designs, I have preferred taking the Pareto frontier points.

Pareto frontier — pandas

# given a dataframe
df: pd.DataFrame = ...


# given two column names
colA_name: str = ...
colA_ascending: bool = True
colB_name: str = ...
colB_ascending: bool = True

# get the rows that are okay and sort by the columns
mask: pd.Series = ...
#eg. mask = (df.AF3_has_clash == 0.) & (df.AF3_fraction_disordered == 0.) & (df.AF3_ptm >= 0.80)
filtered_df = df.loc[mask].sort_values(by=[colA, colA], ascending=[colA_ascending, colB_ascending])


# get the best for the first column
pareto = [filtered_df.iloc[0]]
# iterate finding all cases where the second column is better
for i, row in filtered_df.iloc[1:].iterrows():
    if (row[colB_name] >= pareto[-1][colB_name]):
        pareto.append(row)
frontier = pd.DataFrame(pareto)

Refinements

The sequence can be refined via PyRosetta FastRelax with design as discussed in a previous post. This increases the workload however, which may be very overkill for an insert to stabilise a state.

Another refinement could be pick best designs and do partial diffusions on those and expand the diversity. This is what seems to be done in several papers for binder design. For my applications, I have not gone down this refinement route as I am designing an linker or small subdomain.

Acknowledgements

I do not do lab work and instead collaborate with wet lab scientists. Therefore, I would like to thank my collaborators, but majorly thank Mike Fairhead for being patient with my endless designs and mistakes and Mathew Martin for an insightful discussion on kinase states.

Author

Matteo Ferla

View all posts