Detecting intermolecular interactions is often one of the first steps when assessing the binding mode of a ligand. This usually involves the human researcher opening up a molecular viewer and checking the orientations of the ligand and protein functional groups, sometimes aided by the viewer’s own interaction detecting functionality. For looking at single digit numbers of structures, this approach works fairly well, especially as more experienced researchers can spot cases where the automated interaction detection has failed. When analysing tens or hundreds of binding sites, however, an automated way of detecting and recording interaction information for downstream processing is needed. When I had to do this recently, I used an open-source Python module called ODDT (Open Drug Discovery Toolkit, its full documentation can be found here).
My use case was fairly standard: starting with a list of holo protein structures as pdb files and their corresponding ligands in .sdf format, I wanted to detect any hydrogen bonds between a ligand and its native protein crystal structure. Specifically, I needed the number and name of the the interacting residue, its chain ID, and the name of the protein atom involved in the interaction. A general example on how to do this can be found in the ODDT documentation. Below, I show how I have used the code on PDB structure 1a9u.
from pathlib import Path import oddt from oddt import interactions protein_path = Path('blopig_example', '1a9u_apo.pdb') ligand_path = Path('blopig_example', '1a9u_lig.sdf') oddt_prot = [x for x in oddt.toolkit.readfile('pdb', str(protein_path))][0] oddt_prot.protein=True # protein flag needs to be on for the molecule to be treated as a protein oddt_ligand = next(oddt.toolkit.readfile('sdf', str(ligand_path))) # Now, to find the interacting atoms on the protein and ligand prot_atoms, lig_atoms, strict = oddt.interactions.hbonds(oddt_prot, oddt_ligand)
Protein and ligand atoms interacting atoms are stored as Numpy dictionaries. These are great for downstream analysis, as they feed directly into the workhorse number-crunching Python modules (numpy, pandas, etc.…). The exposed properties for the atom and residue dictionaries can be found here. In the prot_atoms list above, there are 4 entries, as 4 H-bonds were detected (the lig_atoms list also has 4 entries, holding the corresponding interacting atoms on the ligand side). Let’s look at the protein atom involved in the first interaction.
# Get the protein atom involved in the first H-bond p_at = prot_atoms[0] p_at_resname = p_at['resname'] p_at_resnum = p_at['resnum']
Printing the p_at_resname and p_at_resnum variables reveals that the interacting protein residue is Met 109 – one of the hinge residues in this kinase structure, which is expected to interact with the ligand. So far so good.
The next piece of information that I need is the chain ID of the parent residue, as well as the name of the interacting protein atom in the PDB file. Looking at the documentation, these do not seem to be exposed as parent residue properties in the atom_dict, but there are other ways to get them. The one I used involved mapping back to the underlying RDKit atom object in the main oddt_prot structure, which stores a vast amount of information on both the atom and its parent residue.
The ODDT package can use either RDKit or OpenBabel as a backend toolkit, and so at least one of these needs to be installed in the current Python environment. I only had RDKit in my environment, but underlying OpenBabel objects (if using the OpenBabel backend) can be accessed in a similar way.
# Map back to the atom in the protein object main_struct_at = oddt_prot.atoms[p_at['id']] # Get the underlying rdkit Atom object: main_struct_at_rdk = main_struct_at.Atom # Get the chain ID and PDB-format atom name chain_id = main_struct_at_rdk.GetPDBResidueInfo().GetChainId()r atom_name = main_struct_at_rdk.GetPDBResidueInfo().GetName()
This reveals that the backbone N atom of Met 109 interacts with the ligand. Now, let’s look at all of the protein interacting atoms.
import pandas as pd import numpy as np resis = [] resns = [] chains = [] atomns = [] for p_at in prot_atoms: resis.append(p_at['resnum']) resns.append(p_at['resname']) # Map back to the atom in the protein object main_struct_at = oddt_prot.atoms[p_at['id']] # Get the underlying rdkit Atom object: main_struct_at_rdk = main_struct_at.Atom # Get the chain ID and PDB-format atom name chains.append(main_struct_at_rdk.GetPDBResidueInfo().GetChainId()) atomns.append(main_struct_at_rdk.GetPDBResidueInfo().GetName()) colnames = ['residue_number', 'residue_name', 'chain', 'atom'] prot_df = pd.DataFrame(data=np.array([resis, resns, chains, atomns]).T, columns = colnames) prot_df
This is consistent with what is known about the binding of this ligand from the structure. This type of calculation can now be done on the full list of protein structures, and output the dataframe as a csv file, or plugged directly into downstream analysis.