In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.
A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:
from rdkit.Chem import AllChem, MolFromMol2File from rdkit.DataStructs import BulkTanimotoSimilarity import numpy as np import os fps = [] for pdb in pdbs: mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2')) fps.append(AllChem.GetMorganFingerprint(mol, 3)) sims = [] for i in range(len(fps)): sims.append(BulkTanimotoSimilarity(fps[i],fps)) arr = np.array(sims) np.savez_compressed('data/tanimoto_similarity.npz', arr)
Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:
>3zzf_A NGFSATRSTVIQLLNNISTKREVEQYLKYFTSVSQQQFAVIKVGGAIISDNLHELASCLA FLYHVGLYPIVLHGTGPQVNGRLEAQGIEPDYIDGIRITDEHTMAVVRKCFLEQNLKLVT ALEQLGVRARPITSGVFTADYLDKDKYKLVGNIKSVTKEPIEASIKAGALPILTSLAETA SGQMLNVNADVAAGELARVFEPLKIVYLNEKGGIINGSTGEKISMINLDEEYDDLMKQSW VKYGTKLKIREIKELLDYLPRSSSVAIINVQDLQKELFTDSGAGTMIRRGY >3gww_A REHWATRLGLILAMAGNAVGLGNFLRFPVQAAENGGGAFMIPYIIAFLLVGIPLMWIEWA MGRYGGAQGHGTTPAIFYLLWRNRFAKILGVFGLWIPLVVAIYYVYIESWTLGFAIKFLV GLVPEPPPDSILRPFKEFLYSYIGVPKGDEPILKPSLFAYIVFLITMFINVSILIRGISK GIERFAKIAMPTLFILAVFLVIRVFLLETPNGTAADGLNFLWTPDFEKLKDPGVWIAAVG QIFFTLSLGFGAIITYASYVRKDQDIVLSGLTAATLNEKAEVILGGSISIPAAVAFFGVA NAVAIAKAGAFNLGFITLPAIFSQTAGGTFLGFLWFFLLFFAGLTSSIAIMQPMIAFLED ELKLSRKHAVLWTAAIVFFSAHLVMFLNKSLDEMDFWAGTIGVVFFGLTELIIFFWIFGA DKAWEEINRGGIIKVPRIYYYVMRYITPAFLAVLLVVWAREYIPKIMEETHWTVWITRFY IIGLFLFLTFLVFLAERRRNH >1w8l_A MVNPTVFFDIAVDGEPLGRVSFELFADKVPKTAENFRALSTGEKGFGYKGSCFHRIIPGF MCQGGDFTRHNGTGGKSIYGEKFEDENFILKHTGPGILSMANAGPNTNGSQFFICTAKTE WLDGKHVVFGKVKEGMNIVEAMERFGSRNGKTSKKITIADCGQLE
Most of the of the speed gains that MMseqs2 achieves in many-against-many sequence searching comes from its prefiltering stage, where it removes most target sequences for a given query that don’t fulfil certain requirements. Since we want to calculate the sequence identity for every pair, we have to do a fake prefiltering step to enable all-vs-all alignments and sequence identity calculations. As described in the user guide, we can create a shell function that performs this step:
fake_pref() { QDB="$1" TDB="$2" RES="$3" # create link to data file which contains a list of all targets that should be ↪ aligned ln -s "${TDB}.index" "${RES}" # create new index repeatedly pointing to same entry INDEX_SIZE="$(echo $(wc -c < "${TDB}.index"))" awk -v size=$INDEX_SIZE '{ print $1"\t0\t"size; }' "${QDB}.index" > "${RES}.index" # create dbtype (7) awk 'BEGIN { printf("%c%c%c%c",7,0,0,0); exit; }' > "${RES}.dbtype" }
Now we have all we need to calculate the sequence identity. We first create query and target databases, which are the same for us, followed by the fake prefiltering step. In the alignment step we prevent any sequences from being ignored with the -e inf option. Also, make sure to give the option –alignment-mode 3, in order to get the sequence identity. By default the sequence identity is calculated as the number of identical residues divided by the alignment length, however, the length of the shorter sequence for instance can be used intead with the option –seq-id-mode 1. Finally, the results are combined into results.m8 with the convertalis command.
mmseqs createdb QUERY.fasta targetDB --shuffle 0 mmseqs createdb QUERY.fasta queryDB --shuffle 0 fake_pref queryDB targetDB allvsallpref mmseqs align queryDB targetDB allvsallpref allvsallaln -e inf --alignment-mode 3 --seq-id-mode 1 mmseqs convertalis queryDB targetDB allvsallaln results.m8