Pairwise sequence identity and Tanimoto similarity in PDBbind

In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.

A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:

from rdkit.Chem import AllChem, MolFromMol2File
from rdkit.DataStructs import BulkTanimotoSimilarity
import numpy as np
import os

fps = []
for pdb in pdbs:
    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))
    fps.append(AllChem.GetMorganFingerprint(mol, 3))

sims = []
for i in range(len(fps)):
    sims.append(BulkTanimotoSimilarity(fps[i],fps))

arr = np.array(sims)
np.savez_compressed('data/tanimoto_similarity.npz', arr)

Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:

>3zzf_A
NGFSATRSTVIQLLNNISTKREVEQYLKYFTSVSQQQFAVIKVGGAIISDNLHELASCLA
FLYHVGLYPIVLHGTGPQVNGRLEAQGIEPDYIDGIRITDEHTMAVVRKCFLEQNLKLVT
ALEQLGVRARPITSGVFTADYLDKDKYKLVGNIKSVTKEPIEASIKAGALPILTSLAETA
SGQMLNVNADVAAGELARVFEPLKIVYLNEKGGIINGSTGEKISMINLDEEYDDLMKQSW
VKYGTKLKIREIKELLDYLPRSSSVAIINVQDLQKELFTDSGAGTMIRRGY
>3gww_A
REHWATRLGLILAMAGNAVGLGNFLRFPVQAAENGGGAFMIPYIIAFLLVGIPLMWIEWA
MGRYGGAQGHGTTPAIFYLLWRNRFAKILGVFGLWIPLVVAIYYVYIESWTLGFAIKFLV
GLVPEPPPDSILRPFKEFLYSYIGVPKGDEPILKPSLFAYIVFLITMFINVSILIRGISK
GIERFAKIAMPTLFILAVFLVIRVFLLETPNGTAADGLNFLWTPDFEKLKDPGVWIAAVG
QIFFTLSLGFGAIITYASYVRKDQDIVLSGLTAATLNEKAEVILGGSISIPAAVAFFGVA
NAVAIAKAGAFNLGFITLPAIFSQTAGGTFLGFLWFFLLFFAGLTSSIAIMQPMIAFLED
ELKLSRKHAVLWTAAIVFFSAHLVMFLNKSLDEMDFWAGTIGVVFFGLTELIIFFWIFGA
DKAWEEINRGGIIKVPRIYYYVMRYITPAFLAVLLVVWAREYIPKIMEETHWTVWITRFY
IIGLFLFLTFLVFLAERRRNH
>1w8l_A
MVNPTVFFDIAVDGEPLGRVSFELFADKVPKTAENFRALSTGEKGFGYKGSCFHRIIPGF
MCQGGDFTRHNGTGGKSIYGEKFEDENFILKHTGPGILSMANAGPNTNGSQFFICTAKTE
WLDGKHVVFGKVKEGMNIVEAMERFGSRNGKTSKKITIADCGQLE

Most of the of the speed gains that MMseqs2 achieves in many-against-many sequence searching comes from its prefiltering stage, where it removes most target sequences for a given query that don’t fulfil certain requirements. Since we want to calculate the sequence identity for every pair, we have to do a fake prefiltering step to enable all-vs-all alignments and sequence identity calculations. As described in the user guide, we can create a shell function that performs this step:

fake_pref() {
QDB="$1"
TDB="$2"
RES="$3"
# create link to data file which contains a list of all targets that should be
↪ aligned
ln -s "${TDB}.index" "${RES}"
# create new index repeatedly pointing to same entry
INDEX_SIZE="$(echo $(wc -c &lt; "${TDB}.index"))"
awk -v size=$INDEX_SIZE '{ print $1"\t0\t"size; }' "${QDB}.index" &gt; "${RES}.index"
# create dbtype (7)
awk 'BEGIN { printf("%c%c%c%c",7,0,0,0); exit; }' &gt; "${RES}.dbtype"
}

Now we have all we need to calculate the sequence identity. We first create query and target databases, which are the same for us, followed by the fake prefiltering step. In the alignment step we prevent any sequences from being ignored with the -e inf option. Also, make sure to give the option –alignment-mode 3, in order to get the sequence identity. By default the sequence identity is calculated as the number of identical residues divided by the alignment length, however, the length of the shorter sequence for instance can be used intead with the option –seq-id-mode 1. Finally, the results are combined into results.m8 with the convertalis command.

mmseqs createdb QUERY.fasta targetDB --shuffle 0 
mmseqs createdb QUERY.fasta queryDB --shuffle 0
fake_pref queryDB targetDB allvsallpref
mmseqs align queryDB targetDB allvsallpref allvsallaln -e inf --alignment-mode 3 --seq-id-mode 1
mmseqs convertalis queryDB targetDB allvsallaln results.m8

Author

Ísak Valsson

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends

Pairwise sequence identity and Tanimoto similarity in PDBbind

Author