In Bioinformatics, we often deal with distance matrices such as:
- Quantifying pairwise similarities between sequences
- Structural similarity between proteins (RMSD?)
- etc.
Next step is to study the groupings within the distance matrix using an appropriate clustering scheme. The obvious issue with most clustering methods is that you would need to specify the number of clusters beforehand (as for K-Means). Assuming that you do not know very much about the data and ‘plotting’ it is not an option, you might try non-parametric hierarchichal clustering such as linkage. The main difference between the two approaches is that using linkage you specify what the maximal distance within each cluster should be and thus the number of clusters will be adjusted accordingly. Par contre, using K-Means you do not have such a distance-guarantee within each cluster since the number of groups is predefined.
Here I will provide a short piece of python code that employs the hcluster library to perform linkage clustering.
Installation
Download hcluster, unpack it and inside the unpacked folder type:
python setup.py install
Alternatively, if you’re not an admin on your machine type:
python setup.py install --user
Example Code
The purpose of the example bit of code is to generate a random set of points within (0,10) in the 2D space and cluster them according to user’s euclidean distance cutoff.
import matplotlib.pyplot as plt from matplotlib.pyplot import show from hcluster import pdist, linkage, dendrogram import numpy import random import sys #Input: z= linkage matrix, treshold = the treshold to split, n=distance matrix size def split_into_clusters(link_mat,thresh,n): c_ts=n clusters={} for row in link_mat: if row[2] < thresh: n_1=int(row[0]) n_2=int(row[1]) if n_1 >= n: link_1=clusters[n_1] del(clusters[n_1]) else: link_1= [n_1] if n_2 >= n: link_2=clusters[n_2] del(clusters[n_2]) else: link_2= [n_2] link_1.extend(link_2) clusters[c_ts] = link_1 c_ts+=1 else: return clusters #Size of the point matrix rows = 10 #number of points columns = 2 #number of dimensions - 2=2D, 3=3D etc. samples = numpy.empty((rows, columns)) #Initialize a random points matrix with values between 0, 10 (all points in the upper right 0,10 quadrant) for i in xrange(0,rows): for j in xrange(0,columns): samples[i][j] = random.randint(0,10) #Show the points we have randomly generated print "Samples:\n ", samples #Create the distance matrix for the array of sample vectors. #Look up 'squareform' if you want to submit your own distance matrices as they need to be translated into reduced matrices dist_mat = pdist(samples,'euclidean') #Perform linkage clustering - here we use 'average', but other options are possible which you can explore here: http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.cluster.hierarchy.linkage.html z = linkage(dist_mat, method='average',metric='euclidean') #Specify a cutoff that will define the clustering - command line argument: #python ClusterExample.py 3.0 cutoff = float(sys.argv[1]) clustering = split_into_clusters(z,cutoff,rows) if clustering==None: print "No clusters! Most likely your distance cutoff is too high - all falls into the same bag!" quit() #Print the potential singletons - in magenta for i in xrange(0,rows): plt.plot(samples[i][0],samples[i][1], marker='o', color='m', ls='') plt.text(samples[i][0],samples[i][1], str(i), fontsize=12) #If there are more clusters than these the code will fail! colors = ['b','r','g','y','c','k'] cluster_num = 0 for cluster in clustering: print "Cluster: ",cluster_num cluster_num+=1 for i in clustering[cluster]: print "-->",i plt.plot(samples[i][0],samples[i][1], marker='o', color=colors[cluster_num], ls='') #Set the axis limits plt.xlim([-1,12]) plt.ylim([-1,12]) show() #Alternatively plot it as dendogram to see where your distance cutoff performed the tree cut dendrogram(z) show()
When I ran the code above (python [whatever you call the script].py 2.0) that’s what I got (colors correspond to clusters with ‘magenta’ being singletons):
And there is a dendogram command on the bottom of the script to see what the clustering has actually done and where it performed the cut according to your specified cutoff (colors DO NOT correspond to clusters here):
Hcluster library forms part of scipy with very useful methods for data analysis. You can modify the above code to use a variety of other hierarchichal clustering methods which you can further explore here.