In cheminformatics, there are many possible ways to encode chemical data represented by small molecules and proteins, such as SMILES, fingerprints, chemical descriptors etc. Recently, utilising graph-based methods for machine learning have become more prominent. In this post, we will explore why representing molecules as graphs is a natural and suitable encoding.
What is a graph?
First we should discuss what we mean when we use the term graph. We are not referring to a plot of a function, but instead a graph, , consists of two sets: a set of vertices, , and edges, (I shall try to explain this using limited/no mathematical notation). Each vertex, or node, in is a point in the graph, while each element of consists of two elements from that are linked in some manner.
This representation can be made more complex by allowing vertices to have different types, different types of edges, edges that only point in one direction (i.e. does not mean ), and so on and so forth. In particular, one can label vertices (and edges) with any additional information you want (this will prove useful later).
Representing molecules as graphs
Molecules are simply atoms joined together by bonds. These atoms may well be of different types, and the bonds might also be different, but this sounds a lot like a graph where the atoms are the vertices and the bonds are the edges of our graph!
Indeed, Figure 1 provides a simple visual example of the graph representation of a molecule, toluene (or methlybenzene). Since all atoms are carbon, it is possible to encode this molecule fully with a single atom type and two bond types (either single and aromatic, or single and double if kekulized). However, alternate atom typing is possible, for example taking separate atom types for aromatic carbons and aliphatic carbons. This highlights both the flexibility of a graph encoding but also the need to choose a representation, some of which may be more or less useful for the specific task.
What does a computer see?
Computationally, a graph is represented as two matrices: one for vertices, , and either one or two for edges, . The matrix for the vertices is dimensional, and the adjacency matrix for the edges (capturing the connections between the vertices) is dimensional, where is the number of vertices, is the length of the label associated with the vertex, and is the number of edge types (often vertices are one-hot encoded and there are multiple edge types, hence . If not, then an adjacency matrix fully describes the graph).
Using the example in Figure 1, taking two vertex types (aromatic & aliphatic carbon) and two edge types (single & aromatic), we can represent the graph by the matrices shown in Figure 2.
Who cares?
We have described a simple, machine-readable format, that captures all basic features of a molecule, and is readily extendable to include any number of additional, user-defined vertex and edge features. Now that we have our molecular graph in matrix format, we can apply various graph-based machine learning methods on this. These are beyond the scope of this discussion, but could form the topic of a future blog post.