Even within well-studied organisms, many genes lack useful functional annotations. One way to generate such functional information is to infer biological relationships between genes/proteins, using a network of gene coexpression data that includes functional annotations. However, the lack of trustworthy functional annotations can impede the validation of such networks. Hence, there is a need for a principled method to construct gene coexpression networks that capture biological information and are structurally stable even in the absence of functional information.
In my latest paper, we introduce the concept of signed distance correlation as a measure of dependency between two variables and apply it to generate gene coexpression networks. Distance correlation offers a more intuitive approach to network construction than commonly used methods such as Pearson correlation. We propose a framework to generate self-consistent networks using signed distance correlation purely from gene expression data, with no additional information. We analyse data from three different organisms to illustrate how networks generated with our method are more stable and capture more biological information compared to networks obtained from Pearson or Spearman correlations.
To evaluate the stability of the networks, we use COGENT. COGENT evaluates the internal consistency of a method to generate networks from a specific dataset by iteratively splitting the dataset into possibly overlapping sets, and constructing a network from each of them. The more similar the constructed networks are, the more stable the network construction method. We find that networks obtained using signed distance correlation (blue line) are more stable than those obtained using Pearson correlation (red line).
Using COGENT, we also get the edge density for which the stability reaches its maximum. We select the threshold that results in a network with that edge density as the optimal network from the analysed dataset correlation data.
We also analysed the amount of biological information captured in the networks. To do so, we use the information contained in the PPI database STRING. The results show that the networks obtained using signed distance correlation capture more biological information than those based on Pearson correlation.
For a full description of the methodology and the results, please have a look at our paper: https://www.biorxiv.org/content/10.1101/2020.06.21.163543v1 !!