Journal Club: Is our data biased, and should it be?

Jia, X., Lynch, A., Huang, Y. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019) doi:10.1038/s41586-019-1540-5 https://www.nature.com/articles/s41586-019-1540-5

Last week I presented the above paper at group meeting. While a little different from a typical OPIG journal club paper, the data we have access to almost certainly suffers from the same range of (possible) biases explored in this paper.

The authors examined the available data for amine-templated metal oxide synthesis, in particular examining the choice of reactants and reaction conditions. The found such data was heavily biased; after exploring several possible avenues for such bias, concluded that the bias was anthropogenic (“originating in human activity”, i.e. a human bias).

To help isolate the source of bias, the authors performed 548 randomly generated experiments. This demonstrated that the popularity of reactants or the choices of reaction conditions were uncorrelated to the success of the reaction. In fact, they showed that randomly generated experiments better illustrated the range of parameter choices that are compatible with crystal formation.

Why does any of this matter? The authors then showed that machine-learning models trained on a smaller randomised reaction dataset outperformed models trained on larger human-selected reaction datasets, demonstrating the importance of identifying and addressing anthropogenic biases in scientific data.

An interested paper well worth a read!

Author