The Rectified Linear Unit (ReLU) activation function was first used in 1975, but its use exploded when it was used by Nair & Hinton in their 2010 paper on Restricted Boltzmann Machines. ReLU and its derivative are fast to compute, and it has dominated deep neural networks for years. The main problem with the activation function is the so-called dead ReLU problem, where significant negative input to a neuron can cause its gradient to always be zero. To rectify this (har har), modified versions have proposed, including leaky ReLU, GeLU and SiLU, wherein the gradient for x < 0 is not always zero.
A 2020 paper by Naizat et al., which builds upon ideas set out in a 2014 Google Brain blog post seeks to explain why ReLU and its variants seem to be better in general for classification problems than sigmoidal functions such as tanh and sigmoid.
Continue reading