Activation Functions: Comparison of Trends in Practice and Research For
Deep Learning

Metadata
- Author: Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, Stephen Marshall
- Full Title: Activation Functions: Comparison of Trends in Practice and Research For Deep Learning
- Category:articles
- Summary: The text discusses trends in activation functions for deep learning. The authors are Nwankpa, Ijomah, Gachagan, and Marshall. The information is available online.
- URL: https://arxiv.org/pdf/1811.03378.pdf
Highlights
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting (View Highlight)
New highlights added April 19, 2024 at 4:55 PM
- A common problem for most learning based systems is, how the gradient flows within the network, owing to the fact that some gradients are sharp in specific directions and slow or even zero in some other directions thereby creating a problem for an optimal selection techniques of the learning parameters. (View Highlight)
- The Sigmoid is a non-linear AF used mostly in feedforward neural networks. It is a bounded differentiable real function, defined for real input values, with positive derivatives everywhere and some degree of smoothness [48]. (View Highlight)
- the main advantages of the sigmoid functions as, being easy to understand and are used mostly in shallow networks. (View Highlight)
- Sigmoid AF suffers major drawbacks (View Highlight)
- Other forms of AF including the hyperbolic tangent function was proposed to remedy some of these drawbacks suffered by the Sigmoid AF. (View Highlight)
- The tanh function became the preferred function compared to the sigmoid function in that it gives better training performance for multi-layer neural networks [46], [49]. However, the tanh function could not solve the vanishing gradient problem suffered by the sigmoid functions as well. The main advantage provided by the function is that it produces zero centred output thereby aiding the back-propagation process. (View Highlight)
- . This makes the tanh function produce some dead neurons during computation. (View Highlight)
- This limitation of the tanh function spurred further research in activation functions to resolve the problem, and it birthed the rectified linear unit (ReLU) activation function. (View Highlight)
- The main difference between the Sigmoid and Softmax AF is that the Sigmoid is used in binary classification while the Softmax is used for multivariate classification tasks. (View Highlight)
- Note: This is a specious claim. Sigmoid was the original activation function for all units in Rumelhart 1986. Softmax is used to convert logits to weights and comes from statistical mechanics. I would argue that Softmax is not actually an activation function at all.
- The main difference between the Softsign function and the tanh function is that the Softsign converges in polynomial form unlike the tanh function which converges exponentially. (View Highlight)
- . The ReLU represents a nearly linear function and therefore preserves the properties of linear models that made them easy to optimize, with gradient-descent methods [5]. (View Highlight)
- The main advantage of using the rectified linear units in computation is that, they guarantee faster computation since it does not compute exponentials and divisions, with overall speed of computation enhanced [58]. Another property of the ReLU is that it introduces sparsity in the hidden units as it squishes the values between zero to maximum. However, the ReLU has a limitation that it easily overfits compared to the sigmoid function although the dropout technique has been adopted to reduce the effect of overfitting of ReLUs and the rectified networks improved performances of the deep neural networks [20]. (View Highlight)