Activation Functions: Comparison of Trends in Practice and Research For

Deep Learning

rw-book-cover

Metadata

Author: Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, Stephen Marshall
Full Title: Activation Functions: Comparison of Trends in Practice and Research For Deep Learning
Category:articles
Summary: The text discusses trends in activation functions for deep learning. The authors are Nwankpa, Ijomah, Gachagan, and Marshall. The information is available online.
URL: https://arxiv.org/pdf/1811.03378.pdf

Highlights

Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting (View Highlight)

New highlights added April 19, 2024 at 4:55 PM

A common problem for most learning based systems is, how the gradient ﬂows within the network, owing to the fact that some gradients are sharp in speciﬁc directions and slow or even zero in some other directions thereby creating a problem for an optimal selection techniques of the learning parameters. (View Highlight)
The Sigmoid is a non-linear AF used mostly in feedforward neural networks. It is a bounded differentiable real function, deﬁned for real input values, with positive derivatives everywhere and some degree of smoothness [48]. (View Highlight)
the main advantages of the sigmoid functions as, being easy to understand and are used mostly in shallow networks. (View Highlight)
Sigmoid AF suffers major drawbacks (View Highlight)
Other forms of AF including the hyperbolic tangent function was proposed to remedy some of these drawbacks suffered by the Sigmoid AF. (View Highlight)
The tanh function became the preferred function compared to the sigmoid function in that it gives better training performance for multi-layer neural networks [46], [49]. However, the tanh function could not solve the vanishing gradient problem suffered by the sigmoid functions as well. The main advantage provided by the function is that it produces zero centred output thereby aiding the back-propagation process. (View Highlight)
. This makes the tanh function produce some dead neurons during computation. (View Highlight)
This limitation of the tanh function spurred further research in activation functions to resolve the problem, and it birthed the rectiﬁed linear unit (ReLU) activation function. (View Highlight)
The main difference between the Sigmoid and Softmax AF is that the Sigmoid is used in binary classiﬁcation while the Softmax is used for multivariate classiﬁcation tasks. (View Highlight)
- Note: This is a specious claim. Sigmoid was the original activation function for all units in Rumelhart 1986. Softmax is used to convert logits to weights and comes from statistical mechanics. I would argue that Softmax is not actually an activation function at all.
The main difference between the Softsign function and the tanh function is that the Softsign converges in polynomial form unlike the tanh function which converges exponentially. (View Highlight)
. The ReLU represents a nearly linear function and therefore preserves the properties of linear models that made them easy to optimize, with gradient-descent methods [5]. (View Highlight)
The main advantage of using the rectiﬁed linear units in computation is that, they guarantee faster computation since it does not compute exponentials and divisions, with overall speed of computation enhanced [58]. Another property of the ReLU is that it introduces sparsity in the hidden units as it squishes the values between zero to maximum. However, the ReLU has a limitation that it easily overﬁts compared to the sigmoid function although the dropout technique has been adopted to reduce the effect of overﬁtting of ReLUs and the rectiﬁed networks improved performances of the deep neural networks [20]. (View Highlight)

David's raw ML reference notes

Explorer

Nwankpa, et al. (2018)

Activation Functions: Comparison of Trends in Practice and Research For

Metadata

Highlights

New highlights added April 19, 2024 at 4:55 PM

Graph View

Table of Contents

Backlinks