An activation function is used to introduce non-linearity in an artificial neural network. It allows us to model a class label or score that varies non-linearly with independent variables. Non-linearity means that the output cannot be replicated from a linear combination of inputs; this allows the model to learn complex mappings from the available data, and thus the network becomes a universal approximator. On the other hand, a model which uses a linear function (i.e. no activation function) is unable to make sense of complicated data, such as, speech, videos, etc. and is effective only up to a single layer.
To allow backpropagation through the network, the selected activation function should be differentiable. This property is required to compute the gradients which allows us to tune the network weights. The non-linear functions are continuous and transform the input (normally zero-centered, however, these values get beyond their original scale once they get multiplied with their respective weights) in the range $(0, 1)$, $(-1, 1)$, etc. In a neural network, it is possible for some neurons to have linear activation functions, but they must be accompanied by neurons with non-linear activation functions in some other part of the same network.
Although any non-linear function can be used as an activation function, in practice, only a small fraction of these are used. The sections below describe various activation functions. These are accompanied with a Python snippet to plot them using NumPy and Matplotlib:
Binary step
$$a^i_j = f(z^i_j) = \begin{cases} 0 \hspace{1em} \text{if} \hspace{0.3em} z^i_j < 0 \ 1 \hspace{1em} \text{if} \hspace{0.3em} z^i_j > 0 \end{cases}$$
A binary step function is generally used in the Perceptron linear classifier. It thresholds the input values to $1$ and $0$, if they are greater or less than zero, respectively.
This activation function is useful when the input pattern can only belong to one or two groups, that is, binary classification.
plt.step(x, y)
$\tanh$
$$a^i_j = f(x^i_j) = \tanh(x^i_j)$$
The $\tanh$ non-linearity compresses the input in the range $(-1, 1)$. It provides an output which is zero-centered. So, large negative values are mapped to negative outputs. Similarly, zero-valued inputs are mapped to near zero outputs.
Also, the gradients for $\tanh$ are steeper than sigmoid, but it suffers from the vanishing gradient problem. $\tanh$ is commonly referred to as the scaled version of sigmoid, and generally this equation holds: $\tanh(x) = 2 \sigma(2x) - 1$
An alternative equation for the $\tanh$ activation function is:
$$a^i_j = f(x^i_j) = \frac{2}{1+\exp(-2x^i_j)} - 1$$
plt.plot(np.tanh(x))
ArcTan
$$a^i_j = f(x^i_j) = \tanh^{-1}(x^i_j)$$
This activation function maps the input values in the range $(-\pi/2, \pi/2)$. Its derivative converges quadratically against $0$ for large input values. On the other hand, in the sigmoid activation function, the derivative converges exponentially against $0$, which can cause problems during back-propagation.
Its graph is slightly flatter than $\tanh$, so it has a better tendency to differentiate between similar input values.
plt.plot(np.arctan(x))
LeCun’s Tanh
$$a^i_j = f(x^i_j) = 1.7159 \tanh!\left( \frac{2}{3} x^i_j\right)$$
This activation function was first introduced in Yann LeCun‘s paper Efficient BackProp. The constants in the above equation have been chosen to keep the variance of the output close to $1$, because the gain of the sigmoid is roughly $1$ over its useful range.
plt.plot(1.7159 * np.tanh(2/3 * x))
Hard Tanh
$$a^i_j = f(x^i_j) = \max(-1, \min(1, x^i_j))$$
Compared to $\tanh$, the hard $\tanh$ activation function is computationally cheaper. It also saturates for magnitudes of $x$ greater than $1$.
plt.plot(np.maximum(-1, np.minimum(1, x)))
Sigmoid
$$a^i_j = f(x^i_j) = \frac{1}{1+\exp(-x^i_j)}$$
The sigmoid or logistic activation function maps the input values in the range $(0, 1)$, which is essentially their probability of belonging to a class. So, it is mostly used for multi-class classification.
However, like $\tanh$, it also suffers from the vanishing gradient problem. Also, its output is not zero-centered, which causes difficulties during the optimization step. It also has a low convergence rate.
plt.plot(1 / (1 + np.exp(-x)))
Bipolar Sigmoid
$$a^i_j = f(x^i_j) = \frac{1-\exp(-x^i_j)}{1+\exp(-x^i_j)}$$
The sigmoid function can be scaled to have any range of output values, depending upon the problem. When the range is from $-1$ to $1$, it is called a bipolar sigmoid.
plt.plot((1 - np.exp(-x)) / (1 + np.exp(-x)))
ReLU (Rectified Linear Unit)
$$a^i_j = f(x^i_j) = \max(0, x^i_j)$$
A rectified linear unit has the output $0$ if its input is less than or equal to $0$, otherwise, its output is equal to its input. This activation function is also more biologically accurate. It has been widely used in convolutional neural networks. It is also superior to the sigmoid and $\tanh$ activation function, as it does not suffer from the vanishing gradient problem. Thus, it allows for faster and effective training of deep neural architectures.
However, being non-differentiable at $0$, ReLU neurons have the tendency to become inactive for all inputs, that is, they tend to die out. This can be caused by high learning rates, and can thus reduce the model’s learning capacity. This is commonly referred to as the “Dying ReLU” problem.
plt.plot(np.maximum(0, x))
Leaky ReLU
$$a^i_j = f(x^i_j) = \max(0.01 x^i_j, x^i_j)$$
The non-differentiability at zero problem can be solved by allowing a small value to flow when the input is less than or equal to $0$, which thus overcomes the “Dying ReLU” problem. It has proved to give better results for some problems.
plt.plot(np.maximum(0.01 * x, x))
Smooth ReLU
$$a^i_j = f(x^i_j) = \log!\big(1+\exp(x^i_j)\big)$$
Also known as the softplus unit, this activation function also overcomes the “Dying ReLU” problem by making itself differentiable everywhere and causes less saturation overall.
plt.plot(np.log(1 + np.exp(x)))
Logit
$$a^i_j = f(x^i_j) = \log!\bigg(\frac{x^i_j}{(1 − x^i_j)}\bigg)$$
This activation function performs the inverse operation of sigmoid,,that is, given probabilities in the range $(0, 1)$, it maps them to the full range of real numbers. The value of the logit function approaches infinity as the probability gets close to $1$.
It is mostly used in binary classification models, where we want to transform the binary input to real-valued quantities.
plt.plot(np.log(x / (1 - x)))
Softmax
$$a^i_j = f(x^i_j) = \frac{\exp(z^i_j)}{\sum\limits_k \exp(z^i_k)}$$
The softmax function gives us the probabilities that any of the classes are true. It produces values in the range $(0, 1)$. It also highlights the largest value and tries to suppress values which are below the maximum value; its resulting values always sum to $1$. This function is widely used in multiple classification logistic regression models.
plt.plot(np.exp(x) / np.sum(np.exp(x)))
A Juptyer notebook containing all the above plots is hosted on GitHub.