An overview of activation functions used in neural networks

An activation function is used to introduce non-linearity to a network. This allows us to model a class label / score that varies non-linearly with independent variables. Non-linear means the output cannot be replicated from a linear combination of inputs, this allows the model to learn complex mappings from the available data, and thus the network becomes a universal approximator, whereas, a model which uses a linear function (i.e. no activation function) is unable to make sense of complicated data, such as, speech, videos, etc. and is effective for only a single layer.

Another important aspect of the activation function is that it should be differentiable. This is required when we backpropagate through our network and compute gradients, and thus tune our weights accordingly. The non-linear functions are continuous and transform the input (normally zero-centered, however, these values get beyond their original scale once they get multiplied with their respective weights) in the range \((0, 1)\), \((-1, 1)\), etc. In a neural network, it is possible for some neurons to have linear activation functions, but they must be accompanied by neurons with non-linear activation functions in some other part of the same network.

Although any non-linear function can be used as an activation function, in practice, only a small fraction of these are used. Listed below are some commonly used activation functions along with a Python snippet to create their plot using NumPy and Matplotlib:

Binary step

$$a^i_j = f(z^i_j) = \begin{cases} 0 \hspace{1em} \text{if} \hspace{0.3em} z^i_j < 0 \\ 1 \hspace{1em} \text{if} \hspace{0.3em} z^i_j > 0 \end{cases}$$

A binary step function is generally used in the Perceptron linear classifier. It thresholds the input values to \(1\) and \(0\), if they are greater or less than zero, respectively.

This activation function is useful when the input pattern can only belong to one of two groups i.e. binary classification.

plt.step(x, y)

step

\(\tanh\)

$$a^i_j = f(x^i_j) = \tanh(x^i_j)$$

The \(\tanh\) non-linearity compresses the input in the range \((-1, 1)\). It provides an output which is zero-centered. So, large negative values are mapped to negative outputs, similarly, zero-valued inputs are mapped to near zero outputs.

Also, the gradients for \(\tanh\) are steeper than sigmoid, but it suffers from the vanishing gradient problem. \(\tanh\) is commonly referred to as the scaled version of sigmoid, generally this equation holds: \(\tanh(x) = 2 \sigma(2x) - 1\)

An alternative equation for the \(\tanh\) activation function is:

$$a^i_j = f(x^i_j) = \frac{2}{1+\exp(-2x^i_j)} - 1$$
plt.plot(np.tanh(x))

tanh

ArcTan

$$a^i_j = f(x^i_j) = \tanh^{-1}(x^i_j)$$

This activation function maps the input values in the range \((-\pi/2, \pi/2)\). Its derivative converges quadratically against zero for large input values. On the other hand, in the sigmoid activation function, the derivative converges exponentially against zero, which can cause problems during back-propagation.

Its graph is slightly flatter than \(\tanh\), so it has a better tendency to differentiate between similar input values.

plt.plot(np.arctan(x))

arctan

LeCun’s Tanh

$$a^i_j = f(x^i_j) = 1.7159 \tanh\!\left( \frac{2}{3} x^i_j\right)$$

This activation function was first introduced in Yann LeCun‘s paper Efficient BackProp. The constants in the above equation have been chosen to keep the variance of the output close to \(1\), because the gain of sigmoid is roughly \(1\) over its useful range.

plt.plot(1.7159 * np.tanh(2/3 * x))

lecuns-tanh

Hard Tanh

$$a^i_j = f(x^i_j) = \max(-1, \min(1, x^i_j))$$

Compared to \(\tanh\), the hard \(\tanh\) activation function is computationally cheaper. It also saturates for magnitudes of \(x\) greater than \(1\).

plt.plot(np.maximum(-1, np.minimum(1, x)))

hard-tanh

Sigmoid

$$a^i_j = f(x^i_j) = \frac{1}{1+\exp(-x^i_j)}$$

The sigmoid or logistic activation function maps the input values in the range \((0, 1)\), which is essentially their probability of belonging to a class. So, it is mostly used for multi-class classification.

However, like \(\tanh\), it also suffers from the vanishing gradient problem. Also, the output it produces is not zero-centered, which causes difficulties during optimization. It also has a low convergence rate.

plt.plot(1 / (1 + np.exp(-x)))

sigmoid

Bipolar Sigmoid

$$a^i_j = f(x^i_j) = \frac{1-\exp(-x^i_j)}{1+\exp(-x^i_j)}$$

The sigmoid function can be scaled to have any range of output values, depending upon the problem. When the range is from \(-1\) to \(1\), it is called a bipolar sigmoid.

plt.plot((1 - np.exp(-x)) / (1 + np.exp(-x)))

bipolar-sigmoid

ReLU (Rectified Linear Unit)

$$a^i_j = f(x^i_j) = \max(0, x^i_j)$$

A rectified linear unit has the output \(0\) if its input is less than or equal to \(0\), otherwise, its output is equal to its input. It is also more biologically accurate. This has been widely used in convolutional neural networks. It is also superior to the sigmoid and \(\tanh\) activation function, as it does not suffer from the vanishing gradient problem. Thus, it allows for faster and effective training of deep neural architectures.

However, being non-differentiable at \(0\), ReLU neurons have a tendency to become inactive for all inputs i.e. they die out. This can be caused by high learning rates, and can thus reduce the model’s learning capacity. This is commonly referred to as the “Dying ReLU” problem.

plt.plot(np.maximum(0, x))

relu

Leaky ReLU

$$a^i_j = f(x^i_j) = \max(0.01 x^i_j, x^i_j)$$

The non-differentiability at zero problem can be solved by allowing a small value to flow when the input is less than or equal to \(0\), which thus overcomes the “Dying ReLU” problem. It has proved to give better results for some problems.

plt.plot(np.maximum(0.01 * x, x))

leaky-relu

Smooth ReLU

$$a^i_j = f(x^i_j) = \log\!\big(1+\exp(x^i_j)\big)$$

Also known as the softplus unit, this activation function also overcomes the “Dying ReLU” problem by making itself differentiable everywhere and causes less saturation overall.

plt.plot(np.log(1 + np.exp(x)))

smooth-relu

Logit

$$a^i_j = f(x^i_j) = \log\!\bigg(\frac{x^i_j}{(1 − x^i_j)}\bigg)$$

This function performs the inverse operation of sigmoid i.e. given probabilities in the range \((0, 1)\), it maps them to full range of real numbers. The value of the logit function approaches infinity as the probability gets close to \(1\).

It is mostly used in binary classification models, where we want to transform the binary inputs to real-valued quantities.

plt.plot(np.log(x / (1 - x)))

logit

Softmax

$$a^i_j = f(x^i_j) = \frac{\exp(z^i_j)}{\sum\limits_k \exp(z^i_k)}$$

The softmax function’s output tells us the probabilities that any of the classes are true, so it produces values in the range \((0, 1)\). It highlights the largest values and tries to suppress values which are below the maximum value, its resulting values always sum to \(1\). This function is widely used in multiple classification logistic regression models.

plt.plot(np.exp(x) / np.sum(np.exp(x)))

softmax

A Juptyer notebook containing all the above plots is hosted on GitHub.

links

social