Artificial Intelligence (AI) is one of the most trending industries in 2018. AI is changing our world forever. If you know about AI then you must have heard about Neural Networks. Neural Networks is one of the most used and popular algorithms in AI. In this article, I will talk about Activation Functions used in Neural Networks. Activation Functions are very important in Neural Networks.
Note — We don’t count the input layer.
In the simplest form, to calculate the “weighted sum”, we use the following equation
The final out value is the prediction. And we use this prediction to calculate the error by comparing the output with the label. We use the error value to calculate the partial derivative w.r.t the weights and then update the weights with value. We keep repeating this process until the error becomes very small. This process is known as Backward Propagation
This is how a Neural Network works. So now we understand about Neural Networks, so we can jump in the Activation Function.
Note — I’m not going to deep about Forward Propagation and Back Propagation. If you don’t
have any idea about Forward Propagation and Back Propagation, then please learn these
topics first and then follow this post.
So, Why we use Activation Functions?
Why we use Activation Functions? What is the importance of Activation Functions in a Neural Network?
Activation Functions are actually very important for Neural Networks. They add a very important
property in Neural Networks and that is Non-Linearity. Without Activation Functions, Neural
Networks are linear.
But, why Non-Linearity is so important?
A linear function is just a polynomial of one degree, as the function y = x. A graph of the function is given below.
As you can see in the graph, the function is forming a straight line. This is a property of Linear Functions. Linear Functions always form straight lines. If we add more dimensions to then function, then it will form planes or hyper planes. Linear Functions are incapable to form any curves.
On the other hand, Non-Linear Functions can form anything including curves. Non-Linear Functions are functions with a polynomial of more than one degree. For example, the function
y= x² or the function y = 2x³.
In the graph, the functions y = x² and y = sin(x) are forming a curve, not straight lines. Linear Functions are easy to solve but they cannot learn complex patterns. On the other hand, Non-Linear Functions can learn very complex patterns because they can form complex shapes. And this is very important in Neural Networks.
Also, Linear Functions cause another problem. In Neural Networks, if we add more layers or more nodes in the Neural Network, then we can learn more complex functions. But if we don’t use Activation Functions then adding more layers or adding more nodes cannot help in learning complex functions. This is because, if we add two Linear Functions, the output is still Linear.
At this point, you should understand the importance of Activation Functions and Non-Linear Functions in Neural Network.
Now, lets talk about the types of Activation Functions
There are different types of Activation Functions are out there. We will not discuss all Activation Functions in this post, instead only the Activation Functions that are generally used in Neural Networks.
In this post, we’ll discuss 4 major Activation Functions. And these are
1. Sigmoid Activation Function
2. TanH Activation Function
3. ReLu and
4. Leaky ReLu.
There are more Activation Functions out there, these 4 are the major and most used Activation Functions.
Sigmoid or Logistic Activation Function
Sigmoid is one of the most popular and heavily used Activation Function. Sigmoid is a very simple mathematical function that scales the input between 0 and 1. And that means we can use the Sigmoid Function to detect whether to fire the neuron or not. The mathematical formula is given below.
And the graph of the equation is also given below.
Although Sigmoid Activation Function is a very popular Activation Function, these days in the era of Deep Learning, we don’t use the Sigmoid Activation Function that much because it suffers from different problems. Here is a list of common problems with Sigmoid Activation Function
- Vanishing gradient problem.
- Secondly, its output isn’t zero centered. It makes the gradient updates go too far in
different directions. 0 < output > 1, and it makes optimization harder.
- Sigmoids saturate and kill gradients.
- Sigmoids have slow convergence.
Here is the python 3.x code for Sigmoid Function
import numpy as np
def sigmoid(x, deriv=False):
return x * (1 – x)
return 1 / ( 1 + np.exp(-x))
TanH Activation Function
TanH Activation Function is another very popular Activation Function. It is also widely used in Machine Learning. Unlike Sigmoid, tanH Activation Functions output is always zero centered because its range is between -1 and 1. Hence optimization is easier in this method hence in practice it is always preferred over Sigmoid function. But still, it suffers from Vanishing gradient problem.
The mathematical equation and the graph of that equation is given below.
Even with tanH Activation Function, we still have the problem of Vanishing Gradient problem. Python 3 code for tanH Activation Function
import numpy as np
def tanh(x, deriv=False):
return (1 – (x**2))
ReLu Activation Function
ReLu Activation Function is a very popular Activation Function, especially with Deep Neural. Many researchers proved that ReLu Activation Function is a better Activation Function than others most of the time. The biggest advantage if ReLu is that it learn faster and avoids Vanishing Gradient problem. ReLu Activation Function is an incredibly simple function. The mathematical function is given below.
Here is the graph of ReLU.
But ReLu has one limitation. It should only be used in hidden layers of a Neural Network. Another problem with ReLu is that some gradients can be fragile during training and can die. It can cause a weight update which will make it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.
Python 3 code for ReLu
return x * (x > 0)
return 1. * (x > 0)
Leaky ReLu Activation Function
Leaky ReLu Activation Function is another very popular Activation Function generally used with Deep Neural Networks. It is a modified version of ReLu Activation Function. Leaky ReLu Activation Function solves the
problem of dying neurons.
Leaky ReLu solves the problem of dead neurons by introducing a slight slope that keeps the
Which one should we use?
Honestly speaking “it depends”. It depends on what type of task you are performing, what type of data you are using, what model you are using? Generally speaking, for most of the cases, ReLu Activation Function is better than any other Activation Function. So try to use ReLu Activation Function first, then if you are not getting the
performance from the model you want then try some other Activation Function. Remember one thing, don’t use ReLu or Leaky ReLu in the output layer, use these Activation Functions only in hidden layers.
What about other Activation Functions?
There are so many other Activation Functions. I cannot cover all Activation Functions here. The 5 Activation Functions are mentioned here are very popular and generally used in Machine Learning.