As you have already read the article's heading, so no further wasting any time lets get dive into understanding the ACTIVATION FUNCTION.
While building the neural network, one of the mandatory choices is to make is which Activation Function is to use in the neural network. In fact, it is an unavoidable choice because activation function are the foundations of the neural networks to learn and approximate any kind of complex relationship between variables.
What is "Activation Function"?
The Activation Function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron

Roles and Responsibility of the activation are to normalize, restrict, non-linearize, or filter the data set.
Can we do without an Activation Function?
We understand that using an activation function introduces an additional step at each layer during the forward propagation. Now the question is – if the activation function increases the complexity so much, can we do without an activation function?
Imagine a neural network without the activation functions. In that case, every neuron will only be performing a linear transformation on the inputs using the weights and biases. Although linear transformations make the neural network simpler, this network would be less powerful and will not be able to learn the complex patterns from the data.
A neural network without an activation function is essentially just a linear regression model.
Activation Function Types :-
Linear Function -


y = mx+c ( m is line equation represents W(Weights) and c is represented as b(bias) in neural nets so the equation can be modified as y = Wx+b)
Pros
Cons
Binary Step Function -

Binary Step Function is widely known as "Threshold Function"

This activation function is best used to classify inputs such as pictures of cats (so fluffy!) and birds and differentiating between the two. However, it should only be used at the output nodes of the neural network, not the hidden layers.
Pros
Cons
Non - Linear Function -
The graph of a linear function is a line. Thus, the graph of a nonlinear function is not a line. Linear functions have a constant slope, so nonlinear functions have a slope that varies between points. Also, they have the opposite properties of a linear function.
Different types of Non-Linear Function
1. Sigmoid (Logistic) Activation Function


The S-shaped function has proven to work great with two layers and 3 layers of neural network particularly classification problems. Notice the hill-shaped derivative of the function which pushes the network to “move down the hill” to either side, giving the network more distinction when classifying.
Pros
Cons
Tanh Activation Function


Tanh is the modified version of the Sigmoid activation function, but have similar properties of Sigmoid activation function

Pros
Cons
Tanh is preferred over the sigmoid function since it is zero centered and the gradient are not restricted to move in a certain direction
3. ReLu Activation Function (ReLu- Rectified Linear Unit)


Pros
Cons
4. Leaky ReLu Activation Function


This is an attempt to fix the dying ReLU problem as the gradient becomes a small value, α instead of 0.
Pros
Cons
5. ELU (Exponential Linear Units) Activation Function


Pros
Cons
6. P ReLu (Paramagnetic ReLu) Activation Function


Instead of multiplying x with a constant term we can multiply it with a “hyperparameter (α -trainable parameter)” which seems to work better the leaky ReLU. This extension to leaky ReLU is known as Parametric ReLU.
Pros
Cons
7. Swish Activation Function


The experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging data sets.
The curve of the Swish function is smooth and the function is differentiable at all points. This is helpful during the model optimization process and is considered to be one of the reasons that swish outperforms ReLU.
It works well for both positive and negative types of datasets.
8.Softmax / Normalized Exponential Function

Softmax can be described as the combination of multiple sigmoidal functions
The “softmax” function is also a type of sigmoid function but it is very useful to handle multi-class classification problems.
“Softmax function returns the probability for a data point belonging to each individual class.”
9. Softplus Activation Function

The softplus function is similar to the ReLU function, but it is relatively smoother.Function of Softplus or SmoothRelu f(x) = ln(1+exp x).
Derivative of the Softplus function is f’(x) is logistic regression (1/(1+exp x)).
Function value ranges from (0, + inf).
Cons
10. Maxout Activation Function

The Maxout activation is a generalization of the ReLU and the leaky ReLU functions.
It is a learnable activation function.
It is a piecewise linear function that returns the maximum of the inputs.
Which one is better to use? How to choose the right one?
To be honest there is no hard and fast rule to choose the activation function.
Each activation function as its own pro’s and cons.
All the good and bad will be decided based on the trails.
If you have any concern or wanna contact me , you can comment down below or you contact me on LinkedIn