#deeplearning #neuralnetwork #ai

Jay Singh Sept 18 2020 · 6 min read
Share this

As you have already read the article's heading, so no further wasting any time lets get dive into understanding the  ACTIVATION FUNCTION.

While building the neural network, one of the mandatory choices is to make is which Activation Function is to use in the neural network. In fact, it is an unavoidable choice because activation function are the foundations of the neural networks to learn and approximate any kind of complex relationship between variables.

What is "Activation Function"?

 The Activation Function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron

Roles and Responsibility of the activation are to normalize, restrict, non-linearize, or filter the data set.

Can we do without an Activation Function?

We understand that using an activation function introduces an additional step at each layer during the forward propagation. Now the question is – if the activation function increases the complexity so much, can we do without an activation function?

Imagine a neural network without the activation functions. In that case, every neuron will only be performing a linear transformation on the inputs using the weights and biases. Although linear transformations make the neural network simpler, this network would be less powerful and will not be able to learn the complex patterns from the data.

   A neural network without an activation function is essentially just a linear regression model.

 Activation Function Types :-

  •  Linear  Function
  •  Binary Step  Function 
  •  Non-Linear Function
  • Linear Function -

    y = mx+c ( m is line equation represents W(Weights) and c is represented as b(bias) in neural nets so the equation can be modified as y = Wx+b)


  • It gives a range of activations, so it is not binary activation.
  • We can definitely connect a few neurons together and if more than 1 fires, we could take the max ( or softmax) and decide based on that.
  • Cons

  • For this function, the derivative is a constant. That means the gradient has no relationship with X.
  • It is a constant gradient and the descent is going to be on a constant gradient.
  • If there is an error in prediction, the changes made by backpropagation is constant and not depending on the change in input delta(x)
  • Binary Step Function -

    Binary Step Function is widely known as "Threshold Function"

    This activation function is best used to classify inputs such as pictures of cats (so fluffy!) and birds and differentiating between the two. However, it should only be used at the output nodes of the neural network, not the hidden layers.


  • Best with Binary class classification schemes (“yes” or “no”, 1 or 0)
  • Cons

  • Real-life isn’t that black and white and most situations are not this binary.
  • Can only be used at the end of a network.
  • Best for perceptrons and one layer networks
  • Non - Linear Function -

    The graph of a linear function is a line. Thus, the graph of a nonlinear function is not a line. Linear functions have a constant slope, so nonlinear functions have a slope that varies between points. Also, they have the opposite properties of a linear function.

    Different types of Non-Linear Function

    1. Sigmoid (Logistic) Activation Function

    The S-shaped function has proven to work great with two layers and 3 layers of neural network particularly classification problems. Notice the hill-shaped derivative of the function which pushes the network to “move down the hill” to either side, giving the network more distinction when classifying.


  • Amazing for classification problem.
  • This nonlinearity gives the network more complex and allows us to use it for more difficult tasks
  • Cons

  • The sigmoid function is not “zero-centric”.  This makes the gradient updates go too far in different directions 0 < output < 1, and it makes optimization harder.
  • Take for example if the neuron’s input is very negative, then the output after the activation function would be close to 0, which would cause the gradient to be almost nothing. As such the neuron would be saturated, and would not learn.
  • If you graph f′(x), you can see that the output is between 0 and 1. Imagine having an n-layer neural network, if you use the sigmoid function for each layer, then the gradient, as the signal is backpropagated, would get smaller and smaller which leads to Vanishing Gradient Problem.
  • Tanh Activation Function

    Tanh is the modified version of the Sigmoid activation function, but have similar properties of Sigmoid activation function 


  • The output is zero “centric”
  • Optimization is easier
  • Derivative /Differential of the Tanh function (f’(x)) will lie between 0 and 1.
  • Cons

  • Slow convergence- as its computationally heavy. Reason use of exponential math function )
  • A derivative of the Tanh function suffers Vanishing gradient and Exploding gradient problem.
  • Tanh is preferred over the sigmoid function since it is zero centered and the gradient are not restricted to move in a certain direction

    3. ReLu Activation Function (ReLu- Rectified Linear Unit)


  • This function does not activate all the neurons or perceptron at the same time.
  • Computationally efficient.
  • Convergence is very fast.
  • Cons

  • ReLu function in not zero-centric. This makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
  •  If the input is sufficiently negative, the output will always be 0 and as such, the gradient will always be 0. This is a massive problem because it essentially “kills” neurons and prevents them from learning.
  • 4. Leaky ReLu Activation Function

    This is an attempt to fix the dying ReLU problem as the gradient becomes a small value, α instead of 0. 


  • Leaky ReLU is defined to address the problem of dying neuron/dead neurons.
  • It allows negative value during backpropagation.
  • It is efficient and easy for computation.
  • Cons

  • Leaky ReLU does not provide consistent predictions for negative input values.
  • 5. ELU (Exponential Linear Units) Activation Function


  • ELU is also proposed to solve the problem of dying neurons.
  • Zero-centric
  • Cons

  • Slow convergence due to exponential function.
  • Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.
  • 6. P ReLu (Paramagnetic ReLu) Activation Function

    Instead of multiplying x with a constant term we can multiply it with a “hyperparameter (α -trainable parameter)” which seems to work better the leaky ReLU. This extension to leaky ReLU is known as Parametric ReLU.


  • Here α  is a learnable parameter.
  • Have a slight advantage over Leaky Relu due to the trainable parameter.
  • Handle the problem of dying neurons.
  • Cons

  • It does not provide consistent predictions for negative input values.
  • 7. Swish Activation Function

    The experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging data sets.

    The curve of the Swish function is smooth and the function is differentiable at all points. This is helpful during the model optimization process and is considered to be one of the reasons that swish outperforms ReLU.

    It works well for both positive and negative types of datasets.

    8.Softmax / Normalized Exponential Function

    Softmax can be described as the combination of multiple sigmoidal functions 

    The “softmax” function is also a type of sigmoid function but it is very useful to handle multi-class classification problems.

    “Softmax function returns the probability for a data point belonging to each individual class.”

    9. Softplus Activation Function

    The softplus function is similar to the ReLU function, but it is relatively smoother.Function of Softplus or SmoothRelu  f(x) = ln(1+exp x).

    Derivative of the Softplus function is f’(x) is logistic regression (1/(1+exp x)).

    Function value ranges from (0, + inf).


  • It has a vanishing gradient problem
  • 10. Maxout Activation Function

    The Maxout activation is a generalization of the ReLU and the leaky ReLU functions.

    It is a learnable activation function.

    It is a piecewise linear function that returns the maximum of the inputs.

    Which one is better to use? How to choose the right one?

    To be honest there is no hard and fast rule to choose the activation function.

    Each activation function as its own pro’s and cons.

    All the good and bad will be decided based on the trails.

    If you have any concern or wanna  contact me , you can comment down below or you contact me on LinkedIn

    Read next