AlexNet

Ashutosh Kumbhare Jan 07 2021 · 3 min read
Share this

History:  AlexNet was designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor.

AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012. The network achieved a top-5 error of 15.3%, more than 10.8 percentage points lower than that of the runner up. It was also after that year that more and deeper neural networks were proposed, such as the excellent vgg, GoogleLeNet.

Before AlexNet, in CNN tan h and sigmoid function was very famous, this was the first time Relu activation function was used in the CNN architecture. Also, in this model GPU was used, before that GPU was never been used inside deep learning.

Problems in LeNet: LeNet was the model which was came before AlexNet. In between for sure many model may came but no one was a game changer. In LeNet fix set of kernels were used i.e. 2x2 and 5x5, also tan h activation function was used which may create vanishing gradient problem. LeNet take an input of 32x32 size and having only 60k parameters.

The Dataset: For AlexNet they have chosen ImageNet dataset. It is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images. ImageNet consists of variable-resolution images. Therefore, the images have been down-sampled to a fixed resolution of 256×256. Given a rectangular image, the image is rescaled and cropped out the central 256×256 patch from the resulting image.

The Architecture:

The architecture of AlexNet is shown in the above figure, consists of eight layers: five convolutional layers and three fully-connected layers. As we can see in the fig in alexNet input image was of size 227x227x3 (i.e a colourful image was excepted), on top of that they have used 96, 11x11 kernels of stride 4. Output of the kernel can be seen in the 2nd box of the image i.e 55x55 with 96 features.

How to calculate output size?

            Formula = (input size + padding size-kernels size)/stride+1

                  =  (227-11)/4+1   = 55

While extracting features in middle they applied LRN.

Why LRN? What is Local Response Normalization (LRN)?

In this model they have used Relu activation function in forward connection, as we studied that in Relu, negative data will coop-up, but on the other hand when deals with positive(+ve) data it will never give a vanishing gradient problem. But that’s where the problem is taking place, data we are sending is not normalized.

So, they have introduced LRN

          LRN was first introduced in this architecture, the reason for using LRN was to encourage lateral inhibition. Lateral Inhibition is a concept in Neurobiology that refers to the capacity of a neuron to reduce the activity of its neighbors.

In DNN’s the purpose of this lateral inhibition is to carry out local contrast enhancement as excitation for the next layers. LRN is a non-trainable layer, the square-normalizes the pixel values in a feature map within local neighbors.

There are two types of LRN:

1.    Inter-Channel LRN

2.    Intra-Channel LRN

Inter-channel LRN: This is originally what they have used in Alexnet, we can even see this in the 3rd block of architecture. The neighborhood defined is across the channel. For each (x,y) position normalization is carried out in the depth dimension.

So, to solve problem of relu, they have proposed normalization of data using LRN. They thought that whatever data we are passing through Relu function, Why can’t send the normalized data. By passing normalized data even the calculation time may reduced.

Now, let’s come back to Alexnet.

Number of Parameters: Out of 62.3million 6% parameters are found in convolutional  i.e. 6% learning is done in convolutional layers and 94% learning in forward connection. From convolution it was just trying to extract data.

Forward Computation: 95% calculation happened in the convolution layer because there were many kernels, many filter operations and 5% calculation in final forward connection.

The Overfitting Problem. AlexNet had 60 million parameters, a major issue in terms of overfitting. Two methods were employed to reduce overfitting:

  • Data Augmentation. The authors used label-preserving transformation to make their data more varied. Specifically, they generated image translations and horizontal reflections, which increased the training set by a factor of 2048. They also performed Principle Component Analysis (PCA) on the RGB pixel values to change the intensities of RGB channels, which reduced the top-1 error rate by more than 1%.
  • Dropout. This technique consists of “turning off” neurons with a predetermined probability (e.g. 50%). This means that every iteration uses a different sample of the model’s parameters, which forces each neuron to have more robust features that can be used with other random neurons. However, dropout also increases the training time needed for the model’s convergence.
  • Result: The network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%. The best performance achieved during the ILSVRC-2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features, and since then the best-published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features.

    The results on ILSVRC-2010 are summarized in Table 1.

    The Problem: The only problem with alexNet was it is not able to score when high resolution image was used. At that time 227x227 image was considered as a good quality image, also alexNet was very famous around 2012-2013, but as the image resolutions increases alexnet was losing its grip.

    Comments
    Read next