Convolutional Neural Network (CNN) is one of the popular neural networks widely used for image classification. When an image is fed to CNN, the convolutional layers of CNN are able to identify different features of the image. The ability to accurately extract feature information from images makes CNN popular.
A neural network is a network of neural layers. Each layer contains what is called neurons. How we connect these neurons make up the configuration of a Neural Network.

CNN is a special type of neural network. In this article, we will learn those concepts that make a neural network, CNN.
A convolution neural network consists of an input layer, convolutional layers, Pooling(subsampling) layers followed by fully connected feed forward network.

Before we understand CNN, let us understand why do we need this special kind of neural network for processing images. In other words, why doesn’t a simple fully connected neural network work for Image Processing? Okay, now a step back…what is a fully connected neural network —A fully connected neural network is a network in which each neuron from one layer is connected to all neurons on its adjacent layers.

And why is a fully connected neural network not used for Image Processing? Because of what is called ‘Parameter Explosion’
Consider a simple 100×100 pixel image. And let’s say, we have 20 hidden layers with 50 neurons in each layer. So for training the network, the total number of parameters in this fully connected neural network to process 100×100 pixel image would be 100x100x50x20 + bias which is more than 10000000 parameters. Omg! So yeah, this is rightly known as ‘Parameter Explosion’. Thus we need a mechanism that can work well on images but using much lesser parameters than that is used in a fully connected feed-forward neural network. Yes…CNN works with a lesser number of parameters and hence also less prone to overfitting!
Let’s dive into the components of the CNN.
I. Convolutional Layers:
CNN network uses the mathematical concept called Convolution and hence the name. Convolutional Layers are formed by applying the sliding window function to the matrix representing the image. This sliding window function is known as kernel or filter which is nothing but another matrix. In the below example, a 3 x 3 kernel is slid across the 10 x 10 input matrix. The 3 x 3 kernel matrix (K) is multiplied with every 3 x 3 matrix available from the input matrix (I).

where S is the resultant matrix. (8 x 8 matrix in the above example). The resultant matrix is also known as the ‘Feature map’. This operation of matrix multiplying the kernel with image sections is known as convolution.

In CNN, on each convolutional layer, the process does not only use just 1 filter but uses multiple such filters(/kernels) the result of which is we get multiple convoluted matrices. Each convoluted matrix (feature map) gives information on different aspects of the image. In the below picture there are 4 kernels used.

This concept of using different kernels gives CNN the ability to preserve the spatial structure of the image as the value is calculated based only on its surrounding pixels. Also, note the same kernel matrix is used to produce a convolutional matrix. This is why there are so many fewer parameters on CNN when compared to a feed forward network taking an image to process. In other words, on CNN, the image is transformed into much smaller dimensions before it is fed into the fully connected layers of the CNN.
If you are still not sure about the convolution and how the results of convolution carry the information in the image, the following information from Wikipedia can help you understand the different effects that are produced when convolving different kernels to the same image.

Do not worry. When designing the CNN, we will not be deciding on the kernel values. The kernel values are learned automatically by the network using backpropagation. The parameters that we have to decide are the number of kernels and the size of the kernel matrix to be used on each layer in the network. In general 3×3 kernel is used on small images and 5×5 or 7×7 and more on larger images. Generally, fewer filters (16, 32) are used at the input layer, and more filters used at subsequent deeper layers.
Now we know the output of the convolution is a feature matrix that is smaller than the input matrix. In the below example, convolving 3 x 3 kernel with 10 x 10 image produces an 8 x 8 feature map. Also, in the convolution process, the corner pixels of the image matrix are covered less number of times when compared to the non-corner pixels. This obviously impacts the quality of information extracted from the corners of the images.

In order to make sure that the corners of the images are processed the same as the inner parts of the image, one solution is to have the original image matrix surrounded by a new border like below. This is called padding.

So how many padding layers, do we need to add? Again, how do we arrive at this number? Is it also one of the parameters that we should decide on. Yes.
From the examples above we see
output size = input size – filter size + 2 * Pool size + 1.
Example: For 10X10 input and filter 3x 3 with 0 padding the output is 10–3+0+1 = 8.
So in order to have the output to have the same size as the input image, we will calculate the P value.
Example: For 10X10 input image and output image and 3×3 kernel, the padding would have to be (-3+1)/-2 = 1
Hint: Larger the kernel, larger the padding. 🙂
Another parameter for convolutional layers is the stride size. Stride size is the interval at which the kernel is applied. Default stride length is 1. Sometimes 2 or larger number is used for larger images.
The output size of the current layer is given by

and the depth of the output is the same as the number of kernels applied. This helps to decide the number of neurons required for the next layer.
Thus convolution operation helps CNN to find feature patterns independent of the feature located in the image.
RELU :
Similar to any other fully connected neural network, we will need to pass the output of the convolution to the activation function. RELU is the most commonly used activation function for the convolution layer.
II. Pooling Layer
The rectified output of the convolution layer is then passed to the pooling layer. Pooling is used to reduce the size of the input matrix to the subsequent layer. Max pooling is the most commonly used pooling method.
The feature map is scanned in 2 x 2 fashion and the maximum value within the four-pixel value is chosen. 2 x 2 is the most common size chosen for pooling.

Pooling is done to extract the most important feature from the feature map. Other pooling techniques are average pooling, min pooling, etc. The most common one used in CNN is max pooling.
Dropout:
Like the feed-forward network, dropout is used in CNN. Dropdown is used after pooling layers to avoid overfitting.
III. Fully Connected Layer
CNN can contain multiple convolution and pooling layers. Finally, the output of the last pooling layer of the network is flattened and is given to the fully connected layer. The number of hidden layers and the number of neurons in each hidden layer are the parameters that needed to be defined. Sigmoid and Softmax activation functions are used at these layers to output the class probability.
Example:
Here is the example that is used to classify mnist data. The network uses a rectified convolutional layer that uses 32 ‘3×3’ kernels and followed by another rectified convolutional layer which uses 64 ‘3×3’ kernels the output of which is given to max-pooling layer with ‘2 x2’ pool size followed by a dropout layer and feed-forward network. Softmax is used to output the probability of classes.

Summary
We learned the different components of a convolutional neural network and the uses of these components and how they work independently and together.
Nice article easy to understand they way author given with good example.