Well, who hasn’t heard of the term Data Science and Deep Learning before? I assume since you are on this platform you must have 😉 Modern deep learning and computer vision heavily rely on convolutions, in fact, convolutional neural networks are the building block of any modern computer vision pipeline whether its object detection, classification or segmentation. But have you ever wondered how these things or mathematical operations make sense of images? How do they know that the image contains one particular feature or not? If yes, then you are in for a treat.

In this blog post, we only deal with convolution, and how stacking these layers help us in feature extraction. This tutorial will be divided into two parts:

• Convolution – The mathematics
• An example of feature extraction using convolutional filters

## Convolution – a little mathematical perspective

Those of you who might have taken some mathematics, physics or signal processing courses must be quite familiar with the term convolution. Wikipedia states it as – “In mathematics (in particular, functional analysis) convolution is a mathematical operation on two functions (f and g) that produces a third function expressing how the shape of one is modified by the other.” The convolution is a process of integrating the product of two functions in which one of the functions is flipped and time-shifted. Suppose we have two functions and then the convolution of and g is defined as

Fig 1. The convolution integral (taken from Convolution – Wikipedia)

This seems fairly complicated at a first look but convolution in computers actually doesn’t work this way because we don’t have continuous functions and g, what we have are a discrete set of sequences of numbers representing two signals or functions. So for discrete sequences the convolutional integral changes to

Fig 2. Discrete  convolution integral (taken from Convolution – Wikipedia)

This seems fairly simple, right? The entire crux lies in multiplying two sequences term by term and add the results then move one sequence one step ahead.

Fig 3. Convolution integral – a visual approach  (taken from Convolution – Wikipedia)

So the above math may have been tricky for someone but if you remember the crux you are good to go.

Now you ought to be wondering that we are talking about sequences that are one dimensional, how to scale it to two-dimensional images. If you are then you are on the right track but first, we need to clear one thing out of the way. Convolution, as defined in mathematics, is not exactly used in image processing or deep learning. The reversal of one single sequence, as defined, is not done. This is a very niche difference which can be ignored for all practical reasons. For easy understanding just take convolution as the multiplication of two sequences in which one is shifted every step.

### Convolution of matrices

Suppose we have a 4×4 matrix and a 3×3 matrix. To find their convolution, the smaller 3×3 matrix is placed over the 4×4 matrix placing corners on top of each other. All the elements which overlap between the two matrices are multiplied and added together to give the first element of our convolution matrix. Next, the 3×3 matrix is shifted right and the same process is repeated so on and so forth.

Fig 4. Convolution of two matrices. The blue one is the 4×4 matrix, the grey one is the 3×3 matrix and the green is the final convolution result matrix. (taken from Convolution arithmetic)

We all know any image is represented by a 2D or 3D matrix with each pixel value denoting the intensity of the channel. So the above convolution method can be applied to images too.

Fig 5. A numerical representation of images

So if you have stuck to the blog till now, congratulations!!  We have built all the technical details required for the upcoming parts.

## Convolution in Image Processing

Deep Learning really became popular in 2012 when Alex Krizhevsky along with Geoffrey Hinton demonstrated Alexnet, a convolution-based deep neural network architecture, which outperformed the then state of the art in image recognition. But what about before 2012? Weren’t there any image processing algorithms? Of course not!! There were tons of algorithms suited for different tasks but the general feature of most of them was that they used convolution with different manually defined matrices to learn about some particular features present in the image.

I will present it more clearly using an example. Suppose we have a 5×5 input image as shown in Fig 6. and we want to detect the cross in this image. Owing to the simplicity of the image we can clearly state there is a cross but how to automate this?

Fig 6. An example input image

We know that a cross contains two perpendicular lines which are highlighted in Fig 7.  The detection of these lines in images shows a possible chance of presence of a cross. These horizontal and vertical lines on which the presence of an object in an image is dependent are called features.

Fig 7. The features of an image

#### Line Detection

For the time being assume, we have a 3×3 matrix (usually known as a filter) as shown in Fig 8. The value of the elements of this filter was at first empirically derived. This filter is known as the Vertical Line Detector.

Fig 7. Vertical Line Detector

Convolving it with the input image we get the output as shown below

Fig 7. Vertical Line Detector Output

Now we can clearly see that the convolution process yielded an image where only the vertical line in the middle was highlighted.

Similarly, if we transpose the above filter we get a horizontal line detector with convolution which will detect vertical lines in an image.

a)

b)

Fig 8. a) The horizontal line detector filter. b) The convolution output

The detection of both horizontal and vertical lines in an image suggests the presence of a cross but not every image having such lines is a cross. We need to detect more complex features in an image to make an accurate assessment of the object present.

So what can be that feature which helps us to know it’s definitely a cross in the image?

Fig 9.  The most vital feature in detecting a cross where two lines intersect perpendicularly

It is evident we need to detect the above pattern rather than just a set of horizontal and vertical lines. This can be done if we use a more complex feature detector or filter. Until now we have been able to see that convolution processes can help us extract various features in an image and once we know more complex convolution filters we can extract more complex features and detect more complex things in an image.

a)

1. b)

Fig 10. a) The cross detection filter . b) The convolution output

It eventually boils down to how we choose the filter weights or the matrices. We will discuss more on this a little later. For the time being let’s talk about activation function.

#### Activation Functions

It is common knowledge that activation functions are used to add non-linear behaviour to our models and they also help in better feature extraction. There are quite a few activation functions like Relu, Tanh, Leaky Relu, etc. each with its own pros and cons. For eg., one of the most widely used activation function is Relu which creates a piecewise linear approximation of the final neural network function and is also non-differentiable, which is not very optimum, but it reduces training time along with dealing with the vanishing and exploding gradient problems that it is widely used to build high-performance neural networks.

Fig 11. Various activation functions (taken from Ruosen Lee’s blog)

As already seen above, the convolution output of all the different filters detect the main feature with those pixels having a high magnitude but there are other pixels with different pixel values (in our case they are negative but it is not a necessity). So for accurate feature detection, we need to deal with such stray pixels too and activation functions help us to do that.

Let’s take a look what happens if we pass our convolution outputs in all the three cases through a Relu activation function

Fig 12. The effect of Relu activation on convolution outputs

The above images clearly demonstrate the effect of Relu activation in feature detection. It highlights the prominent part of the images which helps in further classification and identification.

So we have finally understood how convolution and activation functions play a major role in the working of artificial neural networks. These filters are stacked one upon the other to understand more complex features. It is generally the case that the initial convolution layers in a network learn simple features in an image like horizontal or vertical lines and the layers deeper use these extracted features to learn more complex features like eyes, ears, etc. Finally, the last layers combine all this information to predict where there is a face in an image or not.

Fig 13. The visualization of convolution filters of VGG – 16. More complex features are learned in deeper layers as seen clearly. (taken from ResearchGate)

But you may have noticed that the filters used in the above examples their weights (or the elements) were chosen by the programmer but for more complex features and larger images, it is sure to be intractable. Here comes the power of Deep Learning.

#### Conclusion:

“Learning” in Deep Learning actually refers to learning the weights of the filters automatically using gradient descent. I won’t go deep into gradient descent in this post but be on the lookout for it in the future! The general scheme of things in deep learning is to randomly initialize the weights of the network and make predictions using those random weights. The prediction is then evaluated against a true label and that feedback is routed back to the network and the weights of the network are fine-tuned to make more accurate predictions.

Fig 14. The normal Machine Learning flow (taken from Deep Learning Italia)

Well, who hasn’t heard of the term Data Science and Deep Learning before? I assume since you are on this platform you must have 😉 Modern deep learning and computer vision heavily rely on convolutions, in fact, convolutional neural networks are the building block of any modern computer vision pipeline whether its object detection, classification or segmentation. But have you ever wondered how these things or mathematical operations make sense of images? How do they know that the image contains one particular feature or not? If yes, then you are in for a treat.

In this blog post, we only deal with convolution, and how stacking these layers help us in feature extraction. This tutorial will be divided into two parts:

• Convolution – The mathematics
• An example of feature extraction using convolutional filters

## Convolution – a little mathematical perspective

Those of you who might have taken some mathematics, physics or signal processing courses must be quite familiar with the term convolution. Wikipedia states it as – “In mathematics (in particular, functional analysis) convolution is a mathematical operation on two functions (f and g) that produces a third function expressing how the shape of one is modified by the other.” The convolution is a process of integrating the product of two functions in which one of the functions is flipped and time-shifted. Suppose we have two functions and then the convolution of and g is defined as

Fig 1. The convolution integral (taken from Convolution – Wikipedia)

This seems fairly complicated at a first look but convolution in computers actually doesn’t work this way because we don’t have continuous functions and g, what we have are a discrete set of sequences of numbers representing two signals or functions. So for discrete sequences the convolutional integral changes to

Fig 2. Discrete  convolution integral (taken from Convolution – Wikipedia)

This seems fairly simple, right? The entire crux lies in multiplying two sequences term by term and add the results then move one sequence one step ahead.

Fig 3. Convolution integral – a visual approach  (taken from Convolution – Wikipedia)

So the above math may have been tricky for someone but if you remember the crux you are good to go.

Now you ought to be wondering that we are talking about sequences that are one dimensional, how to scale it to two-dimensional images. If you are then you are on the right track but first, we need to clear one thing out of the way. Convolution, as defined in mathematics, is not exactly used in image processing or deep learning. The reversal of one single sequence, as defined, is not done. This is a very niche difference which can be ignored for all practical reasons. For easy understanding just take convolution as the multiplication of two sequences in which one is shifted every step.

### Convolution of matrices

Suppose we have a 4×4 matrix and a 3×3 matrix. To find their convolution, the smaller 3×3 matrix is placed over the 4×4 matrix placing corners on top of each other. All the elements which overlap between the two matrices are multiplied and added together to give the first element of our convolution matrix. Next, the 3×3 matrix is shifted right and the same process is repeated so on and so forth.

Fig 4. Convolution of two matrices. The blue one is the 4×4 matrix, the grey one is the 3×3 matrix and the green is the final convolution result matrix. (taken from Convolution arithmetic)

We all know any image is represented by a 2D or 3D matrix with each pixel value denoting the intensity of the channel. So the above convolution method can be applied to images too.

Fig 5. A numerical representation of images

So if you have stuck to the blog till now, congratulations!!  We have built all the technical details required for the upcoming parts.

## Convolution in Image Processing

Deep Learning really became popular in 2012 when Alex Krizhevsky along with Geoffrey Hinton demonstrated Alexnet, a convolution-based deep neural network architecture, which outperformed the then state of the art in image recognition. But what about before 2012? Weren’t there any image processing algorithms? Of course not!! There were tons of algorithms suited for different tasks but the general feature of most of them was that they used convolution with different manually defined matrices to learn about some particular features present in the image.

I will present it more clearly using an example. Suppose we have a 5×5 input image as shown in Fig 6. and we want to detect the cross in this image. Owing to the simplicity of the image we can clearly state there is a cross but how to automate this?

Fig 6. An example input image

We know that a cross contains two perpendicular lines which are highlighted in Fig 7.  The detection of these lines in images shows a possible chance of presence of a cross. These horizontal and vertical lines on which the presence of an object in an image is dependent are called features.

Fig 7. The features of an image

#### Line Detection

For the time being assume, we have a 3×3 matrix (usually known as a filter) as shown in Fig 8. The value of the elements of this filter was at first empirically derived. This filter is known as the Vertical Line Detector.

Fig 7. Vertical Line Detector

Convolving it with the input image we get the output as shown below

Fig 7. Vertical Line Detector Output

Now we can clearly see that the convolution process yielded an image where only the vertical line in the middle was highlighted.

Similarly, if we transpose the above filter we get a horizontal line detector with convolution which will detect vertical lines in an image.

a)

b)

Fig 8. a) The horizontal line detector filter. b) The convolution output

The detection of both horizontal and vertical lines in an image suggests the presence of a cross but not every image having such lines is a cross. We need to detect more complex features in an image to make an accurate assessment of the object present.

So what can be that feature which helps us to know it’s definitely a cross in the image?

Fig 9.  The most vital feature in detecting a cross where two lines intersect perpendicularly

It is evident we need to detect the above pattern rather than just a set of horizontal and vertical lines. This can be done if we use a more complex feature detector or filter. Until now we have been able to see that convolution processes can help us extract various features in an image and once we know more complex convolution filters we can extract more complex features and detect more complex things in an image.

a)

1. b)

Fig 10. a) The cross detection filter . b) The convolution output

It eventually boils down to how we choose the filter weights or the matrices. We will discuss more on this a little later. For the time being let’s talk about activation function.

#### Activation Functions

It is common knowledge that activation functions are used to add non-linear behaviour to our models and they also help in better feature extraction. There are quite a few activation functions like Relu, Tanh, Leaky Relu, etc. each with its own pros and cons. For eg., one of the most widely used activation function is Relu which creates a piecewise linear approximation of the final neural network function and is also non-differentiable, which is not very optimum, but it reduces training time along with dealing with the vanishing and exploding gradient problems that it is widely used to build high-performance neural networks.

Fig 11. Various activation functions (taken from Ruosen Lee’s blog)

As already seen above, the convolution output of all the different filters detect the main feature with those pixels having a high magnitude but there are other pixels with different pixel values (in our case they are negative but it is not a necessity). So for accurate feature detection, we need to deal with such stray pixels too and activation functions help us to do that.

Let’s take a look what happens if we pass our convolution outputs in all the three cases through a Relu activation function

Fig 12. The effect of Relu activation on convolution outputs

The above images clearly demonstrate the effect of Relu activation in feature detection. It highlights the prominent part of the images which helps in further classification and identification.

So we have finally understood how convolution and activation functions play a major role in the working of artificial neural networks. These filters are stacked one upon the other to understand more complex features. It is generally the case that the initial convolution layers in a network learn simple features in an image like horizontal or vertical lines and the layers deeper use these extracted features to learn more complex features like eyes, ears, etc. Finally, the last layers combine all this information to predict where there is a face in an image or not.

Fig 13. The visualization of convolution filters of VGG – 16. More complex features are learned in deeper layers as seen clearly. (taken from ResearchGate)

But you may have noticed that the filters used in the above examples their weights (or the elements) were chosen by the programmer but for more complex features and larger images, it is sure to be intractable. Here comes the power of Deep Learning.

#### Conclusion:

“Learning” in Deep Learning actually refers to learning the weights of the filters automatically using gradient descent. I won’t go deep into gradient descent in this post but be on the lookout for it in the future! The general scheme of things in deep learning is to randomly initialize the weights of the network and make predictions using those random weights. The prediction is then evaluated against a true label and that feedback is routed back to the network and the weights of the network are fine-tuned to make more accurate predictions.

Fig 14. The normal Machine Learning flow (taken from Deep Learning Italia)

With this, I would like to wrap the current blog. This blog only touches the surface of Deep Learning and many more details can be put forward. I hope you have enjoyed reading it and feel free to make suggestions such that I can further improve the quality of these blogs. Hoping to see you next time again. Till then, take care and keep learning!!!