Convolution Neural Network(CNN) : An Idiotic Guide

16 min readMay 18, 2021

Introduction :

Have you ever wondered how Facebook can weirdly recognise who are present in a photo you uploaded ,automatically ,by making boxes on their faces?!

No ?

Then you are in correct place to discover how that has been accomplished by Facebook .

In one word ,This has been possible by using Artificial Intelligence of the super computers deployed by Facebook.

Artificial Intelligence(A.I) has been witnessing a monumental growth in bridging the gap between the capabilities of human and machine .

To bridge the gap between the human capabilities and machine capabilities ,Computer Vision is a go-to field for this.

Computer Vision along with Deep Learning is a revolutionary blend to enable the machine to view the world as humans do, perceive it in a similar manner and even use the knowledge for a multitude of tasks such as Image & Video recognition, Image Analysis & Classification, Media Recreation, Recommendation Systems, Natural Language Processing, etc.

>>And this ‘revolutionary blend’ is called Convolutional Neural Network (CNN).

Convolutional networks ( LeCun , 1989 ), also known as convolutional neural networks or CNNs, are a specialized kind of neural network for processing data that has a known, grid-like topology . Examples include Image data ,which can be interpretated as a 2-D grid of pixel values .

Convolutional neural networks, also called ConvNets, were first introduced in the 1980s by Yann LeCun, a postdoctoral computer science researcher.

Before , I start CNN at one stoke , let me start with what the name suggest and why we need it in first place.

So , as the name suggest , ‘Convolution’ is added to vanilla Neural Network(NN) :

2 = 1+1 (SIMPLE !)

But , wait……what is NN ??!! and Convolution stands for what ??

Okkkh…don’t freak out…we will find out all these questions one by one and step by step in this blog.

So , stay tuned !!!

Brief Introduction to Neural Network (NN) :

In short , NN is a subset of both Machine Learning(ML) and Deep Learning(DL) ,

where machine(here , is our computer)implement the following steps:

(1)Extracts and creates new features by incorporating all the existing features or deleting any current feature.

(2)Learns patterns among the features in our data set.

(3)Also , gives the predicted output once the training is finished and feeds with an unseen data .

And all these things are being accomplished by our machine by mimicking the neural structure of our brain.

A typical NN looks like this :

Look , here , every current layer neural node is connected to each and every neural node in the very next layer , just like mimicking our brain neural network , where every neuron is connected to each and every neuron in the next layer .

But , although , each node in current layer is connected to each and every node in the next layer ,all the neurons are not being activated at the same time. Those specific neurons are being activated when required. And that ‘requirement’ is decided by the input and the parameters associated with those neurons.

Scrapping your head ? ☺

Let’s understand by our Brain Neural Network:

Suppose , you accidentally touch a hot object by your hand ….what will happen …you will feel pain at your hand ,not in your leg. Because ,although all the neurons in your body are connected to each other , neurons in your hand gets the input(sensation of hot = 1).But, neurons at your leg does not get the input(no sensation of hot = 0),that’s why leg-neurons do not get activated.

Exactly , same thing happens in case of NN.

Some Technical Stuff :

According to Universal Approximation Theorem(UAT)(in layman’s language ,any problem that does not matter how much complex the decision boundary(for classification problem) or best fit line(for regression problems) is ,it can be approximated by a finite number of nodes in ANN as per the UAT) , you can feed the pixel values of an image to the vanilla NN , and train NN and get the output .

In particular , for each pixel in the input image, we encoded the pixel’s intensity as the value for a corresponding neuron in the input layer . For instance , you are using 28×28 pixel images .This means that the NN should have 784 (=28×28) neurons in the input layer . We then trained the network’s weights and biases so that the network’s output would — we hope! — correctly identify the input image.

Then , why do we need CNN for image-related task at the first place ?

Because , upon intuition , it is unfair to implement Fully-connected NN for image classification , because , such a model does not take into account the spatial structure of the image . For instance , the pixels which are close together and far enough from each other are being treated in the same footing . That spatial information is a defining factor for image classification.

One of the model architectures that takes care of the spatial structure of an image is “Convolution Neural Network.”

So , now WHATT!!!

By any how , we have to extract features from an unstructured data set(viz. an image) and make the image pixel values resembles to feature-values of any structured data-set.

That ‘any how’ , we call it ‘Convolution’ , in technical term.

Once you understood , why convolution layer is meant for ,you are ready to dive into the sea of CNN .

Happy ‘Deep’ diving !!!

Convolution Neural Networks (CNN) :

A formal definition of CNN would be :
Convolution Neural Networks are good for pattern recognition and feature detection which is especially useful in image classification. Improve the performance of Convolution Neural Networks through hyper-parameter tuning, adding more convolution layers, adding more fully connected layers, or providing more correctly labeled data to the algorithm.

CNN can be built through in the following steps:

(1) Convolution Operation Layer

(2) Pooling Layer

(3) Flattening Layer

(4) Fully connected NN Layer

All the layers of a CNN is being depicted here :Convolution ,Pooling ,Flatten , Fully Connected layer.

(1) Convolution Operation Layer :

It is the very first layer of the CNN , where the raw image is being fed to.

This is the layer where features(i.e. extracting informations about the types of edges , different types of shapes of any image.) are being extracted ,that’s why this layer is sometimes is called as Feature extractor layer.

But how does feature extraction happen ? That follows below :

Suppose , this is my given image :

Raw /unstructured image with a smiley and corresponding pixel values in 0 or 1.

Convolution operation is one of its kind of mathematical operation ,where , two functions are being convolved and returns a number that measures how the shape of one is modified by the other.

’**Input image’ is modified to ‘Feature Map’ by the ‘Feature Detector’.**Before convolution , the input image is of the shape (7*7) . After convolution with the feature detector of the shape (3*3) , the convolved map/output map is of the shape of (5*5).So, the size is decreased.

For image recognition, we convolve the input image with Feature Detector (also known as Kernel or Filter) to generate a Feature Map (also known as Convolved Map or Output Feature Map). This reveals patterns in the image, and also compresses the image for easier processing. Feature Map is generated by element-wise multiplication and addition of corresponding image.

**Example of convolution operation , that (3*3) matrix is moving is called Kernel /filter…doing the element-wise multiplication and adding then to get a number.**

If we want to detect n number of features in an image , n number of filters are being applied to the image again and again and stack them channel wise , to get only one output feature map /Convolved Map.(As, done in the last image.)

After applying n number of kernels to the image , getting n number of feature maps ,that are being stacked to get the final out-put image.

Local receptive fields:

As per usual, we’ll connect the input pixels to a layer of hidden neurons. But we won’t connect every input pixel to every hidden neuron. Instead, we only make connections in small, localized regions of the input image. Each pixel in the output feature map is connected to (3*3) pixels in input image .

one output neuron in the hidden layer is connected to (3*3) neurons in the input image .

The region upon which kernel is locating at a certain time on the input image is called the Local Receptive Field for the respective hidden neuron .Here the size of the receptive fiels is (3*3)

Look at the input image that has a 10×10 pixel size , but the output feature map has the dimension of 8×8 . This is because we can only move the local receptive field 3 neurons across (or 3 neurons down), before colliding with the right-hand side (or bottom) of the input image.

Shared Weights And Biases :

So, so far we have created hidden neurons and each hidden neuron has a bias and (3×3) weights. But , the values of the weights and the bias is going to be same for all the (8×8) hidden layer neurons .

As , the (8×8) hidden layer neurons are made of the same weights and bias ,we may think of this situation as where all the weights and biases are being shared among all the (8×8) hidden layer neurons .

This means that all the neurons in the hidden layer detect exactly the same feature , just at different locations in the input image .

To make this visualize , suppose ,your weights and biases are such that is detecting any slanted line in local receptive region . Now , as the same weights and bias are likely to be applied over all the local receptive fields ,(8×8) hidden layer neurons are containing the same information if there is any slanted line or not at different locations of the input image .

To put it in slightly more abstract terms, convolutional networks are well adapted to the translation invariance of images: move a picture of a cat (say) a little ways, and it’s still an image of a cat .

For this reason , we call this 64 (=8×8) hidden layer neurons as a feature map . And the weights that are being used is called shared weights.

One of the big advantage of using share weights and biases is that : There is a great reduction in the number of parameters in the model .

If you have a input image as (10×10) , you need (3×3) = 9 weight and 1 bias . So , the total number of parameters is 10 (=9+1) .If you have such 30 output feature maps representing 30 unique features in each map , then you need total 300(=10×30) parameters during training of your model .

By comparison, suppose we had a fully connected first layer, with 100=10×10 input neurons, and a relatively modest 30 hidden neurons. Then , you have 100×30 = 3000 weights and each node is associated with each hidden neuron is associated with 1 bias ,so ,there are total 30 biases .Now , he total number of parameters is 3030(=3000+30).

In other words, the fully-connected layer would have more than 10 times as many parameters as the convolutional layer.

Activation Layer :

An ACTIVATION LAYER is always followed by Convolution Layer..

But why is it necessary to apply an ACTIVATION LAYER just after the Convolution Layer ?

There are many reasons to that ,some of them are :

(1) So ,Once , the output feature maps are being generated , all the pixel values in all the feature maps may not be informative .So , to reduce values of some pixels(or may be discarded) , applying an ACTIVATION LAYER just after the Convolution Layer is necessary.

(2)Because a convolution followed by a convolution is a mere convolution .This is ,fundamentally, equivalent to a convolutional neural network with only one layer.This is because , suppose , output of any convolution operation is given as :

y = (m2*x + b2)
x=input
m2 = weight
b2 = bias

Now , if we don’t apply any activation function ,then, the output of the next convolution layer will be :

y' = m1*(m2*x + b2) + b1

=m1 * m2 * x + m1 * b2 + b1

So , y’ =y

The whole convolution network will act like a convolution layer with a single layer.

To avoid such ambiguity , we must apply an activation function so that non-linearity to be introduced.

Activation Layer(Where Activation Function is being applied.) decides among all the pixel values in the Output Feature Map , which pixels will be stored and which pixel values will be reduced or discarded.

There are many Activation Functions(viz. Sigmoid ,Soft-max , tanh ..etc.) , among all of them, most famous one is : ‘ReLU’.

ReLU :

The formal definition of ReLU is :

Definition of ReLU. It takes any value as an input and returns only the positive value as it is and make any negative input arguments as 0 .

Before ReLU :

After ReLU :

In the above example, the ReLU operation removed the Black Pixels so there’s less White to Gray to Black transitions. Borders now have more abrupt Pixel changes.

>>Discarding all the negative pixel values in the i/p feature map.

But, hey wait!!! why this ReLU is famous among all other activation function?
>>One of the main reason for that is that it does not suffer from vanishing gradient problem problem ,during back propagation.(Simply , because ,the gradient of ReLU is unity .)

Leaky ReLU : It is modified ReLU , Leaky ReLU has a non zero value for negative input argument .

Other Activation Functions :

Sigmoid : the formulae for sigmoid is :

Sigmoid function returns a value [0,1] .That is , it is a bounded function . For any negative input , it returns a value <0.5 and for positive input returns a value >0.5 . It has a probabilistic interpretation . It is defined as that it returns a probability being in the ‘+ve class’ in case of binary classification .

Tanh : The formulae for Tanh is the following :

Tanh returns a value [-1 , 1] . It is kind of scaled sigmoid function . It does not bear any probabilistic interpretation . So, it is usually used in hidden layers.

But both Tanh and Sigmoid both suffers from vanishing gradient problem(where the value of the parameters get reduced as we go deeper in to the network.)

As , going deeper , the value of the gradient decreasing .

comparison among different activation functions

(2) Pooling Layer :

In the third stage of any CNN , we use a pooling function to modify the output of the layer further.

But why do we need this in CNN at all ?

POOLING LAYER also further distills Feature Maps (reduces size) while preserving spatial relationships of pixels. Removing unnecessary information also helps prevent over-fitting.

>> In simple words , this layer is required because , this layer helps to further reduce the size of the o/p feature map , that eventually helps in efficient computation but keeping the informations about the image .

There are different kinds of Pooling functions , two of most popular of them are :

(1)Max Pooling finds the largest value of small grids in the Feature Map, this creates a Pooled Feature Map. Max Pooling provides resilience against shifted or rotated features.

In all cases pooling helps the o/p feature map to be invarint to local spatial translations. Local translation invariance is very usful in those cases where e care more about whether the feature is present or not instaed of knowing it pixel precisely where it is .For exmple, in CNN ,when to determine whether an imge contains a face or not , in that case ,it is more important to know if any eyes present on the both sides of the nose or not instead of knowing the position of the eye pixel-perfectly .

example of ma-pooling.Kernel size(f*f) =(2*2)

(2)Average Pooling (sub-sampling) takes the average values of small grids. It makes sure that your Neural Network has Spatial Invariance (able to find learned features in new images that are slightly varied or distorted).

At ,(2,1) position ,the average value is **5 instead of 5.3.**

Besides these two ,there are some other pooling functions too : GlobalAveragePooling ,sum-pooling

(3)Flattening :

After the final feature map is being generated ,it must be fed to the Fully Connected Neural Network.To do so ,the feature map pixel values are being falletened to a one dimensional array ,where ,each value in the array represents the feature value of the input image.

Flattening puts values of the pooled Feature Map matrix into a 1-D vector. This makes it easy for the image data to pass through an Artificial Neural Network .

(4) Fully Connected NN :

This is when the output of a Convolution Neural Network is flattened and fed through a classic Artificial Neural Network. It’s important to note that CNNs require fully-connected hidden layers where as regular ANNs .

Now the wrap up :

You are provided with a dataset consisting of 5,000 Cat images and 5,000 Dog images. We are going to train a Machine Learning model to learn differences between the two categories. The model will predict if a new unseen image is a Cat or Dog. The code architecture will be robust and can be used to recognize any number of image categories, if provided with enough data.

First , one of the 5000 cat images will be feed forwarded to the whole CNN network . The the corresponding Value of the Loss function will be calculated .

Once the loss function is being calculated , the process of back propagation will be initiated. The process of CNN back-propagation adjusts weights of neurons, while adjusting Feature Maps simultaneously. Both the adjustment of the weights of the kernels and the weights in the Fully Connected NN are being accomplished in each epoch until the convergence is being achieved(i.e. value of the loss function becomes close to 0.) .

Parameter Updates :

The weight updates are being kicked off from the top layer i.e. the prediction layer of the CNN.

In the Fully Connected Layer , weight update is being done in following way :

(1)The current weight is being updated according to :

\theta_j’s are weight of the current layer and they are being updated by subtracting a gradient of the loss function.

The back-propagation for the kernels of convolution layer is being done in a bit different way :

It also uses convolution operation between input at that layer and gradient of the loss function , like the following way :

x_i,j are the input of this particular layer and d(L)/d(O) is the gradient of the loss function of previous layer .

Suppose , now the parameters are being updated to its optimal values by making backpropagation in every epoch .

Decision Making :

When it’s time for the CNN to make a decision between Cat or Dog, the final layer neurons gives probability of an image being a Cat or Dog (or any other categories you show it). The Neural Network adjusts votes according to the best weights it has determined through back-propagation.

In this case , cat image is being feed to CNN ,then the final output layer gives the probability of 0.79 for the cat category and 0.21 for the dog category . So, the image must belong to Cat class , that is a correct prediction .

ADD-ON :

Data Pre-processing :

This step is being applied to the training images ,while feeding to model created through CNN pipeline.

This data augmentation can generate tons of new images by applying random modifications to existing images like : shearing , zooming , rotating , stretching ,etc…

This step modifies images to prevent over-fitting. This data augmentation makes our model more robust and generalized.

Through Dat augmentation , tons of new images can be generated .In this image , the original butterfly is being augmented into other 4 different versions of the original butterfly images randomly.

Limitations :

(1)Can’t detect the same image in different angle and lighting condition :

Although , this is a same image at different lighting conditions and different angle(by our human vision) , CNN will detect the above 5 images as 5 different images if they are provided in 5 diferent folders.

(2)the CNN completely loses all the information about the composition and position of the components and they transmit the information further to a neuron which might not be able to classify the image.

CNN only detects if any feature in the image is present or not , the spatial information of the composition is completely lost during the feed forwarding.

These limitations are more evident when it comes to practical applications . CNN’s are being used to detect content properly . Although they are being trained with a vast amount of images , still they can’t detect and block all the inappropriate images . As, it turns out , it has flagged a 30000 year statue as a content of nudity on Facebook .

(3)As, it has to implement a tensor computation operations , it levies a great computation power and cost and GPU power .

Does it mean that CNNs are useless ?

Simply , NO!!

Despite of its disadvantages , it has brought a revolution to our AI and computer vision problem such as facial recognition and image search , etc...

To improve the above mentioned disadvantages of CNN , there are many updates in CNN like Transfer Learning(TL) one of then.

Some of the examples of TL are :

VGG-16 ,VGG-19 , ResNet ,Inceptions , Xception , etc…are the most commonly used pre trained models to reduce the computational power and increase the efficiency of the model.

To know more about them go to :

https://www.tensorflow.org/api_docs/python/tf/keras/applications

Conclusion :

As , it is evident from the above discussion is that :

(1)CNN is basically used for images type data .

(2)Vanilla NN is an integral part of CNN.

(3)The are many loss functions that is being used by the optimizer ,one of the popular Loss function : Categorical Cross Entropy.

Ref for CCE-Loss:

https://manna-phys.medium.com/cross-entropy-loss-binary-and-categorical-59db1f0cc87c .

To more about CNN :

Neural networks and deep learning

In the last chapter we learned that deep neural networks are often much harder to train than shallow neural networks…

neuralnetworksanddeeplearning.com

Future Update:

(1)What is Universal Approximation Algorithm.

(2)How Transfer Learning takes over the disadvantages off the convolution NN will be discussed in my future blogs .

Let’s have a look at this blog , and let me know about your thoughts in the comment section. Constructive criticisms will be appreciated.