Convolutional Neural Networks (CNNs)

Understanding The Basic CNN Structure

6 min readJun 16, 2023

What is the image?

As almost everybody knows, an image is a combination of pixels arranged according to each pixel's color class. These pixels consist of a combination of three primary colors (RGB or BGR) according to certain weights.

The image you see above is a [10, 5, 3] (x, y, color type) size image. If this image were grayscale, the color type value would be 1 due to the fact that there is a color.

What is the Convolutional Neural Network (CNN)?

A CNN (Convolutional Neural Network) is a type of artificial neural network that is specifically designed for processing grid-like data, such as images or sequences. CNNs are widely used in computer vision tasks, including image classification, object detection, image segmentation, and more.

CNN image classifications takes an input image, process it and classify it under certain categories (Eg., Dog, Cat, Tiger, Lion). Computers sees an input image as array of pixels and it depends on the image resolution.

CNNs are structured with layers of interconnected artificial neurons, including convolutional layers, pooling layers, and fully connected layers.

The purpose of performing convolution is to extract features or patterns from input data. Convolution is a fundamental operation in various domains, including image processing, signal processing, and deep learning.

In image processing, convolution is used to apply filters or kernels to an image. These filters can enhance certain features of an image, such as edges or textures, or perform tasks like blurring or sharpening. By convolving an image with different filters, we can highlight specific characteristics and extract relevant information.

Let’s dive into the structures of CNN

Convolutional Layer

Convolution is performed by sliding the kernel over the input image and computing the element-wise multiplication and summation at each position. This process is repeated for each position in the image, resulting in a new output image.

Here’s the formula for convolution:

And in the example below, performing convolution with a 3x3 kernel

After all convolution processes, the obtained result is called a “Feature Map”.

Figure 5. Convolution Neural Network GIF

Convolution of an image with different filters can perform operations such as edge detection, blur, and sharpening by applying filters. The below example shows various convolution images after applying different types of filters (Kernels).

Let’s examine the filtering processes on the picture of Tarkan Gözübüyük, the beloved bass guitarist of the Pentagram (Mezarkabul) band.

Strides

Stride is the number of pixels that shift over the input matrix. When the stride is 1, then we move the filters 1 pixel at a time. When the stride is 2, then we move the filters to 2 pixels at a time, and so on. The below figure shows how convolution would work with a stride of 2.

Padding

Padding is a technique used to preserve the spatial dimensions of the input image after convolution operations on a feature map. Padding involves adding extra pixels around the border of the input feature map before convolution.

This can be done in two ways:

Valid Padding: In the valid padding, no padding is added to the input feature map, and the output feature map is smaller than the input feature map. This is useful when we want to reduce the spatial dimensions of the feature maps.
Same Padding: In the same padding, padding is added to the input feature map such that the size of the output feature map is the same as the input feature map. This is useful when we want to preserve the spatial dimensions of the feature maps.

The number of pixels to be added for padding can be calculated based on the size of the kernel and the desired output of the feature map size. The most common padding value is zero-padding, which involves adding zeros to the borders of the input feature map.

Padding can help in reducing the loss of information at the borders of the input feature map and can improve the performance of the model. However, it also increases the computational cost of the convolution operation. Overall, padding is an important technique in CNNs that helps in preserving the spatial dimensions of the feature maps and can improve the performance of the model.

Pooling

Pooling in convolutional neural networks is a technique for generalizing features extracted by convolutional filters and helping the network recognize features independent of their location in the image.

Activation Functions

Activation functions are mathematical functions applied to the output of a neuron or a neural network layer to introduce non-linearity into the network. These functions determine the output of a neuron or a layer based on its weighted inputs and provide the capability for neural networks to model complex relationships between inputs and outputs.

Sigmoid: For a binary classification in the CNN model.

tanh: The tanh function is very similar to the sigmoid function. The only difference is that it is symmetric around the origin. The range of values, in this case, is from -1 to 1.

Softmax: It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

ReLU: The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time.

Flatten Layer

The flatten layer reshapes this input tensor into a one-dimensional array or vector, collapsing all the dimensions except the batch dimension. The output of the flatten layer has the shape (batch_size, flattened_size), where flattened_size is the product of the remaining dimensions after flattening.

Fully Connected Layer

A fully connected layer, also known as a dense layer or a fully connected neural layer, is a type of layer in a neural network where each neuron or node is connected to every neuron in the previous layer. In a fully connected layer, all the outputs from the previous layer serve as inputs to each neuron in the current layer.

Conclusion

CNNs possess spatial invariance properties, meaning they can recognize patterns and objects regardless of their location in an image. This is achieved through the use of shared weights in the convolutional layers, allowing the network to detect similar patterns at different positions, making CNNs robust to translation and small variations in the input data.
CNNs can capture spatial relationships and exploit local patterns effectively, enhancing their ability to learn intricate structures in images.

Thank you for reading. I hope the blog has been useful to you. I am open to any feedback. I look forward to your positive or negative feedback.

Stay well.