Layers of a Convolutional Neural Network

Convolutional neural networks are built by concatenating individual blocks that achieve different tasks. These building blocks are often referred to as the layers in a convolutional neural network. In this section, some of the most common types of these layers will be explained in terms of their structure, functionality, benefits and drawbacks.

This section is an excerpt from Convolutional Neural Networks

Convolutional Layer

The main task of the convolutional layer is to detect local conjunctions of features from the previous layer and mapping their appearance to a feature map. As a result of convolution in neuronal networks, the image is split into perceptrons, creating local receptive fields and finally compressing the perceptrons in feature maps of size $\begin{array}{l}m_2 \ \times \ m_3\end{array}$ . Thus, this map stores the information where the feature occurs in the image and how well it corresponds to the filter. Hence, each filter is trained spatial in regard to the position in the volume it is applied to.

In each layer, there is a bank of $\begin{array}{l}m_1\end{array}$ filters. The number of how many filters are applied in one stage is equivalent to the depth of the volume of output feature maps. Each filter detects a particular feature at every location on the input. The output $\begin{array}{l}Y_i^{(l)}\end{array}$ of layer $\begin{array}{l}l\end{array}$ consists of $\begin{array}{l}m_1^{(l)}\end{array}$ feature maps of size $\begin{array}{l}m_2^{(l)} \ \times \ m_3^{(l)}\end{array}$ . The $\begin{array}{l}i^{th}\end{array}$ feature map, denoted $\begin{array}{l}Y_i^{(l)}\end{array}$ , is computed as

(1)	$\begin{array}{l}\displaystyle Y_i^{(l)} = B_i^{(l)} + \sum_{j=1}^{m_1^{(l-1)}} K_{i,j}^{(l)} \ast Y_j^{(l-1)}\end{array}$

where $\begin{array}{l}B_i^{(l)}\end{array}$ is a bias matrix and $\begin{array}{l}K_{i,j}^{(l)}\end{array}$ is the filter of size $\begin{array}{l}2h_1^{(l)} + 1 \ \times \ 2h_2^{(l)} + 1\end{array}$ connecting the $\begin{array}{l}j^{th}\end{array}$ feature map in layer $\begin{array}{l}(l-1)\end{array}$ with $\begin{array}{l}i^{th}\end{array}$ feature map in layer.

The result of staging these convolutional layers in conjunction with the following layers is that the information of the image is classified like in vision. That means that the pixels are assembled into edglets, edglets into motifs, motifs into parts, parts into objects, and objects into scenes.

Original version of the page created by Simon Pöcheim can be found here.

Non-Linearity Layer

A non-linearity layer in a convolutional neural network consists of an activation function that takes the feature map generated by the convolutional layer and creates the activation map as its output. The activation function is an element-wise operation over the input volume and therefore the dimensions of the input and the output are identical.

In other words; let layer $\begin{array}{l}l\end{array}$ be a non-linearity layer, it takes the feature volume $\begin{array}{l}Y_i^{(l-1)}\end{array}$ from a convolutional layer $\begin{array}{l}(l-1)\end{array}$ and generates the activation volume $\begin{array}{l}Y_i^{(l)}\end{array}$ :

$\begin{array}{l}\displaystyle Y_i^{(l)} = f(Y_i^{(l-1)})\end{array}$

with,

$\begin{array}{l}Y_i^{(l)}\in \mathbb{R}^{m_1^{(l)}\times m_2^{(l)}\times m_3^{(l)}}\\ Y_i^{(l-1)}\in \mathbb{R}^{m_1^{(l-1)}\times m_2^{(l-1)}\times m_3^{(l-1)}}\\ m_1^{(l)}=m_1^{(l-1)} \wedge m_2^{(l)}= m_2^{(l-1)} \wedge m_3^{(l)}=m_3^{(l-1)}\end{array}$

This can also be interpreted as $\begin{array}{l}m_1^{(l-1)}\end{array}$ number of 2-dimensional feature maps (generated by $\begin{array}{l}m_1^{(l-1)}\end{array}$ number of filters in convolutional layer $\begin{array}{l}(l-1)\end{array}$ ) each with size $\begin{array}{l}m_2^{(l-1)}\times m_3^{(l-1)}\end{array}$ .

In some publications⁽²⁾ a gain koefficient is added to the activation function in order to cope up with the vanishing gradient problem:

$\begin{array}{l}\displaystyle Y_i^{(l)} = g_i f(Y_i^{(l-1)})\end{array}$

Similar to multilayer perceptrons, the activation function is generally implemented as logistic (sigmoid) or hyperbolic tangent functions. However, more recent research suggests rectified linear units (ReLUs) are advantageous over the traditional activation functions particularly in convolutional neural networks ⁽³⁾.

It is noteworthy that, although this wiki seperates the non-linearity layer from convolutional layer, it is not uncommon to see a combination of the non-linearity layer with the convolutional layer in the literature. Some of the works use different notations for their layer and architecture description. In some publications⁽²⁾ of Yann LeCun for example, the combined layer is named a filter bank layer and represented as $\begin{array}{l}{F_{CSG}}\end{array}$ denoting a convolutional layer ( $\begin{array}{l}C\end{array}$ ) with a hyperbolic tangent activation function ( $\begin{array}{l}S\end{array}$ ) and gain coefficients ( $\begin{array}{l}G\end{array}$ ) .

Logistic (sigmoid) and hyperbolic tangent functions are commonly used activation functions in convolutional neural networks.

Rectification Layer

A rectification layer in a convolutional neural network performs element-wise absolute value operation on the input volume (generally the activation volume). Let layer $\begin{array}{l}l\end{array}$ be a rectification layer, it takes the activation volume $\begin{array}{l}Y_i^{(l-1)}\end{array}$ from a non-linearity layer $\begin{array}{l}(l-1)\end{array}$ and generates the rectified activation volume $\begin{array}{l}Y_i^{(l)}\end{array}$ :

$\begin{array}{l}\displaystyle Y_i^{(l)} = |Y_i^{(l-1)}|\end{array}$

Similar to the non-linearity layer, the element-wise operation properties do not change the size of the input volume and therefore, these operations can be (and in many cases⁽¹⁾ including AlexNet⁽⁴⁾ and GoogLeNet⁽⁵⁾ are) merged into a single layer:

$\begin{array}{l}\displaystyle Y_i^{(l)} = | f(Y_i^{(l-1)})|\end{array}$

Regardless of the general simplicity of the operation, it plays a key role in the performance of the convolutional neural network by eliminating cancellation effects in subsequent layers. Particularly when an average pooling method is utilized, the negative values within the activation volume are prone to cancel out the positive activations, degrading the accuracy of the network significantly. Therefore, the rectification is named as a "crucial component" ⁽²⁾.

Rectified Linear Units (ReLU)

The rectified linear units (ReLUs) are a special implementation that combines non-linearity and rectification layers in convolutional neural networks. A rectified linear unit (i.e. thresholding at zero) is a piecewise linear function defined as:

$\begin{array}{l}\displaystyle Y_i^{(l)} = max(0,Y_i^{(l-1)})\end{array}$

The rectified linear units come with three significant advantages in convolutional neural networks compared to the traditional logistic or hyperbolic tangent activation functions:

Rectified linear units propagate the gradient efficiently and therefore reduce the likelihood of a vanishing gradient problem that is common in deep neural architectures.
Rectified linear units threshold negative values to zero, and therefore solve the cancellation problem as well as result in a much more sparse activation volume at its output. The sparsity is useful for multiple reasons but mainly provides robustness to small changes in input such as noise ⁽⁶⁾.
Rectified linear units consist of only simple operations in terms of computation (mainly comparisons) and therefore much more efficient to implement in convolutional neural networks.

As a result of its advantages and performance, most of the recent architectures of convolutional neural networks utilize only rectified linear unit layers (or its derivatives such as noisy or leaky ReLUs) as their non-linearity layers instead of traditional non-linearity and rectification layers.

In works such as AlexNet⁽⁴⁾ rectified linear units are shown to operate six times faster then hyperbolic tangent non-linearities while reaching 25% error rate on CIFAR-10 dataset⁽⁷⁾. More recently, utilization of an advanced derivative Parametric Rectified Linear Units (PReLU) allowed convolutional networks to surpass human-level performance in ImageNet database ⁽⁸⁾. More information on the advantages of the rectified linear units in convolutional neural networks can be found in ⁽³⁾.

Rectified linear units are becoming more popular among their counterparts due to their simplicity and multiple advantages.

Pooling Layer

The pooling or downsampling layer is responsible for reducing the spacial size of the activation maps. In general, they are used after multiple stages of other layers (i.e. convolutional and non-linearity layers) in order to reduce the computational requirements progressively through the network as well as minimizing the likelihood of overfitting.

The pooling layer $\begin{array}{l}l\end{array}$ has two hyperparameters, the spatial extent of the filter $\begin{array}{l}F^{(l)}\end{array}$ and the stride $\begin{array}{l}S^{(l)}\end{array}$ . It takes an input volume of size $\begin{array}{l}m_1^{(l-1)}\times m_2^{(l-1)}\times m_3^{(l-1)}\end{array}$ and provides an output volume of size $\begin{array}{l}m_1^{(l)}\times m_2^{(l)}\times m_3^{(l)}\end{array}$ where;

$\begin{array}{l}m_1^{(l)} = m_1^{(l-1)}\\ m_2^{(l)} =(m_2^{(l-1)}-F^{(l)})/S^{(l)} +1\\ m_3^{(l)} =(m_3^{(l-1)}-F^{(l)})/S^{(l)} +1\end{array}$

The key concept of the pooling layer is to provide translational invariance since particularly in image recognition tasks, the feature detection is more important compared to the feature's exact location. Therefore the pooling operation aims to preserve the detected features in a smaller representation and does so, by discarding less significant data at the cost of spatial resolution.

The pooling layer operates by defining a window of size $\begin{array}{l}F^{(l)}\times F^{(l)}\end{array}$ and reducing the data within this window to a single value. The window is moved by $\begin{array}{l}S^{(l)}\end{array}$ positions after each operation similarly to the convolutional layer and the reduction is repeated at each position of the window until the entire activation volume is spatially reduced.

It is noteworthy that the window for pooling layers does not have to be a square and can be parametrised with $\begin{array}{l}F_1^{(l)}\end{array}$ and $\begin{array}{l}F_2^{(l)}\end{array}$ resulting in a rectangle of size $\begin{array}{l}F_1^{(l)}\times F_2^{(l)}\end{array}$ . However, this is extremely uncommon and is therefore left out of the notation is most of the publications, including this wiki.

The most common methods for reduction are max pooling and average pooling. Max pooling operates by finding the highest value within the window region and discarding the rest of the values. Average pooling on the other hand uses the mean of the values within the region instead.

Max pooling has demonstrated faster convergence and better performance in comparison to the average pooling and other variants such as $\begin{array}{l}l^2\end{array}$ -norm pooling ⁽⁹⁾. Thus, recent work generally trends towards max pooling or similar variants.

Apart from the reduction method, the hyperparameter selections determine whether the pooling windows overlap or not. In case $\begin{array}{l}F^{(l)}> S^{(l)}\end{array}$ the pooling windows overlap on top of each other, filtering some of the data points multiple times. Depending on this condition, a pooling layer is named overlapping or non-overlapping pooling.

Even though the overlapping window method has been used with success to some extent and some of the research⁽⁴⁾ suggests that it can reduce the chance of overfitting on certain datasets, the larger spatial size $\begin{array}{l}F^{(l)}\end{array}$ in pooling layers is generally considered destructive ⁽¹³⁾. Therefore, the most commonly used pooling parameters are $\begin{array}{l}F^{(l)} = 2 , S^{(l)} = 2\end{array}$ which is non-overlapping.

The selection of $\begin{array}{l}F^{(l)} = 2 , S^{(l)} = 2\end{array}$ is the smallest feasible integer sized filter, however, it still discards 75% of the data.This aggressive reduction executed by pooling layers can limit the depth of a network and ultimately limit the performance. This problem has pushed the research to find other methods to improve or replace the pooling layer.

One approach is to use smaller "fractional" filters⁽¹⁰⁾ instead of common ones. Another option is to remove pooling layers completely and simply perform the reduction by increasing the stride in convolutional layers ⁽¹¹⁾. These solutions still remain as active research topics.

It is important to note that some of the cutting edge architectures such as GoogLeNet's Inception Modules also feature $\begin{array}{l}1\times 1\end{array}$ convolutions to reduce the dimensions of the activation volume before a $\begin{array}{l}3\times 3\end{array}$ or $\begin{array}{l}5\times 5\end{array}$ convolutional layer. These are not considered as pooling layers since the reduction is not happening stritctly spatially but along the depth axis. However, the architecture of the GoogLeNet still features max and average pooling layers between the inception modules apart from its $\begin{array}{l}1\times 1\end{array}$ convolutions. ⁽⁵⁾

GoogLeNet's Inception Module architecture depicting the underlying $\begin{array}{l}1\times 1\end{array}$ convolutions. (Image source ⁽⁵⁾)

An animation depicting a $\begin{array}{l}20 \times 20\end{array}$ input being reduced to $\begin{array}{l}2\times 2\end{array}$ output using $\begin{array}{l}F^{(l)}=10\end{array}$ and $\begin{array}{l}S^{(l)}=10\end{array}$ non-overlapping pooling operation. (Image source ⁽¹²⁾)

Pooling is performed spatially on each 2-dimensional map within the volume. As a result the depth of the output volume is the same as the input volume.(Image source ⁽¹³⁾)

Non-overlapping max pooling on a 2-dimensional $\begin{array}{l}4\times 4\end{array}$ input. 75% of the data is discarded in the resulting $\begin{array}{l}2\times 2\end{array}$ output.(Image source ⁽¹²⁾)

Fully Connected Layer

The fully connected layers in a convolutional network are practically a multilayer perceptron (generally a two or three layer MLP) that aims to map the $\begin{array}{l}m_1^{(l-1)}\times m_2^{(l-1)}\times m_3^{(l-1)}\end{array}$ activation volume from the combination of previous different layers into a class probability distribution. Thus, the output layer of the multilayer perceptron will have $\begin{array}{l}m_1^{(l-i)}\end{array}$ outputs, i.e. output neurons where $\begin{array}{l}i\end{array}$ denotes the number of layers in the multilayer perceptron.

The key difference from a standard multilayer perceptron is the input layer where instead of a vector, an activation volume is taken as the input. As a result the fully connected layer is defined as:

If $\begin{array}{l}l-1\end{array}$ is a fully connected layer;

$\begin{array}{l}\displaystyle y_i^{(l)} = f(z_i^{(l)}) \quad \text{with} \quad z_i^{(l)}= \sum_{j=1}^{m_1^{(l-1)}} w_{i,j}^{(l)} y_i^{(l-1)}\end{array}$

otherwise;

$\begin{array}{l}\displaystyle y_i^{(l)} = f(z_i^{(l)}) \quad \text{with} \quad z_i^{(l)}= \sum_{j=1}^{m_1^{(l-1)}} \sum_{r=1}^{m_2^{(l-1)}} \sum_{s=1}^{m_3^{(l-1)}} w_{i,j,r,s}^{(l)} \Big(Y_i^{(l-1)}\Big)_{r,s}\end{array}$

A three layer fully connected multilayer perceptron structure that is identical to a fully connected layer in convolutional neural networks with only difference being the input layer.(Image source)

The goal of the complete fully connected structure is to tune the weight parameters $\begin{array}{l}w_{i,j}^{(l)}\end{array}$ or $\begin{array}{l}w_{i,j,r,s}^{(l)}\end{array}$ to create a stochastic likelihood representation of each class based on the activation maps generated by the concatenation of convolutional, non-linearity, rectification and pooling layers. Individual fully connected layers operate identically to the layers of the multilayer perceptron with the only exception being the input layer.

It is noteworthy that the function $\begin{array}{l}f\end{array}$ once again represents the non-linearity, however, in a fully connected structure the non-linearity is built within the neurons and is not a seperate layer.

As a contradiction, according to Yann LeCun, there are no fully connected layers in a convolutional neural network and fully connected layers are in fact convolutional layers with a $\begin{array}{l}1\times 1\end{array}$ convolution kernels ⁽¹⁴⁾. This is indeed true and a fully connected structure can be realized with convolutional layers which is becoming the rising trend in the research.

As an example; the AlexNet⁽⁴⁾ generates an activation volume of $\begin{array}{l}512\times7\times7\end{array}$ prior to its fully connected layers dimensioned with $\begin{array}{l}4096\end{array}$ , $\begin{array}{l}4096\end{array}$ , $\begin{array}{l}1000\end{array}$ neurons respectively. The first layer can be replaced with a convolutional layer consisting of $\begin{array}{l}4096\end{array}$ filters, each with a size of $\begin{array}{l}m_1 \times m_2 \times m_3 = 512 \times 7 \times 7\end{array}$ resulting in a $\begin{array}{l}4096 \times 1 \times 1\end{array}$ output, which in fact is only a 1-dimensional vector of size $\begin{array}{l}4096\end{array}$ .

The architecture of AlexNet also depicting its dimensions including the fully connected structure as its last three layers.(Image source ⁽⁴⁾)

Subsequently, the second layer can be replaced with a convolutional layer consisting of $\begin{array}{l}4096\end{array}$ filters again, each with a size of $\begin{array}{l}m_1 \times m_2 \times m_3 = 4096 \times 1\times 1\end{array}$ resulting in a $\begin{array}{l}4096 \times 1 \times 1\end{array}$ output once again. Ultimately, the output layer can be replaced with a convolutional layer consisting of $\begin{array}{l}1000\end{array}$ filters, each with a size of $\begin{array}{l}m_1 \times m_2 \times m_3 = 4096 \times 1\times 1\end{array}$ resulting in a $\begin{array}{l}1000 \times 1 \times 1\end{array}$ output, which yields the classification result of the image among $\begin{array}{l}1000\end{array}$ classes ⁽¹³⁾.