Convolutional Neural Network Architectures

Traditional Convolutional Neural Network Architectures

In 1990's Yann LeCun developed first application Convolutional Networks. His paper ''Gradient-based learning applied to document recognition'' is the documentation of first applied Convolutional Neural Network LeNet-5.

This paper is historically important for Convolutional Neural Networks.In his paper he states

''Multilayer Neural Networks trained with backpropagation algorithm consitute the best example of a successful Gradient-Based Learning technique. Given an appropriate network architecture, Gradient-Based Learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns such as handwritten characters, with minimal preprocessing.''

In this paper while Yann LeCun was reviewing methods for handwritten recognition, his research demonstrated that Convolutional Neural Networks outperforms other methods. This is because Convolutional Neural Networks are designed to deal with 2D shapes. ⁽¹⁾ While he was researching he created LeNet, which is the first Convolutional Neural Network Architecture. In Traditional CNN Architectures we will take a look into combining modules for CNN Architectures. These combinations are based on ''What is the Best Multi-Stage Architecture for Object Recognition? '' another paper which was published by Yann LeCun on 2009. The next step will be taking a look into LeNet architecture.

Layers in Traditional Convolutional Neural Network Architectures

Generally, the architecture aims to build a hierarchical structure for fast feature extraction and classification. This hierarchical structure consists of several layers: filter bank layer, non-linear transformation layer, and a pooling layer. The pooling layer averages or takes the maximum value of filter responses over local neighborhoods to combine them. This process achieves invariance to small distortions.⁽²⁾

Traditional architecture is different from the modern ones. Here are the list and short descriptions of layers used in building models for Traditional CNNs.

Filter Bank Layer- $\begin{array}{l}{F}_{{CSG}}\end{array}$ : This layer acts as a special form of convolutional layer. The only addition is that the convolutional layer is put through the $\begin{array}{l}\tanh\end{array}$ operation. This layer calculates the output $\begin{array}{l}\mathrm{{y}_{{i}}}\end{array}$ with $\begin{array}{l}\tanh\end{array}$ :

(1)	$\begin{array}{l}\displaystyle {y}_{{j}}={g}_{{i}}\tanh({\sum _{i }}{k}_{{ij}} \times {x}_{{i}} )\end{array}$

Rectification Layer- $\begin{array}{l}\mathrm{{R}_{{abs}}}\end{array}$
Local Contrast Normalization Layer- $\begin{array}{l}N\end{array}$ : This layer performs local subtractive and divisive normalizations. It enforces local competition between features in feature maps and between features at the spatial location in different feature maps.
Average Pooling and Subsampling Layer- $\begin{array}{l}{P}_{{A}}\end{array}$
Max- Pooling and Subsampling Layer- $\begin{array}{l}{P}_{{M}}\end{array}$

Information on Convolutional, Pooling and Rectification Layer can be found here.

Combination of Modules in Traditional Architecture:

We can build different modules by using layers. We can form a feature extraction is formed by adding a filtering layer and different combinations of rectification, normalization and pooling layer. Most of the time one or two stages of feature extraction and a classifier is enough to make an architecture for recognition.⁽³⁾

$\begin{array}{l}{F}_{{CSG}}\end{array}$ - $\begin{array}{l}{P}_{{A}}\end{array}$ : This combination is one of the most common block for building traditional convolutional networks. When we add several sequences of $\begin{array}{l}{F}_{{CSG}}\end{array}$ - $\begin{array}{l}{P}_{{A}}\end{array}$ and a linear classifier. They would add up to a complete traditional network.

Figure 1:the structure of $\begin{array}{l}{F}_{{CSG}}\end{array}$ - $\begin{array}{l}{P}_{{A}}\end{array}$
$\begin{array}{l}{F}_{{CSG}}\end{array}$ - $\begin{array}{l}\mathrm{{R}_{{abs}}}\end{array}$ - $\begin{array}{l}{P}_{{A}}\end{array}$ : In this module the filter bank layer is followed by rectification layer and average Pooling layer. The input values are squashed by $\begin{array}{l}\tanh\end{array}$ , then the non-linear absolute value is calculated, and finally the average is taken and down sampled.

Figure 2: the structure of $\begin{array}{l}{F}_{{CSG}}\end{array}$ - $\begin{array}{l}\mathrm{{R}_{{abs}}}\end{array}$ - $\begin{array}{l}{P}_{{A}}\end{array}$
$\begin{array}{l}{F}_{{CSG}}\end{array}$ - $\begin{array}{l}\mathrm{{R}_{{abs}}}\end{array}$ - $\begin{array}{l}N\end{array}$ - $\begin{array}{l}{P}_{{A}}\end{array}$ : This module is very similar to previous module only difference is that a local contrast normalization layer is added between rectification layer and average Pooling layer. In comparison to the previous module after the calculation of non-linear absolute value, they will be normalized and send to the pooling layer, where their average is taken and down sampled.

Figure 3: the structure of $\begin{array}{l}{F}_{{CSG}}\end{array}$ - $\begin{array}{l}\mathrm{{R}_{{abs}}}\end{array}$ - $\begin{array}{l}N\end{array}$ - $\begin{array}{l}{P}_{{A}}\end{array}$ (Image source⁽⁴⁾)

$\begin{array}{l}{F}_{{CSG}}\end{array}$ - $\begin{array}{l}{P}_{{M}}\end{array}$ : This module is another common module for convolutional networks.This model forms the basis of HMAX architecture.

Figure 4: the structure of $\begin{array}{l}{F}_{{CSG}}\end{array}$ - $\begin{array}{l}{P}_{{M}}\end{array}$

LeNet-5

LeNet-5 is the name of the first Convolutional Neural Network. It was used in Yann LeCun's experiments. LeNet-5 consists of 7 layers. These layers contain trainable weights.

The input, which is by some considered as a part of the architecture, is of a $\begin{array}{l}32\times32\end{array}$ pixel image.

The convolutional layer C1 has 156 trainable parameters, 122,304 connections and 6 feature maps. In this layer each feature map has a size of $\begin{array}{l}28\times28\end{array}$ . The main reason for this number is to prevent the number connection from input to fall below the designed boundary. Every unit in these feature maps has a connection in size of $\begin{array}{l}5\times5\end{array}$ to feature maps from the input. (or previous layer)

The sub-sampling Layer S2 has 6 feature maps. Each feature map has a size of $\begin{array}{l}14\times14\end{array}$ . Every unit in these feature maps has a connection in size of $\begin{array}{l}2\times2\end{array}$ to feature maps from the previous Layer C1. The units in S2 are send through a sigmodial function: 4 inputs coming from C1 into S2 are added, multiplied by a trainable weight, and then added to trainable bias.⁽⁷⁾ After that operation the receptive fields with size $\begin{array}{l}2\times2\end{array}$ do not overlap and feature maps in S2 are 2 times smaller in comparison to the feature maps from C1. In total this layer has 5,880 connections and 12 trainable weights.

Layer C3 is another convolutional layer with 16 feature maps and it is similar to previous convolutional layer. Each unit in each feature map has a connection in size of $\begin{array}{l}5\times5\end{array}$ to feature maps from previous layer. These feature maps are not fully connected to previous feature maps from S2. Their connection to feature maps is show in the Figure 6. The aim of this method is to break symmetry and decrease the number of connections. In total this layer has 151,600 connections and 1,516 trainable weights.

The sub-sampling layer S4 has 6 feature maps. Each of them has a size of $\begin{array}{l}5\times5\end{array}$ . Layer S4 consists of 2,000 connections and 32 trainable weights. Each unit in the feature maps has connection in size of $\begin{array}{l}2\times2\end{array}$ to the feature maps from previous layer.

The convolutional layer C5 has 120 feature maps and 48,120 trainable connections. Every unit in these feature maps has a connection in size of $\begin{array}{l}5\times5\end{array}$ . Here every unit has a connection to all (16) feature maps from the layer S4. The feature maps of S4 are in the size of $\begin{array}{l}5\times5\end{array}$ . As a result, C5 and S4 are full connection and that makes the size of the feature maps equal to $\begin{array}{l}1\times1\end{array}$ . But C5 is labeled as a convolutional layer and not a fully connected layer because if C5 were to be a fully connected layer, that would make the size of feature maps be bigger than $\begin{array}{l}1\times1\end{array}$ .

Layer F6 is a full connected layer with 84 units and 10,164 trainable weights.

The main function of the output layer is to calculate Euclidean Radial Basis Function (RBF). RBF is calculated for each class, where 84 inputs are used for calculating each class. RBF unit calculates the Euclidean distance between the input vector and the parameter vector. The output functions as identifying the difference between the measurements of input pattern and our model. The bigger the difference between these vectors is bigger the RBF output.⁽⁸⁾ The output is kept minimal to achieve best model. Therefore, the layer F6 is so configured that the difference would be minimized.That make F6 output close to the parameter vector.

Each RBF unit calculates the output $\begin{array}{l}\mathrm{{y}_{{i}}}\end{array}$ :

(2)	$\begin{array}{l}\displaystyle {y}_{{j}}=({\sum _{j }}{x}_{{j}} - {w}_{{ij}} )^2\end{array}$

Figure 5: the Architecture of LeNet-5 (Image source⁽⁵⁾)

Figure 6: this each column in this figure indicates which map in S2 are combined by the units in a particular map of C3 (Image source⁽⁶⁾)

Modern Convolutional Neural Network Architecture:

This chapter offers basic knowledge on how to build reliable simple modern architectures and demonstrates certain known examples from literature.

Layers used in Modern Convolutional Neural Networks:

Layers in modern architectures are very similar to the traditional layers, yet there are certain differences, RELU is a special implementation of Rectification Layer. You can find more information about RELU and Fully connceted Layer here.

For a simple Convolutional Network following layers are used:

Input Layer
Convolutional Layer
RELU Layer
Pooling Layer
Fully Connected Layer

Main idea is that at the start the neural network architecture takes the input, which is an image size of $\begin{array}{l}[A \times B \times C]\end{array}$ , then at the output the class scores of the input image will be produced by this architecture. Convolutional layer and RELU (Rectification) Layer are stacked together and then they are followed by pooling layers. This structure is commonly used and repeated until the input (image) merges spatially to a small size. After that it is sent to Fully Connected Layers. The output of the last fully connected layer, which is at the end of of the architecture, produces the class scores of input image.⁽⁹⁾

Few examples for building Net Architecture:

only a single Fully Connected Layer: This is just a linear classifier
Convolutional → RELU→ Fully Connected
Convolutional → RELU→Pooling→ Fully Connected→ Convolutional → RELU→ Pooling→ Fully Connected→ RELU→ Fully Connected: Convolutional Layer between every Pooling Layer
Convolutional → RELU→ Convolutional → RELU→ Pooling → Convolutional → RELU→ Convolutional → RELU→ Pooling→ Convolutional → RELU→ Convolutional → RELU→ Pooling→ Fully Connected→ RELU→ Fully Connected→ RELU→ Fully Connected: This architectural form has 2 convolutional layers before each Pooling and this form is useful when building a large and deep networks because multiple convolutional layers leads to more detailed and complex features of the input before it is sent to the pooling layer, where some portion of the information will be lost.

How to build the layers:

Convolutional Layer:Generally, we want to use small filters. When building layers stacks of smaller convolutional filters are preferred over a single large layer. Assume that we have three connected $\begin{array}{l}3\times3\end{array}$ convolutional layers. In that formation neurons of the first layer have a view of $\begin{array}{l}3\times3\end{array}$ of the input, in the next layer neurons have a $\begin{array}{l}3\times3\end{array}$ view of the first layer. That means they have a $\begin{array}{l}5\times5\end{array}$ view of input, the next layer neurons have a $\begin{array}{l}3\times3\end{array}$ view of the second layer and a $\begin{array}{l}7\times7\end{array}$ view of input. Parameter wise this structure has $\begin{array}{l}3\times (C \times (3 \times 3 \times C))=27C^2\end{array}$ parameters compared to $\begin{array}{l}C \times (7 \times 7\times C)=49C^2\end{array}$ , which would be the case if a single $\begin{array}{l}7\times7\end{array}$ convolutional layer is used.

Pooling Layer: Max-pooling with $\begin{array}{l}2\times2\end{array}$ receptive fields eliminates 75% of the input information, because they are down sampled by 2 in height and weight. Rarely, $\begin{array}{l}3\times3\end{array}$ receptive fields are used but in general receptive fields bigger than $\begin{array}{l}3\times3\end{array}$ are not practical because that causes high loss of input data.

Specific Architectures:

AlexNet

AlexNet made Convolutional Networks popular in Computer Vision. AlexNet was developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton and won ImageNet ILSVRC challenge in 2012. During this competition it produced the best results, top-1 and top-5 error rates of 37.5% and 17.0%. ⁽¹⁰⁾

In object recognition machine learning methods play a critical role. Generally, the aim is to improve performance of larger data sets, so more powerful models, and better techniques are used . Purpose of AlexNet is to learn from thousands of objects from millions of images. In image recognition it is common that a model does not have the whole data set, and the model should be designed so that it will have prior knowledge, in order to compensate for the missing data in model. This is the main reason AlexNet was designed using Convolutional Neural Networks.

AlexNet Architecture

As it can be seen in Figure 7, AlexNet consists of eight layers: first five of the layers are convolutional and the rest are fully connected layers. First and second convolutional layers are followed by Response-normalization layers, then these Response-normalization layers are followed by Max pooling layers. In addition the fifth convolutional layer is followed by a Max pooling layer.

The output of every convolutional layer and fully connected layer is put through RELU non-linearity. The output of the last fully connected layer sent to the 1000-way softmax layer, which produces 1000 probability values for 1000 class labels, where higher value corresponds to higher probability. Under probability distribution this neural network maximizes the average across the training cases of the log-probability of the correct label.⁽¹¹⁾

As you can see in Figure 7, AlexNet consists of 2 separate pieces (In other words 2 separate GPUs). Filters of convolutional layers only have connection with filters residing on same piece. The only exception is the third convolutional layer, which is connected to all filters from the second layer of the network. And every neuron residing in a full connected layer are connected to neurons from previous layer.

Layer Size for each Layer:

Input: $\begin{array}{l}224\times 224 \times 3\end{array}$

First convolutional layer: $\begin{array}{l}96\end{array}$ filters of size $\begin{array}{l}11\times 11 \times 3\end{array}$

Second convolutional layer: $\begin{array}{l}256\end{array}$ filters of size $\begin{array}{l}5\times5\times 48\end{array}$

Third convolutional layer: $\begin{array}{l}384\end{array}$ filters of size $\begin{array}{l}3 \times 3\times 48\end{array}$

Fourth convolutional layer: $\begin{array}{l}384\end{array}$ filters of size $\begin{array}{l}3 \times 3\times 48\end{array}$

Fifth Convolutional Layer: $\begin{array}{l}256\end{array}$ filters of size $\begin{array}{l}3 \times 3 \times 192\end{array}$

Every fully-connected Layer has $\begin{array}{l}4096\end{array}$ neurons

Figure 7: the structure of AlexNet (Image source⁽¹²⁾)

Multi-Column Deep Neural Networks Architecture

The Multi-Column Deep Neural Networks are modeled after neural layers, which reside between retina and visual cortex of mammals.This architecture offers high performance. In comparison to traditional methods, which are commonly used in computer vision and machine learning, deep artificial neural network architectures offers near-human ability and performance for recognizing handwritten digits or traffic signs. The convolutional neurons use a winner-take-all approach and yield large network depth. This large yield makes the number of sparsely connected neural layers in this deep network almost equal to the number of neurons found between retina and visual cortex of mammals. This architectural method is the first to achieve near-human performance on MNIST handwriting benchmark and on a traffic sign recognition benchmark it outperforms humans by a factor of two.⁽¹³⁾

Multi-Column Deep Neural Networks

This architecture contains hundreds of maps per layer. This is why it is called ''deep''. Originally, this architectural design was inspired by Neocognition, which is an artificial neural ,proposed by Kunihiko Fukushima in 1980s. Neocognition network consisting of many layers of stacked non-linear neurons and it has been used for handwritten recognition and pattern recognition.The aim of this architecture is to iteratively minimize classification error on sets of labeled training images starting from initially random weights. Main problem with Multi-layered Deep Neural Networks was that they are hard to train and computational power to properly use such architecture was not possible. However, with recent advancements in computations and computational power that has changed.

As it can be seen in figure 8, each DNN consists of 2-dimensional fully connected layers with shared weights. They employ a winner-take-all approach the output of this layer is, then sent to pooling layer where the winning neurons are determined. The output of pooling layer is fed into 1-dimensional convolution layer.

The aim here in DNN Architecture is to train only the winner neurons, while making the other neurons not forget what their have learned. That decreases the changes per interval, and it is similar to reducing the energy consumption, if it is considered from a biological point of view. After this point, weight updates only happen after each gradient computation step, effectively making our algorithm online.

In final step, several DNN columns are combined into Multi-column DNN (MCDNN). In this Multi-column predictions from each column are averaged.Weights of each column are randomly initialized. In MCDNN the columns can be trained on the same inputs, or on inputs, which can be preprocessed indifferent ways.

Taking the average of predictions from each column:

$\begin{array}{l}\displaystyle {y}_{{MCDNN}}^{{i}}={\sum _{j}^{columns}} {y}_{{{DNN}_{{j}}}}^{{i}}\end{array}$

where i corresponds to the ith class and j runs over all DNN