Traditional Convolutional Neural Network Architectures
In 1990's Yann LeCun developed first application Convolutional Networks. His paper ''Gradient-based learning applied to document recognition'' is the documentation of first applied Convolutional Neural Network LeNet-5.
This paper is historically important for Convolutional Neural Networks.In his paper he states
''Multilayer Neural Networks trained with backpropagation algorithm consitute the best example of a successful Gradient-Based Learning technique. Given an appropriate network architecture, Gradient-Based Learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns such as handwritten characters, with minimal preprocessing.''
In this paper while Yann LeCun was reviewing methods for handwritten recognition, his research demonstrated that Convolutional Neural Networks outperforms other methods. This is because Convolutional Neural Networks are designed to deal with 2D shapes. (1) While he was researching he created LeNet, which is the first Convolutional Neural Network Architecture. In Traditional CNN Architectures we will take a look into combining modules for CNN Architectures. These combinations are based on ''What is the Best Multi-Stage Architecture for Object Recognition? '' another paper which was published by Yann LeCun on 2009. The next step will be taking a look into LeNet architecture.
Layers in Traditional Convolutional Neural Network Architectures
Generally, the architecture aims to build a hierarchical structure for fast feature extraction and classification. This hierarchical structure consists of several layers: filter bank layer, non-linear transformation layer, and a pooling layer. The pooling layer averages or takes the maximum value of filter responses over local neighborhoods to combine them. This process achieves invariance to small distortions.(2)
Traditional architecture is different from the modern ones. Here are the list and short descriptions of layers used in building models for Traditional CNNs.
Filter Bank Layer- {F}_{{CSG}}: This layer acts as a special form of convolutional layer. The only addition is that the convolutional layer is put through the \tanh operation. This layer calculates the output {y}_{{i}}with \tanh:
(1) {y}_{{j}}={g}_{{i}}\tanh({\sum _{i }}{k}_{{ij}} \times {x}_{{i}} ) - Rectification Layer- {R}_{{abs}}
- Local Contrast Normalization Layer-N: This layer performs local subtractive and divisive normalizations. It enforces local competition between features in feature maps and between features at the spatial location in different feature maps.
- Average Pooling and Subsampling Layer- {P}_{{A}}
- Max- Pooling and Subsampling Layer- {P}_{{M}}
Information on Convolutional, Pooling and Rectification Layer can be found here.
Combination of Modules in Traditional Architecture:
We can build different modules by using layers. We can form a feature extraction is formed by adding a filtering layer and different combinations of rectification, normalization and pooling layer. Most of the time one or two stages of feature extraction and a classifier is enough to make an architecture for recognition.(3)
- {F}_{{CSG}}-{P}_{{A}}: This combination is one of the most common block for building traditional convolutional networks. When we add several sequences of {F}_{{CSG}}-{P}_{{A}} and a linear classifier. They would add up to a complete traditional network.
Figure 1:the structure of {F}_{{CSG}}-{P}_{{A}} - {F}_{{CSG}}-{R}_{{abs}}-{P}_{{A}}: In this module the filter bank layer is followed by rectification layer and average Pooling layer. The input values are squashed by \tanh, then the non-linear absolute value is calculated, and finally the average is taken and down sampled.
Figure 2: the structure of {F}_{{CSG}}-{R}_{{abs}}-{P}_{{A}} - {F}_{{CSG}}-{R}_{{abs}}-N-{P}_{{A}}: This module is very similar to previous module only difference is that a local contrast normalization layer is added between rectification layer and average Pooling layer. In comparison to the previous module after the calculation of non-linear absolute value, they will be normalized and send to the pooling layer, where their average is taken and down sampled.
Figure 3: the structure of {F}_{{CSG}}-{R}_{{abs}}-N-{P}_{{A}} (Image source(4))
- {F}_{{CSG}}-{P}_{{M}}: This module is another common module for convolutional networks.This model forms the basis of HMAX architecture.
Figure 4: the structure of {F}_{{CSG}}-{P}_{{M}}
Modern Convolutional Neural Network Architecture:
This chapter offers basic knowledge on how to build reliable simple modern architectures and demonstrates certain known examples from literature.
Layers used in Modern Convolutional Neural Networks:
Layers in modern architectures are very similar to the traditional layers, yet there are certain differences, RELU is a special implementation of Rectification Layer. You can find more information about RELU and Fully connceted Layer here.
For a simple Convolutional Network following layers are used:
- Input Layer
- Convolutional Layer
- RELU Layer
- Pooling Layer
- Fully Connected Layer
Main idea is that at the start the neural network architecture takes the input, which is an image size of [A \times B \times C], then at the output the class scores of the input image will be produced by this architecture. Convolutional layer and RELU (Rectification) Layer are stacked together and then they are followed by pooling layers. This structure is commonly used and repeated until the input (image) merges spatially to a small size. After that it is sent to Fully Connected Layers. The output of the last fully connected layer, which is at the end of of the architecture, produces the class scores of input image.(9)
Few examples for building Net Architecture:
- only a single Fully Connected Layer: This is just a linear classifier
- Convolutional → RELU→ Fully Connected
- Convolutional → RELU→Pooling→ Fully Connected→ Convolutional → RELU→ Pooling→ Fully Connected→ RELU→ Fully Connected: Convolutional Layer between every Pooling Layer
- Convolutional → RELU→ Convolutional → RELU→ Pooling → Convolutional → RELU→ Convolutional → RELU→ Pooling→ Convolutional → RELU→ Convolutional → RELU→ Pooling→ Fully Connected→ RELU→ Fully Connected→ RELU→ Fully Connected: This architectural form has 2 convolutional layers before each Pooling and this form is useful when building a large and deep networks because multiple convolutional layers leads to more detailed and complex features of the input before it is sent to the pooling layer, where some portion of the information will be lost.
How to build the layers:
Convolutional Layer:Generally, we want to use small filters. When building layers stacks of smaller convolutional filters are preferred over a single large layer. Assume that we have three connected 3\times3 convolutional layers. In that formation neurons of the first layer have a view of 3\times3 of the input, in the next layer neurons have a 3\times3 view of the first layer. That means they have a 5\times5view of input, the next layer neurons have a 3\times3view of the second layer and a 7\times7 view of input. Parameter wise this structure has 3\times (C \times (3 \times 3 \times C))=27C^2 parameters compared to C \times (7 \times 7\times C)=49C^2, which would be the case if a single 7\times7 convolutional layer is used.
Pooling Layer: Max-pooling with 2\times2 receptive fields eliminates 75% of the input information, because they are down sampled by 2 in height and weight. Rarely, 3\times3 receptive fields are used but in general receptive fields bigger than 3\times3 are not practical because that causes high loss of input data.
Literature
[1] Gradient-based learning applied to document recognition 1998 (Yann LeCun, P Haffner, L. Bottou, Y. Bengio)
[2]What is the Best Multi-Stage Architecture for Object Recognition? 2009 (Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann LeCun)
[3] ImageNet Classification with Deep Convolutional Neural Networks 2012 (Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton)
[4] Multi-column Deep Neural Networks for Image Classification 2012 (Dan Ciresan, Ueli Meier and Jürgen Schmidhuber)