Dropout is a method of improvement which is not limited to convolutional neural networks but is applicable to neural networks in general. The basic idea is to remove random units from the network, which should prevent co-adaption. This has proven to reduce overfitting and increase the performance of a neural network.

Image 1: Visualization of the dropout during the training of a neural network. Image source [1].

Motivation

Overfitting

Overfitting is a problem that can occur wherever existing data should be fitted to a describing model. This includes statistics as well as different disciplines in supervised and unsupervised machine learning such as linear regression, k-nearest neighbor regression, logistic regression, support vector machines and neural nets. Hawkins explains overfitting with the principle of parsimony.^[3] This principle states, that a model should contain the least possible amount of variables that are necessary to define it. Models of scientific relationships that violate the principle of parsimony are prone to overfitting. An example would be trying to model a set of 2D points to a higher polynomial, when the points have a linear relationship. Image 2 shows such an example.

Deep neural networks can be trained to develop complex relationships between their input data and their outcome. Depending on the amount of training data the network may develop a behavior that brings good results for the training data, but fails as soon as unknown test data is fed into the network. To prevent overfitting in neural networks, there exist a variety of methods. The simplest approach is, to feed more training data into the network. This prevents, that the neural network is only trained on features, that may be a random coherence of the training data, but may not be a general property of the test data. This of course increases the need for more training data, but also increases the required training time and computational complexity which is in general a limiting factor to neural networks. Another method that shows remarkable results is to classify different subsets of the training data, and fit a model which is based on these subsets. This approach is called bootstrap aggregating or "bagging", and is not limited to neural networks, but can be applied to all forms of statistical classification and regression tasks.⁽⁴⁾ A further possible method to prevent overfitting is called "early stopping", which means that the training is stopped, ideally just before the validation error starts to rise. Of course finding this point in time where the network has the best possible generalization and "price-performance ratio" is not trivial. For further reading, I refer to Prechelts work on finding an early stopping criterion for training neural networks.^[5] The last approach is to create a network model, that has the right capacity.⁽²⁾ If the capacity is too small, the network is not able to represent all features and regularities, that defines the training data. If the capacity of the network is too large, it may develop spurious regularities from the training data. Methods to limit the capacity of a neural network are weight-decay (large weights are penalized or constrained), limiting the number of hidden layers and units, or injecting noise into the network.⁽²⁾ A successful way to prevent overfitting is to perform a dropout. Here units are randomly removed from the neural network, which can also be seen as a form of adding noise to the network.

Co-adaption

Co-adaption is a term, originating from biology and evolution theory.⁽³⁾ Among others, it describes the process, when different species develop an interdependence. An example is the relationship between the plant Acacia hindsii and the and species Pseudomyrex ferruginea. Both of this species developed a habit which is unusual to closely related species. The ant is active 24 hours a day, to protect the the plant, while the plant grows leaves throughout the whole year, in order to provide food. While co-adaption may be an evolutionary advantage in nature, it can cause nuisance in convolutional neural networks. Hinton et al. describe the co-adaption of feature detectors in neural networks.^[2] It means, that a single feature detector is not able to describe a meaningful image feature on its own, but only combined with other feature vectors. They found out that through randomly dropping units from the neural network, co-adaption between the feature detectors can be prevented, as individual feature detectors start to detect specific, helpful features.

Image 2: The graph shows a set of 2D datapoints with an approximately linear relationship. While the red curve, which is a sixth order polynomial has smaller modelling error than the green line, the green line will outperform the model accuracy of the red curve once more data is added.

Image 3: Vertical axis shows error, horizontal axis shows training time. This graph shows a training error of a neural network, which decreases steadily. The validation error is the error generated when using test data as an input. Here the error rises after a certain threshold, due to overfitting. Image source [5].

Method

Let $\begin{array}{l}l = \{1,...,L\}\end{array}$ be the hidden layers of a neural network, where $\begin{array}{l}z^{(l)}\end{array}$ is the input vector and $\begin{array}{l}y^{(l)}\end{array}$ is the output of a hidden layer. For a hidden unit $\begin{array}{l}i\end{array}$ with weight $\begin{array}{l}\textbf{w}_{i}^{(l)}\end{array}$ and bias $\begin{array}{l}b_i^{(l)}\end{array}$ , the formulas for the feed-forward operation are shown in the left column of the table below, where $\begin{array}{l}f()\end{array}$ can be any kind of activation function. The left column shows the same principle only with dropout applied to the network. As corresponding to the Image 2, a new vector $\begin{array}{l}\textbf{r}^{(l)}\end{array}$ is applied to $\begin{array}{l}y^{(l)}\end{array}$ . The elements of vector $\begin{array}{l}\textbf{r}^{(l)}\end{array}$ are $\begin{array}{l}1\end{array}$ with probability of $\begin{array}{l}p\end{array}$ and $\begin{array}{l}0\end{array}$ with a probability of $\begin{array}{l}1-p\end{array}$ . While the vector $\begin{array}{l}y^{(l)}\end{array}$ in network (a) is rigid, the dot-product $\begin{array}{l}*\end{array}$ is applied between $\begin{array}{l}y^{(l)}\end{array}$ and $\begin{array}{l}\textbf{r}^{(l)}\end{array}$ which randomly sets units of the inputvector to the layer $\begin{array}{l}l+1\end{array}$ to $\begin{array}{l}0\end{array}$ .^[1]

(a) Standard Neural Network

(b) Neural Network with Dropout

$\begin{array}{l}\displaystyle z_i^{(l+1)} = \textbf{w}_i^{(l+1)}\textbf{y}^l+b_i^{(l+1)}\end{array}$

$\begin{array}{l}\displaystyle y_i^{(l+1)}=f(z_i^{(l+1)})\end{array}$

$\begin{array}{l}\displaystyle r_j^{(l)}\sim Bernoulli(p)\end{array}$

$\begin{array}{l}\displaystyle \widetilde{\textbf{y}}^{(l)} =\textbf{r}^{(l)}*\textbf{y}^{(l)}\end{array}$

$\begin{array}{l}\displaystyle z_i^{(l+1)}= \textbf{w}_i^{(l+1)}\widetilde{\textbf{y}}^{(l)} + b_i^{(l+1)}\end{array}$

$\begin{array}{l}\displaystyle y_i^{(l+1)} = f(z_i^{(l+1)})\end{array}$

The above dropout stage with probability $\begin{array}{l}p\end{array}$ is redone in every step during the training of the network.

A successful way to reduce an error on the test set, is to perform model averaging.^[2] Here, the outcome of many different networks is taken into account for the final result. In order to receive the outcome of many different networks, all these models have to be pre-trained, which is very expensive. Hinton et al describe their dropout method as an efficient way, to train many different networks, as in each stage a randomly thinned subset of the network is trained. There exist $\begin{array}{l}2^n\end{array}$ possible thinned subnets, for a network with $\begin{array}{l}n\end{array}$ units. Therefore, a dropout network can also be seen as a combination of many different networks, where each one was trained either very rarely or not at all. Image 3 shows the difference between the training step and the test case. During training, a unit is present in the network with a probability of $\begin{array}{l}p\end{array}$ , in each single training step. This means, that each unit is only trained on a subset of the training data. While testing, all units are active in the network, but each of their output weights are multiplied by this probability i.e. they are halved for $\begin{array}{l}p=0.5\end{array}$ , to have approximately the same sum of network weights both during training and test time.^[1] This means, that the expected output value of a unit during training time is equal to the output of a unit during test time. During the test stage, all units are present, in order to approximate the combination of all $\begin{array}{l}2^n\end{array}$ dropout networks. Hinton et al state, that this procedure gives approximately the same result as averaging over the outcome of a great number of dropout networks.^[2]

Image 4: Comparison of a standard and a dropout NN. Image source [1].

Image 5: During training time, each unit in the network is present with a probability of $\begin{array}{l}p\end{array}$ . On the other hand, during test time, this probability is not applied to its units, but to its weights to account for the fact that twice as many units are active in the network. Image source [1].

Effect on Filters

Image 6, allows to develop a good intuition, why dropout is useful for training a neural network. It shows the 256 feature detectors, which were trained on the MNIST data set in a convolutional neural network with one hidden layer. In Image 6 (a) no dropout was applied, while in (b) units were removed from the network with a probability of $\begin{array}{l}p = 0.5\end{array}$ . While the feature detectors in (a) seem to contain mainly random noise, the filters in (b) can detect spots, edges and strokes inside an image. Hinton et al explain this phenomenon as follows: For a unit in a neural network to adapt its parameters, it reacts in a way to minimize the final loss function of the network.^[2]The value of the loss function is not only determined by this single unit, but depends of course on all other units as well. Therefore, the latter unit can adopt in a way, to fix the errors of other units in the network. This can lead to complex co-adaptions which should be avoided. By randomly removing units from the network, a single unit can rely less on the functioning of other units and therefore, has to detect a meaningful feature by itself.

Image 6: Effects on a set of featuredetectors taken from a convolutional neural network. While the features in (a) are mostly indistinguishable for humans, and seem to contain a big portion of white noise, the features in (b) are the product of a training with dropout. As visible the detectors are able to filter meaningful features such as spots, corners and strokes in an image. Image source [1].

Effect on Performance

In their experiments Srivasta et al show, that best performance of the neural network i.e. lowest classification error, is achieved by removing units in hidden layers with a probability of $\begin{array}{l}p = 0.5\end{array}$ and removing units in the input layer with $\begin{array}{l}p = 0.2\end{array}$ .^[1] Image 7 shows the classification error of a set of different neural network architectures, each represented by a different color. The network model may consist of 2-4 hidden layers and 1024 to 2048 units per layer. It shows, that the classification error decreases remarkably, when applying dropout during training of a network. It further shows, that the improvement is independent from the used network architecture, and therefore has a high applicability. As of 2014 all neural networks, that provide state-of-the-art results in one of the databases in table below are using dropout during their training.

Image 7: Comparison of test error with and without dropout. Each color represents a different network architecture, ranging from 2 to 4 hidden layers and 1024 to 2048 units. Image source [1].

Effect on Sparsity

Another important effect when working with dropout networks is, the fact, that the activations (or the output values of each neuron) tend to become sparse, even without any sparsity enforcement, such as weight penalties applied.^[1] Having a sparse model means, that the number of units with high activations is very low, and most of the activations are close to zero. Further, the mean activations should be low as well. Image 8 shows a random subset which was taken from a dropout network (b) or the equivalent network without dropout (a). Image 8 (a) shows, that most units in the network have a mean activation which is about 2. The actual activation values are widely distributed. When applying dropout, the number of high activations decreases noticeable, while the average activation tends to be around 0.7, compared to 2.0 for the standard neural network.

Image 8: The graphs show histograms of the mean activation and the activation value of a set of randomly selected units from the network. While the most units have a mean activation of 2.0 in (a), there is a shift twoards smaller mean activations when performing dropout. The right histograms of (a) and (b) show, that after performing dropout, most of the units have a low activation, while only few neurons in the network retain a high activation value. Image source [1].

Choosing the right Parameter

Srivastava et al found out, that the close to optimal parameter $\begin{array}{l}p\end{array}$ for retaining a unit in a network is about $\begin{array}{l}p = 0.5\end{array}$ for any hidden unit and $\begin{array}{l}p = 0.8\end{array}$ for input units.^[1] A higher value for input units seems intuitive, as through leaving out units in the input layer, direct information of the input is lost. The improving effect in the range of $\begin{array}{l}0.4 \leq p \leq 0.8\end{array}$ is approximately the same. When leaving this range, the network becomes prone to underfitting for removing too many units from the network, or overfitting, when there are only small changes in the network during training.

Literature

[1] Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014, N. Srivastava et. al)

[2] Improving Neural Networks by Preventing Co-adaption of Feature Detectors (2012, G.E. Hinton, N.Srivastava et. al)

[3] The Problem of Overfitting (2004, D. M. Hawkins)

[4] Bagging Predictors (1996, L. Breiman)

[5] Early Stopping - but when? (2012 L. Prechelt)

Weblinks

(1) http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/

(2) http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec9.pdf

(3) http://www.encyclopedia.com/earth-and-environment/ecology-and-environmentalism/environmental-studies/co-adaptation

Seitenhierarchie