Supervised Learning

Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems[1]. The training data consists of a set of training examples. In supervised learning, each example is a pair consisting of an input object(typically a vector) and a desired output value (also called the supervisory signal).For instance, a training example in neural network could be like this,

(1)	$\begin{array}{l}\displaystyle (\vec{x}_i, \vec{y}_i), 0< i\leq n\end{array}$

, where $\begin{array}{l}\vec{x}_i\end{array}$ is the $\begin{array}{l}i^{th}\end{array}$ input vector and correspondingly $\begin{array}{l}\vec{y}_i\end{array}$ is the desired $\begin{array}{l}i^{th}\end{array}$ output vector.

Error Measures

What we'd like during learning process is an algorithm which let us find weights and biases so that the output from the network $\begin{array}{l}\vec{f}(\vec{x}_i)\end{array}$ approximates $\begin{array}{l}\vec{y}_i\end{array}$ for all training inputs $\begin{array}{l}\vec{x}_i\end{array}$ . To quantify how well we're achieving this goal we should use a method to measure the error which indicates the difference between $\begin{array}{l}\vec{f}(\vec{x}_i)\end{array}$ and $\begin{array}{l}\vec{y}_i\end{array}$ (also called cost function). A widely used method is the Mean Squared Errors(MSE):

(2)	$\begin{array}{l}\displaystyle E(w,b) = \frac{1}{n} \varSigma_{\vec{x}}\|\|\vec{y}- \vec{f}(\vec{x}) \|\|^2\end{array}$

Here, $\begin{array}{l}w\end{array}$ denotes the collection of all weights in the network, $\begin{array}{l}b\end{array}$ all the biases, $\begin{array}{l}n\end{array}$ is the total number of training inputs, $\begin{array}{l}\vec{f}(\vec{x})\end{array}$ is the vector of outputs from the network when $\begin{array}{l}\vec{x}\end{array}$ is the input, $\begin{array}{l}\vec{y}\end{array}$ is the desired output, and the sum is over all training inputs, $\begin{array}{l}\vec{x}\end{array}$ .

Besides the MSE, the cross-entropy error measure is also used in neural networks:

(3)	$\begin{array}{l}\displaystyle E(w,b) = \frac{1}{n} \varSigma_{\vec{x}}[\vec{y}log\vec{f}(\vec{x}) +(1-\vec{y})log(1-\vec{f}(\vec{x})]\end{array}$

where $\begin{array}{l}n\end{array}$ is the total number of items of training data, the sum over all training inputs, $\begin{array}{l}\vec{x}\end{array}$ , and $\begin{array}{l}\vec{y}\end{array}$ is the corresponding desired output.

Training Protocols

Here, we mainly introduce four protocols:

Batch training: batch training computes the gradient of the cost function w.r.t. to the parameters $\begin{array}{l}(w,b)\end{array}$ for the entire training dataset.
Stochastic training: Stochastic training in contrast performs a parameter update for each training example $\begin{array}{l}\vec{x}_i\end{array}$ and desired output $\begin{array}{l}\vec{y}_i\end{array}$ .
Online training: In this protocol, each training data would be used only once to update the weights and biases.
Mini-batch training: stochastic training works by randomly picking out a small number $\begin{array}{l}m\end{array}$ of randomly chosen training inputs. We'll refer each of those random training inputs as a mini-batch. It takes the best of all the three training protocols and performs an update for every mini-batch of $\begin{array}{l}n\end{array}$ training examples.

Parameter Optimization

Recapping, our goal in training a neural network is to find weights and biases which minimize the cost function $\begin{array}{l}E(w,b)\end{array}$ , which we could also call Parameter Optimization.
Let's first consider the simplest situation: $\begin{array}{l}E(\theta_1,\theta_2) = g(\theta_1,\theta_2)\end{array}$ , where $\begin{array}{l}\theta_1\end{array}$ and $\begin{array}{l}\theta_2\end{array}$ are 1-dimensional variables. Assume that $\begin{array}{l}E(\theta_1,\theta_2)\end{array}$ has the form:

Fig. 1: Function $\begin{array}{l}E(\theta_1,\theta_2)\end{array}$ .(Source:2)

In Figure, what we'd like is to find where $\begin{array}{l}E(\theta_1,\theta_2)\end{array}$ achieves its global minimum. Now, for the function plotted above, we can eyeball the graph and find the minimum. But actually we do not know where is the exact point at which $\begin{array}{l}E(\theta_1,\theta_2)\end{array}$ reaches its minimum.
Theoretically, when the minimum of $\begin{array}{l}E(\theta_1,\theta_2)\end{array}$ is reached, then the following condition should be satisfied:

(4)	$\begin{array}{l}\displaystyle \nabla E(\theta_1,\theta_2) = 0\end{array}$

In practice, the cost function in a neural network is much more complicated so that it could be impossible to calculate the zero point of the derivative of $\begin{array}{l}E(\theta_1,\theta_2)\end{array}$ . Based on this fact, we start by thinking of $\begin{array}{l}E(\theta_1,\theta_2)\end{array}$ as a kind of valley. Then we image a ball rolling down the slope of the valley and eventually it could roll to the bottom of the valley. So how to use this experience in our neural network?

From Calculus, we knew that:

(5)	$\begin{array}{l}\displaystyle \Delta E \approx \nabla E \cdotp \Delta \vec{\theta}\end{array}$

Recall that our purpose is to minimize the cost function $\begin{array}{l}E(\theta_1,\theta_2)\end{array}$ , so if we choose

(6)	$\begin{array}{l}\displaystyle \Delta \vec{\theta} = - \eta \Delta E\end{array}$

where $\begin{array}{l}\eta\end{array}$ is the learning rate (a small, positive parameter). Thereby, we can infer from the cost function that

(7)	$\begin{array}{l}\displaystyle E(\theta_1,\theta_2) \approx -\eta \|\|\nabla E\|\|^2\end{array}$

which guarantees $\begin{array}{l}\Delta E \leq 0\end{array}$ so that the "ball" could roll to the bottom of the "valley". This method is called Gradient Descent.

Besides this, Newton's Method is also widely used for optimizing the parameters.
Due to the use of second-order information of the cost function in Newton's Method, , the algorithm performs faster convergence.

(8)	$\begin{array}{l}\displaystyle \Delta \vec{\theta} = -\eta(H_n(\vec{\theta}))^{-1}\nabla E(\vec{\theta})\end{array}$

where $\begin{array}{l}H_n\end{array}$ is the Hessian matrix of $\begin{array}{l}E(\vec{\theta})\end{array}$ . The main drawback of this method is its computationally expensive for evaluation and inversion of the Hessian matrix.

Weights Initialization

Now, we've learned how to optimize the parameters. In this subsection, we will see the method to initialize the parameters. But firstly we should clarify if the parameters can all be initialized to zero?
The answer is no, because if every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

Small Random Numbers. As a solution, it is common to initialize the weights of the neurons to small numbers and refer to doing so as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network.
Calibrating the variances with $\begin{array}{l}1/ \sqrt{n}\end{array}$ . One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that we can normalize the variance of each neuron’s output to 1 by scaling its weight vector by the square root of its fan-in (i.e. its number of inputs).

Error Backpropagation

In the previous section, we discussed how to initialize and optimize the parameters. Now, let's think more about the optimization of the parameters: How each parameter in a multi-layered neural network be optimized?

Principle Part

To solve this problem, Error Backpropagation Algorithm[2] is the most famous method that is used.This Algorithm comprises of 4 equations.
First, assume the weighted input to the $\begin{array}{l}j_{th}\end{array}$ neuron in layer $\begin{array}{l}l\end{array}$ is:

(9)	$\begin{array}{l}\displaystyle z^l_j = \Sigma_k w^l_{jk} a^{l-1}_k + b^l_j\end{array}$

where $\begin{array}{l}a^{l-1}_k\end{array}$ is the output of the k-th neuron in layer $\begin{array}{l}l-1\end{array}$ and $\begin{array}{l}w^l_{jk}\end{array}$ is the corresponding weight.
Next, we define the error $\begin{array}{l}\delta^l_j\end{array}$ of neuron j in layer $\begin{array}{l}I\end{array}$ by

(10)	$\begin{array}{l}\displaystyle \delta^l_j \equiv \frac{\partial E}{\partial z^l_j}\end{array}$

First Equation for Error in the Output Layer, $\begin{array}{l}\delta^L\end{array}$ :

(11)	$\begin{array}{l}\displaystyle \delta^L_j = \frac{\partial E}{\partial a^L_j}\sigma'(z^L_j)\end{array}$

where $\begin{array}{l}\sigma\end{array}$ is the activation function.

Second Function for Error $\begin{array}{l}\delta^l\end{array}$ in terms of the error in the next layer , $\begin{array}{l}\delta^l+1\end{array}$ :

(12) $\begin{array}{l}\displaystyle \delta^l = ((w^{l+1})^T \delta^{l+1})\bigodot\sigma'(z^l)\end{array}$

where $\begin{array}{l}\bigodot\end{array}$ is the Hadamard product.

By combining these two equations above we can compute the error $\begin{array}{l}\delta^l\end{array}$ for any layer in the network. We start by using First Equation for Error to compute $\begin{array}{l}\delta^L\end{array}$ , then apply the Second Equation for Error to compute $\begin{array}{l}\delta^{L-1}\end{array}$ . Then use the Second Equation for Error to compute $\begin{array}{l}\delta^{L-2}\end{array}$ , and so on, all the way back through the network.

Third Equation for the rate of change of the cost with respect to any bias in the network:

(13) $\begin{array}{l}\displaystyle \frac{\partial E}{\partial b^l_j} = \delta^l_j\end{array}$

That is, the error $\begin{array}{l}\delta^l_j\end{array}$ is exactly equal to the rate of change $\begin{array}{l}\frac{\partial E}{\partial b^l_j}\end{array}$ .
Fourth equation for the rate of change of the cost with respect to any weight in the network:

(14)	$\begin{array}{l}\displaystyle \frac{\partial E}{\partial w^l_{jk}} = a^{l-1}_k \delta ^l_j\end{array}$

This tells us how to compute the partial derivatives $\begin{array}{l}\frac{\partial E}{\partial w^l_{jk}}\end{array}$ in terms of the quantities $\begin{array}{l}\delta^l\end{array}$ and $\begin{array}{l}a^{l-1}\end{array}$ , which we already know how to compute.

With these four equations, we could "easily" backpropagate the error to each parameter. Here, we only roughly introduced those equations.You may find more about the principle of error backpropagation, here, if you are interested.

Practical Part

Now, let us see how these 4 equations could be used in a neural network .

Once the neural network was mentioned, this picture could be the first view in our mind. This is a typical neural network with 3 layers:

Layer 1: Input Layer
Layer 2: Hidden Layer
Layer 3: Output Layer

This is just a brief view of neural network which means, it is not enough to see how the backpropagation works. So let's go deeper and more detailed.

Fig. 2: A simple example of Neural Network.

In this concrete model on the right, we can clearly find out the inputs, weights, biases and outputs.

The first layer is the input layer which includes 2 neurons $\begin{array}{l}i_1, i_2\end{array}$ and the bias $\begin{array}{l}b_1\end{array}$ . The second layer is the hidden layer with 2 neurons $\begin{array}{l}h_1, h_2\end{array}$ and the bias $\begin{array}{l}b_2\end{array}$ . The last layer is the output layer with 2 outputs neurons $\begin{array}{l}o_1, o_2\end{array}$ . $\begin{array}{l}w_i\end{array}$ between each layer is the corresponding weight. Here, we assume the sigmoid function is the activation function.

For the convenience of the demonstration on how the backpropagation algorithm actually works, we assign each variable an initial value:

Input Data: $\begin{array}{l}i_1 = 0.05, i_2 = 0.10;\end{array}$
Output Data: $\begin{array}{l}out_1 =0.01,out_2=0.99;\end{array}$
Initial weights and biases: $\begin{array}{l}w_1 = 0.15,w_2=0.20,w_3=0.25,w4=0.30;\end{array}$
$\begin{array}{l}w_5 = 0.40, w_6=0.45,w_7=0.50,w_8=0.88;\end{array}$

Out target is: With the input data, the output of the network must approximate to the output data as much as possible.

Fig. 3: Neural Network with 2 inputs and 2 outputs.(source: 3)

Fig. 4: Initial State of the Neural Network.(source: 3)

Step 1 Forward Propagation

Input Layer -----> Hidden Layer:
First, we calculate the weighted input to the $\begin{array}{l}h_1\end{array}$ neuron in the hidden layer:

$\begin{array}{l}\displaystyle z_{h1} = w_1 * i_1 + w_2 * i_2 + b_1 *1 = 0.3775\end{array}$

The output of the $\begin{array}{l}h_1\end{array}$ neuron(with the sigmoid activation function) is:

$\begin{array}{l}\displaystyle a_{h1} = \frac{1}{1 + e^{- z_{h1} }} = 0.593269992\end{array}$

Similarly, we can get the output of the $\begin{array}{l}h_2\end{array}$ neuron is:

$\begin{array}{l}\displaystyle a_{h2} = 0.596884378\end{array}$
Hidden Lay ---→ Output Layer:
Repeat the same procedure from above to the output layer:

$\begin{array}{l}\displaystyle a_{o1} = \frac{1}{1 + e^{- z_{o1} }} = 0.75136507\end{array}$

$\begin{array}{l}\displaystyle a_{o2} = \frac{1}{1 + e^{- z_{o2} }} = 0.0.772928465\end{array}$

Here, the forward propagation is finished because we have the output of the network with the input and initial parameters we gave. Apparently, the output value $\begin{array}{l}(0.75136079, 0.772928465)\end{array}$ differs quite far from the desired output value $\begin{array}{l}(0.01,0.99)\end{array}$ . So,we need to apply the back propagation algorithm to the network to update the parameters and re-calculate output.

Step 2 Back propagation

1. Calculate the error(MSE):
From the first subsection, we knew that:

$\begin{array}{l}\displaystyle E(w,b) = \frac{1}{n} \varSigma_{\vec{x}}||\vec{y}- \vec{f}(\vec{x}) ||^2\end{array}$

According to the equation above, we can easily calculate the output of the MSE:

$\begin{array}{l}\displaystyle E(w,b) = E_{o1} + E_{o2} = 0.298371109\end{array}$

Let's calculate the value of each part from the equation above one by one:

$\begin{array}{l}\frac{\partial E}{\partial a_{o1}}\end{array}$ ：

$\begin{array}{l}\displaystyle E(w,b) = \frac{1}{n} \varSigma_{\vec{x}}||\vec{y}- \vec{f}(\vec{x}) ||^2\end{array}$

$\begin{array}{l}\displaystyle \frac{\partial E}{\partial a_{o1}} = 2 * \frac{1}{2}*(out_1 - o1 )*(-1) + 0 =0.74136507\end{array}$

$\begin{array}{l}\frac{\partial a_{o1}}{\partial z_{o1}}\end{array}$ ：

$\begin{array}{l}\displaystyle a_{o1} = \frac{1}{1 + e^{- z_{o1} }}\end{array}$

$\begin{array}{l}\displaystyle \frac{\partial a_{o1}}{\partial z_{o1}} = a_{o1}*(1-a_{o1}) = 0.186815602\end{array}$

$\begin{array}{l}\frac{\partial z_{o1}}{\partial w_5}\end{array}$ ：

$\begin{array}{l}\displaystyle z_{o1} = w_5 * a_{h1} + w_6*a_{h2} + b_2 *1\end{array}$

$\begin{array}{l}\displaystyle \frac{\partial z_{o1}}{\partial w_5} = 1*a_{o1} * w_5^0 +0+0 = 0.593269992\end{array}$

At last, $\begin{array}{l}\frac{\partial E}{\partial w_5} =0.082167041\end{array}$ .

You may find that we did not use the error variable $\begin{array}{l}\delta\end{array}$ , then why define it?

Actually we did use it. Relying on the previous equation of $\begin{array}{l}\delta\end{array}$ , and based on the calculation above, we can easily derive that:

$\begin{array}{l}\displaystyle \frac{\partial z_{o1}}{\partial w_5} = -(out_1-o1) * a_{o1}*(1-o1)*a_{h1}\end{array}$

Then, $\begin{array}{l}\delta\end{array}$ can be written as:

$\begin{array}{l}\displaystyle \delta_{o1} = \frac{\partial E}{\partial a_{o1}}*\frac{\partial a_{o1}}{\partial z_{o1}} = \frac{\partial E}{\partial z_{o1}}\end{array}$

$\begin{array}{l}\displaystyle \delta_{o1} = -(out_1-o1) * a_{o1}*(1-o1)\end{array}$

So, the derivative of MSE in terms of $\begin{array}{l}w_5\end{array}$ is:

$\begin{array}{l}\displaystyle \frac{\partial z_{o1}}{\partial w_5} = \delta_{o1}*a_{h1}\end{array}$

In this part, the last step is to update the value of $\begin{array}{l}w_5\end{array}$ to make our network "better":

$\begin{array}{l}\displaystyle w_5' = w_5 - \eta * \frac{\partial E}{\partial w_5} = 0.35891648\end{array}$

where $\begin{array}{l}\eta\end{array}$ is the learning rate, we take its value as 0.5 here.

Similarly, $\begin{array}{l}w_6, w_7,w_8\end{array}$ could be updated to:

$\begin{array}{l}\displaystyle w_6' = 0.408666186\end{array}$

$\begin{array}{l}\displaystyle w_7' = 0.511301270\end{array}$

$\begin{array}{l}\displaystyle w_8' = 0.561370121\end{array}$

3. Update the parameters between hidden layer and input layer:

The method we will use differs not that much from the last part, the only change is: for instance, when we calculate the derivative of total MSE in terms of $\begin{array}{l}a_{h1}\end{array}$ , the error in both outputs $\begin{array}{l}o_1,o_2\end{array}$ should be considered, which means:

$\begin{array}{l}\displaystyle \frac{\partial E}{\partial a_{h1}} = \frac{\partial E_1}{\partial a_{h1}}+\frac{\partial E_2}{\partial a_{h1}} = 0.036350306\end{array}$

Based on this, and by the use of Fourth equation $\begin{array}{l}\frac{\partial E}{\partial w^l_{jk}} = a^{l-1}_k \delta ^l_j\end{array}$ , and Second equation $\begin{array}{l}\delta^l = ((w^{l+1})^T \delta^{l+1})\bigodot\sigma'(z^l)\end{array}$ :

$\begin{array}{l}\displaystyle w_1' = w_1 - \eta * \frac{\partial E}{\partial w_1} = 0.1497870716\end{array}$

The same:

$\begin{array}{l}\displaystyle w_2' = 0.19956143\end{array}$

$\begin{array}{l}\displaystyle w_3' = 0.24975114\end{array}$

$\begin{array}{l}\displaystyle w_4' = 0.29950229\end{array}$

Fig. 6: Back propagate the error for the hidden layer.(source:3)

Seitenhierarchie

Error Measures

Training Protocols

Parameter Optimization

Weights Initialization

Error Backpropagation

Principle Part

Practical Part

Step 1 Forward Propagation

Step 2 Back propagation

Literature

2 Kommentare

Unbekannter Benutzer (ga29mit) sagt:

Unbekannter Benutzer (ga58zak) sagt: