Adaptive learning rate methods are an optimization of gradient descent methods with the goal of minimizing the objective function of a network by using the gradient of the function and the parameters of the network.

 

 

 

Author: Johannes Kuhn

 

Gradient descent

Before the adaptive learning rate methods were introduced, the gradient descent algorithms including Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD) and mini-Batch Gradient Descent (mini-BGD, the mixture of BGD ans SGD) were state-of-the-art. In essence, these methods try to update the weights \theta of the network, with the help of a learning rate \eta, the objective function J(\theta) and the gradient of it, \nabla J(\theta). What all gradient descent algorithms and its improvements have in common, is the goal of minimizing J(\theta) in order to find the optimal weights \theta.

The simplest of the three is the BGD.

\theta = \theta - \eta \cdot \nabla_{\theta} J(\theta)

It tries to reach the minimum of J\theta), by subtracting from \theta the gradient of J(\theta) (refere to Figure 3 for a visualization). The algorithm always computes over the whole set of data, for each update. This makes the BGD the slowest and causes it to be unable to update online. Additionally, it performs redundant operates for big sets of data, computing similar examples at each update and it converges to the closeset minimum depending on the given data data, resulting in potential suboptimal results.

An often used algorithm is the SGD.

\theta = \theta - \eta \cdot \nabla_{\theta}J(\theta; x^{(i)}; y^{(i)})

Contrary to BGD, SGD updates for each training example (x^{(i)}; y^{(i)}), thus updating according to a single example step. Furthermore, this fluctuation enables the SGD to jump to minima farther away, potentially reaching a better minimum. But thanks to this fluctuation, SGD is also able to overshoot. This can be counteracted by slowly decreasing the learning rate. In the exemplary code shown in Figure 2, a shuffle function is additionally used in the SGD and mini-BGD algorithm, compared to the BGD. This is done, as it is often preferable to avoid meaningful order of the data and thereby avoid bias of optimization algorithm, although sometimes better results can be achieved with data in order. In this case the shuffle operation is to be removed.

Lastly, there is the mini-BGD.

\theta = \theta - \eta \cdot \nabla_{\theta} J(\theta;x^{(i:i+n)};y^{(i:i+n)})

The mini-BGD updates for every mini-batch of n training examples. This leads to a more stable convergence, by reducing the variance of the parameters. When people talk abput a SGD algorithm, they often refer to this version.

 

batch gradient descent (BGD)stochastic gradient descent (SGD)mini-batch gradient descent (mini-BGD)

for i in range (nb_epoches):

 

 

       params_grad = evaluate_gradient(loss_function, data, params

       params = params - learning_rate * params_grad

for i in range(np_epochs):

          np.random.shuffle(data)

          for example in data:

                params_grad = evaluate_gradient(loss_function, example, params

                params = params - learning_rate * params_grad         

 for i in range(np_epochs):

           np.random.shuffle(data)

           for batch in get_batches(data, batch_size=50):

                params_grad = evaluate_gradient(loss_function, batch, params

                params = params - learning_rate * params_grad   

Figure 2: (1) Pseudo code of the three gradient descent algorithms

 

 

 

 

 

 

Figure 1:(5) Local minima may occure in  J(\theta) (here J(w)), which may result in suboptimal solution for some gradient descent methodes.

 

 

Adaptive Learning Rate Method

As an improvement to traditional gradient descent algorithms, the adaptive gradient descent optimization algorithms or adaptive learning rate methods can be utilized. Several versions of these algorithms are described below.

Momentum can be seen as an evolution of the SGD.

v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta)\\\theta = \theta - v_t

While SGD has problems with data having steep curves in one direction of the gradient, Momentum circumvents that by adding the update vector of the time step before multiplying it with a \gamma, usually around 0.9 (1). As an analogy, one can think of a ball rolling down the gradient, gathering momentum (hence the name), while still being affected by the wind resistance (0< \gamma < 1).

 

Nesterov accelerated gradient can be seen as a further enhancement to momentum.

v_t = \gamma v_{t-1} + \eta \nabla_{\theta}J(\theta - \gamma v_{t-1})\\\theta = \theta - v_t

This algorithm adds a guess of the next step, in the form of the term \gamma v_{t-1}. A comparison for the first two steps of Momentum and Nesterov accelerated gradient can be found in Figure 3. The additional term results in a consideration of the error of the previous step, accelerating the progress in comparison to momentum.

 

Contrary to the nesterov accelerated gradient, Adagrad adapts its learning rate \eta during its run-time and it updates its parameters \theta_i separately during each time step t. It has to do that, since \eta adapts for every \theta_i on its own.

\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,ii}+ \epsilon}} \cdot\nabla_{\theta} J(\theta_i)

G_t is a matrix containing the squared sum of the past gradients with regards to all \theta along its diagonal.

\epsilon is correction term which is utilized to avoid dividing by 0 and is generally insignificantly small (~10^{-8}).

Due to the accumulation of the squared gradients in G_t the learning rate \frac{\eta}{\sqrt{G_{t,ii}+ \epsilon}} gets smaller over time, finally leading to a significantly small rate, which causes the algorithm to obtain no new knowledge.

 

 

 

 

Figure 3:(1)  Visualization of the analogy for Momentum using a \gamma = 0.9.

Figure 4:(1) Comparison of Momentum (blue) and Nesterov accelerated gradient (green). The brown arrow, is the first prediction of the nesterov accelerated gradient, before the gradient is calculated. The green arrow is the final result of the nesterov accelerated gradient, now with the gradient taken into account.

Literature

(1)Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv preprint arXiv:1609.04747 (2016).

(2)Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization." Journal of Machine Learning Research 12.Jul (2011): 2121-2159.

  • No labels

2 Comments

  1. Unknown User (ga29mit)

    All the comments are SUGGESTIONS and are obviously highly subjective!

    Form comments: 

    • I think we agreed on linking to the Literature section by setting an Anchor and then using shortcuts to those sections. It should look like this (6)
    • Try to incorporate links to other pages
    • Check for redundant spaces
    • Give numbers to all figures and formulas and refer to them by number (e.g., Equation 1). Avoid "below" etc.

    Corrections:

    • Adaptive learning rate methodes are optimazation of gradient descent methodes, with the goal of minimizing the objective function of a network, by using the gradient of the function and the parameters of the network.

    • Adaptive learning rate methods are an optimization of gradient descent methods with the goal of minimizing the objective function of a network by using the gradient of the function and the parameters of the network.

    • Before the adaptive Learning Rate Methodes came the gradient descent algortihms, Batch Gradient Descent(BGD), Stochastic Gradient Descent (SGD) and the mixture of them both, mini-Batch Gradient Descent (mini-BGD).

    •  

      Before the adaptive Learning Rate Methods were introduced, the gradient descent algorithms Batch Gradient Descent(BGD), Stochastic Gradient Descent (SGD) and the mixture of them both, mini-Batch Gradient Descent (mini-BGD) were state-of-the-art

    • The simpelst of the three is the BGD.

    • The simplest of the three is the BGD.

    • data, for eeach update. 

    • data, for each update. 

    • Additionally it performes redundant operates for big sets of data, computing similar examples every update and it converges to the closeset minima the data is placed near, resulting in potential suboptimal local minima.

    • Additionally, it performs redundant operations for big sets of data, computing similar examples every update and it converges to the closest minimum in a neighborhood of the given, resulting in potentially suboptimal local minima.

    • Furthermore with this fluctuation, SGD is able to jump to minima farther away, potentialy reaching a better minimum

    • Furthermore, this fluctuation enables SGD to jump to minima further away, potentially reaching a better minimum.

    • This is done, as it is often preferable to avoid meaningful order of the data, to avoid bias of optimazation algorithm, although sometimes better results can be achieved with data in order. In this case the shuffle operaten is to be removed. Lastly there is the mini-BGD.

    • This is done, as it is often preferable to avoid meaningful order of the data and thereby avoid bias of the optimization algorithm, although sometimes better results can be achieved with data in order. In this case the shuffle operation has to be removed. Lastly, there is the mini-BGD.

    • Often, if it is talked about an SGD algorithm, this one is meaned instead.

    • When people talk about an SGD algorithm, they often actually refer to this version.

    • As the next step to gradient descent algorithms there are the adaptive gradient descent optimization algorithm or adaptive learning rate methodes. Below, a few popular algorithms of this kind are described.

    • As the next step to gradient descent algorithms there are the adaptive gradient descent optimization algorithms or adaptive learning rate methods. Below a few popular algorithm of this kind are described.

    • Momentum can be seen as a evolution of the SGD.

    • Momentum can be seen as an evolution of the SGD.

    • Momentum circumevents that with adding the update vectore of the time step before an multipliing it with a  γ

    • Momentum circumvents that with adding the update vector of the time step before an multiplying it with a  γ

    • Nesterov accelerated gradient can be seen as a further enhancment to Momentum.

    • Nesterov accelerated gradient can be seen as a further enhancement to Momentum.

    • This algortihm adds a guess of the next stepp, in the form of the term   −γvt−1 

    • This algorithm adds a guess of the next step, in the form of the term     −γvt−1 

    • Using the same  γ   γ = 0.9 a comparisen in the first two steps of Momentum and Nesterov accelerated gradient can be viewed in Figure 2. The additional term results in a consideration of the error of the previos step, accelarating the progress compared to Momentum.

    • Using the same  γ   γ = 0.9, a comparison of the first two steps of Momentum and Nesterov accelerated gradient can be found in Figure 2. The additional term results in a consideration of the error of the previous step, accelerating the progress compared to Momentum.

    • Figure 2:[1] Comparisson of Momentum (blue) and Nesterov accelerated gradient (green). The green arrow is the addition of the brown arrow, represented in the term γvt−1, plus the red arrow, represented by the term  η∇θJ(θ−γvt−1)
    • Figure 2:[1] Comparison of Momentum (blue) and Nesterov accelerated gradient (green). The green arrow corresponds to the vector addition of the brown arrow, represented in the term  γvt−1, and the red arrow, represented by the term  η∇θJ(θ−γvt−1)

    •  i   θi seperatly during time each time step t.

    •  i   θi separately during each time step t.
    •  since adapts  η for every  θi on its own.
    •  since η adapts for every  θi on its own.
    • Gt is a matirx containingthe  squares sum of the past gradients w.r.t all θ along its diagonal.
    • Gt is a matrix containing the squared sum of the past gradients w.r.t to all θ along its diagonal.
    • ϵ is correction term to avoid deviding by 0 and is gernerally insignivicly small (~ 10 8   10−8).
    • ϵ is a correction term to avoid dividing by 0 and is generally insignificantly small (~ 10−8).

     

    Final remark:
    • I was patient the whole time, but this cannot be your f**king serious???? How about running a spell-check and proof-read on your own work before you waste someone else's time?
  2. Unknown User (ga73fuj)

    General suggestions:

    • I'm not sure about the capitalization, but it's not consistent throughout the page.
    • There are no explanations about the notation of the mathematical formulas that you used. Since there are no links to other pages explaining this stuff, you need to explain what they represent.
    • There are no links from the text to the references. (no anchors)
    • Page formatting could be handled more carefully.

    Corrections:

    • Before the adaptive Learning Rate Methods were introduced, the gradient descent algorithms, Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD) and the mixture of them both, mini-Batch Gradient Descent (mini-BGD) were state-of-the-art.
    • Before the Adaptive Learning Rate Methods were introduced, the gradient descent algorithms including Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD) and the mini-Batch Gradient Descent (mini-BGD, a mixture of BGD and SGD), were state-of-the-art.
    • The algorithm computes always over the whole set of data, for each update.
    • The algorithm always computes over the whole set of data, for each update.
    • This makes the BGD the slowest and unable to update online.
    • This makes the BGD the slowest and causes it to be unable to update online.
    • Additionally it performs redundant operates for big sets of data, computing similar examples every update and it converges to the neighborhood of the given data data, resulting in potential suboptimal local minima.
    • Additionally, it performs redundant operations for big sets of data, computing similar examples at each update and it converges to the closest minumum depending on the given data, resulting in potential suboptimal results.
    • Contrary to the one above
    • Contrary to BGD
    • SGD updates for each training example x(i)x(i) and y(i)y(i), thus updating only one example at a time.
    • SGD updates for each training example x(i)x(i) and y(i)y(i), thus updating according to a single example at step.(are the examples the ones to be updated?)
    • This can be counteracted with slowly decreasing the learning rate.
    • This can be counteracted by slowly decreasing the learning rate.
    • In the exemplary code below, additionally to the second for-loop, a shuffle function is used in the SGD and mini-BGD algorithm.
    • As illustrated in the exemplary code (Figure ?), a shuffle function is used in the SGD and mini-BGD as an addition to BGD. (I'm not sure if this is what you meant.)
    • The mini-BGD updates for every mini-batch of n training examples.
    • The mini-BGD updates for every mini-batch of n (using MathInline would be better) training examples.
    • When people talk abput a SGD algorithm, they refer to this version.
    • When people talk about a SGD algorithm, they refer to this version. (I don't think this makes any sense. You explained about the concept of SGD before introducing mini-BGD. Now at the end of mini-BGD section, you wrote "When people talk about a SGD algorithm, they refer to this version." . Wouldn't that make SGD = mini-BGD?)
    • As the next step to gradient descent algorithms there are the adaptive gradient descent optimization algorithms or adaptive learning rate methods. Below a few popular algorithm of this kind are described.
    • As an improvement over to traditional gradient descent algorithms, the adaptive gradient descent optimization algorithms or adaptive learning rate methods can be utilized. Several versions of these algorithms are described below.
    • Momentum circumvents that with adding the update vector of the time step before multiplying it with a
    • Momentum circumvents that by adding the update vector at each time step before multiplying it with a
    • Nesterov accelerated gradient can be seen as a further enhancement to Momentum.
    • Nesterov accelerated gradient can be seen as a further enhancement to momentum. (your capitalization is not consistent in this sentence or in general. It's either "accelerated gradient - momentum" or "Accelerated Gradient - Momentum". I believe, the first one is more preferable when writing an article in english.)
    • in the form of the term −γvt−1−γvt−1 as an input to the Jacobian matric
    • in the form of the term −γvt−1−γvt−1 as an input to the Jacobian matrix
    • Using the same γγ = 0.9, a comparison of the first two steps of Momentum and Nesterov accelerated gradient can be found in Figure 2.
    • A comparison of the first two steps of Momentum and Nesterov accelerated gradient can be found in Figure 2. (I'd mention the 0.9 in the explanation of the figure. It does not fit here in my opinion.)
    • The additional term results in a consideration of the error of the previous step, accelerating the progress compared to Momentum
    • The additional term results in a consideration of the error from the previous step, accelerating the progress in comparison to momentum
    • separately during each time step t
    • separately during each time step t(MathInline)
    • a matrix containing the squared sum of the past gradients w.r.t all
    • a matrix containing the squared sum of the past gradients with regards to all (This abbreviation is not canon.)
    • is correction term to avoid dividing by 0
    • is a correction term which is utilized to avoid dividing by 0
    • gets smaller over time, finally leading to a rate too small, resulting in the algorithm gaining no new knowledge.
    • gets smaller over time, finally leading to a significantly small rate, which causes the algorithm to obtain no new knowledge.