SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov, Frank Hutter, 2016
SuperConvergence: Very Fast Training of Neural Networks Using Large Learning Rates
Leslie N. Smith, Nicholay Topin, 2017
Research Topic
 Category (General): Deep Learning
 Category (Specific): Hyper Tuning, Optimization
Paper summary
SGDR
 Partial warm restarts improve rate of convergence, often used in gradientfree optimization.
 Propose a warm restart technique for stochastic gradient descent.
 Study its performance on CIFAR10/100
 Show that this technique improve its anytime performance when training deep neural network:
 SGD with warm restarts requires 2× to 4× fewer epochs than the currentlyused learning rate schedule schemes to achieve comparable or even better results
Superconvergence
 Introduce “Superconvergence” term: where neural networks can be train and converge much faster than standard training methods.
 Propoce a way to achieve superconvergence: onecycle policy + large learning rate.
 Use Hessian Free optimization method to produce an estimate of the optimal learning rate.
 Study its performance on CIFAR10/100, MNIST, Imagenet with various model.
 Show that this phenomenon can also happen when the amount of labeled training data is limited but still boost the model performance.
 Mostly all of these are mentioned in the previous recap, so it will not be discussed more in the sections below.
Explain Like I’m 5 (ELI5) 👶👶👶
SGDR
 Just like a normal person works everyday. You start the day with maximum effort, but then time goes by you feel tired and the productivity reduces. You go home, rest. The next day, you’re recharged and start the cycle once again.
Superconvergence
 As a result of CLR, superconvergence is born.
Issues addressed by the paper
SGDR
 Despite of the existance of advanced optimization methods like Adam, AdaDelta, SOTA result on CIFAR10/100, ImageNet still based on SGD with momentum, associated with Resnet model.
 Current way to get out of the plateau while using SGD is LR scheduler and L2 regularization.
 They want to break through and produce a new approach to SGD.
Superconvergence
 They found a way to train DNN faster and achieve better performance.
 Large LR regularizes training, so other regularization method should be reduced to maintain optimal balance of optimization
 Hessianfree optimization method estimates optimal LR, demonstrating that large LR find wide, flat minima.
Approach/Method
SGDR
 SGDR simulates a new warmstarted run/restart of SGD after \(T_{i}\) epochs are performed.
 Learning rate is calculated by cos annealing function.
 \(\eta\): learning rate.
 \(T_{cur}\): how many epochs passed since the restart.

\(T_{i}\): how mane epochs for a restart, you can leave it constant or increase overtime.
 How the learning rate looks after training:
Best practice
SGDR
 Start with small \(T_{i}\), then increase it by factor of \(T_{mult}\) at every start.
 Decreate max_lr and min_lr at every new start may increase performance.
 Stop training when current learning rate is equal to min_lr.
 SGDR allows to train larger network.
Results
 SGDR technique helps the author to surpass the SOTA at the time with much faster computational time.
Conclusions
The author’s conclusions
 Our SGDR simulates warm restarts by scheduling the learning rate to achieve competitive results on CIFAR10 and CIFAR100 roughly two to four times faster, achieved new state oftheart results with SGDR.
 SGDR might also reduce the problem of learning rate selection because the annealing and restarts of SGDR scan / consider a range of learning rate values.
Rating
My Conclusion
 Another technique can be considered when training in new project, should quick test overall before stick to one and go deeper.
Paper implementation
Cos Annealing
SGDR
Cited references and used images from:
 https://towardsdatascience.com/httpsmediumcomreinawangtwstochasticgradientdescentwithrestarts5f511975163
 https://arxiv.org/abs/1608.03983
 Pytorch library