FaceBook: 1 hour training ImageNet

Accurate, Large Minibatch SGD

Training ImageNet in 1 Hour

Main Idea

Higher training speed requires larger mini-batch size.

8192 images one batch, 256 GPUs

Larger mini-batch size leads to lower accuracy
Linear scaling rule for adjusting learning rates as a function of minibatch size
Warmup scheme overcomes optimization challenges early in training

Background

mini-batch SGD
Larger mini-batch size lead to lower accuracy.

mini-batch SGD

Iteration(in FaceBook Paper):

Convergence:
- Learning Rate:
- Converge Speed:
M: batch size, K: iteration number, σ²: stochastic gradient variance

Goal

Use large minibatches
- scale to multiple workers
Maintaining training and generalization accuracy

Solution

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

Analysis

k iteration, minibatch size of n:
1 iteration, minibatch size of kn:
Assume gradients of the above fomulas are equal
- Two updates can be similar only if we set the second learning rate to k times the first learning rate.

Conditions that assumption not hold

Initial training epochs when the network is changing rapidly.
Results are stable for a large range of sizes, beyond a certain point

Warm Up

Low learning rate to solve rapid change of the initial network.
Constant Warmup: Sudden change of learning rate causes the training error to spike.
Gradual warmup: Ramping up the learning rate from a small to a large value.
start from a learning rate of η and increment it by a constant amount at each iteration such that it reaches η̂ = kη after 5 epochs.

Reference

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
机器之心提问：如何评价Facebook Training ImageNet in 1 Hour这篇论文?
Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization
ENTROPY-SGD: BIASING GRADIENT DESCENT INTO WIDE VALLEYS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imgnet_1h.md

imgnet_1h.md

FaceBook: 1 hour training ImageNet

Accurate, Large Minibatch SGD

Training ImageNet in 1 Hour

Main Idea

Background

mini-batch SGD

mini-batch SGD

Goal

Solution

Analysis

Conditions that assumption not hold

Warm Up

Reference

Files

imgnet_1h.md

Latest commit

History

imgnet_1h.md

File metadata and controls

FaceBook: 1 hour training ImageNet

Accurate, Large Minibatch SGD

Training ImageNet in 1 Hour

Main Idea

Background

mini-batch SGD

mini-batch SGD

Goal

Solution

Analysis

Conditions that assumption not hold

Warm Up

Reference