-
Background
- Batch normalization for solving vanishing/exploding gradients
-
Deeper model => Higher error
- With the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.
- Residual Representations
-
Short Connections
- "highway networks” present shortcut connections with gating functions
-
Problem
- Is learning better networks as easy as stacking more layers? => Vanishing/Exploding Gradients
-
Contribution
- Deeper network without higher error which is easier to optimize
- Achieve 3.57% error on the ImageNet test set by ensemble ResNet
- 1st place of ILSVRC 2015 classification task
-
Dataset
-
ImageNet 1000 classes
- training: 1.28 million images
- validation: 50k images
- test: 100k images
- CIFAR-10
- CIFAR-100
-
Architecture
-
Residual learning: a building block.
Shortcut connections
- H(x) = F(x) + x
{x is optimal && F(x) is 0} is better than {F(x) is 0}
At least, the model learns something from x
- No extra parameter
No extra computational complexity
- Addition is element-wise addition
- Dimension(x) != Dimension(F(x)) Ws is a square matrix
- If there are two layers in the residual block
Activation function is "ReLU" (Bias are omitted for simplified notation)
- trained end-to-end by SGD
- The three layers are 1x1, 3x3, and 1x1 convolutions, where the 1x1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3x3 layer a bottleneck with smaller input/output dimensions.
-
Parameter Setting
-
Input 224x224 cropped image
- Randomly sampled from an image or its horizontal flip
- With per-pixel mean subtracted
- Standard color augmentation
-
Conv + BN + ReLU
- BN = BatchNormalization
- SGD: mini-batch-size = 256
- Learning rate = 0.1 (/10 when the error plateaus)
- Use a weight decay of 0.0001 and a momentum of 0.9
- Do not use dropout
-
Experiment
- ResNet
- Plain Network
-
CIFAR-10 and Analysis
-
Augmentation
- 4 pixels are padded on each side, and a 32x32 crop is randomly sampled from the padded image or its horizontal flip.
- These models are trained with a mini-batch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split.
- Compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).
-
Question
- Why does ResNet model have fewer filters and
lower complexity than VGG nets?
- What is vanishing/exploding gradients?
-
Why does shortcut (function F) have 2 middle layer?
- If it's one layer, it's similar to a linear layer (y = W1x+x)
-
The learning rate starts from 0.1 and is divided by 10 when the error plateaus. How to know the error plateaus?
- LearningRateScheduler
-
ReduceLROnPlateau
- 当评价指标不在提升时,减少学习率
-
Why do we need bottleneck network?
- To reduce computational cost (times of multiplications)