1. Background
    1. Batch normalization for solving vanishing/exploding gradients
    2. Deeper model => Higher error
      1. With the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.
    3. Residual Representations
    4. Short Connections
      1. "highway networks” present shortcut connections with gating functions
  2. Problem
    1. Is learning better networks as easy as stacking more layers? => Vanishing/Exploding Gradients
  3. Contribution
    1. Deeper network without higher error which is easier to optimize
    2. Achieve 3.57% error on the ImageNet test set by ensemble ResNet
    3. 1st place of ILSVRC 2015 classification task
  4. Dataset
    1. ImageNet 1000 classes
      1. training: 1.28 million images
      2. validation: 50k images
      3. test: 100k images
    2. CIFAR-10
    3. CIFAR-100
  5. Architecture
    1. Residual learning: a building block. Shortcut connections
      1. H(x) = F(x) + x {x is optimal && F(x) is 0} is better than {F(x) is 0} At least, the model learns something from x
      2. No extra parameter No extra computational complexity
      3. Addition is element-wise addition
      4. Dimension(x) != Dimension(F(x)) Ws is a square matrix
      5. If there are two layers in the residual block Activation function is "ReLU" (Bias are omitted for simplified notation)
    2. trained end-to-end by SGD
    3. The three layers are 1x1, 3x3, and 1x1 convolutions, where the 1x1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3x3 layer a bottleneck with smaller input/output dimensions.
    4. Parameter Setting
      1. Input 224x224 cropped image
        1. Randomly sampled from an image or its horizontal flip
        2. With per-pixel mean subtracted
        3. Standard color augmentation
      2. Conv + BN + ReLU
        1. BN = BatchNormalization
      3. SGD: mini-batch-size = 256
      4. Learning rate = 0.1 (/10 when the error plateaus)
      5. Use a weight decay of 0.0001 and a momentum of 0.9
      6. Do not use dropout
  6. Experiment
    1. ResNet
    2. Plain Network
    3. CIFAR-10 and Analysis
      1. Augmentation
        1. 4 pixels are padded on each side, and a 32x32 crop is randomly sampled from the padded image or its horizontal flip.
      2. These models are trained with a mini-batch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split.
    4. Compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).
  7. Question
    1. Why does ResNet model have fewer filters and lower complexity than VGG nets?
    2. What is vanishing/exploding gradients?
    3. Why does shortcut (function F) have 2 middle layer?
      1. If it's one layer, it's similar to a linear layer (y = W1x+x)
    4. The learning rate starts from 0.1 and is divided by 10 when the error plateaus. How to know the error plateaus?
      1. LearningRateScheduler
      2. ReduceLROnPlateau
        1. 当评价指标不在提升时,减少学习率
    5. Why do we need bottleneck network?
      1. To reduce computational cost (times of multiplications)