VGG Mind Map

Background
1. ILSVRC-2013: Winner utilized smaller receptive window size and smaller stride of the convolutional layer
2. GoogLeNet: (22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions). First place of 2014 ILSVRC
3. Train and test densely over the whole image and multiple scales (Sermanet et al., 2014; Howard, 2014)
Contribution
1. Increase depth of the architecture by using 3x3 conv filter => Increase the accuracy and win the second place of 2014 ILSVRC
  1. 3x3 conv filter is the smallest size to capture the notion of left/right, up/down, center
Architecture
1. Input: 224x224x3
  1. Training Image Size: Scale and Crop
  2. Preprocessing: Only subtract the mean RGB value, computed on the training set, from each pixel
2. Configuration
  1. Convolutional Layer
  2. filter_size = 3x3
  3. channels=64 (c=cx2 after max-pooling layer, till c=512)
  4. strides = 1
  5. activation = 'relu'
  6. Max Pooling (#=5)
  7. window_size = 2x2
  8. strides = 2
  9. FC Layer (#=3) First = 4096 Second = 4096 Third = 1000 (# of classes)
3. Model 16 weight layers
4. Model 19 weight layers
Discussion
1. Same receptive fields but larger depth
  1. 7x7 receptive fields: a single 7x7 layer = a stack of three 3x3 conv. layer
Dataset
1. ILSVRC-2012
2. 1000 classes
Training
1. Batch size = 256
2. Momentum = 0.9
3. L2 Regularization: 5x10^-4
4. 0.5 dropout for the first two FC layer
5. Learning rate: 10^-2 => (/10) x 3 times
6. Use pre-trained weights from 11 layer for part of deeper layers' initialization and normal distribution weight for intermediate layers
  1. Pre-trained 11 layers model A: we initialized the first four convolutional layers and the last three fully connected layers with the layers of net A
7. #images 5M for training / 50K for validation
Test
1. Evaluation: top-1 and top-5 error
  1. top-1 error: the proportion of incorrectly classified images
  2. top-5 error: the proportion of images that the ground-truth category is outside the top-5 predicted categories
2. #images 100K
Result
1. Depth ⬆️ => Error ⬇️
Question
1. Why is first two FC 4096?
  1. For better computing, the channel is better to be 2^n.
  2. To satisfy 1000 classes output
  3. Some other answer from Quora: But why 4096? There is no reasoning. It's just a choice. It could have been 8000, it could have been 20, it just depends on what works best for the network.
2. Why does this larger depth has smaller number of weights?
  1. e.g. If input and output has C channels
  2. single 7x7 layer: #weights = Cx (7x7xC) = 49C^2
  3. three-layer 3x3 convolution stack: #weights = 3x Cx (3x3xC) = 27C^2
3. What's the relationship between iteration and epoch?
  1. batchSize is the number of dataset for SGD every time, which means the parameters are updated by using batchSize number of dataset.
  2. One iteration means train the model by batchSize number of dataset for once.
  3. One epoch means train the model by whole dataset for once.
  4. e.g. If there are 1000 samples in the dataset and batchSize=10, we need 100 iterations / 1 epoch to train this whole dataset for once.
4. How to know when the validation set accuracy stopped improving?
5. What is normal distribution with the zero mean and 10^-2 variance?
6. What is ConvNet Fusion?
  1. Combine the outputs of several models by averaging their soft-max class posteriors.
  2. This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014).
7. What is k-fold Cross-Validation?
  1. If the dataset is big enough: Training: 50% Validation: 25% Test: 25%
  2. If the dataset is small: Keep small part as test dataset. Apply k-fold Cross-Validation for the rest of N samples.
  3. Process of k-fold Cross-validation:
    1. Shuffle these N samples and equally split to k parts.
    2. Loop k-1 times { k-1 for training 1 for validation calculate acc }
    3. final accuracy = Var(acc)