-
Background
- ILSVRC-2013: Winner utilized smaller receptive window size and smaller stride of the convolutional layer
- GoogLeNet: (22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5
convolutions). First place of 2014 ILSVRC
- Train and test densely over the whole image and multiple scales (Sermanet et al., 2014; Howard, 2014)
-
Contribution
-
Increase depth of the architecture by using 3x3 conv filter => Increase the accuracy and win the second place of 2014 ILSVRC
- 3x3 conv filter is the smallest size to capture the notion of left/right, up/down, center
-
Architecture
-
Input: 224x224x3
- Training Image Size: Scale and Crop
- Preprocessing: Only subtract the mean RGB value, computed on the training set, from each pixel
-
Configuration
- Convolutional Layer
- filter_size = 3x3
- channels=64 (c=cx2 after max-pooling layer, till c=512)
- strides = 1
- activation = 'relu'
- Max Pooling (#=5)
- window_size = 2x2
- strides = 2
- FC Layer (#=3)
First = 4096
Second = 4096
Third = 1000 (# of classes)
- Model 16 weight layers
- Model 19 weight layers
-
Discussion
-
Same receptive fields but larger depth
- 7x7 receptive fields: a single 7x7 layer = a stack of three 3x3 conv. layer
-
Dataset
- ILSVRC-2012
- 1000 classes
-
Training
- Batch size = 256
- Momentum = 0.9
- L2 Regularization: 5x10^-4
- 0.5 dropout for the first two FC layer
- Learning rate: 10^-2 => (/10) x 3 times
-
Use pre-trained weights from 11 layer for part of deeper layers' initialization and normal distribution weight for intermediate layers
- Pre-trained 11 layers model A:
we initialized the first four convolutional layers and the last three fully connected layers with the layers of net A
- #images 5M for training / 50K for validation
-
Test
-
Evaluation: top-1 and top-5 error
- top-1 error: the proportion of incorrectly classified images
- top-5 error: the proportion of images that the ground-truth category is outside the top-5 predicted categories
- #images 100K
-
Result
- Depth ⬆️ => Error ⬇️
-
Question
-
Why is first two FC 4096?
- For better computing, the channel is better to be 2^n.
- To satisfy 1000 classes output
- Some other answer from Quora: But why 4096? There is no reasoning. It's just a choice. It could have been 8000, it could have been 20, it just depends on what works best for the network.
-
Why does this larger depth has smaller number of weights?
- e.g. If input and output has C channels
- single 7x7 layer:
#weights = Cx (7x7xC) = 49C^2
- three-layer 3x3 convolution stack:
#weights = 3x Cx (3x3xC) = 27C^2
-
What's the relationship between iteration and epoch?
- batchSize is the number of dataset for SGD every time, which means the parameters are updated by using batchSize number of dataset.
- One iteration means train the model by batchSize number of dataset for once.
- One epoch means train the model by whole dataset for once.
- e.g.
If there are 1000 samples in the dataset and batchSize=10, we need 100 iterations / 1 epoch to train this whole dataset for once.
- How to know when the validation set accuracy stopped improving?
- What is normal distribution with the zero mean and 10^-2 variance?
-
What is ConvNet Fusion?
- Combine the outputs of several models by averaging their soft-max class posteriors.
- This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014).
-
What is k-fold Cross-Validation?
- If the dataset is big enough:
Training: 50%
Validation: 25%
Test: 25%
- If the dataset is small:
Keep small part as test dataset.
Apply k-fold Cross-Validation for the rest of N samples.
-
Process of k-fold Cross-validation:
- Shuffle these N samples and equally split to k parts.
- Loop k-1 times
{
k-1 for training
1 for validation
calculate acc
}
- final accuracy = Var(acc)