OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
The paper proposes a integrated approach to object detection, recognition, and localization with a single Convolutional Net. The way of test evaluation is new.
Date: 2013
Preprocess
The shortest side of the picture is resized to 256
Horizontally flipping
Data-augmentation
Each five random 224*224 patch from original 256*256 patch and horizontally flipped patch: totally 10
Optimization
SGD: decrease the learning rate by a factor of 0.5 after (30, 50, 60, 70, 80) epochs
Learning rate: 0.05
Momentum: 0.6
Batch size: 128
Weight decay: 0.00001
Weight initialization
Weights: zero-mean gaussian distribution with standard deviation 0.01
Biases: zero-mean gaussian distribution with standard deviation 0.01
Dropout
Fully-connected layers at 6th and 7th with probability 0.5
Test Evaluation
This test-evaluation is often called dense sliding window method or dense evaluation.
Shifted output from 5th layer is fed onto the latter layers and the results are averaged.
(1) For a single image, at a given scale, we start with the unpooled layer 5 feature maps (2) Each of unpooled maps undergoes a 3*3 max pooling operation (non-overlapping regions), repeated 3*3 times for (∆x, ∆y) pixel offsets of {0, 1, 2}. (3) This produces a set of pooled feature maps, replicated (3*3) times for different (∆x, ∆y) combinations. (4) The classifier (layers 6,7,8) has a fixed input size of 5*5 and produces a C-dimensional output vector for each location within the pooled maps. The classifier is applied in sliding-window fashion to the pooled maps, yielding C-dimensional output maps (for a given (∆x, ∆y) combination). (5) The output maps for different (∆x, ∆y) combinations are reshaped into a single 3D output map (two spatial dimensions x C classes).
Architecture
Result
Imagenet recognition














