4. Object detection¶
Slides: pdf
4.1. Object detection¶
Contrary to object classification/recognition which assigns a single label to an image, object detection requires to both classify object and report their position and size on the image (bounding box).
A naive and very expensive method is to use a trained CNN as a high-level filter. The CNN is trained on small images and convolved on bigger images. The output is a heatmap of the probability that a particular object is present.
Object detection is both a:
Classification problem, as one has to recognize an object.
Regression problem, as one has to predict the coordinates \((x, y, w, h)\) of the bounding box.
The main datasets for object detection are the PASCAL Visual Object Classes Challenge (20 classes, ~10K images, ~25K annotated objects, http://host.robots.ox.ac.uk/pascal/VOC/voc2008/) and the MS COCO dataset (Common Objects in COntext, 330k images, 80 labels, http://cocodataset.org)
4.2. R-CNN : Regions with CNN features¶
R-CNN [Girshick et al., 2014] was one of the first CNN-based architectures allowing object detection.
It is a pipeline of 4 steps:
Bottom-up region proposals by searching bounding boxes based on pixel info (selective search https://ivi.fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013/UijlingsIJCV2013.pdf).
Feature extraction using a pre-trained CNN (AlexNet).
Classification using a SVM (object or not; if yes, which one?)
If an object is found, linear regression on the region proposal to generate tighter bounding box coordinates.
Each region proposal is processed by the CNN, followed by a SVM and a bounding box regressor.
The CNN is pre-trained on ImageNet and fine-tuned on Pascal VOC (transfer learning).
4.3. Fast R-CNN¶
The main drawback of R-CNN is that each of the 2000 region proposals have to go through the CNN: extremely slow. The idea behind Fast R-CNN [Girshick, 2015] is to extract region proposals in higher feature maps and to use transfer learning.
The network first processes the whole image with several convolutional and max pooling layers to produce a feature map. Each object proposal is projected to the feature map, where a region of interest (RoI) pooling layer extracts a fixed-length feature vector. Each feature vector is fed into a sequence of FC layers that finally branch into two sibling output layers:
a softmax probability estimate over the K classes plus a catch-all “background” class.
a regression layer that outputs four real-valued numbers for each class.
The loss function to minimize is a composition of different losses and penalty terms:
4.4. Faster R-CNN¶
Both R-CNN and Fast R-CNN use selective search to find out the region proposals: slow and time-consuming. Faster R-CNN [Ren et al., 2016] introduces an object detection algorithm that lets the network learn the region proposals. The image is passed through a pretrained CNN to obtain a convolutional feature map. A separate network is used to predict the region proposals. The predicted region proposals are then reshaped using a RoI (region-of-interest) pooling layer which is then used to classify the object and predict the bounding box.
4.5. YOLO¶
(Fast(er)) R-CNN perform classification for each region proposal sequentially: slow. YOLO (You Only Look Once) [Redmon & Farhadi, 2016] applies a single neural network to the full image to predict all possible boxes and the corresponding classes. YOLO divides the image into a SxS grid of cells.
Each grid cell predicts a single object, with the corresponding \(C\) class probabilities (softmax). It also predicts the coordinates of \(B\) possible bounding boxes (x, y, w, h) as well as a box confidence score. The SxSxB predicted boxes are then pooled together to form the final prediction.
In the figure below, the yellow box predicts the presence of a person (the class) as well as a candidate bounding box (it may be bigger than the grid cell itself).
In the original YOLO implementation, each grid cell proposes 2 bounding boxes:
Each grid cell predicts a probability for each of the 20 classes, two bounding boxes (4 coordinates per bounding box) and their confidence scores. This makes C + B * 5 = 30 values to predict for each cell.
4.5.1. Architecture of the CNN¶
YOLO uses a CNN with 24 convolutional layers and 4 max-pooling layers to obtain a 7x7 grid. The last convolution layer outputs a tensor with shape (7, 7, 1024). The tensor is then flattened and passed through 2 fully connected layers. The output is a tensor of shape (7, 7, 30), i.e. 7x7 grid cells, 20 classes and 2 boundary box predictions per cell.
4.5.2. Confidence score¶
The 7x7 grid cells predict 2 bounding boxes each: maximum of 98 bounding boxes on the whole image. Only the bounding boxes with the highest class confidence score are kept.
In practice, the class confidence score should be above 0.25.
4.5.3. Intersection over Union (IoU)¶
To ensure specialization, only one bounding box per grid cell should be responsible for detecting an object. During learning, we select the bounding box with the biggest overlap with the object. This can be measured by the Intersection over the Union (IoU).
4.5.4. Loss functions¶
The output of the network is a 7x7x30 tensor, representing for each cell:
the probability that an object of a given class is present.
the position of two bounding boxes.
the confidence that the proposed bounding boxes correspond to a real object (the IoU).
We are going to combine three different loss functions:
The categorization loss: each cell should predict the correct class.
The localization loss: error between the predicted boundary box and the ground truth for each object.
The confidence loss: do the predicted bounding boxes correspond to real objects?
Classification loss
The classification loss is the mse between:
\(\hat{p}_i(c)\): the one-hot encoded class \(c\) of the object present under each cell \(i\), and
\(p_i(c)\): the predicted class probabilities of cell \(i\).
where \(\mathbb{1}_i^\text{obj}\) is 1 when there actually is an object behind the cell \(i\), 0 otherwise (background).
They could also have used the cross-entropy loss, but the output layer is not a regular softmax layer. Using mse is also more compatible with the other losses.
Localization loss
For all bounding boxes matching a real object, we want to minimize the mse between:
\((\hat{x}_i, \hat{y}_i, \hat{w}_i, \hat{h}_i)\): the coordinates of the ground truth bounding box, and
\((x_i, y_i, w_i, h_i)\): the coordinates of the predicted bounding box.
where \(\mathbb{1}_{ij}^\text{obj}\) is 1 when the bounding box \(j\) of cell \(i\) “matches” with an object (IoU). The root square of the width and height of the bounding boxes is used. This allows to penalize more the errors on small boxes than on big boxes.
Confidence loss
Finally, we need to learn the confidence score of each bounding box, by minimizing the mse between:
\(C_i\): the predicted confidence score of cell \(i\), and
\(\hat{C}_i\): the IoU between the ground truth bounding box and the predicted one.
Two cases are considered:
There was a real object at that location (\(\mathbb{1}_{ij}^\text{obj} = 1\)): the confidences should be updated fully.
There was no real object (\(\mathbb{1}_{ij}^\text{noobj} = 1\)): the confidences should only be moderately updated (\(\lambda^\text{noobj} = 0.5\))
This is to deal with class imbalance: there are much more cells on the background than on real objects.
Put together, the loss function to minimize is:
4.5.5. YOLO trained on PASCAL VOC¶
YOLO was trained on PASCAL VOC (natural images) but generalizes well to other datasets (paintings…). YOLO runs in real-time (60 fps) on a NVIDIA Titan X. Faster and more accurate versions of YOLO have been developed: YOLO9000 [Redmon et al., 2016], YOLOv3 [Redmon & Farhadi, 2018], YOLOv5 (https://github.com/ultralytics/yolov5)…
Refer to the website of the authors for additional information: https://pjreddie.com/darknet/yolo/
4.6. SSD¶
The idea of SSD (Single-Shot Detector, [Liu et al., 2016]) is similar to YOLO, but:
faster
more accurate
not limited to 98 objects per scene
multi-scale
Contrary to YOLO, all convolutional layers are used to predict a bounding box, not just the final tensor: skip connections. This allows to detect boxes at multiple scales (pyramid).
4.7. 3D object detection¶
It is also possible to use depth information (e.g. from a Kinect) as an additional channel of the R-CNN. The depth information provides more information on the structure of the object, allowing to disambiguate certain situations (segmentation).
Lidar point clouds can also be used for detecting objects, for example VoxelNet [Zhou & Tuzel, 2017] trained in the KITTI dataset.
Additional resources on object detection
https://medium.com/comet-app/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852
https://medium.com/@smallfishbigsea/faster-r-cnn-explained-864d4fb7e3f8
https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088
https://towardsdatascience.com/lidar-3d-object-detection-methods-f34cf3227aea