4. Object detection

Slides: pdf

4.1. Object detection

Contrary to object classification/recognition which assigns a single label to an image, object detection requires to both classify object and report their position and size on the image (bounding box).

../_images/dnn_classification_vs_detection.png

Fig. 4.10 Object recognition vs. detection. Source: https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4

A naive and very expensive method is to use a trained CNN as a high-level filter. The CNN is trained on small images and convolved on bigger images. The output is a heatmap of the probability that a particular object is present.

../_images/objectdetection.png

Fig. 4.11 Using a pretrained CNN to generate heatmaps. Source: https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4

Object detection is both a:

  • Classification problem, as one has to recognize an object.

  • Regression problem, as one has to predict the coordinates \((x, y, w, h)\) of the bounding box.

../_images/localization.png

Fig. 4.12 Object detection is both a classification and regression task. Source: https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e

The main datasets for object detection are the PASCAL Visual Object Classes Challenge (20 classes, ~10K images, ~25K annotated objects, http://host.robots.ox.ac.uk/pascal/VOC/voc2008/) and the MS COCO dataset (Common Objects in COntext, 330k images, 80 labels, http://cocodataset.org)

4.2. R-CNN : Regions with CNN features

R-CNN [Girshick et al., 2014] was one of the first CNN-based architectures allowing object detection.

../_images/rcnn.png

Fig. 4.13 R-CNN [Girshick et al., 2014].

It is a pipeline of 4 steps:

  1. Bottom-up region proposals by searching bounding boxes based on pixel info (selective search https://ivi.fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013/UijlingsIJCV2013.pdf).

  2. Feature extraction using a pre-trained CNN (AlexNet).

  3. Classification using a SVM (object or not; if yes, which one?)

  4. If an object is found, linear regression on the region proposal to generate tighter bounding box coordinates.

Each region proposal is processed by the CNN, followed by a SVM and a bounding box regressor.

../_images/rcnn-detail.png

Fig. 4.14 R-CNN [Girshick et al., 2014]. Source: https://courses.cs.washington.edu/courses/cse590v/14au/cse590v_wk1_rcnn.pdf

The CNN is pre-trained on ImageNet and fine-tuned on Pascal VOC (transfer learning).

../_images/rcnn-training.png

Fig. 4.15 R-CNN [Girshick et al., 2014]. Source: https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e

4.3. Fast R-CNN

The main drawback of R-CNN is that each of the 2000 region proposals have to go through the CNN: extremely slow. The idea behind Fast R-CNN [Girshick, 2015] is to extract region proposals in higher feature maps and to use transfer learning.

../_images/fast-rcnn.png

Fig. 4.16 Fast R-CNN [Girshick, 2015].

The network first processes the whole image with several convolutional and max pooling layers to produce a feature map. Each object proposal is projected to the feature map, where a region of interest (RoI) pooling layer extracts a fixed-length feature vector. Each feature vector is fed into a sequence of FC layers that finally branch into two sibling output layers:

  • a softmax probability estimate over the K classes plus a catch-all “background” class.

  • a regression layer that outputs four real-valued numbers for each class.

The loss function to minimize is a composition of different losses and penalty terms:

\[ \mathcal{L}(\theta) = \lambda_1 \, \mathcal{L}_\text{classification}(\theta) + \lambda_2 \, \mathcal{L}_\text{regression}(\theta) + \lambda_3 \, \mathcal{L}_\text{regularization}(\theta) \]

4.4. Faster R-CNN

Both R-CNN and Fast R-CNN use selective search to find out the region proposals: slow and time-consuming. Faster R-CNN [Ren et al., 2016] introduces an object detection algorithm that lets the network learn the region proposals. The image is passed through a pretrained CNN to obtain a convolutional feature map. A separate network is used to predict the region proposals. The predicted region proposals are then reshaped using a RoI (region-of-interest) pooling layer which is then used to classify the object and predict the bounding box.

../_images/faster-rcnn.png

Fig. 4.17 Faster R-CNN [Ren et al., 2016].

4.5. YOLO

(Fast(er)) R-CNN perform classification for each region proposal sequentially: slow. YOLO (You Only Look Once) [Redmon & Farhadi, 2016] applies a single neural network to the full image to predict all possible boxes and the corresponding classes. YOLO divides the image into a SxS grid of cells.

../_images/yolo.png

Fig. 4.18 YOLO [Redmon & Farhadi, 2016].

Each grid cell predicts a single object, with the corresponding \(C\) class probabilities (softmax). It also predicts the coordinates of \(B\) possible bounding boxes (x, y, w, h) as well as a box confidence score. The SxSxB predicted boxes are then pooled together to form the final prediction.

In the figure below, the yellow box predicts the presence of a person (the class) as well as a candidate bounding box (it may be bigger than the grid cell itself).

../_images/yolo1.jpeg

Fig. 4.19 Each cell predicts a class (e.g. person) and the (x, y, w, h) coordinates of the bounding box. Source: https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088

In the original YOLO implementation, each grid cell proposes 2 bounding boxes:

../_images/yolo2.jpeg

Fig. 4.20 Each cell predicts two bounding boxes per object. Source: https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088

Each grid cell predicts a probability for each of the 20 classes, two bounding boxes (4 coordinates per bounding box) and their confidence scores. This makes C + B * 5 = 30 values to predict for each cell.

../_images/yolo3.jpeg

Fig. 4.21 Each cell outputs 30 values: 20 for the classes and 5 for each bounding box, including the confidence score. Source: https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088

4.5.1. Architecture of the CNN

YOLO uses a CNN with 24 convolutional layers and 4 max-pooling layers to obtain a 7x7 grid. The last convolution layer outputs a tensor with shape (7, 7, 1024). The tensor is then flattened and passed through 2 fully connected layers. The output is a tensor of shape (7, 7, 30), i.e. 7x7 grid cells, 20 classes and 2 boundary box predictions per cell.

../_images/yolo-cnn.png

Fig. 4.22 Architecture of the CNN used in YOLO [Redmon & Farhadi, 2016].

4.5.2. Confidence score

The 7x7 grid cells predict 2 bounding boxes each: maximum of 98 bounding boxes on the whole image. Only the bounding boxes with the highest class confidence score are kept.

\[ \text{class confidence score = box confidence score * class probability} \]

In practice, the class confidence score should be above 0.25.

../_images/yolo4.png

Fig. 4.23 Only the bounding boxes with the highest class confidence scores are kept among the 98 possible ones. Source: [Redmon & Farhadi, 2016].

4.5.3. Intersection over Union (IoU)

To ensure specialization, only one bounding box per grid cell should be responsible for detecting an object. During learning, we select the bounding box with the biggest overlap with the object. This can be measured by the Intersection over the Union (IoU).

../_images/iou1.jpg

Fig. 4.24 The Intersection over Union (IoU) measures the overlap between bounding boxes. Source: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

../_images/iou2.png

Fig. 4.25 The Intersection over Union (IoU) measures the overlap between bounding boxes. Source: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

4.5.4. Loss functions

The output of the network is a 7x7x30 tensor, representing for each cell:

  • the probability that an object of a given class is present.

  • the position of two bounding boxes.

  • the confidence that the proposed bounding boxes correspond to a real object (the IoU).

We are going to combine three different loss functions:

  1. The categorization loss: each cell should predict the correct class.

  2. The localization loss: error between the predicted boundary box and the ground truth for each object.

  3. The confidence loss: do the predicted bounding boxes correspond to real objects?

Classification loss

The classification loss is the mse between:

  • \(\hat{p}_i(c)\): the one-hot encoded class \(c\) of the object present under each cell \(i\), and

  • \(p_i(c)\): the predicted class probabilities of cell \(i\).

\[ \mathcal{L}_\text{classification}(\theta) = \sum_{i=0}^{S^2} \mathbb{1}_i^\text{obj} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2 \]

where \(\mathbb{1}_i^\text{obj}\) is 1 when there actually is an object behind the cell \(i\), 0 otherwise (background).

They could also have used the cross-entropy loss, but the output layer is not a regular softmax layer. Using mse is also more compatible with the other losses.

Localization loss

For all bounding boxes matching a real object, we want to minimize the mse between:

  • \((\hat{x}_i, \hat{y}_i, \hat{w}_i, \hat{h}_i)\): the coordinates of the ground truth bounding box, and

  • \((x_i, y_i, w_i, h_i)\): the coordinates of the predicted bounding box.

\[\begin{split} \mathcal{L}_\text{localization}(\theta) = \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^\text{obj} [ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2] \\ \qquad\qquad\qquad\qquad + \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^\text{obj} [ (\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2] \end{split}\]

where \(\mathbb{1}_{ij}^\text{obj}\) is 1 when the bounding box \(j\) of cell \(i\) “matches” with an object (IoU). The root square of the width and height of the bounding boxes is used. This allows to penalize more the errors on small boxes than on big boxes.

Confidence loss

Finally, we need to learn the confidence score of each bounding box, by minimizing the mse between:

  • \(C_i\): the predicted confidence score of cell \(i\), and

  • \(\hat{C}_i\): the IoU between the ground truth bounding box and the predicted one.

\[\begin{split} \mathcal{L}_\text{confidence}(\theta) = \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^\text{obj} (C_{ij} - \hat{C}_{ij})^2 \\ \qquad\qquad\qquad\qquad + \lambda^\text{noobj} \, \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^\text{noobj} (C_{ij} - \hat{C}_{ij})^2 \end{split}\]

Two cases are considered:

  1. There was a real object at that location (\(\mathbb{1}_{ij}^\text{obj} = 1\)): the confidences should be updated fully.

  2. There was no real object (\(\mathbb{1}_{ij}^\text{noobj} = 1\)): the confidences should only be moderately updated (\(\lambda^\text{noobj} = 0.5\))

This is to deal with class imbalance: there are much more cells on the background than on real objects.

Put together, the loss function to minimize is:

\[\begin{split} \begin{align} \mathcal{L}(\theta) & = \mathcal{L}_\text{classification}(\theta) + \lambda_\text{coord} \, \mathcal{L}_\text{localization}(\theta) + \mathcal{L}_\text{confidence}(\theta) \\ & = \sum_{i=0}^{S^2} \mathbb{1}_i^\text{obj} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2 \\ & + \lambda_\text{coord} \, \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^\text{obj} [ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2] \\ & + \lambda_\text{coord} \, \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^\text{obj} [ (\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2] \\ & + \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^\text{obj} (C_{ij} - \hat{C}_{ij})^2 \\ & + \lambda^\text{noobj} \, \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^\text{noobj} (C_{ij} - \hat{C}_{ij})^2 \\ \end{align} \end{split}\]

4.5.5. YOLO trained on PASCAL VOC

YOLO was trained on PASCAL VOC (natural images) but generalizes well to other datasets (paintings…). YOLO runs in real-time (60 fps) on a NVIDIA Titan X. Faster and more accurate versions of YOLO have been developed: YOLO9000 [Redmon et al., 2016], YOLOv3 [Redmon & Farhadi, 2018], YOLOv5 (https://github.com/ultralytics/yolov5)…

../_images/yolo-result2.png

Fig. 4.26 Performance of YOLO compared to the state of the art. Source: [Redmon & Farhadi, 2016].

Refer to the website of the authors for additional information: https://pjreddie.com/darknet/yolo/

4.6. SSD

The idea of SSD (Single-Shot Detector, [Liu et al., 2016]) is similar to YOLO, but:

  • faster

  • more accurate

  • not limited to 98 objects per scene

  • multi-scale

Contrary to YOLO, all convolutional layers are used to predict a bounding box, not just the final tensor: skip connections. This allows to detect boxes at multiple scales (pyramid).

../_images/ssd.png

Fig. 4.27 Single-Shot Detector, [Liu et al., 2016].

4.7. 3D object detection

It is also possible to use depth information (e.g. from a Kinect) as an additional channel of the R-CNN. The depth information provides more information on the structure of the object, allowing to disambiguate certain situations (segmentation).

../_images/rcnn-rgbd.png

Fig. 4.28 Learning Rich Features from RGB-D Images for Object Detection, [Gupta et al., 2014].

Lidar point clouds can also be used for detecting objects, for example VoxelNet [Zhou & Tuzel, 2017] trained in the KITTI dataset.

../_images/voxelnet.png

Fig. 4.29 VoxelNet [Zhou & Tuzel, 2017].

../_images/voxelnet-result.png

Fig. 4.30 VoxelNet [Zhou & Tuzel, 2017]. Source: https://medium.com/@SmartLabAI/3d-object-detection-from-lidar-data-with-deep-learning-95f6d400399a