Semantic segmentation
Professur für Künstliche Intelligenz - Fakultät für Informatik
Classical segmentation methods only rely on the similarity between neighboring pixels, they do not use class information.
The output of a semantic segmentation is another image, where each pixel represents the class.
The classes can be binary, for example foreground/background, person/not, etc.
Semantic segmentation networks are used for example in Youtube stories to add virtual backgrounds (background matting).
Wang B, Zheng H, Liang X, Chen Y, Lin L, Yang M. (2018). Toward Characteristic-Preserving Image-based Virtual Try-On Network. arXiv:180707688.
There are many datasets freely available, but annotating such data is very painful, expensive and error-prone.
A fully convolutional network only has convolutional layers and learns to predict the output tensor.
The last layer has a pixel-wise softmax activation. We minimize the pixel-wise cross-entropy loss
\mathcal{L}(\theta) = \mathbb{E}_\mathcal{D} [- \sum_\text{pixels} \sum_\text{classes} t_i \, \log y_i]
SegNet has an encoder-decoder architecture, with max-pooling to decrease the spatial resolution while increasing the number of features.
But what is the inverse of max-pooling? Upsampling operation.
Badrinarayanan, Handa and Cipolla (2015). “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling.” arXiv:1505.07293
Nearest neighbor and Bed of nails would just make random decisions for the upsampling.
In SegNet, max-unpooling uses the information of the corresponding max-pooling layer in the encoder to place pixels adequately.
The original feature map is upsampled by putting zeros between the values.
A learned filter performs a regular convolution to produce an upsampled feature map.
Works well when convolutions with stride are used in the encoder.
Quite expensive computationally.
The problem of SegNet is that small details (small scales) are lost because of the max-pooling. the segmentation is not precise.
The solution proposed by U-Net is to add skip connections (as in ResNet) between different levels of the encoder-decoder.
The final segmentation depends both on:
large-scale information computed in the middle of the encoder-decoder.
small-scale information processed in the early layers of the encoder.
Ronneberger, Fischer, Brox (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597
For many applications, segmenting the background is useless. A two-stage approach can save computations.
Mask R-CNN uses faster R-CNN to extract bounding boxes around interesting objects, followed by the prediction of a mask to segment the object.
He K, Gkioxari G, Dollár P, Girshick R. 2018. Mask R-CNN. arXiv:170306870
He K, Gkioxari G, Dollár P, Girshick R. 2018. Mask R-CNN. arXiv:170306870