Fully Convolutional One-Stage Object Detection (FCOS)

Fully Convolutional One-Stage Object Detection (FCOS)

Hello there. I took a break from my Internet activity but here I am again. In this piece, I am writing about Fully Convolutional One-Stage Object Detection (FCOS), which was published after YOLOv3 but before YOLOv4. Therefore, I feel like it is important to look at it first. Also, I am kind of tired of YOLO haha. I hope you find it useful since it is surely a unique work.

FCOS completely drops the complicated computation related to anchor boxes and all the hyper-parameters related, which are often very sensitive to the final detection performance. It is based on FPN and very straightforward architecture. However, it is still good and fast.

FCOS: Approach

Taken from the original paper: https://arxiv.org/pdf/1904.01355.pdf

Please take your time and have a glance at the architecture above. I believe having seen and analyzed the architecture first, helps understand the concept much better.

Fully Convolutional One-Stage Object Detector

Taken from the original paper: https://arxiv.org/pdf/1904.01355.pdf

Let F_i be a feature map at a layer i of a CNN backbone. Each pixel location (x, y) can be mapped back onto the input image as ( s//2+ xs,  s// 2  + ys), where ‘//’ includes floor operation after division and is the total stride until the layer. For example, s at P3, shown in Fig. 2 (above), corresponds to 8. Thus, the location (x, y) is treated as positive sample provided it’s within any ground-truth box. The class label c of the location is the class label of the ground-truth box. However, if it does not fall into any ground-truth box, a class label 0 is assigned to indicate background.

In addition to classification, FCOS also predicts 4D real vector t ∗ = (l ∗ , t∗ , r∗ , b∗ ) being the regression targets for the location. Each letter stands for 4 corresponding directions from (x, y) location, which are left, top, right and bottom, and they are calculated as follows:

Taken from the original paper: https://arxiv.org/pdf/1904.01355.pdf

where (x_0, y_0) and (x_1, y_1) are for left-top and right-bottom corners of the ground-truth bounding box.

Multi-level Prediction with FPN for FCOS

Following FPN, FCOS detects different sizes of objects on different levels of feature maps. Thus, there are five levels of feature maps defined as {P3, P4, P5, P6, P7} and shown in the Fig. 2. As we see, P3, P4 and P5 are obtained by passing the backbone’s feature maps C3, C4 and C5 through 1 × 1 convolutions in a top-down manner. While, P6 and P7 are computed by applying one convolutional layer (stride=2) on P5 and P6, respectively. Therefore, the resulting strides for feature levels P3, P4, P5, P6 and P7 are 8, 16, 32, 64 and 128, respectively.

In contrast anchor-based detectors, where anchor boxes with different sizes are assigned to different feature levels, m_i, the maximum distance that feature level i needs to regress, is assigned for each level. For example, if a box satisfies max(l ∗ , t∗ , r∗ , b∗ ) > m_i or max(l ∗ , t ∗ , r∗ , b∗ ) < m_i−1, it is discarded and regarded as a negative sample for F_i. In the original paper, m2, m3, m4, m5, m6 and m7 are set as 0, 64, 128, 256, 512 and ∞, respectively. Thus, P3 is responsible for the range [0, 64], P4 is for [64, 128] and so on.

The shared regression head should be able to perform well for different size range at different layers. In order to help it, there are trainable scalars s_i for each layer. Finally, since l ∗ , t∗ , r∗ , b∗ are always positive exp function is applied to s_i * x, giving final equation exp(s_i * x) where x is a prediction.

Loss Function

Taken from the original paper: https://arxiv.org/pdf/1904.01355.pdf

The authors employ focal and IOU losses for classification and regression tasks, respectively. Note that negative samples (background) are discarded from the regression task.

Center-ness for FCOS

Authors observed that a lot of low-quality bounding box predictions produced by locations far away from the center of an object was the problem of FCOS. Therefore, there’s a center-ness branch that predicts the corresponding score. While the ground-truth value is computed as follows:

Taken from the original paper: https://arxiv.org/pdf/1904.01355.pdf

The value is within [0, 1] and is thus trained with binary cross entropy (BCE) loss and added to the equation above (Eq. 2).

Thus, there will be a lot of locations on F_i that have a high classification score. Therefore, we multiply the classification scores with corresponding center-ness scores during inference phase. Finally, we can apply NMS to get rid of low-quality predictions.

Read more