Overview

This is the benchmark introduced in CVPR 2019 paper: Towards Universal Object Detection by Domain Attention[1]. The goal of this benchmark is to encourage designing universal object detection system, capble of solving various detection tasks. To train and evaluate universal/multi-domain object detection systems, we established a new universal object detection benchmark (UODB) of 11 datasets:
1. Pascal VOC[2]
2. WiderFace[3]
3. KITTI[4]
4. LISA[5]
5. DOTA[6]
6. COCO[7]
7. Watercolor[8]
8. Clipart[8]
9. Comic[8]
10. Kitchen[9]
11. DeepLesions[10].
This set includes the popular VOC and COCO, composed of images of everyday objects, e.g. bikes, humans, animals, etc. The 20 VOC categories are replicated on CrossDomain with three subsets of Watercolor, Clipart and Comic, with objects depicted in watercolor, clipart and comic styles, respectively. Kitchen consists of common kitchen objects, collected with an hand-held Kinect, while WiderFace contains human faces, collected on the web. Both KITTI and LISA depict traffic scenes, collected with cameras mounted on moving vehicles. KITTI covers the categories of vehicle, pedestrian and cyclist, while LISA is composed of traffic signs. DOTA is a surveillance-style dataset, containing objects such as vehicles, planes, ships, harbors, etc. imaged from aerial cameras. Finally DeepLesion is a dataset of lesions on medical CT images. Altogether, UODB covers a wide range of variations in category, camera view, image style, etc, and thus establishes a good suite for the evaluation of universal/multi-domain object detection.

Dataset

Train/Val/Test split, domain and classes number of each dataset in UODB benchmark are introduced in this section. For Watercolor, Clipart, Comic, Kitchen and DeepLesion, we trained on the official trainval sets and tested on the test set. For Pascal VOC, we trained on VOC2007 and VOC2012 trainval set and tested on VOC2007 test set. For WiderFace, we trained on the train set and tested on the val set. For KITTI, we followed the train/val splitting of [11] for development and trained on the trainval set for the final results on test set. For LISA, we trained on the train set and tested on the val set. For DOTA, we followed the pre-processing of [6], trained on train set and tested on val set. For MS-COCO, we trained on COCO 2014 valminusminival and tested on minival, to shorten the experimental period.

Datasets	Training	Validation	Testing	Domain	Classes Number
KITTI	4K	4K	7K	traffic	3
WiderFace	13K	3K	6K	face	2
Pascal VOC	8K	8K	5K	natural	20
LISA	8K	~	2K	traffic	4
DOTA	14K	5K	10K	aerial	15
MS-COCO	35K	5K	~	natural	20
Watercolor	1K	~	1K	watercolor	6
Clipart	1K	~	1K	clipart	20
Comic	1K	~	1K	comic	6
Kitchen	5K	~	2K	indoor	11
DeepLesion	23K	5K	5K	medical	2

All datasets in UODB benchmark are for academic use only, and any form of commercial use requires permission from the each dataset's organizer.
Pre-processed annotation and images
Annotation: Annotation for all datasets except MS-COCO are transferred to Pascal VOC format. MS-COCO will stick with COCO format.
Pre-processed images: For DeepLesion datasets, the 12-bit CT intensity range was rescaled to floating-point numbers in [0, 255] using a single windowing (−1024 to 3071 HU) that covers the intensity ranges of lung, soft tissue and bone. Each 3 channel image is composed of 3 adjacent slices. You can also get DeepLesion raw images. Other datasets will stick with provided official images.

Raw annotation and image of each dataset

Pascal VOC

MS-COCO

KITTI

WiderFace

DOTA

DeepLesion

Clipart

Comic

Watercolor

LISA

Kitchen

Annotation format
A sample XML annotation file based on Pascal VOC format will be as follows:
<annotation>
<folder>GeneratedData_Train</folder>
<filename>000001.png</filename>
<path>/my/path/GeneratedData_Train/000001.png</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>224</width>
<height>224</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>car</name>
<pose>Frontal</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<occluded>0</occluded>
<bndbox>
<xmin>82</xmin>
<xmax>172</xmax>
<ymin>88</ymin>
<ymax>146</ymax>
</bndbox>
</object>
</annotation>
Ground truth bounding box will be 1-based pixel value, top left and bottom right coordinates are given. File name, image path, source and objects categories of corresponding images are also provided.

Evaluation

We will use Pascal VOC style evaluation metric for evaluation[2], the evaluation code provided here can be used to obtain results on the publicly available validation and test sets. (Official test sets are not available for WiderFace, DOTA, Pascal VOC 12(07 test is available), MS-COCO and KITTI. To obtain official test results on these datasets, for which ground-truth annotations are hidden, generated results must be uploaded to the evaluation server on official website.)
Evaluation Metrics
For the detection task, participants submitted a list of bounding boxes with associated confidence (rank). The provision of a confidence level allows results to be ranked such that the trade-off between false positives and false negatives can be evaluated, without defining arbitrary costs on each type of classification error. Detections were assigned to ground truth objects and judged to be true/false positives by measuring bounding box overlap. To be considered a correct detection, the overlap ratio $a_o$ between the predicted bounding box $B_p$ and ground truth bounding box $B_{gt}$ must exceed 0.5 (50%) by the formula: $$a_{o} = \frac{area(B_p \cap B_{gt})}{area(B_p \cup B_{gt})}$$ where $B_p \cap B_{gt}$ denotes the intersection of the predicted and ground truth bounding boxes and $B_p \cup B_{gt}$ their union, the interpolated average precision was used to evaluate detection. For a given task and class, the precision/recall curve is computed from a method’s ranked output. Recall is defined as the proportion of all positive examples ranked above a given rank. Precision is the proportion of all examples above that rank which are from the positive class. The AP summarises the shape of the precision/recall curve, and is defined as the mean precision at a set of eleven equally spaced recall levels [0, 0.1, . . . , 1]: $$AP=\frac{1}{11}\sum_{r\in {[0, 0.1, . . . , 1]}}p_{interp}(r)$$ The precision at each recall level r is interpolated by taking the maximum precision measured for a method for which the corresponding recall exceeds r: $$p_{interp}(r)=\underset{\tilde{r}:\tilde{r}\ge r}{max} \, p(\tilde{r})$$ where $p(\tilde{r})$ is the measured precision at recall $r$ .

Reference

1. Wang X, Cai Z, Gao D, Vasconcelos N. Towards Universal Object Detection by Domain Attention, CVPR 2019. Github
2. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A. The pascal visual object classes (voc) challenge. International journal of computer vision. 2010 Jun 1;88(2):303-38.
3. Yang S, Luo P, Loy CC, Tang X. Wider face: A face detection benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 5525-5533).
4. Geiger A, Lenz P, Stiller C, Urtasun R. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research. 2013 Sep;32(11):1231-7.
5. Andreas M, Mohan M. T, and Thomas B. M, "Vision based Traffic Sign Detection and Analysis for Intelligent Driver Assistance Systems: Perspectives and Survey," IEEE Transactions on Intelligent Transportation Systems, 2012.
6. Xia GS, Bai X, Ding J, Zhu Z, Belongie S, Luo J, Datcu M, Pelillo M, Zhang L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018 (pp. 3974-3983).
7. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: Common objects in context. In European conference on computer vision 2014 Sep 6 (pp. 740-755). Springer, Cham.
8. Inoue N, Furuta R, Yamasaki T, Aizawa K. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018 (pp. 5001-5009).
9. Georgakis G, Reza MA, Mousavian A, Le PH, Košecká J. Multiview RGB-D dataset for object instance detection. In 2016 Fourth International Conference on 3D Vision (3DV) 2016 Oct 25 (pp. 426-434). IEEE.
10. Yan K, Wang X, Lu L, Summers RM. DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. Journal of Medical Imaging. 2018 Jul;5(3):036501.
11. Cai Z, Fan Q, Feris RS, Vasconcelos N. A unified multi-scale deep convolutional neural network for fast object detection. In European conference on computer vision 2016 Oct 8 (pp. 354-370). Springer, Cham.

Overview

Publication

Dataset

Evaluation

Demo

Reference