Home People Research Publications Demos
News Jobs Prospective
About Internal

Comparison of Semantic Image Annotation Algorithms

The adoption of a standard annotation database (Corel5k) has helped further research in semantic annotation by providing a benchmark for comparing algorithms. This page contains comparisons of different annotation algorithms on the standard databases. If you have new results that you would like to add to these tables, contact Nuno Vasconcelos with the reference.

Evaluation Metrics

Image annotation performance is evaluated by comparing the captions automatically generated for the test set, with the human-produced ground-truth. A common set of metrics, based on [2,3], is used to evaluate image annotation and ranked retrieval.

For evaluation semantic annotation, automatic annotation is defined as the top five semantic classes assigned by the algorithm, and the recall and precision of every word in the test set is computed. For a given semantic descriptor, assuming that there are wH human annotated images in the test set and the system annotates wauto, of which wC are correct, the per-word recall and precision are given by recall = wC / wH and precision = wC / wauto, respectively. Finally, the values of recall and precision are averaged over the set of words that appear in the test set to get the average per-word precision (P) and average per-word recall (R). We also consider the number of words with non-zero recall (NZR), i.e., words with wC > 0, which provides an indication of how many words the system has effectively learned.

The performance of semantic retrieval is also evaluated by measuring precision and recall. Given a query term and the top n image matches retrieved from the database, recall is the percentage of all relevant images contained in the retrieved set, and precision is the percentage of n which are relevant (where relevant means that the ground-truth annotation of the image contains the query term). Retrieval performance is evaluated with the mean-average precision (MAP), defined in [3], which is the average precision, over all queries, at the ranks where recall changes (i.e., where relevant items occur).

On the larger database of [7], where per-image annotations are not available and automatic annotation is based on image categorization, two other metrics are used for evaluation. The first is "image categorization accuracy", where an image is considered correctly categorized if any of the top r categories is the true category. Second, annotation performance is evaluated with the "mean coverage", which is defined as the percentage of ground-truth annotations that match the computer annotations.

For more details, please see the references. The evaluation metrics are summarized in the following table:

P[3]average per-word precision
R[3]average per-word recall
# words NZR[3]number of words with non-zero recall
MAP[3]mean-average precision (ranked retrieval)
MAP NZR[3]mean-average precision for NZR (ranked retrieval)
Image Categorization Accuracy[7]percent of the images correctly categorized
Mean Coverage[7]percent of the ground-truth annotations that match the computer annotations


The Corel5k database contains 5,000 Corel images (4,500 training and 500 testing) with annotations from 371 words. There are 260 keywords that are tested (only the ones that are in the ground-truth of the test-set). The protocol for evaluation is outlined in [3]. "Ref" is the reference paper for the algorithm, and "Src" is the paper where the evaluation was reported.

   Semantic Annotation   Ranked Retrieval
MethodRef.Src.  P (All)R (All)# words NZR  MAP (All)MAP (NZR)
SML[1][1]  0.230.29137  0.310.49
MBRM[3][2,3]  0.240.25122  0.300.35
CRM[2,3][2,3]  0.160.19107  --
Translation[4][2,3]  0.060.0449  --
Co-occurence[5][2,3]  0.030.0219  --


The Corel30k database [1] contains 31,695 Corel images (28,525 training and 3,170 testing) with annotations from 950 words. Algorithms are evaluated using the same metrics as Corel5k.

   Semantic Annotation   Ranked Retrieval
MethodRef.Src.  P (All)R (All)# words NZR  MAP (All)MAP (NZR)
SML[1][6]  0.130.21424  0.210.47


The PSU database [7] contains 59,695 Corel images (23,878 training and 35,817 testing) organized into 600 categories. All the images within a single category share the same annotations (from 442 common words). The algorithms are evaluated using the Corel5k metrics.

   Semantic Annotation   Ranked Retrieval
MethodRef.Src.  P (All)R (All)# words NZR  MAP (All)MAP (NZR)
SML-GMM-DCT[1][6]  0.150.32413  0.260.27

Performance on the PSU database can also be evaluated using supervised category-based learning (SCBL), which is based on classifiers for image categories. For an image, the annotations are selected from the words associated with the top 5 image categories, according to a statistical test (see [7,1] for more details). The algorithms are evaluated on the accuracy of image categorization, and the "mean coverage" of the annotations.

   Image Categorization Accuracy   Mean Coverage
MethodRef.Src.  r=1r=2r=3r=4r=5  th=0.0649no th
GMM-DCT[1][1]  0.2090.2700.3090.3380.362  0.34200.6124
2D-MHMM[7][7]  0.1190.1710.2080.2320.261  0.21630.4748


[1]G. Carneiro, A. B. Chan P. J. Moreno, and N. Vasconcelos, "Supervised Learning of Semantic Classes for Image Annotation and Retrieval," IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 29(3), pp. 394-410, March 2006. [pdf]
[2]V. Lavrenko, R. Manmatha, and J. Jeon, "A Model for Learning the Semantics of Pictures," In Proc. Conf. Advances in Neural Information Processing Systems, 2003. [pdf]
[3]S. Feng, R. Manmatha, and V. Lavrenko, "Multiple Bernoulli Relevance Models for Image and Video Annotation," In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2004. [pdf]
[4]P. Duygulu, K. Barnard, and D.F.N. Freitas, "Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary," In Proc. European Conf. Computer Vision, 2002. [www]
[5]Y. Mori, H. Takahashi, and R. Oka, "Image-to-Word Transformation Based on Dividing and Vector Quantizing Images with Words," In Proc. First Int'l Workshop Multimedia Intelligent Storage and Retrieval Management, 1999. [www]
[6]A. B. Chan, P. J. Moreno, and N. Vasconcelos, "Using statistics to search and annotate pictures: an evaluation of semantic image annotation and retrieval on large databases," In Joint Statistical Meetings (JSM), Seattle, 2006. [pdf]
[7]J. Li and J. Wang, "Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1075-1088, Sept. 2003. [www]

  • Back to Project Page

    © SVCL