(cache) COCO - Common Objects in Context

Challenges: Detections | Captions | Keypoints

Begin by learning about the individual challenges. Compete to earn prizes and opportunities to present your work. Come to the workshops to learn about the state of the art!

Download

Download the dataset, including the dataset tools, images, and annotations. Learn about the annotation format. See cocoDemo in either the Matlab or Python code.
cocoDemo

Results Format

Develop your algorithm. Run your algorithm on COCO and save the results using the format described on this page. See evalDemo in either the Matlab or Python code.
evalDemo

Evaluate: Detections | Captions

Evaluate results of your system on the validation set. The same code will run on the evaluation server when you upload your results. See evalDemo in either the Matlab or Python code and evalCapDemo in the Python code for detection and caption demo code, respectively.
evalDemo | evalCapDemo

Upload: Detections | Captions

Upload your results to the evaluation server.

Leaderboard: Detections | Captions

Check out the state-of-the-art! See what algorithms are best at the various tasks.

Tools

Matlab+Python+Lua APIs [Version 2.0]

V2.0 of the API was completed 07/2015 and includes detection evaluation code. The Lua API, added 05/2016, supports only load and view functionality (no eval code).

Images

2014 Training images [80K/13GB]
2014 Val. images [40K/6.2GB]
2014 Testing images [40K/6.2GB]
2015 Testing images [80K/12.4G]

Annotations

2014 Train/Val object instances [158MB]
2014 Train/Val person keypoints [70MB]
2014 Train/Val image captions [18.8MB]
2014 Testing Image info [0.74MB]
2015 Testing Image info [1.83MB]

Note: annotations updated on 07/23/2015 with the addition of a "coco_url" field (for allowing of direct downloads of individual images).

1. Overview

The 2014 Testing Images are for the COCO Captioning Challenge, while the 2015 Testing Images are for the Detection and Keypoint Challenges. The train and val data are common to all challenges. Note also that as an alternative to downloading the large image zip files, individual images may be downloaded from the COCO website using the "coco_url" field specified in the image info struct (see details below).

Please follow the instructions in the README to download and setup the COCO data (annotations and images). By downloading this dataset, you agree to our Terms of Use.

2. COCO API

The COCO API assists in loading, parsing, and visualizing annotations in COCO. The API supports object instance, object keypoint, and image caption annotations (for captions not all functionality is defined). For additional details see: CocoApi.m, coco.py, and CocoApi.lua for Matlab, Python, and Lua code, respectively, and also the Python API demo.

Throughout the API "ann"=annotation, "cat"=category, and "img"=image.

download

Download COCO images from mscoco.org server.

getAnnIds

Get ann ids that satisfy given filter conditions.

getCatIds

Get cat ids that satisfy given filter conditions.

getImgIds

Get img ids that satisfy given filter conditions.

loadAnns

Load anns with the specified ids.

loadCats

Load cats with the specified ids.

loadImgs

Load imgs with the specified ids.

loadRes

Load algorithm results and create API for accessing them.

showAnns

Display the specified annotations.

3. MASK API

COCO provides segmentation masks for every object instance. This creates two challenges: storing masks compactly and performing mask computations efficiently. We solve both challenges using a custom Run Length Encoding (RLE) scheme. The size of the RLE representation is proportional to the number of boundaries pixels of a mask and operations such as area, union, or intersection can be computed efficiently directly on the RLE. Specifically, assuming fairly simple shapes, the RLE representation is O(√n) where n is number of pixels in the object, and common computations are likewise O(√n). Naively computing the same operations on the decoded masks (stored as an array) would be O(n).

The MASK API provides an interface for manipulating masks stored in RLE format. The API is defined below, for additional details see: MaskApi.m, mask.py, or MaskApi.lua. Finally, we note that a majority of ground truth masks are stored as polygons (which are quite compact), these polygons are converted to RLE when needed.

encode

Encode binary masks using RLE.

decode

Decode binary masks encoded via RLE.

merge

Compute union or intersection of encoded masks.

iou

Compute intersection over union between masks.

area

Compute area of encoded masks.

toBbox

Get bounding boxes surrounding encoded masks.

frBbox

Convert bounding boxes to encoded masks.

frPoly

Convert polygon to encoded mask.

4. Annotation format

COCO currently has three annotation types: object instances, object keypoints, and image captions. The annotations are stored using the JSON file format. All annotations share the basic data structure below:

{

"info"

info,

"images"

[image],

"annotations"

[annotation],

"licenses"

[license],

}

info {

"year"

int,

"version"

str,

"description"

str,

"contributor"

str,

"url"

str,

"date_created"

datetime,

}

image{

"id"

int,

"width"

int,

"height"

int,

"file_name"

str,

"license"

int,

"flickr_url"

str,

"coco_url"

str,

"date_captured"

datetime,

}

license{

"id"

int,

"name"

str,

"url"

str,

}

The data structures specific to the various annotation types are described below.

4.1. Object Instance Annotations

Each instance annotation contains a series of fields, including the category id and segmentation mask of the object. The segmentation format depends on whether the instance represents a single object (iscrowd=0 in which case polygons are used) or a collection of objects (iscrowd=1 in which case RLE is used). Note that a single object (iscrowd=0) may require multiple polygons, for example if occluded. Crowd annotations (iscrowd=1) are used to label large groups of objects (e.g. a crowd of people). In addition, an enclosing bounding box is provided for each object (box coordinates are measured from the top left image corner and are 0-indexed). Finally, the categories field of the annotation structure stores the mapping of category id to category and supercategory names. See also the Detection Challenge.

annotation{

"id"

int,

"image_id"

int,

"category_id"

int,

"segmentation"

RLE or [polygon],

"area"

float,

"bbox"

[x,y,width,height],

"iscrowd"

0 or 1,

}

categories[{

"id"

int,

"name"

str,

"supercategory"

str,

}]

4.2. Object Keypoint Annotations

A keypoint annotation contains all the data of the object annotation (including id, bbox, etc.) and two additional fields. First, "keypoints" is a length 3k array where k is the total number of keypoints defined for the category. Each keypoint has a 0-indexed location x,y and a visibility flag v defined as v=0: not labeled (in which case x=y=0), v=1: labeled but not visible, and v=2: labeled and visible. A keypoint is considered visible if it falls inside the object segment. "num_keypoints" indicates the number of labeled keypoints (v>0) for a given object (many objects, e.g. crowds and small objects, will have num_keypoints=0). Finally, for each category, the categories struct has two additional fields: "keypoints," which is a length k array of keypoint names, and "skeleton", which defines connectivity via a list of keypoint edge pairs and is used for visualization. Currently keypoints are only labeled for the person category (for most medium/large non-crowd person instances). See also the Keypoint Challenge.

annotation{

"keypoints"

[x1,y1,v1,...],

"num_keypoints"

int,

"[cloned]"

...,

}

categories[{

"keypoints"

[str],

"skeleton"

[edge],

"[cloned]"

...,

}]

"[cloned]": denotes fields copied from object instance annotations defined in 4.1.

4.3. Image Caption Annotations

These annotations are used to store image captions. Each caption describes the specified image and each image has at least 5 captions (some images have more). See also the Captioning Challenge.

annotation{

"id"

int,

"image_id"

int,

"caption"

str,

}

1. Results Format Overview

This page describes the results format used by COCO. The general structure of the results format is similar for all annotation types: for both object detection (using either bounding boxes or object segments) and image caption generation. Submitting algorithm results on COCO for evaluation requires using the formats described below.

2. Results Format

The results format used by COCO closely mimics the format of the ground truth as described on the download page. We suggest reviewing the ground truth format before proceeding.

Each algorithmically produced result, such as an object bounding box, object segment, or image caption, is stored separately in its own result struct. This singleton result struct must contains the id of the image from which the result was generated (note that a single image will typically have multiple associated results). Results across the whole dataset are aggregated in an array of such result structs. Finally, the entire result struct array is stored to disk as a single JSON file (save via gason in Matlab or json.dump in Python).

The data struct for each of the three result types is described below. The format of the individual fields below (category_id, bbox, segmentation, etc.) is the same as for the ground truth (for details see the download page).

2.1. Object detection (bounding boxes)

[{

"image_id"

int,

"category_id"

int,

"bbox"

[x,y,width,height],

"score"

float,

}]

Note: box coordinates are floats measured from the top left image corner (and are 0-indexed). We recommend rounding coordinates to the nearest tenth of a pixel to reduce resulting JSON file size.

2.2. Object detection (segmentation)

[{

"image_id"

int,

"category_id"

int,

"segmentation"

RLE,

"score"

float,

}]

Note: a binary mask containing an object segment should be encoded to RLE using the MaskApi function encode(). For additional details see either MaskApi.m or mask.py. Note that the core RLE code is written in c (see maskApi.h), so it is possible to perform encoding without using Matlab or Python, but we do not provide support for this case.

2.3. Keypoint detection

[{

"image_id"

int,

"category_id"

int,

"keypoints"

[x₁,y₁,v₁,...,x_k,y_k,v_k],

"score"

float,

}]

Note: keypoint coordinates are floats measured from the top left image corner (and are 0-indexed). We recommend rounding coordinates to the nearest pixel to reduce file size. Note also that the visibility flags v_i are not currently used (except for controlling visualization), we recommend simply setting v_i=1.

2.4. Caption generation

[{

"image_id"

int,

"caption"

str,

}]

3. Storing and Browsing Results

Example result JSON files are available in coco/results/ as part of the github package. Because the results format is similar to the ground truth annotation format, the CocoApi for accessing the ground truth can also be used to visualize and browse algorithm results. For details please see evalDemo (demo) and also loadRes() in the CocoApi.

1. Detection Evaluation

This page describes the detection evaluation code used by COCO. The evaluation code provided here can be used to obtain results on the publicly available COCO validation set. It computes multiple metrics described below. To obtain results on the COCO test set, for which ground truth annotations are hidden, generated results must be submitted to the evaluation server. For instructions on submitting results to the evaluation server please see the upload page. The exact same evaluation code, described below, is used to evaluate detections on the test set.

2. Metrics

The following 12 metrics are used for characterizing the performance of an object detector on COCO:

Average Precision (AP):
AP
APIoU=.50
APIoU=.75
AP Across Scales:
APsmall
APmedium
APlarge
Average Recall (AR):
ARmax=1
ARmax=10
ARmax=100
AR Across Scales:
ARsmall
ARmedium
ARlarge
% AP at IoU=.50:.05:.95 (primary challenge metric)
% AP at IoU=.50 (PASCAL VOC metric)
% AP at IoU=.75 (strict metric)

% AP for small objects: area < 322
% AP for medium objects: 322 < area < 962
% AP for large objects: area > 962

% AR given 1 detection per image 
% AR given 10 detections per image
% AR given 100 detections per image

% AR for small objects: area < 322
% AR for medium objects: 322 < area < 962
% AR for large objects: area > 962

Unless otherwise specified, AP and AR are averaged over multiple Intersection over Union (IoU) values. Specifically we use 10 IoU thresholds of .50:.05:.95. This is a break from tradition, where AP is computed at a single IoU of .50 (which corresponds to our metric AP^IoU=.50). Averaging over IoUs rewards detectors with better localization.
AP is averaged over all categories. Traditionally, this is called "mean average precision" (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.
AP (averaged across all 10 IoU thresholds and all 80 categories) will determine the challenge winner. This should be considered the single most important metric when considering performance on COCO.
In COCO, there are more small objects than large objects. Specifically: approximately 41% of objects are small (area < 32²), 34% are medium (32² < area < 96²), and 24% are large (area > 96²). Area is measured as the number of pixels in the segmentation mask.
AR is the maximum recall given a fixed number of detections per image, averaged over categories and IoUs. AR is related to the metric of the same name used in proposal evaluation but is computed on a per-category basis.
All metrics are computed allowing for at most 100 top-scoring detections per image (across all categories).
The evaluation metrics for detection with bounding boxes and segmentation masks are identical in all respects except for the IoU computation (which is performed over boxes or masks, respectively).

3. Results Format

The results format used for storing generated detections is described on the results format page. For reference, here is a summary of the detection results format for boxes and segments, respectively:

[{

"image_id"

int,

"category_id"

int,

"bbox"

[x,y,width,height],

"score"

float,

}]

Note: box coordinates are floats measured from the top left image corner (and are 0-indexed). We recommend rounding coordinates to the nearest tenth of a pixel to reduce resulting JSON file size.

[{

"image_id"

int,

"category_id"

int,

"segmentation"

RLE,

"score"

float,

}]

Note: binary masks should be encoded via RLE using the MaskApi function encode().

4. Evaluation Code

Evaluation code is available on the COCO github. Specifically, see either CocoEval.m or cocoeval.py in the Matlab or Python code, respectively. Also see evalDemo in either the Matlab or Python code (demo).

The evaluation parameters are as follows (defaults in brackets, in general no need to change):

params{

"imgIds"

[all] N img ids to use for evaluation

"catIds"

[all] K cat ids to use for evaluation

"iouThrs"

[.5:.05:.95] T=10 IoU thresholds for evaluation

"recThrs"

[0:.01:1] R=101 recall thresholds for evaluation

"areaRng"

[all,small,medium,large] A=4 area ranges for evaluation

"maxDets"

[1 10 100] M=3 thresholds on max detections per image

"useSegm"

[1] if true evaluate against ground-truth segments

"useCats"

[1] if true use category labels for evaluation

}

Running the evaluation code via calls to evaluate() and accumulate() produces two data structures that measure detection quality. The two structs are evalImgs and eval, which measure quality per-image and aggregated across the entire dataset, respectively. The evalImgs struct has KxA entries, one per evaluation setting, while the eval struct combines this information into precision and recall arrays. Details for the two structs are below (see also CocoEval.m or cocoeval.py):

evalImgs[{

"dtIds"

[1xD] id for each of the D detections (dt)

"gtIds"

[1xG] id for each of the G ground truths (gt)

"dtImgIds"

[1xD] image id for each dt

"gtImgIds"

[1xG] image id for each gt

"dtMatches"

[TxD] matching gt id at each IoU or 0

"gtMatches"

[TxG] matching dt id at each IoU or 0

"dtScores"

[1xD] confidence of each dt

"dtIgnore"

[TxD] ignore flag for each dt at each IoU

"gtIgnore"

[1xG] ignore flag for each gt

}]

eval{

"params"

parameters used for evaluation

"date"

date evaluation was performed

"counts"

[T,R,K,A,M] parameter dimensions (see above)

"precision"

[TxRxKxAxM] precision for every evaluation setting

"recall"

[TxKxAxM] max recall for every evaluation setting

}

Finally summarize() computes the 12 detection metrics defined earlier based on the eval struct.

5. Analysis Code

In addition to the evaluation code, we also provide a function analyze() for performing a detailed breakdown of false positives. This was inspired by Diagnosing Error in Object Detectors by Derek Hoiem et al., but is quite different in implementation and details. The code generates plots like this:

Both plots show analysis of the ResNet (bbox) detector from Kaiming He et al., winner of the 2015 Detection Challenge. The first plot shows a breakdown of errors of ResNet for the person class; the second plot is an overall analysis of ResNet averaged over all categories.

Each plot is a series of precision recall curves where each PR curve is guaranteed to be strictly higher than the previous as the evaluation setting becomes more permissive. The curves are as follows:

C75: PR at IoU=.75 (AP at strict IoU), area under curve corresponds to AP^IoU=.75 metric.
C50: PR at IoU=.50 (AP at PASCAL IoU), area under curve corresponds to AP^IoU=.50 metric.
Loc: PR at IoU=.10 (localization errors ignored, but not duplicate detections). All remaining settings use IoU=.1.
Sim: PR after supercategory false positives (fps) are removed. Specifically, any matches to objects with a different class label but that belong to the same supercategory don't count as either a fp (or tp). Sim is computed by setting all objects in the same supercategory to have the same class label as the class in question and setting their ignore flag to 1. Note that person is a singleton supercategory so its Sim result is identical to Loc.
Oth: PR after all class confusions are removed. Similar to Sim, except now if a detection matches any other object it is no longer a fp (or tp). Oth is computed by setting all other objects to have the same class label as the class in question and setting their ignore flag to 1.
BG: PR after all background (and class confusion) fps are removed. For a single category, BG is a step function that is 1 until max recall is reached then drops to 0 (the curve is smoother after averaging across categories).
FN: PR after all remaining errors are removed (trivially AP=1).

The area under each curve is shown in brackets in the legend. In the case of the ResNet detector, overall AP at IoU=.75 is .399 and perfect localization would increase AP to .682. Interesting, removing all class confusions (both within supercategory and across supercategories) would only raise AP slightly to .713. Removing background fp would bump performance to .870 AP and the rest of the errors are missing detections (although presumably if more detections were added this would also add lots of fps). In summary, ResNet's errors are dominated by imperfect localization and background confusions.

For a given detector, the code generates a total of 372 plots! There are 80 categories, 12 supercategories, and 1 overall result, for a total of 93 different settings, and the analysis is performed at 4 scales (all, small, medium, large, so 93*4=372 plots). The file naming is [supercategory]-[category]-[size].pdf for the 80*4 per-category results, overall-[supercategory]-[size].pdf for the 12*4 per supercategory results, and overall-all-[size].pdf for the 1*4 overall results. Of all the plots, typically the overall and supercategory results are of the most interest.

Note: analyze() can take significant time to run, please be patient. As such, we typically do not run this code on the evaluation server; you must run the code locally using the validation set. Finally, currently analyze() is only part of the Matlab API; Python code coming soon.

1. Detections Upload

This page describes the upload instructions for submitting results to the detection evaluation server. Before proceeding, please review the results format and evaluation details. Submitting results allows you to participate in the COCO Detection Challenge and compare results to the state-of-the-art on the detection leaderboard.

2. Competition Details

The COCO 2015 Test Set can be obtained on the download page. The recommended training data consists of the COCO 2014 Training and Validation sets. External data of any form is allowed (except of course any form of annotation on the COCO Test set is forbidden). Please specify any and all external data used for training in the "method description" when uploading results to the evaluation server.

There are two distinct detection challenges and associated leaderboards: for detectors that output bounding boxes and for detectors that output object segments. The bounding box challenge provides continuity with past challenges such as the PASCAL VOC; the detection by segmentation challenge encourages higher accuracy object localization. The evaluation code in the two cases computes IoU using boxes or segments, respectively, but is otherwise identical. Please see the evaluation details.

Please limit the number of entries to the evaluation server to a reasonable number, e.g. one entry per paper. To avoid overfitting, the number of submissions per user is limited to 2 upload per day and a maximum of 5 submissions per user. It is not acceptable to create multiple accounts for a single project to circumvent this limit. The exception to this is if a group publishes two papers describing unrelated methods, in this case both sets of results can be submitted for evaluation.

2.1. Test Set Splits

The 2015 COCO Test set consists of ~80K test images. To limit overfitting while giving researchers more flexibility to test their system, we have divided the test set into four roughly equally sized splits of ~20K images each: test-dev, test-standard, test-challenge, and test-reserve. Submission to the test set automatically results in submission on each split (identities of the splits are not publicly revealed). In addition, to allow for debugging and validation experiments, we allow researcher unlimited submission to test-dev. Each test split serves a distinct role; details below.

  
    
      split
      #imgs
      submission
      scores reported
    

      Test-Dev
      ~20K
      unlimited
      immediately
    

      Test-Standard
      ~20K
      limited
      immediately
    

      Test-Challenge
      ~20K
      limited
      challenge
    

      Test-Reserve
      ~20K
      limited
      never
    

split	#imgs	submission	scores reported
Test-Dev	~20K	unlimited	immediately
Test-Standard	~20K	limited	immediately
Test-Challenge	~20K	limited	challenge
Test-Reserve	~20K	limited	never

Test-Dev: We place no limit on the number of submissions allowed to test-dev. In fact, we encourage use of the test-dev for performing validation experiments. Use test-dev to debug and finalize your method before submitting to the full test set.

Test-Standard: The test-standard split is the default test data for the detection competition. When comparing to the state of the art, results should be reported on test-standard.

Test-Challenge: The test-challenge split is used for the COCO Detection Challenge. Results will be revealed during the ImageNet and COCO Visual Recognition Challenges Workshop.

Test-Reserve: The test-reserve split is used to protect against possible overfitting. If there are substantial differences between a method's scores on test-standard and test-reserve this will raise a red-flag and prompt further investigation. Results on test-reserve will not be publicly revealed.

We emphasize that except for test-dev, results cannot be submitted to a single split and must instead be submitted on the full test set. A submission to the test set populates three leaderboards: test-dev, test-standard and test-challenge (the updated test-challenge leaderboard will not be revealed until the ECCV 2016 Workshop). It is not possible to submit to test-standard without submitting to test-challenge or vice-versa (however, it is possible to submit to the test set without making results public, see below). The identity of the images in each split is not revealed, except for test-dev.

2.2. Test-Dev Best Practices

The test-dev 2015 set is a subset of the 2015 Testing set. The specific images belonging to test-dev are listed in the "image_info_test-dev2015.json" file available on the download page as part of the "2015 Testing Image info" download. As discussed, we place no limit on the number of submissions allowed on test-dev. Note that while submitting to test-dev will produce evaluation results, doing so will not populate the public test-dev leaderboard. Instead, submitting to the full test set populates the test-dev leaderboard. This limits the number of results displayed on the test-dev leaderboard.

Test-dev should be used only for validation and debugging: in a publication it is not acceptable to report results on test-dev only. However, for validation experiments it is acceptable to report results of competing methods on test-dev (obtained from the public test-dev leaderboard). While test-dev is prone to some overfitting, we expect this may still be useful in practice. We emphasize that final comparisons should always be performed on test-standard.

The differences between the validation and test-dev sets are threefold: guaranteed consistent evaluation of test-dev using the evaluation server, test-dev cannot be used for training (annotations are private), and a leaderboard is provided for test-dev, allowing for comparison with the state-of-the-art. We note that the continued popularity of the outdated PASCAL VOC 2007 dataset partially stems from the fact that it allows for simultaneous validation experiments and comparisons to the state-of-the-art. Our goal with test-dev is to provide similar functionality (while keeping annotations private).

3. Enter The Competition

First you need to create an account on CodaLab. From your account you will be able to participate in all COCO challenges.

Before uploading your results to the evaluation server, you will need to create a JSON file containing your results in the correct format. The file should be named "detections_[testset]_[alg]_results.json". Replace [alg] with your algorithm name and [testset] with either "test-dev2015" or "test2015" depending on the test split you are using. Place the JSON file into a zip file named "detections_[testset]_[alg]_results.zip".

To submit your zipped result file to the COCO Detection Challenge click on the “Participate” tab on the CodaLab evaluation server. Select the challenge type (bbox or segm) and test split (test-dev or test). When you select “Submit / View Results” you will be given the option to submit new results. Please fill in the required fields and click “Submit”. A pop-up will prompt you to select the results zip file for upload. After the file is uploaded the evaluation server will begin processing. To view the status of your submission please select “Refresh Status”. Please be patient, the evaluation may take quite some time to complete (~20min on test-dev and ~80min on the full test set). If the status of your submission is “Failed” please check your file is named correctly and has the right format.

In addition to the CodaLab leaderboard, we also host our own more detailed leaderboard that includes additional results and method information (such as paper references). Note that the CodaLab leaderboard may contain results not yet migrated to our own leaderboard.

4. Download Evaluation Results

After evaluation is complete and the server shows a status of “Finished”, you will have the option to download your evaluation results by selecting “Download evaluation output from scoring step.” The zip file will contain three files:

      detections_[testset]_[alg]_eval.json

      metadata

      scores.txt
    
      % aggregated evaluation on test 

      % automatically generated (safe to ignore) 

      % automatically generated (safe to ignore)

The format of the eval file is described on the detection evaluation page.

1. Keypoint Evaluation

Note: Evaluation metrics were updated 09/05/2016. They are likely finalized, but are still subject to change if we discover any issues before the competition deadline. If you discover any flaws or pitfalls in the proposed metrics please contact us asap.

This page describes the keypoint evaluation metric used by COCO. The COCO keypoint task requires simultaneously detecting objects and localizing their keypoints (object locations are not given at test time). As the task of simultaneous detection and keypoint estimation is relatively new, we chose to adopt a novel metric inspired by object detection metrics. For simplicity, we refer to this task as keypoint detection and the prediction algorithm as the keypoint detector.

We suggest reviewing the evaluation metrics for object detection before proceeding. As in the other COCO tasks, the evaluation code can be used to evaluate results on the publicly available validation set. To obtain results on the test set, for which ground truth annotations are hidden, generated results must be submitted to the evaluation server. For instructions on submitting results to the evaluation server please see the upload page.

1.1. Evaluation Overview

The core idea behind evaluating keypoint detection is to mimic the evaluation metrics used for object detection, namely average precision (AP) and average recall (AR) and their variants. At the heart of these metrics is a similarity measure between ground truth objects and predicted objects. In the case of object detection, the IoU serves as this similarity measure (for both boxes and segments). Thesholding the IoU defines matches between the ground truth and predicted objects and allows computing precision-recall curves. To adopt AP/AR for keypoints detection, we thus only need to define an analogous similarity measure. We do so next by defining an object keypoint similarity (OKS) which plays the same role as the IoU.

1.2. Object Keypoint Similarity

For each object, ground truth keypoints have the form [x₁,y₁,v₁,...,x_k,y_k,v_k], where x,y are the keypoint locations and v is a visibility flag defined as v=0: not labeled, v=1: labeled but not visible, and v=2: labeled and visible. Each ground truth object also has a scale s which we define as the square root of the object segment area. For details on the ground truth format please see the download page.

For each object, the keypoint detector must output keypoint locations and an object-level confidence. Predicted keypoints for an object should have the same form as the ground truth: [x₁,y₁,v₁,...,x_k,y_k,v_k]. However, the detector's predicted v_i are not currently used during evaluation, that is the keypoint detector is not required to predict per-keypoint visibilities or confidences.

We define the object keypoint similarity (OKS) as:

      OKS = Σi[exp(-di2/2s2κi2)δ(vi>0)] / Σi[δ(vi>0)]
    

The d_i are the Euclidean distances between each corresponding ground truth and detected keypoint and the v_i are the visibility flags of the ground truth (the detector's predicted v_i are not used). To compute OKS, we pass the d_i through an unnormalized Guassian with standard deviation sκ_i, where s is the object scale and κ_i is a per-keypont constant that controls falloff. For each keypoint this yields a keypoint similarity that ranges between 0 and 1. These similarities are averaged over all labeled keypoints (keypoints for which v_i>0). Predicted keypoints that are not labeled (v_i=0) do not affect the OKS. Perfect predictions will have OKS=1 and predictions for which all keypoints are off by more than a few standard deviations sκ_i will have OKS~0. The OKS is analogous to the IoU. Given the OKS, we can compute AP and AR just as the IoU allows us to compute these metrics for box/segment detection.

1.3. Tuning OKS

We tune the κ_i such that the OKS is a perceptually meaningful and easy to interpret similarity measure. First, using 5000 redundantly annotated images in val, for each keypoint type i we measured the per-keypoint standard deviation σ_i with respect to object scale s. That is we compute σ_i²=E[d_i²/s²]. σ_i varies substantially for different keypoints: keypoints on a person's body (shoulders, knees, hips, etc.) tend to have a σ much larger than on a person's head (eyes, nose, ears).

To obtain a perceptually meaningful and interpretable similarity metric we set κ_i=2σ_i. With this setting of κ_i, at one, two, and three standard deviations of d_i/s the keypoint similarity exp(-d_i²/2s²κ_i²) takes on values of e^-1/8=.88, e^-4/8=.61 and e^-9/8=.32. As expected, human annotated keypoints are normally distributed (ignoring occasional outliers). Thus, recalling the 68–95–99.7 rule, setting κ_i=2σ_i means that 68%, 95%, and 99.7% of human annotated keypoints should have a keypoint similarity of .88, .61, or .32 or higher, respectively (in practice the percentages are 75%, 95% and 98.7%).

The OKS is the average keypoint similarity across all (labeled) object keypoints. Below we plot the predicted OKS distribution with κ_i=2σ_i assuming 10 independent keypoints per object (blue curve) and the actual distribution of human OKS scores on the dually annotated data (green curve):

The curves don't match exactly for a few reasons: (1) object keypoints are not independent, (2) the number of labeled keypoints per objects varies, and (3) the real data contains 1-2% outliers (most of which are caused by annotators mistaking left for right or annotating the wrong person when two people are nearby). Nevertheless, the behavior is roughly as expected. We conclude with a few observations about human performance: (1) at OKS of .50, human performance is nearly perfect (95%), (2) median human OKS is ~.91, (3) human performance drops rapidly after an OKS of .95. Note that this OKS distribution can be used to predict human AR (as AR doesn't depend on false positives).

2. Metrics

The following 10 metrics are used for characterizing the performance of a keypoint detector on COCO:

Average Precision (AP):
AP
APOKS=.50
APOKS=.75
AP Across Scales:
APmedium
APlarge
Average Recall (AR):
AR
AROKS=.50
AROKS=.75
AR Across Scales:
ARmedium
ARlarge
% AP at OKS=.50:.05:.95 (primary challenge metric)
% AP at OKS=.50 (loose metric)
% AP at OKS=.75 (strict metric)

% AP for medium objects: 322 < area < 962
% AP for large objects: area > 962

% AR at OKS=.50:.05:.95
% AR at OKS=.50
% AR at OKS=.75

% AR for medium objects: 322 < area < 962
% AR for large objects: area > 962

Unless otherwise specified, AP and AR are averaged over multiple OKS values (.50:.05:.95).
As discussed, we set κ_i=2σ_i for each keypoint type i. For people, the σ's are .026, .025, .035, .079, .072, .062, .107, .087, & .089 for the nose, eyes, ears, shoulders, elbows, wrists, hips, knees, & ankles, respectively.
AP (averaged across all 10 OKS thresholds) will determine the challenge winner. This should be considered the single most important metric when considering keypoint performance on COCO.
All metrics are computed allowing for at most 20 top-scoring detections per image (we use 20 detections, not 100 as in the object detection challenge, as currently person is the only category with keypoints).
Small objects (segment area < 32²) do not contain keypoint annotations.
For objects without labeled keypoints, including crowds, we use a lenient heuristic that allows matching of detections based on hallucinated keypoints (placed within the ground truth objects so as to maximize OKS). This is very similar to how ignore regions are handled for detection with boxes/segments. See the code for details.
Each object is given equal importance, regardless of the number of labeled/visible keypoints. We do not filter objects with only a few keypoints, nor do we weight object examples by the number of keypoints present.

3. Results Format

The results format used for storing generated keypoints is described on the results format page. For reference, here is a summary of the keypoint results:

[{

"image_id"

int,

"category_id"

int,

"keypoints"

[x₁,y₁,v₁,...,x_k,y_k,v_k],

"score"

float,

}]

4. Evaluation Code

1. Keypoints Upload

This page describes the upload instructions for submitting results to the keypoint evaluation server. Before proceeding, please review the results format and evaluation details. Submitting results allows you to participate in the COCO Keypoints Challenge and compare results to the state-of-the-art on the keypoints leaderboard.

2. Competition Details

2.1. Test Set Splits

  
    
      split
      #imgs
      submission
      scores reported
    

      Test-Dev
      ~20K
      unlimited
      immediately
    

      Test-Standard
      ~20K
      limited
      immediately
    

      Test-Challenge
      ~20K
      limited
      challenge
    

      Test-Reserve
      ~20K
      limited
      never
    

split	#imgs	submission	scores reported
Test-Dev	~20K	unlimited	immediately
Test-Standard	~20K	limited	immediately
Test-Challenge	~20K	limited	challenge
Test-Reserve	~20K	limited	never

These are identical to the test splits used for the object detection challenge. To understand their role in more detail, and for best practices, please see the detection upload page (section 2).

3. Enter The Competition

First you need to create an account on CodaLab. From your account you will be able to participate in all COCO challenges.

Before uploading your results to the evaluation server, you will need to create a JSON file containing your results in the correct format. The file should be named "person_keypoints_[testset]_[alg]_results.json". Replace [alg] with your algorithm name and [testset] with either "test-dev2015" or "test2015" depending on the test split you are using. Place the JSON file into a zip file named "person_keypoints_[testset]_[alg]_results.zip".

To submit your zipped result file to the COCO Detection Challenge click on the “Participate” tab on the CodaLab evaluation server. Select test split (test-dev or test). When you select “Submit / View Results” you will be given the option to submit new results. Please fill in the required fields and click “Submit”. A pop-up will prompt you to select the results zip file for upload. After the file is uploaded the evaluation server will begin processing. To view the status of your submission please select “Refresh Status”. If the status of your submission is “Failed” please check your file is named correctly and has the right format.

4. Download Evaluation Results

      eval.json

      metadata

      scores.txt
    
      % aggregated evaluation on test 

      % automatically generated (safe to ignore) 

      % automatically generated (safe to ignore)

The format of the eval file is described on the keypoints evaluation page.

1. Caption Evaluation

This page describes the caption evaluation code used by COCO. The evaluation code provided here can be used to obtain results on the publicly available COCO validation set. It computes multiple common metrics, including BLEU, METEOR, ROUGE-L, and CIDEr (the writeup below contains references and descriptions of each metric). If you use the captions, evaluation code, or server, we ask that you cite Microsoft COCO Captions: Data Collection and Evaluation Server:

@article{capeval2015,

Author={X. Chen and H. Fang and TY Lin and R. Vedantam and S. Gupta and P. Dollár and C. L. Zitnick},
Journal = {arXiv:1504.00325},
Title = {Microsoft COCO Captions: Data Collection and Evaluation Server},
Year = {2015}

}

To obtain results on the COCO test set, for which ground truth annotations are hidden, generated results must be submitted to the evaluation server. For instructions on submitting results to the evaluation server please see the upload page. The exact same evaluation code, described below, is used to evaluate generated captions on the test set.

2. Results Format

The results format used for storing generated captions is described on the results format page. For reference, here is a summary of the caption results format:

[{

"image_id"

int,

"caption"

str,

}]

3. Evaluation Code

Evaluation code can be obtained on the coco-captions github page. Unlike the general COCO API, the COCO caption evaluation code is only available under Python.

Running the evaluation code produces two data structures that summarize caption quality. The two structs are evalImgs and eval, which summarize caption quality per-image and aggregated across the entire test set, respectively. Details for the two data structures are given below. We recommend running the python caption evaluation demo for more details.

evalImgs[{

"image_id"

int,

"BLEU_1"

float,

"BLEU_2"

float,

"BLEU_3"

float,

"BLEU_4"

float,

"METEOR"

float,

"ROUGE_L"

float,

"CIDEr"

float,

}]

eval{

"BLEU_1"

float,

"BLEU_2"

float,

"BLEU_3"

float,

"BLEU_4"

float,

"METEOR"

float,

"ROUGE_L"

float,

"CIDEr"

float,

}

1. Captions Upload

This page describes the upload instructions for submitting results to the caption evaluation server. Before proceeding, please review the results format and evaluation details. Submitting results allows you to participate in the COCO Captioning Challenge 2015 and compare results to the state-of-the-art on the captioning leaderboard.

2. Competition Details

Training Data: The recommended training set for the captioning challenge is the COCO 2014 Training Set. The COCO 2014 Validation Set may also be used for training when submitting results on the test set. External data of any form is allowed (except any form of annotation on the COCO Testing set is forbidden). Please specify any and all external data used for training in the "method description" when uploading results to the evaluation server.

Please limit the number of entries to the captioning challenge to a reasonable number, e.g. one entry per paper. To avoid overfitting to the test data, the number of submissions per user is limited to 1 upload per day and a maximum of 5 submissions per user. It is not acceptable to create multiple accounts for a single project to circumvent this limit. The exception to this is if a group publishes two papers describing unrelated methods, in this case both sets of results can be submitted for evaluation.

3. Enter The Competition

First you need to create an account on CodaLab. From your account you will be able to participate in all COCO challenges.

Before uploading your results to the evaluation server, you will need to create two JSON files containing your captioning results in the correct results format. One file should correspond to your results on the 2014 validation dataset, and the other to the 2014 test dataset. Both sets of results are required for submission. Your files should be named as follows:

    results.zip

      captions_val2014_[alg]_results.json

      captions_test2014_[alg]_results.json

Replace [alg] with your algorithm name and place both files into a single zip file named "results.zip".

To submit your zipped result file to the COCO Captioning Challenge click on the “Participate” tab on the CodaLab webpage. When you select “Submit / View Results” you will be given the option to submit new results. Please fill in the required fields and click “Submit”. A pop-up will prompt you to select the results zip file for upload. After the file is uploaded the evaluation server will begin processing. To view the status of your submission please select “Refresh Status”. Please be patient, the evaluation may take quite some time to complete. If the status of your submission is “Failed” please check to make sure your files are named correctly, they have the right format, and your zip file contains two files corresponding to the validation and testing datasets.

After you submit your results to the evaluation server, you can control whether your results are publicly posted to the CodaLab leaderboard. To toggle the public visibility of your results please select either “post to leaderboard” or “remove from leaderboard”. For now only one result can be published to the leaderboard at any time, we may change this in the future. After your results are posted to the CodaLab leaderboard, your captions on the validation dataset will be publicly available. Your captions on the test set will not be publicly released.

4. Download Evaluation Results

      captions_val2014_[alg]_evalimgs.json

      captions_val2014_[alg]_eval.json

      captions_test2014_[alg]_eval.json

      metadata

      scores.txt
    
      % per image evaluation on val 

      % aggregated evaluation on val 

      % aggregated evaluation on test 

      % automatically generated (safe to ignore) 

      % automatically generated (safe to ignore)

The format of the evaluation files is described on the caption evaluation page. Please note that the *_evalImgs.json file is only available for download on the validation dataset, and not the test set.

Welcome to the COCO Captioning Challenge!
Winners were announced at CVPR 2015
Caption evaluation server remains open!

1. Introduction

Update: The COCO caption evaluation server remains open. Please submit new results to compare to state-of-the-art methods using several automatic evaluation metrics. The COCO 2015 Captioning Challenge is now, however, complete. Results were presented as part of the CVPR 2015 Large-scale Scene Understanding (LSUN) workshop and are available to view on the leaderboard.

The COCO Captioning Challenge is designed to spur the development of algorithms producing image captions that are informative and accurate. Teams will be competing by training their algorithms on the COCO 2014 dataset and having their results scored by human judges.

2. Dates

April 1, 2015

Training and testing data, and evaluation software released

May 29, 2015

Submission deadline at 11:59 PST

June 5, 2015

Challenge results (human judgment) released

June 12, 2015

Winner presents at the LSUN Workshop at CVPR 2015

This captioning challenge is part of the Large-scale Scene Understanding (LSUN) CVPR 2015 workshop organized by Princeton University. For further details please visit the LSUN website.

3. Organizers

Yin Cui (Cornell)
Matteo Ruggero Ronchi (Caltech)
Tsung-Yi Lin (Cornell)
Piotr Dollár (Facebook AI Research)
Larry Zitnick (Microsoft Research)

4. Challenge Guidelines

Participants are recommended but not restricted to train their algorithms on COCO 2014 dataset. The results should contain a single caption for each validation and test image and they must be submitted and publicly published on the CodaLab leaderboard. Please specify any and all external data used for training in the "method description" when uploading results to the evaluation server.

By the challenge deadline, both results on the validation and test sets must be submitted to the evaluation server. The results on validation will be public and used for performance diagnosis and visualization. The competitors' algorithms will be evaluated based on the feedback from human judges and the top performing teams will be awarded prizes. Two or three teams will also be invited to present at the LSUN workshop.

Please follow the instructions in the format, evaluate, and upload tabs which describe the results format, evaluation code, and upload instructions, resprecitvely. The COCO Caption Evaluation Toolkit is also available. The tooklit provides evaluation code for common metrics for caption analysis, including the BLEU, METEOR, ROUGE-L, and CIDEr metrics. Note that for the competition, instead of automated metrics, human judges will evaluate algorithm results.

Welcome to the COCO 2015 Detection Challenge!

1st Place Detection and Segmentation: Team MSRA
2nd Place Detection and Segmentation: Team FAIRCNN
Best Student Entry: Team ION

Detection results and winners' methods were presented at the ICCV 2015 ImageNet and COCO Visual Recognition Challenges Joint Workshop (slides and recording of all talks are now available). Challenge winners along with up-to-date results are available to view on the leaderboard. The evaluation server remains open for upload of new results.

1. Overview

We are pleased to announce the COCO 2015 Detection Challenge. This competition is designed to push the state of the art in object detection forward. Teams are encouraged to compete in either (or both) of two object detection challenges: using bounding box output or object segmentation output.

The COCO train, validation, and test sets, containing more than 200,000 images and 80 object categories, are available on the download page. All object instance are annotated with a detailed segmentation mask. Annotations on the training and validation sets (with over 500,000 object instances segmented) are publicly available.

2. Dates

July 23, 2015

Training and testing data, and evaluation software released

November 25, 2015

Extended submission deadline (11:59 PST)

December 10, 2015

Challenge results released

December 17, 2015

Winner presents at ICCV 2015 Workshop

This detection challenge is part of the ImageNet and COCO Visual Recognition Challenges joint workshop at ICCV 2015. For further details about the joint workshop please visit the workshop website.

3. Organizers

Tsung-Yi Lin (Cornell)

Yin Cui (Cornell)

Genevieve Patterson (Brown)

Matteo Ruggero Ronchi (Caltech)

Larry Zitnick (Microsoft Research)

Piotr Dollár (Facebook AI Research)

4. Award Committee

Yin Cui (Cornell)

Genevieve Patterson (Brown)

Matteo Ruggero Ronchi (Caltech)

Serge Belongie (Cornell)

Lubomir Bourdev (UC Berkeley)

Michael Maire (TTI Chicago)

Pietro Perona (Caltech)

Deva Ramanan (CMU)

5. Challenge Guidelines

The detection evaluation page lists detailed information regarding how submissions will be scored. Instructions for submitting results are available on the detection upload page.

To limit overfitting while giving researchers more flexibility to test their system, we have divided the test set into a number of splits, including test-dev, test-standard, and test-challenge. Test-dev is used for debugging and validation experiments and allows for unlimited submission to the evaluation server. Test-standard is used to maintain a public leaderboard that is updated upon submission. Finally, test-challenge is used for the workshop competition; results will be revealed during the workshop at ICCV 2015. A more thorough explanation is available on the upload page.

Competitors are recommended but not restricted to train their algorithms on COCO 2014 train and val sets. The download page contains links to all COCO 2014 train+val images and associated annotations as well as the 2015 test images. Please specify any and all external data used for training in the "method description" when uploading results to the evaluation server.

By the challenge deadline, results must be submitted to the evaluation server. Competitors' algorithms will be evaluated according to the rules described on the evaluation page. Challenge participants with the most successful and innovative methods will be invited to present.

After careful consideration, this challenge uses a more comprehensive comparison metric than the traditional AP at Intersection over Union (IoU) threshold of 0.5. Specifically, AP is averaged over multiple IoU values between 0.5 and 1.0; this rewards detectors with better localization. Please refer to the "Metrics" section of the evaluation page for a detailed explanation of the competition metrics.

6. Tools and Instructions

We provide extensive API support for the COCO images, annotations, and evaluation code. To download the COCO API, please visit our GitHub repository. For an overview of how to use the API, please visit the download page and consult the sections entitled COCO API and MASK API.

Due to the large size of the COCO dataset and the complexity of this challenge, the process of competing in this challenge may not seem simple. To help guide competitors to victory, we provide explanations and instructions for each step of the process on the download, format, evaluation, and upload pages. For additional questions, please contact cocodataset@outlook.com.

Welcome to the COCO 2016 Detection Challenge!

1. Overview

The COCO 2016 Detection Challenge is designed to push the state of the art in object detection forward. Teams are encouraged to compete in either (or both) of two object detection challenges: using bounding box output or object segmentation output.

This challenge is part of the ImageNet and COCO Visual Recognition workshop at ECCV 2016. For further details about the joint workshop please visit the workshop website. Participants are encouraged to participate in both the COCO and ImageNet detection challenges. Please also see the concurrent COCO 2016 Keypoint Challenge.

The COCO train, validation, and test sets, containing more than 200,000 images and 80 object categories, are available on the download page. All object instances are annotated with a detailed segmentation mask. Annotations on the training and validation sets (with over 500,000 object instances segmented) are publicly available.

This is the second COCO detection challenge and it closely follows the COCO 2015 Detection Challenge. In particular, the same data and metrics are being used for this year's challenge.

2. Dates

June 1, 2016

Challenge officially announced

September 16, 2016

[Extended] Submission deadline (11:59 PST)

October 2, 2016

Challenge results released

October 9, 2016

Winners present at ECCV 2016 Workshop

3. Organizers

Tsung-Yi Lin (Cornell)

Yin Cui (Cornell)

Genevieve Patterson (Brown)

Matteo Ruggero Ronchi (Caltech)

Ross Girshick (Facebook AI Research)

Piotr Dollár (Facebook AI Research)

4. Award Committee

Yin Cui (Cornell)

Genevieve Patterson (Brown)

Matteo Ruggero Ronchi (Caltech)

Serge Belongie (Cornell)

Lubomir Bourdev (UC Berkeley)

Michael Maire (TTI Chicago)

Pietro Perona (Caltech)

Deva Ramanan (CMU)

5. Challenge Guidelines

The detection evaluation page lists detailed information regarding how submissions will be scored. Instructions for submitting results are available on the detection upload page.

To limit overfitting while giving researchers more flexibility to test their system, we have divided the test set into a number of splits, including test-dev, test-standard, and test-challenge. Test-dev is used for debugging and validation experiments and allows for unlimited submission to the evaluation server. Test-standard is used to maintain a public leaderboard that is updated upon submission. Finally, test-challenge is used for the workshop competition; results will be revealed during the workshop at ECCV 2016. A more thorough explanation is available on the upload page.

6. Tools and Instructions

Due to the large size of the COCO dataset and the complexity of this challenge, the process of competing in this challenge may not seem simple. To help, we provide explanations and instructions for each step of the process on the download, format, evaluation, and upload pages. For additional questions, please contact cocodataset@outlook.com.

Welcome to the COCO 2016 Keypoint Challenge!

1. Overview

Deadline has been extended to 09/16. We apologize for the delay of releasing evaluation code. The keypoint evaluation metrics is finalized and the keypoint evaluation server is open for test-dev evaluation. The full test set evaluation will open shortly. Thank you for your patience!

The COCO 2016 Keypoint Challenge requires localization of person keypoints in challenging, uncontrolled conditions. The keypoint challenge involves simultaneously detecting people and localizing their keypoints (person locations are not given at test time). For full details of this task please see the keypoint evaluation page.

This challenge is part of the ImageNet and COCO Visual Recognition workshop at ECCV 2016. For further details about the joint workshop please visit the workshop website. Please also see the concurrent COCO 2016 Detection Challenge.

Training and val data have now been released. The training set for this task consists of over 100K person instances labeled with keypoints (the majority of people in COCO at medium and large scales) and over 1 million total labeled keypoints. The val set has an addtional 50K annotated people.

2. Dates

June 1, 2016

Challenge officially announced

September 9, 2016

Evaluation server opens

September 16, 2016

[Extended] Submission deadline (11:59 PST)

October 2, 2016

Challenge results released

October 9, 2016

Winners present at ECCV 2016 Workshop

3. Organizers

Tsung-Yi Lin (Cornell)

Yin Cui (Cornell)

Genevieve Patterson (Brown)

Matteo Ruggero Ronchi (Caltech)

Lubomir Bourdev (UC Berkeley)

Ross Girshick (Facebook AI Research)

Piotr Dollár (Facebook AI Research)

4. Award Committee

Yin Cui (Cornell)

Genevieve Patterson (Brown)

Matteo Ruggero Ronchi (Caltech)

Serge Belongie (Cornell)

Lubomir Bourdev (UC Berkeley)

Michael Maire (TTI Chicago)

Pietro Perona (Caltech)

Deva Ramanan (CMU)

5. Challenge Guidelines

The keypoint evaluation page lists detailed information regarding how submissions will be scored. Instructions for submitting results are available on the keypoint upload page. Note that the keypoint challenge follows the detection challenge quite closely. Specifically, the same challenge rules apply and the same COCO images sets are used. Details follow below.

To limit overfitting while giving researchers more flexibility to test their system, we have divided the test set into a number of splits, including test-dev, test-standard, and test-challenge. Test-dev is used for debugging and validation experiments and allows for unlimited submission to the evaluation server. Test-standard is used to maintain a public leaderboard that is updated upon submission. Finally, test-challenge is used for the workshop competition; results will be revealed during the workshop at ECCV 2016. A more thorough explanation is available on the upload page.

As noted earlier, the keypoint challenge involves simultaneously detecting people and localizing their keypoints (person locations are not given at test time). As this is a fairly under-explored setting, we have carefully designed a new set of metrics for this task. Please refer to the "Metrics" section of the evaluation page for a detailed explanation of the competition metrics.

6. Tools and Instructions

Last updated: 12/08/2016. Please visit CodaLab for the latest results.

	CIDEr-D	Meteor	ROUGE-L	BLEU-1	BLEU-2	BLEU-3	BLEU-4	SPICE	date

Metrics

For the details of data collection and evaluation, please read Microsoft COCO Captions: Data Collection and Evaluation Server.

References

Last updated: 12/08/2016. Please visit CodaLab for the latest results.

	CIDEr-D	Meteor	ROUGE-L	BLEU-1	BLEU-2	BLEU-3	BLEU-4	SPICE(x10)	date

Metrics

For the details of data collection and evaluation, please read Microsoft COCO Captions: Data Collection and Evaluation Server.

References

Results of the 2015 Captioning Challenge. Finalized 06/2015. See other tabs for up-to-date results.

	M1	M2	M3	M4	M5	date

Metrics

We conducted a human study to understand how satisfactory are the results obtained from the captions submitted in the COCO captioning challenge. We were interested in three main points:

Which algorithm produces the best captions?
What are the factors determining which is the best algorithm?
Do the algorithms produce captions resembling human-generated sentences?

To address the above questions we developed five Graphical User Interfaces (GUI) on the Amazon Mechanical Turk (AMT) platform to collect human responses. From the responses we designed the following five metrics

Ranking

The ranking for the competition was based on the results from M1 and M2. The other metrics have been used as diagnostic and interpretation of the results. Points are assigned to the top 5 teams for:

	M1	M2	TOTAL	Ranking

References

	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L	AR	AR⁵⁰	AR⁷⁵	AR^M	AR^L	date

Metrics

Average Precision (AP):
AP
APIoU=.50
APIoU=.75
AP Across Scales:
APmedium
APlarge
Average Recall (AR):
AR
ARIoU=.50
ARIoU=.75
AR Across Scales:
ARmedium
ARlarge
% AP at IoU=.50:.05:.95 (primary challenge metric)
% AP at IoU=.50 
% AP at IoU=.75 (strict metric)

% AP for medium objects: 322 < area < 962
% AP for large objects: area > 962

% AR at IoU=.50:.05:.95 
% AR at IoU=.50 
% AR at IoU=.75 

% AR for medium objects: 322 < area < 962
% AR for large objects: area > 962

Please see the evaluation page for more detailed information about the metrics.

References

Last pdated: 06/20/2017 (results migrated weekly from CodaLab). For information about each test split please see the upload page. Hints: use the buttons below to select the dataset split, click on column headers to sort the data, use the search bar to filter results, use the buttons to export the current table view, and click on method names to show per-category results.

	AP	AP⁵⁰	AP⁷⁵	AP^S	AP^M	AP^L	AR¹	AR¹⁰	AR¹⁰⁰	AR^S	AR^M	AR^L	date

Metrics

Average Precision (AP):
AP
APIoU=.50
APIoU=.75
AP Across Scales:
APsmall
APmedium
APlarge
Average Recall (AR):
ARmax=1
ARmax=10
ARmax=100
AR Across Scales:
ARsmall
ARmedium
ARlarge
% AP at IoU=.50:.05:.95 (primary challenge metric)
% AP at IoU=.50 (PASCAL VOC metric)
% AP at IoU=.75 (strict metric)

% AP for small objects: area < 322
% AP for medium objects: 322 < area < 962
% AP for large objects: area > 962

% AR given 1 detection per image 
% AR given 10 detections per image
% AR given 100 detections per image

% AR for small objects: area < 322
% AR for medium objects: 322 < area < 962
% AR for large objects: area > 962

Please see the evaluation page for more detailed information about the metrics.

References