Benchmark
We provide scripts for evaluating and training models on task datasets. The following benchmark results are included for reference.
ALBEF
Pretraining |
COCO (download) |
|
Visual Genome (download) |
||
SBU (download) |
||
CC3M (download) |
||
CC12M (download) |
Retrieval |
R1 |
R5 |
R10 |
Training |
Evaluation |
|
---|---|---|---|---|---|---|
TR |
COCO (download) |
77.6 |
94.1 |
97.2 |
||
IR |
COCO (download) |
61.0 |
84.5 |
90.7 |
||
TR |
Flickr30k (download) |
77.6 |
94.1 |
97.2 |
||
IR |
Flickr30k (download) |
61.0 |
84.5 |
90.7 |
VQA |
test-dev |
test-std/test |
Training |
Evaluation |
---|---|---|---|---|
VQAv2 (download) |
76.35 |
76.54 |
||
OKVQA (download) |
NA |
54.7 |
NA |
|
AOKVQA (download) |
54.5 |
NA |
NA |
Multimodal Classification |
val |
test |
Training |
Evaluation |
---|---|---|---|---|
SNLI-VE (download) |
80.60 |
81.04 |
||
NLVR2 (download) |
82.47 |
82.91 |
BLIP
Pretraining (14M) |
COCO (download) |
|
Visual Genome (download) |
||
SBU (download) |
||
CC3M (download) |
||
CC12M (download) |
Tasks |
Retrieval |
R1 |
R5 |
R10 |
Training |
Evaluation |
---|---|---|---|---|---|---|
TR |
COCO (download) |
82.0 |
95.8 |
98.1 |
||
IR |
COCO (download) |
64.5 |
86.0 |
91.7 |
||
TR |
Flickr30k (download) |
96.9 |
99.9 |
100.0 |
||
IR |
Flickr30k (download) |
87.5 |
97.6 |
98.9 |
VQA |
test-dev |
test-std/test |
Training |
Evaluation |
---|---|---|---|---|
VQAv2 (download) |
78.23 |
78.29 |
||
OKVQA (download) |
NA |
55.4 |
||
AOKVQA (download) |
56.2 |
50.1 |
Image Captioning |
BLEU@4 |
CIDEr |
SPICE |
Training |
Evaluation |
---|---|---|---|---|---|
COCO (download) |
39.9 |
133.5 |
23.7 |
||
NoCaps (download) |
31.9 |
109.1 |
14.7 |
NA |
Multimodal Classification |
val |
test |
Training |
Evaluation |
---|---|---|---|---|
NLVR2 (download) |
82.48 |
83.25 |
CLIP
Tasks |
Retrieval (Zero-shot) |
R1 |
R5 |
R10 |
Evaluation |
---|---|---|---|---|---|
TR |
COCO (download) |
57.2 |
80.5 |
87.8 |
|
IR |
COCO (download) |
36.5 |
60.8 |
71.0 |
|
TR |
Flickr30k (download) |
86.5 |
98.0 |
99.1 |
|
IR |
Flickr30k (download) |
67.0 |
88.9 |
93.3 |
Multimodal Classification |
val |
Evaluation |
---|---|---|
ImageNet |
76.5 |
ALPRO
Tasks |
Retrieval |
R1 |
R5 |
R10 |
Training |
Evaluation |
---|---|---|---|---|---|---|
TR |
MSRVTT (download) |
33.2 |
60.5 |
71.7 |
||
VR |
MSRVTT (download) |
33.8 |
61.4 |
72.7 |
||
TR |
DiDeMo (download) |
38.8 |
66.4 |
76.8 |
||
VR |
DiDeMo (download) |
36.6 |
67.5 |
77.9 |
Video QA |
test |
Training |
Evaluation |
---|---|---|---|
MSRVTT |
42.1 |
||
MSVD |
46.0 |