Benchmark ############ We provide scripts for evaluating and training models on task datasets. The following benchmark results are included for reference. ALBEF ******* .. list-table:: :widths: 30 80 20 * - **Pretraining** - COCO (`download `__) - `script `__ * - - Visual Genome (`download `__) - * - - SBU (`download `__) - * - - CC3M (`download `__) - * - - CC12M (`download `__) - .. list-table:: :widths: 30 40 20 20 20 30 30 :header-rows: 1 * - - **Retrieval** - **R1** - **R5** - **R10** - **Training** - **Evaluation** * - TR - COCO (`download `__) - 77.6 - 94.1 - 97.2 - `script `__ - `script `__ * - IR - COCO (`download `__) - 61.0 - 84.5 - 90.7 - `script `__ - `script `__ * - TR - Flickr30k (`download `__) - 77.6 - 94.1 - 97.2 - `script `__ - `script `__ * - IR - Flickr30k (`download `__) - 61.0 - 84.5 - 90.7 - `script `__ - `script `__ .. list-table:: :widths: 20 20 20 20 20 :header-rows: 1 * - **VQA** - **test-dev** - **test-std/test** - **Training** - **Evaluation** * - VQAv2 (`download `__) - 76.35 - 76.54 - `script `__ - `script `__ * - OKVQA (`download `__) - NA - 54.7 - `script `__ - NA * - AOKVQA (`download `__) - 54.5 - NA - `script `__ - NA .. list-table:: :widths: 20 20 20 20 20 :header-rows: 1 * - **Multimodal Classification** - **val** - **test** - **Training** - **Evaluation** * - SNLI-VE (`download `__) - 80.60 - 81.04 - `script `__ - `script `__ * - NLVR2 (`download `__) - 82.47 - 82.91 - `script `__ - `script `__ BLIP ******* .. list-table:: :widths: 30 80 20 * - **Pretraining (14M)** - COCO (`download `__) - `script `__ * - - Visual Genome (`download `__) - * - - SBU (`download `__) - * - - CC3M (`download `__) - * - - CC12M (`download `__) - .. list-table:: :widths: 30 40 20 20 20 30 30 :header-rows: 1 * - **Tasks** - **Retrieval** - **R1** - **R5** - **R10** - **Training** - **Evaluation** * - TR - COCO (`download `__) - 82.0 - 95.8 - 98.1 - `script `__ - `script `__ * - IR - COCO (`download `__) - 64.5 - 86.0 - 91.7 - `script `__ - `script `__ * - TR - Flickr30k (`download `__) - 96.9 - 99.9 - 100.0 - `script `__ - `script `__ * - IR - Flickr30k (`download `__) - 87.5 - 97.6 - 98.9 - `script `__ - `script `__ .. list-table:: :widths: 20 20 20 20 20 :header-rows: 1 * - **VQA** - **test-dev** - **test-std/test** - **Training** - **Evaluation** * - VQAv2 (`download `__) - 78.23 - 78.29 - `script `__ - `script `__ * - OKVQA (`download `__) - NA - 55.4 - `script `__ - `script `__ * - AOKVQA (`download `__) - 56.2 - 50.1 - `script `__ - `script `__ .. list-table:: :widths: 20 20 20 20 20 20 :header-rows: 1 * - **Image Captioning** - **BLEU@4** - **CIDEr** - **SPICE** - **Training** - **Evaluation** * - COCO (`download `__) - 39.9 - 133.5 - 23.7 - `script `__ - `script `__ * - NoCaps (`download `__) - 31.9 - 109.1 - 14.7 - NA - `script `__ .. list-table:: :widths: 20 20 20 20 20 :header-rows: 1 * - **Multimodal Classification** - **val** - **test** - **Training** - **Evaluation** * - NLVR2 (`download `__) - 82.48 - 83.25 - `script `__ - `script `__ CLIP ******* .. list-table:: :widths: 30 40 20 20 20 30 :header-rows: 1 * - **Tasks** - **Retrieval (Zero-shot)** - **R1** - **R5** - **R10** - **Evaluation** * - TR - COCO (`download `__) - 57.2 - 80.5 - 87.8 - `script `__ * - IR - COCO (`download `__) - 36.5 - 60.8 - 71.0 - `script `__ * - TR - Flickr30k (`download `__) - 86.5 - 98.0 - 99.1 - `script `__ * - IR - Flickr30k (`download `__) - 67.0 - 88.9 - 93.3 - `script `__ .. list-table:: :widths: 20 20 20 :header-rows: 1 * - **Multimodal Classification** - **val** - **Evaluation** * - ImageNet - 76.5 - `script `__ ALPRO ******* .. list-table:: :widths: 30 40 20 20 20 20 30 :header-rows: 1 * - **Tasks** - **Retrieval** - **R1** - **R5** - **R10** - **Training** - **Evaluation** * - TR - MSRVTT (`download `__) - 33.2 - 60.5 - 71.7 - `script `__ - `script `__ * - VR - MSRVTT (`download `__) - 33.8 - 61.4 - 72.7 - `script `__ - `script `__ * - TR - DiDeMo (`download `__) - 38.8 - 66.4 - 76.8 - `script `__ - `script `__ * - VR - DiDeMo (`download `__) - 36.6 - 67.5 - 77.9 - `script `__ - `script `__ .. list-table:: :widths: 20 20 20 20 :header-rows: 1 * - **Video QA** - **test** - **Training** - **Evaluation** * - MSRVTT - 42.1 - `script `__ - `script `__ * - MSVD - 46.0 - `script `__ - `script `__