Example on Finetuning BLIP on COCO-Captioning

To finetune BLIP model on the coco caption dataset, first refer to Preparing Datasets to prepare the dataset if you have not done so.

To finetune the model, we have prepared a run script for you, which can run as follows:

bash run_scripts/blip/train/train_caption_coco_large.sh

This will finetune the pre-trained BLIP large model into a new model that can be used for captioning.

Deep Dive

Now let’s take a closer look at the script and see what it does.

python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip/train/caption_coco_large_ft.yaml

As can be seen, the script simply calls the train.py with PyTorch distributed training enabled. The --cfg-path argument specifies the runtime config file to use. The config file is a YAML file that specifies the training parameters, shown as follows:

 1 # Copyright (c) 2022, salesforce.com, inc.
 2 # All rights reserved.
 3 # SPDX-License-Identifier: BSD-3-Clause
 4 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
 5
 6model:
 7  arch: blip_caption
 8
 9  model_type: large_coco
10  load_finetuned: False
11
12datasets:
13  coco_caption: # name of the dataset builder
14    vis_processor:
15        train:
16          name: "blip_image_train"
17        eval:
18          name: "blip_image_eval"
19    text_processor:
20        train:
21          name: "blip_caption"
22          prompt: "a picture of "
23        eval:
24          name: "blip_caption"
25
26run:
27  task: captioning
28  # optimizer
29  lr_sched: "linear_warmup_cosine_lr"
30  init_lr: 2e-6
31  min_lr: 0
32  weight_decay: 0.05
33  max_epoch: 5
34  batch_size_train: 16
35  batch_size_eval: 64
36  num_workers: 4
37
38  max_len: 20
39  min_len: 5
40  num_beams: 3
41
42  seed: 42
43  output_dir: "output/BLIP/Caption_coco"
44
45  amp: False
46  resume_ckpt_path: null
47
48  evaluate: False 
49  train_splits: ["train"]
50  valid_splits: ["val"]
51  test_splits: ["test"]
52
53  device: "cuda"
54  world_size: 1
55  dist_url: "env://"
56  distributed: True
The runtime config file is divided into 3 sections:
  • model: specifies the model architecture and type to use.

  • data: specifies the dataset to use.

  • run: specifies the runner arguments, such as tasks, optimizer, learning rate scheduler, etc.

We describe each section in detail below.

Model configurations

1model:
2  arch: blip_caption
3
4  model_type: large_coco
5  load_finetuned: False

The arch argument specifies the model architecture to use. In this case, we use the blip_caption architecture. You can find available architectures by inspecting the model_zoo. Once the architecture is specified, the runner will look for the model class registered with the name and try to instantiate a model instance. In this case BlipCaption is the model registered with the name blip_caption.

The registry maintains a mapping from the name string to the model class. This allows the runner to find the model class dynamically based on the name string from the config file. The following segment in lavis/models/blip_models/blip_caption.py shows how BlipCaption is registered with the name string blip_caption:

 1@registry.register_model("blip_caption")
 2class BlipCaption(BlipBase):
 3    """
 4    BLIP captioning model.
 5
 6    Supported model types:
 7        - base_coco: fine-tuned BLIP base model on COCO caption dataset (Karparthy split).
 8        - large_coco: fine-tuned BLIP large model on COCO caption dataset (Karparthy split).
 9
10    Usage:
11        >>> from lavis.models import load_model
12        >>> model = load_model("blip_caption", "base_coco")
13        >>> model = load_model("blip_caption", "large_coco")
14    """
15
16    PRETRAINED_MODEL_CONFIG_DICT = {
17        "base_coco": "configs/models/blip_caption_base_coco.yaml",
18        "large_coco": "configs/models/blip_caption_large_coco.yaml",
19    }

One same model architecture may be pre-trained or finetuned on different datasets or have different model configurations. For example, BlipCaption have:

  • base_coco: pre-trained base BLIP model adapated for COCO captioning finetuning.

  • large_coco: pre-trained large BLIP model adapated for COCO captioning finetuning.

Therefore, we also need to specify model_type. Here we use large_coco. And we set load_finetuned to False to indicate that we are finetuning the model from the pre-trained weights. If load_finetuned set to True as by default, the model will load finetuned weights on coco captioning.

Given the model architecture and type, the library will then look for the default model config for large_coco in lavis/models/blip_models/blip_caption.py. As can be seen in the above code snippet, the corresponding config path is stored in BlipCaption.PRETRAINED_MODEL_CONFIG_DICT. Then the library will load lavis/configs/models/blip_caption_large_coco.yaml as the configuration to build the model.

Priority of Configs: Note that the priority of the run config is higher than the default model config, meaning that arguments in the run config will override the default model config. For example, in the default model config, load_finetuned is set to True by default, while in the run config, we set it to False and finetuning from the pre-trained weights only.

Dataset configurations

The second section of the config file specifies the dataset(s) to use.

 1datasets:
 2  coco_caption: # name of the dataset builder
 3    vis_processor:
 4        train:
 5          name: "blip_image_train"
 6        eval:
 7          name: "blip_image_eval"
 8    text_processor:
 9        train:
10          name: "blip_caption"
11          prompt: "a picture of "
12        eval:
13          name: "blip_caption"

We associate each dataset with a vis_processor and a text_processor, responsible for processing the visual and textual input respectively. Here we again use the registry mechanism to dynamically load the processor class based on the name string. For example, blip_image_train is the name string for the BlipImageTrainProcessor class, which is registered in lavis/processors/blip_processors.py.

Similarly, the dataset name string is also registered in the registry, pointing to a dataset builder COCOCapBuilder class. By default, the builder will load the default dataset configuration as in DATASET_CONFIG_DICT. You may also add new dataset types by adding new entries to the dictionary.

The dataset configuration used here is:

 1datasets:
 2  coco_caption: # name of the dataset builder
 3    dataset_card: dataset_card/coco_caption.md
 4    # data_dir: ${env.data_dir}/datasets
 5    data_type: images # [images|videos|features]
 6
 7    build_info:
 8      # Be careful not to append minus sign (-) before split to avoid itemizing
 9      annotations:
10        train:
11          url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_train.json
12          md5: aa31ac474cf6250ebb81d18348a07ed8
13          storage: coco/annotations/coco_karpathy_train.json
14        val:
15          url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_val.json
16          md5: b273847456ef5580e33713b1f7de52a0
17          storage:  coco/annotations/coco_karpathy_val.json
18        test:
19          url: https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json
20          md5: 3ff34b0ef2db02d01c37399f6a2a6cd1
21          storage: coco/annotations/coco_karpathy_test.json
22      images:
23        storage: coco/images/

In this configuration file, we specify the dataset name and mainly its building information. The build information is divided into two parts: annotation and images. The annotation files will be automatically downloaded upon loading the dataset for the first time. The images part specifies the image root directory. This is a relative path to the cache directory, which is cache by default. If you have a local copy of the dataset, you can specify the path to the local copy by overwriting the images part in the runtime config file. For example, you may alter the run config as below to use your local dataset copy:

datasets:
    coco_caption: # name of the dataset builder
        vis_processor:
            train:
            name: "blip_image_train"
            eval:
            name: "blip_image_eval"
        text_processor:
            train:
            name: "blip_caption"
            prompt: "a picture of "
            eval:
            name: "blip_caption"
        images:
            YOUR_LOCAL_IMAGE_ROOT_DIR

LAVIS supports using multiple datasets for training. See an example in lavis/projects/blip/train/pretrain_14m.yaml.

Runner configurations

The last section of the config file specifies the arguments for the runner, shown below:

 1run:
 2  task: captioning
 3  # optimizer
 4  lr_sched: "linear_warmup_cosine_lr"
 5  init_lr: 2e-6
 6  min_lr: 0
 7  weight_decay: 0.05
 8  max_epoch: 5
 9  batch_size_train: 16
10  batch_size_eval: 64
11  num_workers: 4
12
13  max_len: 20
14  min_len: 5
15  num_beams: 3
16
17  seed: 42
18  output_dir: "output/BLIP/Caption_coco"
19
20  amp: False
21  resume_ckpt_path: null
22
23  evaluate: False 
24  train_splits: ["train"]
25  valid_splits: ["val"]
26  test_splits: ["test"]
27
28  device: "cuda"
29  world_size: 1
30  dist_url: "env://"
31  distributed: True
Here we specify runner-related arguments, including
  • task-specific arguments, such as task, max_len, min_len, etc.

  • learning rate schedulers, optimizer;

  • distributed training settings;

  • logging and checkpointing settings.

Available Configurations

See Training Models on Task Datasets (Commands and Configurations) for the full list of available configurations and their descriptions.