This is an official PyTorch Implementation of Honeybee: Locality-enhanced Projector for Multimodal LLM, Junbum Cha*, Wooyoung Kang*, Jonghwan Mun*, Byungseok Roh. [paper]
2024.04
🔥🔥🔥 Honeybee is accepted by CVPR 2024 as a Highlight.- Of the 2719 accepted papers, 324 (11.9%) were selected as highlights.
- PyTorch
2.0.1
pip install -r requirements.txt
# additional requirements for demo
pip install -r requirements_demo.txt
We provide checkpoints from both the pre-training (PT) and finetuning (FT) stages.
- Comparison with other SoTA methods (Table 6)
Model | Checkpoints | MMB | MME | SEED-I | LLaVA-w | MM-Vet | MMMU | POPE |
---|---|---|---|---|---|---|---|---|
Honeybee-C-7B-M144 | PT / FT | 70.1 | 1891.3 | 64.5 | 67.1 | 34.9 | 35.3 | 83.2 |
Honeybee-D-7B-M144 | PT / FT | 70.8 | 1835.5 | 63.8 | 66.3 | - | - | - |
Honeybee-C-13B-M256 | PT / FT | 73.2 | 1944.0 | 68.2 | 75.7 | 35.6 | 36.4 | 84.3 |
Honeybee-D-13B-M256 | PT / FT | 73.5 | 1950.0 | 66.6 | 72.9 | - | - | - |
- Pushing the limits of Honeybee (Table 7)
Model | Checkpoints | MMB | MME | SEED-I | LLaVA-w | ScienceQA | MM-Vet | MMMU | POPE |
---|---|---|---|---|---|---|---|---|---|
Honeybee-C-7B-M256 | PT / FT | 71.0 | 1951.3 | 65.5 | 70.6 | 93.2 | 38.1 | 37.3 | 85.5 |
Honeybee-C-13B-M576 | PT / FT | 73.6 | 1976.5 | 68.6 | 77.5 | 94.4 | 42.2 | 36.2 | 85.6 |
After Downloading all of data below, organize the data in ./data
.
Then, modify the data-specific argument files, such as annotation and image root paths, in configs/data_configs/train_dataset
and configs/tasks
, correspondingly.
For the pretraining stage, we use the BlipCapFilt and COYO datasets. Given their large size, we recommend downloading them according to the guidelines provided by here and storing them in the webdataset format.
Please note that we employ a filtered subset of the original COYO-700M dataset, specifically the COYO100M subset. This subset excludes image-text pairs with a CLIP similarity score below 0.3, as determined using the CLIP ViT-B/32.
Please download the datasets for finetuning from their official sources:
- VQA (open-ended): VQAv2, GQA, OCRVQA, VSR
- VQA (multiple choices): ScienceQA, A-OKVQA
- Referring expression comprehension: RefCOCO, RefCOCO+, RefCOCOg, VisualGenome
- Instruction: LLaVA150K, ShareGPT
Please follow the official guidelines to prepare benchmark datasets: MMB, MME, SEED-Bench, ScienceQA, LLaVABench, MMVet, MMMU, POPE, and OwlEval.
For GPT-based evaluation, including LLaVABench, MMVet, and MMB (gpt matcher), OpenAI API information should be filled in tasks/llavabench/gpt_eval.py
, tasks/mm_vet/mmbet_eval.py
, and tasks/mmb/eval_mmb_gpt.py
, respectively.
### Pretraining
bash scripts/pt.sh {exp_name} ${args1} ${args2} ...
### Finetuning
bash scripts/ft.sh -p {pretrained_ckpt} {exp_name} ${args1} ${args2} ...
### Evaluation
bash scripts/eval_all.sh {ckpt path}
- Please carefully follow the quotation mark usage in the example below.
- e.g., When defining
data_config/train_dataset
, SHOULD wrap it with single quotation marks ('
).
# Examples
pretrained_ckpt=<path_to_pretrained_ckpt>
ft_output_dir="output/ft/<path_to_output>"
mkdir -p ${ft_output_dir}
# 1st example: sampling_weights with single quotation marks
deepspeed ./train.py \
--config-name=finetune output_dir=${ft_output_dir} pretrained_ckpt=${pretrained_ckpt} \
'data_config/train_dataset=[llava150k,sqa,vicuna40k]' \
data_config.train_cfg.sampling_weights='[0.5, 0.2, 0.3]' \
2>&1 | tee ${ft_output_dir}/train.log
# 2nd example: sampling_weights without single quotation marks; there should be no spaces between values.
deepspeed ./train.py \
--config-name=finetune output_dir=${ft_output_dir} pretrained_ckpt=${pretrained_ckpt} \
'data_config/train_dataset=[llava150k,sqa,vicuna40k]' \
data_config.train_cfg.sampling_weights=[0.5,0.2,0.3] \
2>&1 | tee ${ft_output_dir}/train.log
We utilized batch inference in our evaluation to accelerate experiments. The batch inference does not significantly change average scores, but individual scores may vary slightly (about ±0.1~0.2). To strictly reproduce the official results, the use of 8 devices (GPUs) is required; the number of devices influences batch construction, affecting the final scores.
We used the default batch size specified in each task config, except for the largest model (Honeybee-C-13B-M576
) where we used B=8 due to memory constraints.
Example code for the inference is provided in inference_example.ipynb.
The example images in ./examples
are adopted from mPLUG-Owl.
We also provide gradio demo:
python -m serve.web_server --bf16 --port {PORT} --base-model checkpoints/7B-C-Abs-M144/last
@inproceedings{cha2023honeybee,
title={Honeybee: Locality-enhanced Projector for Multimodal LLM},
author={Junbum Cha and Wooyoung Kang and Jonghwan Mun and Byungseok Roh},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}
The source code is licensed under Apache 2.0 License.
The pretrained weights are licensed under CC-BY-NC 4.0 License.
Acknowledgement: this project is developed based on mPLUG-Owl, which is also under the Apache 2.0 License.
Kakao Brain "Honeybee" is the name of the Multimodal Large Language Model (MLLM) open source project, not the customer service brand.