Skip to content

Latest commit

 

History

History

Emu1

Generative Pretraining in Multimodality

Quan Sun1*, Qiying Yu2,1*, Yufeng Cui1*, Fan Zhang1*, Xiaosong Zhang1*, Yueze Wang1, Hongcheng Gao1,
Jingjing Liu2, Tiejun Huang1,3, Xinlong Wang1

1 BAAI, 2 THU, 3 PKU
* Equal Contribution

| Paper | Demo |

PWC

Emu is a multimodal generalist that can seamlessly generate images and texts in multimodal context. Emu is trained with a unified autoregressive objective, i.e., predict-the-next-element, including both visual embeddings and textual tokens. Trained under this objective, Emu can serve as a generalist interface for both image-to-text and text-to-image tasks.

News

  • Oct 16, 2023: Emu-I achieves state-of-the-art performance on the MM-Vet benchmark (w/o external tools like GPT-4), which assesses large multimodal models in real-world, in-the-wild scenarios.
  • Oct 13, 2023: The code for the zero-shot evaluation of Emu-I has been released!
  • Sep 18, 2023: Tools for processing YT-Storyboard-1b dataset have been released!

Generalist Interface

Emu serves as a generalist interface capable of diverse multimodal tasks, such as image captioning, image/video question answering, and text-to-image generation, together with new abilities like in-context text and image generation, and image blending:

Setup

Clone this repository and install required packages:

git clone https://github.com/baaivision/Emu
cd Emu/Emu1

pip install -r requirements.txt

Model Weights

We release the pretrained and instruction-tuned weights of Emu. Our weights are subject to LLaMA-1's license.

Model name Weight
Emu w/ Decoder 🤗 HF link (34GB)
Emu-I 🤗 HF link (27GB)

Inference

At present, we provide inference code that can process interleaved image-text and video as input, and output text and image.

For instruction-tuned model, we provide examples for image captioning, visual question answering, and interleaved multi-image understanding:

python inference.py --instruct --ckpt-path ${INSTRUCT_CKPT_PATH}

For pretrained model, we provide an example for in-context learning:

python inference.py --ckpt-path ${PRETRAIN_CKPT_DIR}/multimodal_encoder/pytorch_model.bin

For image generation, we provide examples for image blending, text-to-image and in-context generation:

python image_inference.py --ckpt-path ${PRETRAIN_CKPT_DIR}

Evaluation

We provide Emu-I's zero-shot evaluation code on MM-Vet, COCO Caption, VQAv2, OKVQA, VizWiz and VisDial benchmarks. For example, evaluating COCO captioning on a node with 8 GPUs:

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env \
    eval.py \
    --instruct \
    --batch_size 4 \
    --ckpt_path ${INSTRUCT_CKPT_PATH} \
    --root_path /path/to/benchmark_root \
    --dataset_name coco \  # coco, mmvet, vqav2, okvqa, vizwiz, visdial
    --output_path ./output/

where /path/to/benchmark_root should contain the following file structure:

benchmark_root/
    mm-vet/
        mm-vet.json
        images/
            v1_0.png
            ...
    coco/
        images/
            test2015/
                COCO_test2015_{...}.jpg
                ...
            val2014/
                COCO_val2014_{...}.jpg
                ...
        annotations/
            coco_karpathy_test.json
            coco_karpathy_test_gt.json
            coco_karpathy_val.json
            coco_karpathy_val_gt.json
            v2_OpenEnded_mscoco_val2014_questions.json
            v2_mscoco_val2014_annotations.json
            vqa_test.json
            vqa_val_eval.json
    okvqa/
        annotations/
            OpenEnded_mscoco_val2014_questions.json
            mscoco_val2014_annotations.json
            vqa_val_eval.json
    vizwiz/
        images/
            test/
                VizWiz_test_{...}.jpg
                ...
            val/
                VizWiz_val_{...}.jpg
                ...
        annotations/
            test.json
            val.json
    visdial/
        VisualDialog_test2018/
            VisualDialog_test2018_{...}.jpg
            ...
        VisualDialog_val2018/
            VisualDialog_val2018_{...}.jpg
            ...
        visdial_1.0_test.json
        visdial_1.0_val.json

You can also customize your own file structure and modify the corresponding data loading code. Each dataset file can be found in the mm_eval/datasets/ directory. All files can be downloaded from the official dataset websites or from LAVIS.

Schedule

We are committed to open-sourcing all Emu related materials, including:

  • The weights of Emu and Emu-I
  • Inference example for interleaved image-text as input, text as output
  • Video inference example
  • Weights of image decoder & image generation/blending example
  • YT-Storyboard-1B pretraining data
  • Pretraining code
  • Instruction tuning code
  • Evaluation code

We hope to foster the growth of our community through open-sourcing and promoting collaboration👬. Let's step towards multimodal intelligence together🍻.

Acknowledgement

We thank the great work from LLaMA, BLIP-2, Stable Diffusion, and FastChat.

Citation

If you find Emu useful for your research and applications, please consider starring this repository and citing:

@article{Emu,
  title={Generative Pretraining in Multimodality},
  author={Sun, Quan and Yu, Qiying and Cui, Yufeng and Zhang, Fan and Zhang, Xiaosong and Wang, Yueze and Gao, Hongcheng and Liu, Jingjing and Huang, Tiejun and Wang, Xinlong},
  publisher={arXiv preprint arXiv:2307.05222},
  year={2023},
}