Libra: Building Decoupled Vision System on Large Language Models

This repository provides a simple implementation of Libra in PyTorch, including pretraining, finetuning, and inference.

Please refer to the ICML 2024 paper:

Libra: Building Decoupled Vision System on Large Language Models

Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu

Preparation

ENVIRONMENT. Install the required dependencies:

pip install -r requirements.txt

DATA. The code supports data in the webdatasets, coco, LLaVA-instruction formats, specifically as:

DATASETS/
├── laion/
│   ├── 00000.tar
│   ├── 00001.tar
│   ├── ...
│   └── 07776.tar
├── instruction/
│   ├── llava_v1_5_mix665k.json
│   ├── data/
│   |   ├── coco/
│   |   ├── gqa/
│   |   ├── ...
│   └── └── vg
└── coco/
    ├── annotations/
    │   ├── coco_karpathy_train.json
    |   └── ...
    ├── train2017/
    ├── val2017/
    ├── train2014/
    └── ...

CHECKPOINTS. If you want to train Libra from scratch, several praparations are needed. Otherwise you can just skip this step.

Prepare the huggingface version of the llama-2-7b-chat-hf model. Please refer to here. Then rename the folder name to llama-2-7b-chat-hf-libra.
Merge the vision tokenizer weight into the pretrained llama path. The pretrained vision tokenizer weight can be found here.
Download the pretrained CLIP model in huggingface and merge it into the pretrained model paths. The CLIP model can be downloaded here.

If you want to run the official Libra models, you need to download libra-11b-chat or libra-11b-base.

The final checkpoint path should be like:

CHECKPOINTS/
├── libra-11b-base/
│   ├── ...
│   └── openai-clip-vit-large-patch14-336/
│       └── ...    
├── libra-11b-chat/
│   ├── ...
│   └── openai-clip-vit-large-patch14-336/
│       └── ...    
└── llama-2-7b-chat-hf-libra/
    |
    │   # original llama files
    |
    ├── config.json
    ├── pytorch_model-00001-of-00002.bin
    ├── ...
    ├── tokenizer.model
    │   
    │   # newly added vision tokenizer
    │   
    ├── vision_tokenizer_config.yaml
    ├── vqgan.ckpt
    │
    │   # CLIP model
    │
    └── openai-clip-vit-large-patch14-336/
        └── ...

Inference

We provide a simple jupyter demo here.

Pretraining

We use the LAION dataset for pretraining. Please refer to the config file for detailed usage. The training command is:

torchrun --nnodes=5 --nproc_per_node=8 train.py --cfg-path libra/configs/libra_pretrain.yaml

Instruction Tuning

The code supports finetuning data in the LLaVA instruction format. Please refer to LLaVA to organize the data. Or you can use customized data, as long as its annonation is similar to llava_v1_5_mix665k.json.

torchrun --nnodes=1 --nproc_per_node=8 train.py --cfg-path libra/configs/libra_instruction.yaml

Model Weights

We provide the pretrained base model (Libra-Base) and the model after instruction tuning (Libra-Chat).

Model	Url
Libra-Base	HuggingFace
Libra-Chat	HuggingFace

Citation

If you find our work helpful, please consider citing:

@InProceedings{xu2024libra,
  title = {Libra: Building Decoupled Vision System on Large Language Models},
  author = {Xu, Yifan and Yang, Xiaoshan and Song, Yaguang and Xu, Changsheng},
  booktitle = {Proceedings of the 41st International Conference on Machine Learning},
  pages = {55371--55388},
  year = {2024},
  volume = {235},
  series = {Proceedings of Machine Learning Research},
  publisher = {PMLR},
}

Acknowledgments

We'd like to thank Menghao Hu from Pengcheng Laboratory for data management and Chaoyou Fu from Tencent for early discussion. The code was built upon LAVIS, Huggingface Trainer, and deepspeed. Thanks for their great works.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
demo		demo
images		images
libra		libra
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Libra: Building Decoupled Vision System on Large Language Models

Preparation

Inference

Pretraining

Instruction Tuning

Model Weights

Citation

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

YifanXu74/Libra

Folders and files

Latest commit

History

Repository files navigation

Libra: Building Decoupled Vision System on Large Language Models

Preparation

Inference

Pretraining

Instruction Tuning

Model Weights

Citation

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages