Skip to content

Commit

Permalink
update documentation and example scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
i-gao committed Sep 16, 2023
1 parent b0ff9a4 commit 35e77f0
Show file tree
Hide file tree
Showing 15 changed files with 258 additions and 24 deletions.
Binary file added docs/inputs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/signature.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/xattn_langstream.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 9 additions & 1 deletion open_flamingo/eval/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# OpenFlamingo Evaluation Suite

This is the evaluation module of OpenFlamingo. It contains a set of utilities for evaluating multimodal models on various benchmarking datasets.

*This module is a work in progress! We will be updating this README as it develops. In the meantime, if you notice an issue, please file a Bug Report or Feature Request [here](https://github.com/mlfoundations/open_flamingo/issues/new/choose).*
Expand All @@ -19,6 +18,15 @@ This is the evaluation module of OpenFlamingo. It contains a set of utilities fo

When evaluating a model using `num_shots` shots, we sample the exemplars from the training split. Performance is evaluated on a disjoint test split, subsampled to `--num_samples` examples (or using the full test split if `--num_samples=-1`).

## Supported models
This evaluation module interfaces with models using the `EvalModel` class defined in `eval/eval_models/eval_model.py`. The `EvalModel` wrapper standardizes the generation and rank classification interfaces.

To help standardize VLM evaluations, we have implemented EvalModel wrappers for models from three code repositories:

* This open_flamingo repository, i.e. all models created using this repository's `src` code
* The pretrained [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) models. Note that this model can only take in one image per input sequence; this is not to be confused with the BLIP-like implementation in the open_flamingo repository, which can take in arbitrarily interleaved image/text sequences
* Huggingface's [IDEFICS](https://huggingface.co/blog/idefics) models

## Sample scripts
Our codebase uses DistributedDataParallel to parallelize evaluation by default, so please make sure to set the `MASTER_ADDR` and `MASTER_PORT` environment variables or use `torchrun`. We provide a sample Slurm evaluation script in `open_flamingo/open_flamingo/scripts/run_eval.sh`.

Expand Down
2 changes: 1 addition & 1 deletion open_flamingo/scripts/fill_vqa_testdev_results.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Helper scripts to prepare a vqa test-dev evaluation for EvalAI submission.
Helper scripts to prepare a Vizwiz or VQAv2 test-dev evaluation for EvalAI submission.
Note: EvalAI requires VQAv2 submissions to have predictions for all the questions in the test2015 set, not just the test-dev set.
Given a json with a subset of the vqa questions, fill in the rest of the questions with an empty string as the model prediction.
"""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Notes:
- VQAv2 test-dev and test-std annotations are not publicly available.
To evaluate on these splits, please follow the VQAv2 instructions and submit to EvalAI.
This script will evaluate on the val split.
- Vizwiz test-dev annotations are also not publicly available; please go through EvalAI.
com

export PYTHONFAULTHANDLER=1
Expand Down
77 changes: 77 additions & 0 deletions open_flamingo/scripts/run_eval_deepspeed.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-task=1

<<com
Example Slurm evaluation script.
Notes:
- VQAv2 test-dev and test-std annotations are not publicly available.
To evaluate on these splits, please follow the VQAv2 instructions and submit to EvalAI.
This script will evaluate on the val split.
- Vizwiz test-dev annotations are also not publicly available; please go through EvalAI.
com

export PYTHONFAULTHANDLER=1
export CUDA_LAUNCH_BLOCKING=0
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=$(shuf -i 0-65535 -n 1)
export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`

echo go $COUNT_NODE
echo $HOSTNAMES

export PYTHONPATH="$PYTHONPATH:open_flamingo"
srun --cpu_bind=v --accel-bind=gn python
deepspeed open_flamingo/open_flamingo/eval/evaluate.py \
--vision_encoder_path ViT-L-14 \
--vision_encoder_pretrained openai\
--lm_path anas-awadalla/mpt-1b-redpajama-200b \
--tokenizer_path anas-awadalla/mpt-1b-redpajama-200b \
--cross_attn_every_n_layers 1 \
--checkpoint_path "openflamingo/OpenFlamingo-3B-vitl-mpt1b/checkpoint.pt" \
--results_file "results.json" \
--precision fp32 \
--batch_size 8 \
--deepspeed \
--eval_coco \
--eval_vqav2 \
--eval_flickr30 \
--eval_ok_vqa \
--eval_textvqa \
--eval_vizwiz \
--eval_hateful_memes \
--coco_train_image_dir_path "/path/to/mscoco_karpathy/train2014" \
--coco_val_image_dir_path "/path/to/mscoco_karpathy/val2014" \
--coco_karpathy_json_path "/path/to/mscoco_karpathy/dataset_coco.json" \
--coco_annotations_json_path "/path/to/mscoco_karpathy/annotations/captions_val2014.json" \
--vqav2_train_image_dir_path "/path/to/vqav2/train2014" \
--vqav2_train_annotations_json_path "/path/to/vqav2/v2_mscoco_train2014_annotations.json" \
--vqav2_train_questions_json_path "/path/to/vqav2/v2_OpenEnded_mscoco_train2014_questions.json" \
--vqav2_test_image_dir_path "/path/to/vqav2/val2014" \
--vqav2_test_annotations_json_path "/path/to/vqav2/v2_mscoco_val2014_annotations.json" \
--vqav2_test_questions_json_path "/path/to/vqav2/v2_OpenEnded_mscoco_val2014_questions.json" \
--flickr_image_dir_path "/path/to/flickr30k/flickr30k-images" \
--flickr_karpathy_json_path "/path/to/flickr30k/dataset_flickr30k.json" \
--flickr_annotations_json_path "/path/to/flickr30k/dataset_flickr30k_coco_style.json" \
--ok_vqa_train_image_dir_path "/path/to/okvqa/train2014" \
--ok_vqa_train_annotations_json_path "/path/to/okvqa/mscoco_train2014_annotations.json" \
--ok_vqa_train_questions_json_path "/path/to/okvqa/OpenEnded_mscoco_train2014_questions.json" \
--ok_vqa_test_image_dir_path "/path/to/okvqa/val2014" \
--ok_vqa_test_annotations_json_path "/path/to/okvqa/mscoco_val2014_annotations.json" \
--ok_vqa_test_questions_json_path "/path/to/okvqa/OpenEnded_mscoco_val2014_questions.json" \
--textvqa_image_dir_path "/path/to/textvqa/train_images/" \
--textvqa_train_questions_json_path "/path/to/textvqa/train_questions_vqa_format.json" \
--textvqa_train_annotations_json_path "/path/to/textvqa/train_annotations_vqa_format.json" \
--textvqa_test_questions_json_path "/path/to/textvqa/val_questions_vqa_format.json" \
--textvqa_test_annotations_json_path "/path/to/textvqa/val_annotations_vqa_format.json" \
--vizwiz_train_image_dir_path "/path/to/v7w/train" \
--vizwiz_test_image_dir_path "/path/to/v7w/val" \
--vizwiz_train_questions_json_path "/path/to/v7w/train_questions_vqa_format.json" \
--vizwiz_train_annotations_json_path "/path/to/v7w/train_annotations_vqa_format.json" \
--vizwiz_test_questions_json_path "/path/to/v7w/val_questions_vqa_format.json" \
--vizwiz_test_annotations_json_path "/path/to/v7w/val_annotations_vqa_format.json" \
--hateful_memes_image_dir_path "/path/to/hateful_memes/img" \
--hateful_memes_train_annotations_json_path "/path/to/hateful_memes/train.json" \
--hateful_memes_test_annotations_json_path "/path/to/hateful_memes/dev.json" \
34 changes: 34 additions & 0 deletions open_flamingo/scripts/run_train_ddp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
#SBATCH --time=5-00:00:00
#SBATCH --job-name=openflamingo

export PYTHONFAULTHANDLER=1
export CUDA_LAUNCH_BLOCKING=0
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=$(shuf -i 0-65535 -n 1)
export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`

export PYTHONPATH="$PYTHONPATH:open_flamingo"
srun --cpu_bind=v --accel-bind=gn python open_flamingo/open_flamingo/train/train.py \
--lm_path meta-llama/Llama-2-13b \
--tokenizer_path meta-llama/Llama-2-13b \
--model_family flamingo \
--cross_attn_every_n_layers 4 \
--dataset_resampled \
--batch_size_mmc4 16 \
--batch_size_laion 32 \
--train_num_samples_mmc4 125000\
--train_num_samples_laion 250000 \
--loss_multiplier_laion 0.2 \
--workers=4 \
--run_name "fsdp" \
--num_epochs 480 \
--warmup_steps 0 \
--mmc4_textsim_threshold 0.0 \
--laion_shards "/path/to/laion-samples/{000000..000001}.tar" \
--mmc4_shards "/path/to/mmc4-samples/{000000..000001}.tar" \
--report_to_wandb
Original file line number Diff line number Diff line change
@@ -1,26 +1,24 @@
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=6
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
#SBATCH --account=efml
#SBATCH --partition=gpu
#SBATCH --time=48:00:00
#SBATCH --job-name=flamingo
#SBATCH --time=5-00:00:00
#SBATCH --job-name=openflamingo

export PYTHONFAULTHANDLER=1
export CUDA_LAUNCH_BLOCKING=0
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=15000
export MASTER_PORT=$(shuf -i 0-65535 -n 1)
export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
export HF_DATASETS_CACHE="/gscratch/efml/anasa2/.huggingface" TRANSFORMERS_CACHE="/gscratch/efml/anasa2/.huggingface"

export PYTHONPATH="$PYTHONPATH:open_flamingo"
srun --cpu_bind=v --accel-bind=gn python

deepspeed open_flamingo/open_flamingo/train/train.py \
--lm_path meta-llama/Llama-2-13b \
--tokenizer_path meta-llama/Llama-2-13b \
--model_family flamingo \
--cross_attn_every_n_layers 4 \
--dataset_resampled \
--batch_size_mmc4 16 \
Expand All @@ -34,7 +32,6 @@ deepspeed open_flamingo/open_flamingo/train/train.py \
--num_epochs 480 \
--warmup_steps 0 \
--mmc4_textsim_threshold 0.0 \
--laion_shards "/mmfs1/gscratch/efml/anasa2/laion-samples/{000000..000001}.tar" \
--mmc4_shards "/mmfs1/gscratch/efml/anasa2/mmc4-samples/shard_{0..1}-000000000.tar" \
--gradient_checkpointing \
--report_to_wandb \
--laion_shards "/path/to/laion-samples/{000000..000001}.tar" \
--mmc4_shards "/path/to/mmc4-samples/{000000..000001}.tar" \
--report_to_wandb
40 changes: 40 additions & 0 deletions open_flamingo/scripts/run_train_fsdp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
#SBATCH --time=5-00:00:00
#SBATCH --job-name=openflamingo

<<com
To use FSDP, please make sure to use Pytorch Nightly > 2.0.1!
com

export PYTHONFAULTHANDLER=1
export CUDA_LAUNCH_BLOCKING=0
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=$(shuf -i 0-65535 -n 1)
export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`

export PYTHONPATH="$PYTHONPATH:open_flamingo"
srun --cpu_bind=v --accel-bind=gn python open_flamingo/open_flamingo/train/train.py \
--lm_path meta-llama/Llama-2-13b \
--tokenizer_path meta-llama/Llama-2-13b \
--model_family flamingo \
--cross_attn_every_n_layers 4 \
--dataset_resampled \
--batch_size_mmc4 16 \
--batch_size_laion 32 \
--fsdp \
--fsdp_sharding_strategy hybrid \
--train_num_samples_mmc4 125000\
--train_num_samples_laion 250000 \
--loss_multiplier_laion 0.2 \
--workers=4 \
--run_name "fsdp" \
--num_epochs 480 \
--warmup_steps 0 \
--mmc4_textsim_threshold 0.0 \
--laion_shards "/path/to/laion-samples/{000000..000001}.tar" \
--mmc4_shards "/path/to/mmc4-samples/{000000..000001}.tar" \
--report_to_wandb
56 changes: 56 additions & 0 deletions open_flamingo/src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# OpenFlamingo: Modeling
We provide modules to mix-and-match into several vision-language model architectures.

## What is a VLM?
A **vision-language model (VLM)** is a language model capable of processing a sequence of arbitraily interleaved images/videos with text to output text.

![A VLM takes in a sequence of interleaved images/videos with text and outputs text.](../../docs/signature.png)

The forward signature of a VLM is as follows:

* `vision_x`: The batch of images / videos to process. This is a tensor of the shape `(B, T_img, F, C, H, W)`, where `B` is the batch dimension, `T_img` collates the images/videos within one input sequence, `F` collates frames within a video, and `(C, H, W)` are the channel, height, and width dimensions respectively.
* `lang_x`: The batch of input_ids (text) to process. This is a tensor of the shape `(B, T_txt)`, where `T_txt` is the number of text tokens within one input sequence.

To explain to the model how to interleave the image/text elements within a sequence, `lang_x` should include `<image>` tokens ("media tokens") that specify where the images/videos are placed. (See figure below)

![Illustration of what the inputs to a VLM look like.](../../docs/inputs.png)


## VLM modeling with the open_flamingo repository
This repository provides modules for constructing various VLM architectures.

All models inherit from the `VLM` (vision-language model) class defined in `src/vlm.py`. As documented there, a VLM is defined by four component modules:
1. A **vision encoder** that extracts features from pixels (e.g. CLIP). This module should take in vision inputs of the shape `(B, T_img, F, C, H, W)` and output features of the shape `(B, T_img, F, v, d)`.
2. A **vision tokenizer** that converts features from the vision encoder into token-like embeddings (e.g. PerceiverResampler). This module should take in vision features of the shape `(B, T_img, F, v, d)` and output tokens of the shape `(B, T_img, n, d)`.
3. A fusion method that allows the language model to attend to these tokens, e.g. cross-attention (as done in [Flamingo](https://arxiv.org/abs/2204.14198)), or placing the tokens directly in the language model's input sequence (as done in [Kosmos](https://arxiv.org/abs/2306.14824)).
4. A language model.

This repository allows us to construct architectures by mixing-and-matching options for all four kinds of modules.

### Supported vision encoders
All CLIP-style encoders from the [OpenCLIP](https://github.com/mlfoundations/open_clip) library are supported. This includes OpenAI's models.

### Supported vision tokenizers
* [Perceiver Resampler](https://arxiv.org/abs/2103.03206)
* [Q-former](https://arxiv.org/abs/2301.12597)
* Linear projection

### Supported fusion methods
Models are further split into those that inherit from `VLMWithCrossAttention` (dense cross attention to fuse vision + language, Flamingo-style) vs. `VLMWithLanguageStream` (insert vision tokens into the language stream, Kosmos-style).

![A VLM with cross attention and a VLM with language stream represent two methods for fusing the vision and language inputs.](../../docs/xattn_langstream.png)

### Supported language models
All autoregressive language models from [Huggingface Transformers](https://huggingface.co/models) are supported.

## Example architectures
Using these modules, the following architectures are implemented as examples.

|Model|Vision tokenizer|Fusion method|Trainable parameters|
|----|------------|------------|------------|
|[Flamingo](https://arxiv.org/abs/2204.14198)|Perceiver|Cross attention|Added language model embeddings, vision tokenizer|
|[Kosmos](https://arxiv.org/abs/2306.14824)|Perceiver|Language stream|Everything except the vision encoder|
|[BLIP](https://arxiv.org/abs/2301.12597)|Q-former|Language stream|Added language model embeddings, vision tokenizer|

We welcome contributions! If you'd like to add additional vision tokenizers, fusion methods, or model types, please open a PR.

36 changes: 29 additions & 7 deletions open_flamingo/train/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,23 @@
# OpenFlamingo Training
To train OpenFlamingo, please ensure your environment matches that of `environment.yml`.
We provide efficient data loading and distributed training code.
To train with OpenFlamingo, please ensure your environment matches that of `environment.yml`.

Table of contents:

* [Data](#data)
* [Example commands](#example-training-command)
* [Distributed training](#distributed-training)

## Data
Our codebase uses [WebDataset](https://github.com/webdataset/webdataset) to efficiently load `.tar` files containing image and text sequences. We recommend resampling shards with replacement during training using the `--dataset_resampled` flag.

Supported pretraining datasets
* LAION-2B
* Multimodal C4 (MMC4)
* ChatGPT-generated sequences from OpenFlamingo [technical report](https://arxiv.org/abs/2308.01390)

We plan to add additional datasets in the future, and we welcome contributions! If you'd like to add support for a pretraining dataset, please open a PR.

### LAION-2B Dataset
[LAION-2B](https://arxiv.org/abs/2210.08402) contains 2B web-scraped (image, text) pairs.
We use [img2dataset](https://github.com/rom1504/img2dataset) to download this dataset into tar files.
Expand All @@ -27,7 +41,7 @@ Models trained with ChatGPT-generated sequences:
* OpenFlamingo-4B-vitl-rpj3b-langinstruct

## Example training command
We provide a sample Slurm training script in `scripts/`. You can also modify the following command:
We provide sample Slurm training scripts in `scripts/`. You can also modify the following command:

```
torchrun --nnodes=1 --nproc_per_node=4 train.py \
Expand All @@ -52,9 +66,17 @@ torchrun --nnodes=1 --nproc_per_node=4 train.py \
*Note: The MPT-1B [base](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b) and [instruct](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b-dolly) modeling code does not accept the `labels` kwarg or compute cross-entropy loss directly within `forward()`, as expected by our codebase. We suggest using a modified version of the MPT-1B models found [here](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b) and [here](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b-dolly).*

## Distributed training
Our codebase supports distributed training using three frameworks:

* Pytorch's [DistributedDataParallel](https://pytorch.org/docs/stable/torch.nn.parallel.DistributedDataParallel.html). This is the default method used by `train.py`.
* Pytorch's [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html) (FSDP). Use the `--fsdp` flag.
* [DeepSpeed](https://github.com/microsoft/DeepSpeed) stages 1-3. Use the `--deepspeed` flag.

Note that you should use exactly one of these training methods.

`train/distributed.py` contains utilities to help with setting up distributed training using Slurm / `torchrun`. See example scripts in the `scripts` directory.

### FSDP notes
To use FSDP, make sure to use Pytorch Nightly (> 2.0.1).

By default, `train.py` uses Pytorch's [DistributedDataParallel](https://pytorch.org/docs/stable/torch.nn.parallel.DistributedDataParallel.html) for training.
To use [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html), make sure to use Pytorch Nightly (> 2.0.1), and use the `--fsdp` flag.
To use [DeepSpeed](https://github.com/microsoft/DeepSpeed), use the `--deepspeed` flag.
(Note that you should use *either* FSDP or Deepspeed, not both.)
We also implement gradient checkpointing and mixed precision training. Use the `--gradient_checkpointing` and `--precision` arguments respectively.
We support two sharding strategies for FSDP: full sharding (model sharing across all nodes and GPUs) or hybrid sharding (model sharding across GPUs within nodes, data parallel between nodes). The former saves GPU memory; the latter saves on communication costs.
4 changes: 1 addition & 3 deletions requirements-eval.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@ inflection
pycocoevalcap
pycocotools
tqdm

black
mypy
pylint
pytest
requests
requests
1 change: 1 addition & 0 deletions requirements-training.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ braceexpand
webdataset
tqdm
wandb
deepspeed
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
einops
einops-exts
transformers>=4.28.1
torch==2.0.1
torch>=2.0.1
pillow
open_clip_torch>=2.16.0
sentencepiece==0.1.98

0 comments on commit 35e77f0

Please sign in to comment.