Skip to content

Commit

Permalink
[Benchmark] Add MMMU dataset (open-compass#18)
Browse files Browse the repository at this point in the history
* Add MMMU dataset

* Add MMMU dataset

* update

* update

* update

* update

* update

* Delete GPT4V_INT directory

* Delete idefics_9b_instruct directory

* Delete idefics_9b_instruct_result directory

* Delete qwen_chat directory

* Delete images/MMMU directory

* update

* remove font

* revert_timeout

* update MMMU Eval

* update can_infer_text

* update idefics & llava

* update QwenVL

* update smp.py

* update xcomposer

* rename as evaluate

* update

* update

* support interleave_generate

* update

* update inference

* update MMMU

* update

* update MMMU md5

* update

* fix

* update

* update MMMU

---------

Co-authored-by: “llllIlllll” <“[email protected]”>
Co-authored-by: kennymckormick <[email protected]>
  • Loading branch information
3 people authored Dec 26, 2023
1 parent 2a75d53 commit 2e70185
Show file tree
Hide file tree
Showing 26 changed files with 269 additions and 186 deletions.
25 changes: 13 additions & 12 deletions Custom_Benchmark_and_Model.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,19 @@

Currently, we organize a benchmark as one single TSV file. During inference, the data file will be automatically downloaded to `$LMUData` (default path is `$HOME/LMUData`, if not set explicitly). All existing benchmark TSV files are handled by `TSVDataset` implemented in `vlmeval/utils/data_util.py`.

| Dataset Name \ Fields | index | image | image_path | question | hint | A | B | C | D | answer | category | l2-category | split |
| ---------------------- | ----- | ----- | ---------- | -------- | ---- | ---- | ---- | ---- | ---- | ------ | -------- | ----------- | ----- |
| MMBench_DEV_[CN/EN] ||| |||||||||||
| MMBench_TEST_[CN/EN] ||| ||||||| ||||
| CCBench ||| || ||||||| | |
| SEEDBench_IMG ||| || ||||||| | |
| MME ||| || | | | | ||| | |
| CORE_MM ||||| | | | | | || | |
| MMVet ||| || | | | | ||| | |
| COCO_VAL ||| | | | | | | || | | |
| OCRVQA_[TEST/TESTCORE] ||| || | | | | || | | |
| TextVQA_VAL ||| || | | | | || | | |
| Dataset Name \ Fields | index | image | image_path | question | hint | multi-choice<br>options | answer | category | l2-category | split |
| ---------------------- | ----- | ----- | ---------- | -------- | ---- | ----------------------- | ------ | -------- | ----------- | ----- |
| MMBench_DEV_[CN/EN] ||| ||||||||
| MMBench_TEST_[CN/EN] ||| |||| ||||
| CCBench ||| || |||| | |
| SEEDBench_IMG ||| || |||| | |
| MME ||| || | ||| | |
| CORE_MM ||||| | | || | |
| MMVet ||| || | ||| | |
| MMMU_DEV_VAL ||||| ||||||
| COCO_VAL ||| | | | || | | |
| OCRVQA_[TEST/TESTCORE] ||| || | || | | |
| TextVQA_VAL ||| || | || | | |

<div align="center"><b>Table 1. TSV fields of supported datasets.</b></div>

Expand Down
22 changes: 12 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

## 🆕 News

- **[2023-12-26]** We support MMMU now (Dataset Name: MMMU_DEV_VAL). The evaluation result is available at [**MMMU**](results/MMMU.md). 🔥🔥🔥
- **[2023-12-24]** We support two VQA Datasets: **OCRVQA** (Dataset Name: OCRVQA_TEST, OCRVQA_TESTCORE) and **TextVQA** (Dataset Name: TextVQA_VAL). The evaluation is undergoing. 🔥🔥🔥
- **[2023-12-23]** We update the performance of GPT-4v and GeminiPro on all existing benchmarks, [**check the result**](https://opencompass.org.cn/leaderboard-multimodal). 🔥🔥🔥
- **[2023-12-20]** We support a new benchmark: **COCO Caption** (Dataset Name: COCO_VAL). The evaluation is undergoing. 🔥🔥🔥
Expand All @@ -22,16 +23,17 @@

**Supported Dataset**

| Dataset | Inference | Evaluation | Results |
| ------------------------------------------------------------ | --------- | ---------- | ------------------------------------------------------------ |
| [**MMBench Series**](https://github.com/open-compass/mmbench/): MMBench, MMBench-CN, CCBench ||| [**MMBench Series**](https://mmbench.opencompass.org.cn/leaderboard) |
| [**MME**](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) ||| [**MME**](results/MME.md) |
| [**SEEDBench_IMG**](https://github.com/AILab-CVC/SEED-Bench) ||| [**SEEDBench_IMG**](results/SEEDBench_IMG.md) |
| [**MM-Vet**](https://github.com/yuweihao/MM-Vet) ||| [**MM-Vet**](results/MMVet.md) |
| [**COCO Caption**](https://cocodataset.org) ||| |
| [**OCRVQA**](https://ocr-vqa.github.io) ||| |
| [**TextVQA**](https://textvqa.org) ||| |
| [**Core-MM**](https://github.com/core-mm/core-mm) || | |
| Dataset | Dataset Names (for run.py) | Inference | Evaluation | Results |
| ------------------------------------------------------------ | ------------------------------------------------------ | --------- | ---------- | ------------------------------------------------------------ |
| [**MMBench Series**](https://github.com/open-compass/mmbench/): <br>MMBench, MMBench-CN, CCBench | MMBench-DEV-[EN/CN]<br>MMBench-TEST-[EN/CN]<br>CCBench ||| [**MMBench Series**](https://mmbench.opencompass.org.cn/leaderboard) |
| [**MME**](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) | MME ||| [**MME**](results/MME.md) |
| [**SEEDBench_IMG**](https://github.com/AILab-CVC/SEED-Bench) | SEEDBench_IMG ||| [**SEEDBench_IMG**](results/SEEDBench_IMG.md) |
| [**MM-Vet**](https://github.com/yuweihao/MM-Vet) | MMVet ||| [**MM-Vet**](results/MMVet.md) |
| [**MMMU**](https://mmmu-benchmark.github.io) | MMMU_DEV_VAL ||| [**MMMU**](results/MMMU.md) |
| [**COCO Caption**](https://cocodataset.org) | COCO_VAL ||| |
| [**OCRVQA**](https://ocr-vqa.github.io) | OCRVQA_TESTCORE, OCRVQA_TEST ||| |
| [**TextVQA**](https://textvqa.org) | TextVQA_VAL ||| |
| [**Core-MM**](https://github.com/core-mm/core-mm) | CORE_MM || | |

**Supported API Models**

Expand Down
34 changes: 34 additions & 0 deletions results/MMMU.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# MMMU Evaluation Results

> - In MMMU Evaluation, we evaluate the `dev` (150 samples) and `validation` (900 samples) set of MMMU.
> - **Answer Inference:**
> - For models with `interleave_generate` interface (accept interleaved images & texts as inputs), all testing samples can be inferred. **`interleave_generate` is adopted for inference.**
> - For models without `interleave_generate` interface, samples with more than one images are skipped (42 out of 1050, directly count as wrong). **`generate` is adopted for inference.**
> - **Evaluation**:
> - MMMU include two types of questions: **multi-choice questions** & **open-ended QA**.
> - For **open-ended QA (62/1050)**, we re-formulate it as multi-choice questions: `{'question': 'QQQ', 'answer': 'AAA'} -> {'question': 'QQQ', 'A': 'AAA', 'B': 'Other Answers', 'answer': 'A'}`, and then adopt the same evaluation paradigm for **multi-choice questions**.
> - For **multi-choice questions (988/1050)**, we use **GPT-3.5-Turbo-0613** for matching prediction with options if heuristic matching does not work.
### MMMU Scores

| Model | Overall<br>(Val) | Overall<br>(Dev) | Art & Design<br>(Val) | Business<br>(Val) | Science<br>(Val) | Health & Medicine<br>(Val) | Humanities & Social Science<br>(Val) | Tech & Engineering<br>(Val) |
| :------------------- | ---------------: | ---------------: | --------------------: | ----------------: | ---------------: | -------------------------: | -----------------------------------: | --------------------------: |
| qwen_chat | 37.6 | 30 | 49.2 | 36 | 28 | 32.7 | 55.8 | 31.9 |
| llava_v1.5_13b | 36.8 | 42 | 49.2 | 23.3 | 36 | 34 | 51.7 | 33.3 |
| sharegpt4v_7b | 36.7 | 30 | 50 | 27.3 | 26.7 | 37.3 | 50 | 34.8 |
| TransCore_M | 36.6 | 38.7 | 54.2 | 32 | 27.3 | 32 | 49.2 | 32.4 |
| llava_v1.5_7b | 36.1 | 38.7 | 45.8 | 25.3 | 34 | 32 | 48.3 | 35.7 |
| instructblip_13b | 32.9 | 30 | 37.5 | 29.3 | 32 | 28.7 | 37.5 | 33.8 |
| PandaGPT_13B | 32.7 | 26.7 | 42.5 | 35.3 | 30 | 29.3 | 45.8 | 21.9 |
| llava_v1_7b | 32.1 | 33.3 | 31.7 | 24.7 | 31.3 | 32 | 37.5 | 35.2 |
| instructblip_7b | 30.4 | 24 | 38.3 | 28 | 22 | 30.7 | 39.2 | 28.6 |
| VisualGLM_6b | 28.9 | 28.7 | 30 | 24 | 28 | 28 | 40.8 | 26.2 |
| qwen_base | 28.8 | 29.3 | 43.3 | 18.7 | 25.3 | 32.7 | 42.5 | 19.5 |
| flamingov2 | 28.2 | 21.3 | 27.5 | 30 | 28.7 | 28 | 33.3 | 24.3 |
| **Frequent Choice** | **26.8** | | | | | | | |
| MiniGPT-4-v1-13B | 26.2 | 23.3 | 33.3 | 19.3 | 28.7 | 26 | 34.2 | 21 |
| idefics_80b_instruct | 25.1 | 23.3 | 39.2 | 17.3 | 23.3 | 24 | 48.3 | 11.4 |
| MiniGPT-4-v2 | 24.6 | 32 | 27.5 | 22.7 | 21.3 | 28 | 33.3 | 19 |
| MiniGPT-4-v1-7B | 23 | 19.3 | 32.5 | 27.3 | 18.7 | 17.3 | 15 | 26.2 |
| **Random Choice** | **22.1** | | | | | | | |
| idefics_9b_instruct | 19.6 | 20 | 22.5 | 11.3 | 20.7 | 23.3 | 31.7 | 13.3 |
4 changes: 2 additions & 2 deletions run.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import torch
import torch.distributed as dist
from vlmeval.smp import *
from vlmeval.eval import COCO_eval, MME_eval, MMVet_eval, multiple_choice_eval, MME_rating, VQAEval
from vlmeval.evaluate import COCO_eval, MME_eval, MMVet_eval, multiple_choice_eval, MME_rating, VQAEval
from vlmeval.inference import infer_data_job, prefetch_acc
from vlmeval.config import supported_VLM

Expand Down Expand Up @@ -65,7 +65,7 @@ def main():
dump(res, result_file.replace('.xlsx', '_prefetch.xlsx'))

if rank == 0 and args.mode == 'all':
if listinstr(['MMBench', 'CCBench', 'SEEDBench_IMG'], dataset_name):
if listinstr(['MMBench', 'CCBench', 'SEEDBench_IMG', 'MMMU'], dataset_name):
multiple_choice_eval(result_file, dataset=dataset_name, model='chatgpt-0613', nproc=args.nproc, verbose=args.verbose)
elif dataset_name == 'MME':
MME_eval(result_file, model='chatgpt-0613', nproc=args.nproc, verbose=args.verbose)
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
openpyxl
seaborn
tabulate
xlsxwriter
"""


Expand Down
2 changes: 1 addition & 1 deletion vlmeval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

from .smp import *
from .api import *
from .eval import *
from .evaluate import *
from .utils import *
from .vlm import *
from .config import *
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 2e70185

Please sign in to comment.