[Benchmark] Add MMMU dataset (open-compass#18)

* Add MMMU dataset * Add MMMU dataset * update * update * update * update * update * Delete GPT4V_INT directory * Delete idefics_9b_instruct directory * Delete idefics_9b_instruct_result directory * Delete qwen_chat directory * Delete images/MMMU directory * update * remove font * revert_timeout * update MMMU Eval * update can_infer_text * update idefics & llava * update QwenVL * update smp.py * update xcomposer * rename as evaluate * update * update * support interleave_generate * update * update inference * update MMMU * update * update MMMU md5 * update * fix * update * update MMMU --------- Co-authored-by: “llllIlllll” <“[email protected]”> Co-authored-by: kennymckormick <[email protected]>
shan23chen · Dec 26, 2023 · 2e70185 · 2e70185
1 parent 2a75d53
commit 2e70185
Show file tree

Hide file tree

Showing 26 changed files with 269 additions and 186 deletions.
diff --git a/Custom_Benchmark_and_Model.md b/Custom_Benchmark_and_Model.md
@@ -4,18 +4,19 @@
 
 Currently, we organize a benchmark as one single TSV file. During inference, the data file will be automatically downloaded to `$LMUData` (default path is `$HOME/LMUData`, if not set explicitly). All existing benchmark TSV files are handled by `TSVDataset` implemented in `vlmeval/utils/data_util.py`. 
 
-| Dataset Name \ Fields  | index | image | image_path | question | hint | A    | B    | C    | D    | answer | category | l2-category | split |
-| ---------------------- | ----- | ----- | ---------- | -------- | ---- | ---- | ---- | ---- | ---- | ------ | -------- | ----------- | ----- |
-| MMBench_DEV_[CN/EN]    | ✅     | ✅     |            | ✅        | ✅    | ✅    | ✅    | ✅    | ✅    | ✅      | ✅        | ✅           | ✅     |
-| MMBench_TEST_[CN/EN]   | ✅     | ✅     |            | ✅        | ✅    | ✅    | ✅    | ✅    | ✅    |        | ✅        | ✅           | ✅     |
-| CCBench                | ✅     | ✅     |            | ✅        |      | ✅    | ✅    | ✅    | ✅    | ✅      | ✅        |             |       |
-| SEEDBench_IMG          | ✅     | ✅     |            | ✅        |      | ✅    | ✅    | ✅    | ✅    | ✅      | ✅        |             |       |
-| MME                    | ✅     | ✅     |            | ✅        |      |      |      |      |      | ✅      | ✅        |             |       |
-| CORE_MM                | ✅     | ✅     | ✅          | ✅        |      |      |      |      |      |        | ✅        |             |       |
-| MMVet                  | ✅     | ✅     |            | ✅        |      |      |      |      |      | ✅      | ✅        |             |       |
-| COCO_VAL               | ✅     | ✅     |            |          |      |      |      |      |      | ✅      |          |             |       |
-| OCRVQA_[TEST/TESTCORE] | ✅     | ✅     |            | ✅        |      |      |      |      |      | ✅      |          |             |       |
-| TextVQA_VAL            | ✅     | ✅     |            | ✅        |      |      |      |      |      | ✅      |          |             |       |
+| Dataset Name \ Fields  | index | image | image_path | question | hint | multi-choice<br>options | answer | category | l2-category | split |
+| ---------------------- | ----- | ----- | ---------- | -------- | ---- | ----------------------- | ------ | -------- | ----------- | ----- |
+| MMBench_DEV_[CN/EN]    | ✅     | ✅     |            | ✅        | ✅    | ✅                       | ✅      | ✅        | ✅           | ✅     |
+| MMBench_TEST_[CN/EN]   | ✅     | ✅     |            | ✅        | ✅    | ✅                       |        | ✅        | ✅           | ✅     |
+| CCBench                | ✅     | ✅     |            | ✅        |      | ✅                       | ✅      | ✅        |             |       |
+| SEEDBench_IMG          | ✅     | ✅     |            | ✅        |      | ✅                       | ✅      | ✅        |             |       |
+| MME                    | ✅     | ✅     |            | ✅        |      |                         | ✅      | ✅        |             |       |
+| CORE_MM                | ✅     | ✅     | ✅          | ✅        |      |                         |        | ✅        |             |       |
+| MMVet                  | ✅     | ✅     |            | ✅        |      |                         | ✅      | ✅        |             |       |
+| MMMU_DEV_VAL           | ✅     | ✅     | ✅          | ✅        |      | ✅                       | ✅      | ✅        | ✅           | ✅     |
+| COCO_VAL               | ✅     | ✅     |            |          |      |                         | ✅      |          |             |       |
+| OCRVQA_[TEST/TESTCORE] | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
+| TextVQA_VAL            | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
 
 <div align="center"><b>Table 1. TSV fields of supported datasets.</b></div>
 

diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@
 
 ## 🆕 News
 
+- **[2023-12-26]** We support MMMU now (Dataset Name: MMMU_DEV_VAL).  The evaluation result is available at [**MMMU**](results/MMMU.md).  🔥🔥🔥
 - **[2023-12-24]** We support two VQA Datasets: **OCRVQA** (Dataset Name: OCRVQA_TEST, OCRVQA_TESTCORE) and **TextVQA** (Dataset Name: TextVQA_VAL).  The evaluation is undergoing.  🔥🔥🔥
 - **[2023-12-23]** We update the performance of GPT-4v and GeminiPro on all existing benchmarks, [**check the result**](https://opencompass.org.cn/leaderboard-multimodal).  🔥🔥🔥
 - **[2023-12-20]** We support a new benchmark: **COCO Caption** (Dataset Name: COCO_VAL). The evaluation is undergoing.  🔥🔥🔥
@@ -22,16 +23,17 @@
 
 **Supported Dataset**
 
-| Dataset                                                      | Inference | Evaluation | Results                                                      |
-| ------------------------------------------------------------ | --------- | ---------- | ------------------------------------------------------------ |
-| [**MMBench Series**](https://github.com/open-compass/mmbench/): MMBench, MMBench-CN, CCBench | ✅         | ✅          | [**MMBench Series**](https://mmbench.opencompass.org.cn/leaderboard) |
-| [**MME**](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) | ✅         | ✅          | [**MME**](results/MME.md)                                    |
-| [**SEEDBench_IMG**](https://github.com/AILab-CVC/SEED-Bench) | ✅         | ✅          | [**SEEDBench_IMG**](results/SEEDBench_IMG.md)                |
-| [**MM-Vet**](https://github.com/yuweihao/MM-Vet)             | ✅         | ✅          | [**MM-Vet**](results/MMVet.md)                               |
-| [**COCO Caption**](https://cocodataset.org)                  | ✅         | ✅          |                                                              |
-| [**OCRVQA**](https://ocr-vqa.github.io)                      | ✅         | ✅          |                                                              |
-| [**TextVQA**](https://textvqa.org)                           | ✅         | ✅          |                                                              |
-| [**Core-MM**](https://github.com/core-mm/core-mm)            | ✅         |            |                                                              |
+| Dataset                                                      | Dataset Names (for run.py)                             | Inference | Evaluation | Results                                                      |
+| ------------------------------------------------------------ | ------------------------------------------------------ | --------- | ---------- | ------------------------------------------------------------ |
+| [**MMBench Series**](https://github.com/open-compass/mmbench/): <br>MMBench, MMBench-CN, CCBench | MMBench-DEV-[EN/CN]<br>MMBench-TEST-[EN/CN]<br>CCBench | ✅         | ✅          | [**MMBench Series**](https://mmbench.opencompass.org.cn/leaderboard) |
+| [**MME**](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) | MME                                                    | ✅         | ✅          | [**MME**](results/MME.md)                                    |
+| [**SEEDBench_IMG**](https://github.com/AILab-CVC/SEED-Bench) | SEEDBench_IMG                                          | ✅         | ✅          | [**SEEDBench_IMG**](results/SEEDBench_IMG.md)                |
+| [**MM-Vet**](https://github.com/yuweihao/MM-Vet)             | MMVet                                                  | ✅         | ✅          | [**MM-Vet**](results/MMVet.md)                               |
+| [**MMMU**](https://mmmu-benchmark.github.io)                 | MMMU_DEV_VAL                                           | ✅         | ✅          | [**MMMU**](results/MMMU.md)                                  |
+| [**COCO Caption**](https://cocodataset.org)                  | COCO_VAL                                               | ✅         | ✅          |                                                              |
+| [**OCRVQA**](https://ocr-vqa.github.io)                      | OCRVQA_TESTCORE, OCRVQA_TEST                           | ✅         | ✅          |                                                              |
+| [**TextVQA**](https://textvqa.org)                           | TextVQA_VAL                                            | ✅         | ✅          |                                                              |
+| [**Core-MM**](https://github.com/core-mm/core-mm)            | CORE_MM                                                | ✅         |            |                                                              |
 
 **Supported API Models**
 

diff --git a/results/MMMU.md b/results/MMMU.md
@@ -0,0 +1,34 @@
+# MMMU Evaluation Results
+
+> - In MMMU Evaluation, we evaluate the `dev` (150 samples) and `validation` (900 samples) set of MMMU. 
+> - **Answer Inference:**
+>   - For models with `interleave_generate` interface (accept interleaved images & texts as inputs), all testing samples can be inferred. **`interleave_generate` is adopted for inference.**
+>   - For models without `interleave_generate` interface, samples with more than one images are skipped (42 out of 1050, directly count as wrong). **`generate` is adopted for inference.**
+> - **Evaluation**:
+>   - MMMU include two types of questions: **multi-choice questions** & **open-ended QA**. 
+>   - For **open-ended QA (62/1050)**, we re-formulate it as multi-choice questions: `{'question': 'QQQ', 'answer': 'AAA'} -> {'question': 'QQQ', 'A': 'AAA', 'B': 'Other Answers', 'answer': 'A'}`, and then adopt the same evaluation paradigm for **multi-choice questions**. 
+>   - For **multi-choice questions (988/1050)**, we use **GPT-3.5-Turbo-0613** for matching prediction with options if heuristic matching does not work. 
+
+### MMMU Scores
+
+| Model                | Overall<br>(Val) | Overall<br>(Dev) | Art & Design<br>(Val) | Business<br>(Val) | Science<br>(Val) | Health & Medicine<br>(Val) | Humanities & Social Science<br>(Val) | Tech & Engineering<br>(Val) |
+| :------------------- | ---------------: | ---------------: | --------------------: | ----------------: | ---------------: | -------------------------: | -----------------------------------: | --------------------------: |
+| qwen_chat            |             37.6 |               30 |                  49.2 |                36 |               28 |                       32.7 |                                 55.8 |                        31.9 |
+| llava_v1.5_13b       |             36.8 |               42 |                  49.2 |              23.3 |               36 |                         34 |                                 51.7 |                        33.3 |
+| sharegpt4v_7b        |             36.7 |               30 |                    50 |              27.3 |             26.7 |                       37.3 |                                   50 |                        34.8 |
+| TransCore_M          |             36.6 |             38.7 |                  54.2 |                32 |             27.3 |                         32 |                                 49.2 |                        32.4 |
+| llava_v1.5_7b        |             36.1 |             38.7 |                  45.8 |              25.3 |               34 |                         32 |                                 48.3 |                        35.7 |
+| instructblip_13b     |             32.9 |               30 |                  37.5 |              29.3 |               32 |                       28.7 |                                 37.5 |                        33.8 |
+| PandaGPT_13B         |             32.7 |             26.7 |                  42.5 |              35.3 |               30 |                       29.3 |                                 45.8 |                        21.9 |
+| llava_v1_7b          |             32.1 |             33.3 |                  31.7 |              24.7 |             31.3 |                         32 |                                 37.5 |                        35.2 |
+| instructblip_7b      |             30.4 |               24 |                  38.3 |                28 |               22 |                       30.7 |                                 39.2 |                        28.6 |
+| VisualGLM_6b         |             28.9 |             28.7 |                    30 |                24 |               28 |                         28 |                                 40.8 |                        26.2 |
+| qwen_base            |             28.8 |             29.3 |                  43.3 |              18.7 |             25.3 |                       32.7 |                                 42.5 |                        19.5 |
+| flamingov2           |             28.2 |             21.3 |                  27.5 |                30 |             28.7 |                         28 |                                 33.3 |                        24.3 |
+| **Frequent Choice**  |         **26.8** |                  |                       |                   |                  |                            |                                      |                             |
+| MiniGPT-4-v1-13B     |             26.2 |             23.3 |                  33.3 |              19.3 |             28.7 |                         26 |                                 34.2 |                          21 |
+| idefics_80b_instruct |             25.1 |             23.3 |                  39.2 |              17.3 |             23.3 |                         24 |                                 48.3 |                        11.4 |
+| MiniGPT-4-v2         |             24.6 |               32 |                  27.5 |              22.7 |             21.3 |                         28 |                                 33.3 |                          19 |
+| MiniGPT-4-v1-7B      |               23 |             19.3 |                  32.5 |              27.3 |             18.7 |                       17.3 |                                   15 |                        26.2 |
+| **Random Choice**    |         **22.1** |                  |                       |                   |                  |                            |                                      |                             |
+| idefics_9b_instruct  |             19.6 |               20 |                  22.5 |              11.3 |             20.7 |                       23.3 |                                 31.7 |                        13.3 |
diff --git a/run.py b/run.py
@@ -1,7 +1,7 @@
 import torch
 import torch.distributed as dist
 from vlmeval.smp import *
-from vlmeval.eval import COCO_eval, MME_eval, MMVet_eval, multiple_choice_eval, MME_rating, VQAEval
+from vlmeval.evaluate import COCO_eval, MME_eval, MMVet_eval, multiple_choice_eval, MME_rating, VQAEval
 from vlmeval.inference import infer_data_job, prefetch_acc
 from vlmeval.config import supported_VLM
 
@@ -65,7 +65,7 @@ def main():
                     dump(res, result_file.replace('.xlsx', '_prefetch.xlsx'))
 
             if rank == 0 and args.mode == 'all':
-                if listinstr(['MMBench', 'CCBench', 'SEEDBench_IMG'], dataset_name):
+                if listinstr(['MMBench', 'CCBench', 'SEEDBench_IMG', 'MMMU'], dataset_name):
                     multiple_choice_eval(result_file, dataset=dataset_name, model='chatgpt-0613', nproc=args.nproc, verbose=args.verbose)
                 elif dataset_name == 'MME':
                     MME_eval(result_file, model='chatgpt-0613', nproc=args.nproc, verbose=args.verbose)

diff --git a/setup.py b/setup.py
@@ -21,6 +21,7 @@
 openpyxl
 seaborn
 tabulate
+xlsxwriter
 """
 
 

diff --git a/vlmeval/__init__.py b/vlmeval/__init__.py
@@ -5,7 +5,7 @@
 
 from .smp import *
 from .api import *
-from .eval import *
+from .evaluate import *
 from .utils import *
 from .vlm import *
 from .config import *
diff --git a/vlmeval/eval/__init__.py → vlmeval/evaluate/__init__.py b/vlmeval/eval/__init__.py → vlmeval/evaluate/__init__.py
diff --git a/vlmeval/eval/coco_eval.py → vlmeval/evaluate/coco_eval.py b/vlmeval/eval/coco_eval.py → vlmeval/evaluate/coco_eval.py
diff --git a/vlmeval/eval/mme_eval.py → vlmeval/evaluate/mme_eval.py b/vlmeval/eval/mme_eval.py → vlmeval/evaluate/mme_eval.py
diff --git a/vlmeval/eval/mmvet_eval.py → vlmeval/evaluate/mmvet_eval.py b/vlmeval/eval/mmvet_eval.py → vlmeval/evaluate/mmvet_eval.py
-Original file line number
+Diff line change
@@ Expand Up / @@ -21,6 +21,7 @@ @@
     openpyxl
     seaborn
     tabulate
+    xlsxwriter
     """
@@ Expand Down @@