[Feature] Allow to use local judge llm (#132)

* Use local llm Allow to use a local judge llm by setting the system variable LOCAL_LLM * Update Quickstart.md for local judge LLM * run pre-commit * Update misc.py --------- Co-authored-by: Haodong Duan <[email protected]>
open-compass · Mar 28, 2024 · ee8cb93 · ee8cb93
1 parent 86373b7
commit ee8cb93
Show file tree

Hide file tree

Showing 2 changed files with 62 additions and 16 deletions.
diff --git a/Quickstart.md b/Quickstart.md
@@ -4,9 +4,9 @@ Before running the evaluation script, you need to **configure** the VLMs and set
 
 After that, you can use a single script `run.py` to inference and evaluate multiple VLMs and benchmarks at a same time.
 
-## Step0. Installation & Setup essential keys
+## Step 0. Installation & Setup essential keys
 
-**Installation. **
+**Installation.**
 
 ```bash
 git clone https://github.com/open-compass/VLMEvalKit.git
@@ -16,7 +16,8 @@ pip install -e .
 
 **Setup Keys.**
 
-- To infer with API models (GPT-4v, Gemini-Pro-V, etc.) or use LLM APIs as the **judge or choice extractor**, you need to first setup API keys. You can place the required keys in `$VLMEvalKit/.env` or directly set them as the environment variable. If you choose to create a `.env` file, its content will look like:
+To infer with API models (GPT-4v, Gemini-Pro-V, etc.) or use LLM APIs as the **judge or choice extractor**, you need to first setup API keys. VLMEvalKit will first try the "exact matching" policy to extract choices from the output answers. If this step is not successful, VLMEvalKit uses an LLM to extract choices from answers.
+- You can place the required keys in `$VLMEvalKit/.env` or directly set them as the environment variable. If you choose to create a `.env` file, its content will look like:
 
   ```bash
   # The .env file, place it under $VLMEvalKit
@@ -31,8 +32,7 @@ pip install -e .
   ```
 
 - Fill the blanks with your API keys (if necessary). Those API keys will be automatically loaded when doing the inference and evaluation.
-
-## Step1. Configuration
+## Step 1. Configuration
 
 **VLM Configuration**: All VLMs are configured in `vlmeval/config.py`, for some VLMs, you need to configure the code root (MiniGPT-4, PandaGPT, etc.) or the model_weight root (LLaVA-v1-7B, etc.) before conducting the evaluation. During evaluation, you should use the model name specified in `supported_VLM` in `vlmeval/config.py` to select the VLM. For MiniGPT-4 and InstructBLIP, you also need to modify the config files in `vlmeval/vlm/misc` to configure LLM path and ckpt path.
 
@@ -42,7 +42,7 @@ Following VLMs require the configuration step:
 
 **Manual Weight Preparation & Configuration**: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B
 
-## Step2. Evaluation
+## Step 2. Evaluation
 
 We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.py` or create a soft-link of the script (to use the script anywhere):
 
@@ -76,3 +76,45 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
 ```
 
 The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.
+
+## Deploy a local language model as the judge / choice extractor
+The default setting mentioned above uses OpenAI's GPT as the judge LLM. However, you can also deploy a local judge LLM with [LMDeploy](https://github.com/InternLM/lmdeploy).
+
+First install:
+```
+pip install lmdeploy openai
+```
+
+And then deploy a local judge LLM with the single line of code. LMDeploy will automatically download the model from Huggingface. Assuming we use internlm2-chat-1_8b as the judge, port 23333, and the key sk-123456 (the key must start with "sk-" and follow with any number you like):
+```
+lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
+```
+
+You need to get the model name registered by LMDeploy with the following code:
+```
+from openai import OpenAI
+client = OpenAI(
+    api_key='sk-123456',
+    base_url="http://0.0.0.0:23333/v1"
+)
+model_name = client.models.list().data[0].id
+```
+
+Now set some environment variables to tell VLMEvalKit how to use the local judge LLM. In fact, the local judge LLM mimics an online OpenAI model.
+```
+export OPENAI_API_KEY=sk-123456
+export OPENAI_API_BASE=http://0.0.0.0:23333/v1/chat/completions
+export LOCAL_LLM=<model_name you get>
+```
+
+Finally, you can run the commands in step 2 to evaluate your VLM with the local judge LLM.
+
+Note that
+
+- If you hope to deploy the judge LLM in a single GPU and evaluate your VLM on other GPUs because of limited GPU memory, try `CUDA_VISIBLE_DEVICES=x` like
+```
+CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
+CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc-per-node=3 run.py --data HallusionBench  --model qwen_chat --verbose
+```
+- If the local judge LLM is not good enough in following the instructions, the evaluation may fail. Please report such failures (e.g., by issues).
+- It's possible to deploy the judge LLM in different ways, e.g., use a private LLM (not from HuggingFace) or use a quantized LLM. Please refer to the [LMDeploy doc](https://lmdeploy.readthedocs.io/en/latest/serving/api_server.html). You can use any other deployment framework if they support OpenAI API.
diff --git a/vlmeval/evaluate/misc.py b/vlmeval/evaluate/misc.py
@@ -3,20 +3,24 @@
 from vlmeval.smp import load_env
 
 INTERNAL = os.environ.get('INTERNAL', 0)
+LOCAL_LLM = os.environ.get('LOCAL_LLM', None)
 
 
 def build_judge(version, **kwargs):
     load_env()
-    model_map = {
-        'gpt-4-turbo': 'gpt-4-1106-preview',
-        'gpt-4-0613': 'gpt-4-0613',
-        'gpt-4-0314': 'gpt-4-0314',
-        'gpt-4-0125': 'gpt-4-0125-preview',
-        'chatgpt-1106': 'gpt-3.5-turbo-1106',
-        'chatgpt-0613': 'gpt-3.5-turbo-0613',
-        'chatgpt-0125': 'gpt-3.5-turbo-0125'
-    }
-    model_version = model_map[version]
+    if LOCAL_LLM is None:
+        model_map = {
+            'gpt-4-turbo': 'gpt-4-1106-preview',
+            'gpt-4-0613': 'gpt-4-0613',
+            'gpt-4-0314': 'gpt-4-0314',
+            'gpt-4-0125': 'gpt-4-0125-preview',
+            'chatgpt-1106': 'gpt-3.5-turbo-1106',
+            'chatgpt-0613': 'gpt-3.5-turbo-0613',
+            'chatgpt-0125': 'gpt-3.5-turbo-0125'
+        }
+        model_version = model_map[version]
+    else:
+        model_version = LOCAL_LLM
     if INTERNAL:
         model = OpenAIWrapperInternal(model_version, **kwargs)
     else: