Skip to content

Commit

Permalink
[Feature] Allow to use local judge llm (#132)
Browse files Browse the repository at this point in the history
* Use local llm

Allow to use a local judge llm by setting the system variable LOCAL_LLM

* Update Quickstart.md for local judge LLM

* run pre-commit

* Update misc.py

---------

Co-authored-by: Haodong Duan <[email protected]>
  • Loading branch information
StarCycle and kennymckormick authored Mar 28, 2024
1 parent 86373b7 commit ee8cb93
Show file tree
Hide file tree
Showing 2 changed files with 62 additions and 16 deletions.
54 changes: 48 additions & 6 deletions Quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ Before running the evaluation script, you need to **configure** the VLMs and set

After that, you can use a single script `run.py` to inference and evaluate multiple VLMs and benchmarks at a same time.

## Step0. Installation & Setup essential keys
## Step 0. Installation & Setup essential keys

**Installation. **
**Installation.**

```bash
git clone https://github.com/open-compass/VLMEvalKit.git
Expand All @@ -16,7 +16,8 @@ pip install -e .

**Setup Keys.**

- To infer with API models (GPT-4v, Gemini-Pro-V, etc.) or use LLM APIs as the **judge or choice extractor**, you need to first setup API keys. You can place the required keys in `$VLMEvalKit/.env` or directly set them as the environment variable. If you choose to create a `.env` file, its content will look like:
To infer with API models (GPT-4v, Gemini-Pro-V, etc.) or use LLM APIs as the **judge or choice extractor**, you need to first setup API keys. VLMEvalKit will first try the "exact matching" policy to extract choices from the output answers. If this step is not successful, VLMEvalKit uses an LLM to extract choices from answers.
- You can place the required keys in `$VLMEvalKit/.env` or directly set them as the environment variable. If you choose to create a `.env` file, its content will look like:

```bash
# The .env file, place it under $VLMEvalKit
Expand All @@ -31,8 +32,7 @@ pip install -e .
```

- Fill the blanks with your API keys (if necessary). Those API keys will be automatically loaded when doing the inference and evaluation.

## Step1. Configuration
## Step 1. Configuration

**VLM Configuration**: All VLMs are configured in `vlmeval/config.py`, for some VLMs, you need to configure the code root (MiniGPT-4, PandaGPT, etc.) or the model_weight root (LLaVA-v1-7B, etc.) before conducting the evaluation. During evaluation, you should use the model name specified in `supported_VLM` in `vlmeval/config.py` to select the VLM. For MiniGPT-4 and InstructBLIP, you also need to modify the config files in `vlmeval/vlm/misc` to configure LLM path and ckpt path.

Expand All @@ -42,7 +42,7 @@ Following VLMs require the configuration step:

**Manual Weight Preparation & Configuration**: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B

## Step2. Evaluation
## Step 2. Evaluation

We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.py` or create a soft-link of the script (to use the script anywhere):

Expand Down Expand Up @@ -76,3 +76,45 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
```

The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.

## Deploy a local language model as the judge / choice extractor
The default setting mentioned above uses OpenAI's GPT as the judge LLM. However, you can also deploy a local judge LLM with [LMDeploy](https://github.com/InternLM/lmdeploy).

First install:
```
pip install lmdeploy openai
```

And then deploy a local judge LLM with the single line of code. LMDeploy will automatically download the model from Huggingface. Assuming we use internlm2-chat-1_8b as the judge, port 23333, and the key sk-123456 (the key must start with "sk-" and follow with any number you like):
```
lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
```

You need to get the model name registered by LMDeploy with the following code:
```
from openai import OpenAI
client = OpenAI(
api_key='sk-123456',
base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
```

Now set some environment variables to tell VLMEvalKit how to use the local judge LLM. In fact, the local judge LLM mimics an online OpenAI model.
```
export OPENAI_API_KEY=sk-123456
export OPENAI_API_BASE=http://0.0.0.0:23333/v1/chat/completions
export LOCAL_LLM=<model_name you get>
```

Finally, you can run the commands in step 2 to evaluate your VLM with the local judge LLM.

Note that

- If you hope to deploy the judge LLM in a single GPU and evaluate your VLM on other GPUs because of limited GPU memory, try `CUDA_VISIBLE_DEVICES=x` like
```
CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server internlm/internlm2-chat-1_8b --server-port 23333
CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc-per-node=3 run.py --data HallusionBench --model qwen_chat --verbose
```
- If the local judge LLM is not good enough in following the instructions, the evaluation may fail. Please report such failures (e.g., by issues).
- It's possible to deploy the judge LLM in different ways, e.g., use a private LLM (not from HuggingFace) or use a quantized LLM. Please refer to the [LMDeploy doc](https://lmdeploy.readthedocs.io/en/latest/serving/api_server.html). You can use any other deployment framework if they support OpenAI API.
24 changes: 14 additions & 10 deletions vlmeval/evaluate/misc.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,24 @@
from vlmeval.smp import load_env

INTERNAL = os.environ.get('INTERNAL', 0)
LOCAL_LLM = os.environ.get('LOCAL_LLM', None)


def build_judge(version, **kwargs):
load_env()
model_map = {
'gpt-4-turbo': 'gpt-4-1106-preview',
'gpt-4-0613': 'gpt-4-0613',
'gpt-4-0314': 'gpt-4-0314',
'gpt-4-0125': 'gpt-4-0125-preview',
'chatgpt-1106': 'gpt-3.5-turbo-1106',
'chatgpt-0613': 'gpt-3.5-turbo-0613',
'chatgpt-0125': 'gpt-3.5-turbo-0125'
}
model_version = model_map[version]
if LOCAL_LLM is None:
model_map = {
'gpt-4-turbo': 'gpt-4-1106-preview',
'gpt-4-0613': 'gpt-4-0613',
'gpt-4-0314': 'gpt-4-0314',
'gpt-4-0125': 'gpt-4-0125-preview',
'chatgpt-1106': 'gpt-3.5-turbo-1106',
'chatgpt-0613': 'gpt-3.5-turbo-0613',
'chatgpt-0125': 'gpt-3.5-turbo-0125'
}
model_version = model_map[version]
else:
model_version = LOCAL_LLM
if INTERNAL:
model = OpenAIWrapperInternal(model_version, **kwargs)
else:
Expand Down

0 comments on commit ee8cb93

Please sign in to comment.