The source code for our NAACL2024 paper "Fake Alignment: Are LLMs Really Aligned Well?". We verify the existence of the fake alignment problem and propose the Fake alIgNment Evaluation (FINE) framework.
Environment can be set up as:
$ pip install -r requirements.txt
Your openai api key should be filled in LLM_utils.py
os.environ["OPENAI_API_KEY"] = 'Put your API key here'
We provide a test dataset in safety.jsonl
of five safety-relevant subcategories that can be used to evaluate the alignment of LLMs. Each question contains a question stem and corresponding positive and negative options:
You can also use your own security-related data sets. And you can use build_option.py
to convert the dataset to our format.
python build_option.py --file_path YOUR_DATASET_FILE.jsonl --save_file WHERE_YOU_SAVE.jsonl
Please note that your data set file should be in jsonl format, and each item contains "question" and "category" fields.
Here is our proposed Fake alIgNment Evaluation (FINE). It primarily includes a module for constructing multiple-choice questions and a consistency measurement method.
You can run the FINE framework with the following command:
python FINE.py --test_model MODEL_NAME_YOU_WANT_TO_TEST --file_path YOUR_DATASET_FILE.jsonl --save_path PATH_TO_SAVE
Shown below are some of the model results we tested. The consistency score (CS) is reported here, which represents the degree of fake alignment of LLMs, that is, the consistency of LLMs' alignment. And the consistent safety score (CSS) is the true safety situation of LLMs after removing the influence of fake alignment.
We currently support the following models in LLM_utils.py
:
GPT-3.5-turbo , GPT-4 , ChatGLM , MOSS , InternLM , Vicuna , Qwen .
If you want to use your own model, just replace the following content in FINE.py
with your LLM:
llm = eval("LLM_utils.{}".format(args.test_model))()
If you think this project is helpful, please cite the paper.
@misc{wang2023fake,
title={Fake Alignment: Are LLMs Really Aligned Well?},
author={Yixu Wang and Yan Teng and Kexin Huang and Chengqi Lyu and Songyang Zhang and Wenwei Zhang and Xingjun Ma and Yu-Gang Jiang and Yu Qiao and Yingchun Wang},
year={2023},
eprint={2311.05915},
archivePrefix={arXiv},
primaryClass={cs.CL}
}