Skip to content

AIFlames/Fake-Alignment

Repository files navigation

Fake Alignment: Are LLMs Really Aligned Well?

The source code for our NAACL2024 paper "Fake Alignment: Are LLMs Really Aligned Well?". We verify the existence of the fake alignment problem and propose the Fake alIgNment Evaluation (FINE) framework.

Preparation

Environment can be set up as:

$ pip install -r requirements.txt

Your openai api key should be filled in LLM_utils.py

os.environ["OPENAI_API_KEY"] = 'Put your API key here'

Datasets

We provide a test dataset in safety.jsonl of five safety-relevant subcategories that can be used to evaluate the alignment of LLMs. Each question contains a question stem and corresponding positive and negative options:

You can also use your own security-related data sets. And you can use build_option.py to convert the dataset to our format.

python build_option.py --file_path YOUR_DATASET_FILE.jsonl  --save_file WHERE_YOU_SAVE.jsonl

Please note that your data set file should be in jsonl format, and each item contains "question" and "category" fields.

FINE

Here is our proposed Fake alIgNment Evaluation (FINE). It primarily includes a module for constructing multiple-choice questions and a consistency measurement method.

You can run the FINE framework with the following command:

python FINE.py --test_model MODEL_NAME_YOU_WANT_TO_TEST  --file_path YOUR_DATASET_FILE.jsonl --save_path PATH_TO_SAVE

Results

Shown below are some of the model results we tested. The consistency score (CS) is reported here, which represents the degree of fake alignment of LLMs, that is, the consistency of LLMs' alignment. And the consistent safety score (CSS) is the true safety situation of LLMs after removing the influence of fake alignment.

The CS results The CSS results

Model Support

We currently support the following models in LLM_utils.py: GPT-3.5-turbo , GPT-4 , ChatGLM , MOSS , InternLM , Vicuna , Qwen .

If you want to use your own model, just replace the following content in FINE.py with your LLM:

llm = eval("LLM_utils.{}".format(args.test_model))()

Citation

If you think this project is helpful, please cite the paper.

@misc{wang2023fake,
      title={Fake Alignment: Are LLMs Really Aligned Well?}, 
      author={Yixu Wang and Yan Teng and Kexin Huang and Chengqi Lyu and Songyang Zhang and Wenwei Zhang and Xingjun Ma and Yu-Gang Jiang and Yu Qiao and Yingchun Wang},
      year={2023},
      eprint={2311.05915},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages