Fake Alignment: Are LLMs Really Aligned Well?

The source code for our NAACL2024 paper "Fake Alignment: Are LLMs Really Aligned Well?". We verify the existence of the fake alignment problem and propose the Fake alIgNment Evaluation (FINE) framework.

Preparation

Environment can be set up as:

$ pip install -r requirements.txt

Your openai api key should be filled in LLM_utils.py

os.environ["OPENAI_API_KEY"] = 'Put your API key here'

Datasets

We provide a test dataset in safety.jsonl of five safety-relevant subcategories that can be used to evaluate the alignment of LLMs. Each question contains a question stem and corresponding positive and negative options:

You can also use your own security-related data sets. And you can use build_option.py to convert the dataset to our format.

python build_option.py --file_path YOUR_DATASET_FILE.jsonl  --save_file WHERE_YOU_SAVE.jsonl

Please note that your data set file should be in jsonl format, and each item contains "question" and "category" fields.

FINE

Here is our proposed Fake alIgNment Evaluation (FINE). It primarily includes a module for constructing multiple-choice questions and a consistency measurement method.

You can run the FINE framework with the following command:

python FINE.py --test_model MODEL_NAME_YOU_WANT_TO_TEST  --file_path YOUR_DATASET_FILE.jsonl --save_path PATH_TO_SAVE

Results

Shown below are some of the model results we tested. The consistency score (CS) is reported here, which represents the degree of fake alignment of LLMs, that is, the consistency of LLMs' alignment. And the consistent safety score (CSS) is the true safety situation of LLMs after removing the influence of fake alignment.

Model Support

We currently support the following models in LLM_utils.py: GPT-3.5-turbo , GPT-4 , ChatGLM , MOSS , InternLM , Vicuna , Qwen .

If you want to use your own model, just replace the following content in FINE.py with your LLM:

llm = eval("LLM_utils.{}".format(args.test_model))()

Citation

If you think this project is helpful, please cite the paper.

@misc{wang2023fake,
      title={Fake Alignment: Are LLMs Really Aligned Well?}, 
      author={Yixu Wang and Yan Teng and Kexin Huang and Chengqi Lyu and Songyang Zhang and Wenwei Zhang and Xingjun Ma and Yu-Gang Jiang and Yu Qiao and Yingchun Wang},
      year={2023},
      eprint={2311.05915},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
FINE.py		FINE.py
LICENSE		LICENSE
LLM_utils.py		LLM_utils.py
README.md		README.md
build_option.py		build_option.py
dna_training_set.jsonl		dna_training_set.jsonl
prompts.py		prompts.py
requirements.txt		requirements.txt
safety.jsonl		safety.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake Alignment: Are LLMs Really Aligned Well?

Preparation

Datasets

FINE

Results

Model Support

Citation

About

Releases

Packages

Languages

License

AIFlames/Fake-Alignment

Folders and files

Latest commit

History

Repository files navigation

Fake Alignment: Are LLMs Really Aligned Well?

Preparation

Datasets

FINE

Results

Model Support

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages