Skip to content
/ AGLA Public

[Arxiv 2024] AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Notifications You must be signed in to change notification settings

Lackel/AGLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AGLA: Assembly of Global and Local Attention

The official repo of paper AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention.

Authors

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Guang Dai, Ping Chen, Shijian Lu.

Abstract

Large Vision-Language Models (LVLMs) often suffer from object hallucinations, where the generated textual responses do not align with the actual objects in the image. This paper identifies attention deficiency towards discriminative local image features as a key cause of this issue. We introduce the Assembly of Global and Local Attention (AGLA), a training-free, plug-and-play method designed to reduce object hallucinations by combining global features for response generation and local features for visual discrimination. Our extensive experiments demonstrate that AGLA consistently mitigates object hallucinations and enhances the overall perception capabilities of LVLMs across various discriminative and generative benchmarks.

Content

1. Data

2. Model

3. Requirements

4. Running

5. Results

6. Thanks

7. Citation

Data

We conducted experiments using four public datasets:

1. POPE

2. MME

3. CHAIR

4. LLaVA-Bench-Wild

The question data for POPE and CHAIR are included in this repository. You will need to download the COCO_val2014 image files separately. For the MME dataset, you can request the data through the provided link. The LLaVA-Bench-Wild dataset can be downloaded from Huggingface via the provided link.

Model

We experimented with two LVLMs: LLaVA and InstructBLIP. An overview of our model is shown below.

Requirements

The environment is based on Python 3.9. Detailed dependencies are listed in requirements.txt.

Running

To run experiments on POPE with LLaVA 1.5 or InstructBLIP, use the following commands in the eval folder:

sh llava1.5_pope.bash
sh instructblip_pope.bash

To evaluate model performance on POPE, use eval_pope.py.

For other datasets, modify the file paths and prompts in run_llava.py and run_instructblip.py to generate results and evaluate model performance following the guidance of their original repositories.

Results

Thanks

The logit adjustment framework (i.e., sample.py) is based on VCD.

Citation

If our paper or code is helpful to you, please consider citing our work:

@article{an2024agla,
  title={AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention},
  author={An, Wenbin and Tian, Feng and Leng, Sicong and Nie, Jiahao and Lin, Haonan and Wang, QianYing and Dai, Guang and Chen, Ping and Lu, Shijian},
  journal={arXiv preprint arXiv:2406.12718},
  year={2024}
}

About

[Arxiv 2024] AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages