- HuggingFace demo and model checkpoint, see here
- The CODE dataset for evaluation, see here
- ContextDET training scripts, see here (waiting to be cleaned up)
Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection--understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
Task | Language Input | Output(s) | Remark |
---|---|---|---|
Object Detection | ✗ | box, class label | pre-defined class labels |
Open-Vocabulary Object Detection | (optional) class names for CLIP | box, class label | pre-defined class labels |
Referring Expression Comprehension | complete referring expression | box that expression refers to | / |
Contextual Cloze Test (ours) | incomplete expression, object names are masked | {box, name} to complete the mask | name could be most valid English word |
Image Captioning | ✗ | language caption | / |
Contextual Captioning (ours) | ✗ | language caption, box | / |
Visual Question Answering | language question | language answer | / |
Contextual QA (ours) | language question | language question, box | / |
We present ContextDET, a novel generate-then-detect framework, specialized for contextual object detection. ContextDET is end-to-end and consists of three key architectural components:
- a visual encoder that extracts high-level image representations and computes visual tokens,
- a pre-trained LLM that decodes multimodal contextual tokens with a task-related multimodal prefix, and
- a visual decoder that predicts matching scores and bounding boxes for conditional queries linked to contextual object words.
The new generate-then-detect framework enables us to detect object words within human vocabulary.
🤗 You can try our demo on HuggingFace spaces. To avoid waiting in the queue and speed up your inference, consider duplicating the space and use GPU resources.
🤗 If you want to try the demo on your own computer with GPU, follow these steps
- Install the required python packages:
pip install -r requirements.txt
- Download the checkpoint file from the following URL and save it in your local directory.
- Now, you're ready to run the demo. Execute the following command:
python app.py
You are expected to see the following web page:
We would be grateful if you consider citing our work if you find it useful:
@article{zang2023contextual,
author = {Zang, Yuhang and Li, Wei and Han, Jun and Zhou, Kaiyang and Loy, Chen Change},
title = {Contextual Object Detection with Multimodal Large Language Models},
journal = {arXiv preprint arXiv:2305.18279},
year = {2023}
}
This project is licensed under S-Lab License 1.0. Redistribution and use for non-commercial purposes should follow this license.
We acknowledge the use of the following public code in this project: 1DETR, 2Deformable DETR, 3DETA, 4OV DETR, 5BLIP2.
If you have any questions, please feel free to contact Yuhang Zang (zang0012 AT ntu.edu.sg).