LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning

Code for the LOVEU@CVPR2023 Workshop Generic Event Boundary Captioning (GEBC) Chanllenge. Our proposed method achieved a 76.14 score on the test set and won the $1^{st}$ place in the challenge. The technical report can be found here.

Introduction

We proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality. (2) To adapt the model to the GEBC task, we take the video Q-former as an adapter and train it with the frozen visual feature extractors and LLM.

Enviroment Preparation

First, you should create a conda environment:

conda env create -f environment.yml
conda activate llmvagebc

Prerequisite Checkpoints

Before using the repository, make sure you have obtained the following checkpoints:

Remember to change the path of checkpoints ckpt in the config file.

Data

Download the Kinetic-GEBC dataset from https://sites.google.com/view/loveucvpr23/track2.

For primary visual feature: Using BLIP-2 to extract primary visual features. We use feature_extraction.py to do so. Remember to change the video_dir and save_dir in train_configs/blip2_feature_extract.yaml before you run:

python feature_extraction.py --cfg-path train_configs/blip2_feature_extract.yaml

For other visual features: CLIP to extract frame-level features and Omnivore to extract clip-level features. We use this pipeline to extract features.

Then, put the extracted features under these three folders:

data/features/eva_vit_g_q_former_tokens_12
data/features/clip_fps_15_stride_1_rename,
data/features/omnivore_fps_15_len_16_stride_1_rename

You can also directly download the official provided features here. But, remember to change the q_former_feature_folder, other_feat_total_size, other_feature_names and other_feature_folders in the config file.

Using VinVL to extract region-level features. The region feature of a video is saved to multiple .npy files, where each single file contains the region feature of a sampled frame. Merge the feature file paths into video_to_frame_index.json in the following format:

{
    "video_id": [
        "frame_1_feat.npy",
        "frame_2_feat.npy",
        ...     
    ],
    ...
}

Then put this file under data/features/.

Training and Validation

Firstly, set the configs in train_configs/${NAME_OF_YOUR_CONFIG_FILE}.yaml. Then run the script

CUDA_VISIBLE_DEVICES=${YOUR_GPU_ID} python train.py \
    --cfg-path train_configs/${NAME_OF_YOUR_CONFIG_FILE}.yaml

The results can be found in video_llama/output/.

Acknowledgement

We are grateful for the following awesome projects our LLMVA-GEBC arising from:

Citation

If you find our code useful, please cite the repo as follows:

@article{tang2023llmva,
  title={LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning},
  author={Tang, Yunlong and Zhang, Jinrui and Wang, Xiangchen and Wang, Teng and Zheng, Feng},
  journal={arXiv preprint arXiv:2306.10354},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
data		data
eval_configs		eval_configs
figs		figs
prompts		prompts
train_configs		train_configs
video_llama		video_llama
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE_Lavis.md		LICENSE_Lavis.md
LICENSE_Minigpt4.md		LICENSE_Minigpt4.md
LICENSE_Videollama.md		LICENSE_Videollama.md
README.md		README.md
apply_delta.py		apply_delta.py
demo_video.py		demo_video.py
environment.yml		environment.yml
extract_dinov2_feature.py		extract_dinov2_feature.py
feature_extraction.py		feature_extraction.py
generate_submission_file.py		generate_submission_file.py
jupyter		jupyter
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning

Introduction

Enviroment Preparation

Prerequisite Checkpoints

Data

Training and Validation

Acknowledgement

Citation

About

Licenses found

Releases

Packages

Contributors 3

Languages

License

Licenses found

zjr2000/LLMVA-GEBC

Folders and files

Latest commit

History

Repository files navigation

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning

Introduction

Enviroment Preparation

Prerequisite Checkpoints

Data

Training and Validation

Acknowledgement

Citation

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages