We use the filtered Conceptual Captions, SBU, LAION datasets as the image-text dataset. You can refer to minigpt4 to prepare these datasets.
We use the webvid2.5m datasets as the video-text dataset.
You can refer to video2dataset to prepare these datasets.
We use the wavcaps datasets as the audio-text dataset.
You can refer to WavCaps to prepare these datasets.
We recommand you to re-organize all one-stage pretraining datasets in two ways.
The first way is to use custom_datasets custom_datasets/valor_data/data.py in our codes.
You need to re-organize the pretraining datasets as follows:
├── datasets
│ ├── dataset_name
│ │ ├── images(optional)
│ │ │ ├── image0.mp4
│ │ │ └── image1.mp4
│ │ ├── videos(optional)
│ │ │ ├── video0.mp4
│ │ │ └── video1.mp4
│ │ ├── frames(optional)
│ │ │ ├── video0
│ │ │ │ ├──img_0001.jpg
│ │ │ │ └──img_0002.jpg
│ │ │ └── video1
│ │ │ │ ├──img_0001.jpg
│ │ │ │ └──img_0002.jpg
│ │ ├── audios(optional)
│ │ │ ├── video0.wav
│ │ │ └── video1.wav
│ │ └── pretrain_txt_mapper.json
And we create a indepentdent config file train_configs/audio%cc16m%webvid2m%laion_v4a2.json to manage the pretraining datasets and its sample rate/batch size in training procedure. You can set different data type, sample rate, batch size and num workers for each dataloader by changing task, steps, batch_size and n_workers.
For more details, please refer to valor.
You can also use default datasets provided by LAVIS and MiniGPT4, which re-organize the datasets in following ways:
.
├── ${MINIGPT4_DATASET}
│ ├── cc_sbu
│ ├── convert_cc_sbu.py
│ ├── download_cc_sbu.sh
│ ├── ccs_synthetic_filtered_large.json
│ ├── ccs_synthetic_filtered_large.tsv
│ └── cc_sbu_dataset
│ ├── 00000.tar
│ ├── 00000.parquet
│ ...
│ ├── laion
│ ├── convert_laion.py
│ ├── download_laion.sh
│ ├── laion_synthetic_filtered_large.json
│ ├── laion_synthetic_filtered_large.tsv
│ └── laion_dataset
│ ├── 00000.tar
│ ├── 00000.parquet
│ ...
...
We provide re-organized text annotation files in MULTIS datasets in this googledrive_link or baiduyun_link key:2gt3.
Additionally, you should download the raw images/videos/audios by yourself. These raw data is from MSCOCO, MSRVTT and Audioset.
two-stage datasets are re-organized as following formats:
├── datasets
│ ├── MULTIS
│ │ ├── images(optional)
│ │ │ ├── image0.mp4
│ │ │ └── image1.mp4
│ │ ├── videos(optional)
│ │ │ ├── video0.mp4
│ │ │ └── video1.mp4
│ │ ├── frames(optional)
│ │ │ ├── video0
│ │ │ │ ├──img_0001.jpg
│ │ │ │ └──img_0002.jpg
│ │ │ └── video1
│ │ │ │ ├──img_0001.jpg
│ │ │ │ └──img_0002.jpg
│ │ ├── audios(optional)
│ │ │ ├── video0.wav
│ │ │ └── video1.wav
│ │ │── MULTIS_annotation
│ │ │ ├── annotation0.json
│ │ │ └── annotation1.json
You can also change dataloader settings in instructiontuning_configs/ivaav_inschat.json. And you can add or modify task-specific prompts in instructiontuning_configs/task_prompt.json