[CVPR 2024] Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
zhixiang wei1, lin chen2, et al.
1 University of Science of Techonology of China 2 Shanghai AI Laboratory
Project page: https://zxwei.site/rein
Paper: https://arxiv.org/pdf/2312.04265.pdf
Rein is a efficient and robust fine-tuning method, specifically developed to effectively utilize Vision Foundation Models (VFMs) for Domain Generalized Semantic Segmentation (DGSS). It achieves SOTA on Cityscapes to ACDC, and GTAV to Cityscapes+Mapillary+BDD100K. Using only synthetic data, Rein achieved an mIoU of 78.4% on Cityscapes validation set! Using only the data from the Cityscapes training set, we achieved an average mIoU of 77.6% on ACDC test set!
Trained on Cityscapes, Rein generalizes to unseen driving scenes and cities: Nighttime Shanghai, Foggy Countryside, and Rainy Hollywood.
night_shanghai.mp4
rain_chicago.mp4
fog_beijing.mp4
Setting | mIoU | Config | Log & Checkpoint |
---|---|---|---|
GTAV |
66.7 | config | log & checkpoint |
+Synthia |
72.2 | config | log & checkpoint |
+UrbanSyn |
78.4 | config | log & checkpoint |
+1/16 of Cityscapes training |
82.5 | config | log & checkpoint |
GTAV |
60.0 | config | log & checkpoint |
Cityscapes |
77.6 | config | log & checkpoint |
Cityscapes |
60.0 | config | log & checkpoint |
Setting | Pretraining | Citys. mIoU | Config | Log & Checkpoint |
---|---|---|---|---|
ResNet50 | ImageNet1k | 49.1 | config | log & checkpoint |
ResNet101 | ImageNet1k | 45.9 | config | log & checkpoint |
ConvNeXt-Large | ImageNet21k | 57.9 | config | log & checkpoint |
ViT-Small | DINOv2 | 55.3 | config | log & checkpoint |
ViT-Base | DINOv2 | 64.3 | config | log & checkpoint |
CLIP-Large | OPENAI | 58.1 | config | log & checkpoint |
SAM-Huge | SAM | 59.2 | config | log & checkpoint |
EVA02-Large | EVA02 | 67.8 | config | log & checkpoint |
If you find our code or data helpful, please cite our paper:
@InProceedings{Wei_2024_CVPR,
author = {Wei, Zhixiang and Chen, Lin and Jin, Yi and Ma, Xiaoxiao and Liu, Tianle and Ling, Pengyang and Wang, Ben and Chen, Huaian and Zheng, Jinjin},
title = {Stronger Fewer \& Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {28619-28630}
}
-
🔥 To facilitate users in integrating reins into their own projects, we provide a simplified version of reins: simple_reins. With this version, users can easily use reins as a feature extractor. (Note: This version has removed features related to mask2former)
-
We have uploaded the config for
ResNet
andConvNeXt
. -
🔥 We have uploaded the checkpoint and config for
+1/16 of Cityscapes
training set, and it get 82.5% on the Cityscapes validation set! -
Rein is accepted in
CVPR2024
! -
🔥 Using only the data from the Cityscapes training set, we achieved an average mIoU of 77.56% on the ACDC test set! This result ranks first in the DGSS methods on the ACDC benchmark! Checkpoint is avaliable at release.
-
Using only synthetic data (UrbanSyn, GTAV, and Synthia), Rein achieved an mIoU of 78.4% on Cityscapes! Checkpoint is avaliable at release.
Experience the demo: Users can open demo.ipynb in any Jupyter-supported editor to explore our demonstration.
For testing on the cityscapes dataset, refer to the 'Install' and 'Setup' sections below.
To set up your environment, execute the following commands:
conda create -n rein -y
conda activate rein
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia -y
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.0"
pip install "mmsegmentation>=1.0.0"
pip install "mmdet>=3.0.0"
pip install xformers=='0.0.20' # optional for DINOv2
pip install -r requirements.txt
pip install future tensorboard
The Preparation is similar as DDB.
Cityscapes: Download leftImg8bit_trainvaltest.zip
and gt_trainvaltest.zip
from Cityscapes Dataset and extract them to data/cityscapes
.
Mapillary: Download MAPILLARY v1.2 from Mapillary Research and extract it to data/mapillary
.
GTA: Download all image and label packages from TU Darmstadt and extract them to data/gta
.
Prepare datasets with these commands:
cd Rein
mkdir data
# Convert data for validation if preparing for the first time
python tools/convert_datasets/gta.py data/gta # Source domain
python tools/convert_datasets/cityscapes.py data/cityscapes
# Convert Mapillary to Cityscapes format and resize for validation
python tools/convert_datasets/mapillary2cityscape.py data/mapillary data/mapillary/cityscapes_trainIdLabel --train_id
python tools/convert_datasets/mapillary_resize.py data/mapillary/validation/images data/mapillary/cityscapes_trainIdLabel/val/label data/mapillary/half/val_img data/mapillary/half/val_label
(Optional) ACDC: Download all image and label packages from ACDC and extract them to data/acdc
.
(Optional) UrbanSyn: Download all image and label packages from UrbanSyn and extract them to data/urbansyn
.
The final folder structure should look like this:
Rein
├── ...
├── checkpoints
│ ├── dinov2_vitl14_pretrain.pth
│ ├── dinov2_rein_and_head.pth
├── data
│ ├── cityscapes
│ │ ├── leftImg8bit
│ │ │ ├── train
│ │ │ ├── val
│ │ ├── gtFine
│ │ │ ├── train
│ │ │ ├── val
│ ├── bdd100k
│ │ ├── images
│ │ | ├── 10k
│ │ │ | ├── train
│ │ │ | ├── val
│ │ ├── labels
│ │ | ├── sem_seg
│ │ | | ├── masks
│ │ │ | | ├── train
│ │ │ | | ├── val
│ ├── mapillary
│ │ ├── training
│ │ ├── cityscapes_trainIdLabel
│ │ ├── half
│ │ │ ├── val_img
│ │ │ ├── val_label
│ ├── gta
│ │ ├── images
│ │ ├── labels
├── ...
- Download: Download pre-trained weights from facebookresearch for testing. Place them in the project directory without changing the file name.
- Convert: Convert pre-trained weights for training or evaluation.
(optional for 1024x1024 resolution)
python tools/convert_models/convert_dinov2.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/dinov2_converted.pth
python tools/convert_models/convert_dinov2.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/dinov2_converted_1024x1024.pth --height 1024 --width 1024
Run the evaluation:
python tools/test.py configs/dinov2/rein_dinov2_mask2former_512x512_bs1x4.py checkpoints/dinov2_rein_and_head.pth --backbone dinov2_converted.pth
For most of provided release checkpoints, you can run this command to evluate
python tools/test.py /path/to/cfg /path/to/checkpoint --backbone /path/to/dinov2_converted.pth #(or dinov2_converted_1024x1024.pth)
Start training in single GPU:
python tools/train.py configs/dinov2/rein_dinov2_mask2former_512x512_bs1x4.py
Start training in multiple GPU:
PORT=12345 CUDA_VISIBLE_DEVICES=1,2,3,4 bash tools/dist_train.sh configs/dinov2/rein_dinov2_mask2former_1024x1024_bs4x2.py NUM_GPUS
Because we only fine-tune and save the REIN and head weights, if you need a complete set of segmentor weights, you need to use this script:
python generate_full_weights.py --segmentor_save_path SEGMENTOR_SAVE_PATH --backbone CONVERTED_BACKBONE --rein_head REIN_HEAD
-
What is the difference between the ReinMask2FormerHead and original Mask2FormerHead?
-
How to Integrate Rein into Your Existing Backbone?(without mmsegmentation)
-
Q: How to Visualize?
- A: Use
tools/visualize.py
, such as :
python tools/visualize.py /path/to/cfg /path/to/checkpoint /path/to/images --backbone /path/to/converted_backbone
here
/path/to/images
can be a filename or image folder. - A: Use
-
Q: Why do we need to use multiple weight files during testing?**
-
A: The weight files used during testing are:
-
Backbone: Pre-trained backbone weight files. Since Rein is a parameter-efficient fine-tuning method, there is no need to fine-tune the backbone. This means that for the same backbone, we only need to store one set of parameters, which can significantly reduce storage space.
-
Rein_head: Fine-tuned Rein weights and decode head weights.
-
-
Our implementation is mainly based on following repositories. Thanks for their authors.