Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table1 #3

Open
geek-APTX4869 opened this issue Oct 3, 2024 · 1 comment
Open

Table1 #3

geek-APTX4869 opened this issue Oct 3, 2024 · 1 comment
Labels
question Further information is requested

Comments

@geek-APTX4869
Copy link

Dear author:
How to get results of the table 1?

@pablomm
Copy link
Collaborator

pablomm commented Oct 28, 2024

Hi @geek-APTX4869! Thank you for your interest, and apologies for the delay in responding—I wasn’t receiving notifications about activity in the repository.

To reproduce the results in Table 1, we perform the following steps:

  • Dataset Preparation: You’ll need to create synthetic datasets using Stable Diffusion or a similar image generation model. In our case:

    • For VOC-sim, we generated 600 images using prompts with the template "A photograph of a 〈class-name〉" for each model. To control visual variability, we used the same random seeds across all architectures.
    • For COCO-cap, we generated images based on complex captions from the COCO dataset, following similar schema from previous literature.
  • Ground Truth: We manually annotated the ground truth masks using CVAT software to avoid bias introduced by other segmentation models. While some other works generate ground truth via model-based segmentation, we opted for human annotation for higher fidelity.

  • Mask Extraction: We generated masks using each method evaluated. The process generally involved:

  • Extracting the attentions associated with the class-related word in the prompt. Applying preprocessing and thresholding to convert the attention maps into binary masks following the different methods. For token optimization, instead of using the prompt-word embedding, we utilized a token optimized on a separate image (not included in the evaluation set). This method allows us to extract attentions for words not included in the prompt used for image generation.

  • Evaluation: We computed the mIoU between the ground truth masks and the binary masks generated by each method.

You can find a link to the dataset with generated images and annotations for evaluation in the README. Please feel free to reach out if you have further questions!

@pablomm pablomm added the question Further information is requested label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants