-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] MM Eval Mask Sizes #1920
[FIX] MM Eval Mask Sizes #1920
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1920
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e988093 with merge base d338066 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -426,7 +426,8 @@ def padded_collate_tiled_images_and_mask( | |||
if pad_max_images is not None: | |||
_, _, img_seq = concat_masks.shape | |||
concat_masks = F.pad( | |||
concat_masks, (0, pad_max_images * image_seq_len - img_seq) | |||
concat_masks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does the pad to max images happen? this is just padding the masks to max num images? and would pad direction affect image padding at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don’t actually need to add them or you’d waste compute on them. If you have a I cache for 7 images, then you want to mask out the additional images. It’s similar to how you mask extra token positions during inference but don’t add padding tokens.
Context
What is the purpose of this PR? Is it to
Issue first reported in #1874 found that eval fails for llama3.2 vision 11B. This issue found was that the VisionCrossAttentionMask was padding the masks to 4 tiles during inference while at the same time the padded_collate_tiled_images_and_mask function was assuming that the masks weren't padded and inferring incorrect shape information.
The solution is to remove any inference time padding logic from the mask transform and pass pad_max_tiles=4 to the collate function during inference and eval to let the collate function handle all the padding.
During this investigation I found that padded_collate_tiled_images_and_mask was using "image_seq_len" variable from the last definition in a loop, meaning that if there were multiple images with different sizes the variable would be wrong at the end. Updated this as well.
Changelog
What are the changes made in this PR?
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
pre-commit install
)pytest tests
pytest tests -m integration_test
Ran as usual
tune run dev/generate_v2 --config llama3_2_vision/generation_v2
Ran as usual
tune run full_finetune_single_device --config llama2_2_vision/11B_full_single_device
Fixed
tune run eleuther_eval --config llama3_2_vision/evaluation