Multimodal collater with interleaved image, cross-attention mask padding #1156

RdoubleA · 2024-07-09T20:04:51Z

Context

Add the batch collater for multimodal image + text datasets. The collater supports the following:

Tiled images
Multiple images per sample
Cross-attention masks

Inputs must be samples from the multimodal dataset post tiling, post transform:

"tokens": List[int] of length text_seq_len, varies across samples
"labels": List[int] of length text_seq_len, varies across samples
"images": List[Tensor], each with shape (n_tiles, c, h, w)
"encoder_mask": List[Tensor], each with shape (text_seq_len, image_seq_len)
"aspect_ratio": List[Tensor], each with shape (h_ratio, w_ratio)

It performs the following actions:
(1) Pad text sequence and encoder mask to the longest sequence length in the batch
(2) Pad image tensors in the tile dimension with zeros to the largest number of tiles in the batch
(3) Add empty images of zeros to samples up to max number of images in the batch
(4) Pad aspect ratios with (1,1) for all added padding images

Feedback requested:

The name is a bit verbose, but padded_collate_multimodal is too vague as well
We are padding numerous dimensions here, requiring nested for loops with runtime O(total num of images in batch) and running it twice. Would be great to see if we can simplify/optimize/vectorize it further

Changelog

Added the collate function + unit test

Test plan

pytest tests/torchtune/utils/test_collate.py

Docs

pytorch-bot · 2024-07-09T20:04:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1156

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 85dbb95 with merge base 8451b0d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-09-04T20:44:30Z

Codecov Report

Attention: Patch coverage is 98.59155% with 1 line in your changes missing coverage. Please review.

Project coverage is 73.42%. Comparing base (8451b0d) to head (6a0b462).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
torchtune/data/_collate.py	97.22%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1156      +/-   ##
==========================================
+ Coverage   71.21%   73.42%   +2.20%     
==========================================
  Files         287      287              
  Lines       14058    14128      +70     
==========================================
+ Hits        10011    10373     +362     
+ Misses       4047     3755     -292

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

torchtune/data/_collate.py

pbontrager

Really good way to handle a complicated topic. I think we should name this padded_collate_vision_text for now and pretend that it's general until it's not. My suspicion is that we'll have to move this into the model folder in the future as it's so model specific. After you address the comments I left, I'll test this for you on my script.

torchtune/data/_collate.py

pbontrager · 2024-09-05T17:18:59Z

torchtune/data/_collate.py

+        ...         "encoder_mask": [torch.ones(2, 5 * 4)],
+        ...     },
+        ... ]
+        >>> model_inputs = padded_collate_vision_text(batch=batch)


This name doesn't match the current name. I actually prefer padded_collate_vision_text as it's more straight forward and we can either generalize this function or split and rename as we get more vision_text models in the future.

pbontrager · 2024-09-05T17:22:40Z

torchtune/data/_collate.py

+                [8, 9, -100, -100]])
+        >>> print(model_inputs["encoder_input"]["images"].shape)  # (bsz, max_num_images, max_num_tiles, c, h, w)
+        torch.Size([2, 2, 4, 1, 1, 1])
+        >>> print(model_inputs["encoder_mask"].shape)  # (bsz, max_num_images, max_num_tiles, tokens_per_tile * max_num_tiles)


This should actually be [2, 4, 40] since cross attention is text vs image sequence and the image sequence is num_imagesnum_tilestokens_per_tile.

but what about batch size?

pbontrager · 2024-09-05T17:24:36Z

torchtune/data/_collate.py

+        torch.Size([2, 2, 4, 1, 1, 1])
+        >>> print(model_inputs["encoder_mask"].shape)  # (bsz, max_num_images, max_num_tiles, tokens_per_tile * max_num_tiles)
+        torch.Size([2, 2, 4, 20])
+        >>> print(model_inputs["encoder_input"]["aspect_ratio"].shape)  # (bsz, max_num_images, 2)


I'm not sure if this should be [2, 2, 2] or [2, 4]. @felipemello1 ?

aspect_ratio should be (bsz, max_num_images, 2), and then in the clip we reshape:

aspect_ratio = aspect_ratio.reshape(bsz_and_n_imgs, 2)

https://github.com/pytorch/torchtune/blob/82c232d0679ddef3fc419cdc18af758b98b4da05/torchtune/modules/vision_transformer.py#L354C9-L354C63

felipemello1 · 2024-09-05T19:19:56Z

torchtune/data/_collate.py

+    collated_text = padded_collate_sft(text_only, padding_idx, ignore_idx)
+    max_seq_len = collated_text["tokens"].shape[-1]
+
+    # TODO: Figure out how to make this more efficient or vectorized. Setting


didnt think too much about it, but maybe:

do a first pass to check the max of each dimension.

create a tensor with all zeros. Pre allocating should simplify all the padding.

Add the input to the tensor correct line: eg. tensor[0] += sample

Pre-allocating would definitely simplify the code. I would still need to loop through each individual image though

I will leave this as a follow-up though in the interest of time

RdoubleA · 2024-09-05T19:44:15Z

I think we should name this padded_collate_vision_text for now and pretend that it's general until it's not.

I actually think we should go the other way and keep this overspecified until we have a concrete use case to generalize it, I'd rather be overspecific than mislead users into thinking this can be used for any multimodal model

SalmanMohammadi · 2024-09-05T20:35:42Z

torchtune/data/_collate.py

+        - "tokens": List[int] of length text_seq_len, varies across samples
+        - "labels": List[int] of length text_seq_len, varies across samples
+        - "encoder_input": Dict[str, List[torch.Tensor]]
+            - "images": List[torch.Tensor], each with shape (n_tiles, c, h, w)


Can you say somewhere c, h, w = channel, height, width?

pbontrager

I've tested this and everything seems to be working accurately. The only issue right now is that this can't be used for inference. We need to expose padded_direction for text and not expect "labels" when padded_direction=="left". Also, I'd like to propose "padded_collate_tiled_images_and_mask".

pbontrager

Thank you!

collater + test

d294766

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 9, 2024

RdoubleA added 3 commits September 4, 2024 13:21

Merge branch 'main' into mm_collator

8bcbbc9

update docs

66bb855

update docstring

d353f90

RdoubleA requested review from ebsmothers, SalmanMohammadi and pbontrager September 4, 2024 20:32

fix init

8102181

RdoubleA added 2 commits September 4, 2024 14:21

Merge branch 'main' into mm_collator

172b75f

use padded_collate_sft

9a2eb6e

ebsmothers reviewed Sep 5, 2024

View reviewed changes

torchtune/data/_collate.py Outdated Show resolved Hide resolved

ebsmothers reviewed Sep 5, 2024

View reviewed changes

torchtune/data/_collate.py Show resolved Hide resolved

ebsmothers reviewed Sep 5, 2024

View reviewed changes

torchtune/data/_collate.py Show resolved Hide resolved

RdoubleA added 4 commits September 5, 2024 08:55

Merge branch 'main' into mm_collator

b606171

comments

92115d7

switch to encoder_input format

89bccf1

Merge branch 'main' into mm_collator

dc4d9f3

pbontrager reviewed Sep 5, 2024

View reviewed changes

felipemello1 reviewed Sep 5, 2024

View reviewed changes

Merge branch 'main' into mm_collator

e4e5f58

SalmanMohammadi reviewed Sep 5, 2024

View reviewed changes

RdoubleA added 4 commits September 5, 2024 14:32

update

9573cab

fix encoder mask shape

3fb8af4

Merge branch 'main' into mm_collator

7e82b08

Merge branch 'main' into mm_collator

7ca46bf

pbontrager reviewed Sep 10, 2024

View reviewed changes

RdoubleA and others added 4 commits September 11, 2024 08:43

Merge branch 'main' into mm_collator

d89c33a

add pad direction for inference

3848c40

update docstring

6a0b462

remove labels field if inference

b5d36e7

pbontrager approved these changes Sep 11, 2024

View reviewed changes

fix test

85dbb95

pbontrager merged commit 377abc0 into pytorch:main Sep 11, 2024
17 checks passed

RdoubleA deleted the mm_collator branch September 11, 2024 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal collater with interleaved image, cross-attention mask padding #1156

Multimodal collater with interleaved image, cross-attention mask padding #1156

RdoubleA commented Jul 9, 2024

pytorch-bot bot commented Jul 9, 2024 •

edited

Loading

codecov-commenter commented Sep 4, 2024 •

edited

Loading

pbontrager left a comment

pbontrager Sep 5, 2024

SalmanMohammadi Sep 5, 2024

pbontrager Sep 5, 2024

RdoubleA Sep 5, 2024

pbontrager Sep 5, 2024

felipemello1 Sep 5, 2024

felipemello1 Sep 5, 2024 •

edited

Loading

RdoubleA Sep 5, 2024

RdoubleA Sep 5, 2024

RdoubleA commented Sep 5, 2024

SalmanMohammadi Sep 5, 2024

pbontrager left a comment

pbontrager left a comment

Multimodal collater with interleaved image, cross-attention mask padding #1156

Multimodal collater with interleaved image, cross-attention mask padding #1156

Conversation

RdoubleA commented Jul 9, 2024

Context

Changelog

Test plan

Docs

pytorch-bot bot commented Jul 9, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1156

✅ No Failures

codecov-commenter commented Sep 4, 2024 • edited Loading

Codecov Report

pbontrager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA commented Sep 5, 2024

Choose a reason for hiding this comment

pbontrager left a comment

Choose a reason for hiding this comment

pbontrager left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Jul 9, 2024 •

edited

Loading

codecov-commenter commented Sep 4, 2024 •

edited

Loading

felipemello1 Sep 5, 2024 •

edited

Loading