Images in Messages #1504

joecummings · 2024-09-05T21:10:21Z

Put image content in `Message` (and other suspicious changes)

Background

No snappy intro, I'm just tired.

This all started in considering the inference UX for multimodal models. The ideal state would be something like this in the config:

prompt:
	- system: You are a helpful assistant. 
	- user:
		- image: https://en.wikipedia.org/wiki/Wikipedia:Images#/media/File:Spinifex_Pigeon_0A2A1585.jpg
		- text: What's in this image?

Then the user can call the generate recipe:

python recipes/generate.py --config recipes/configs/multimodal_generation.yaml
This is an image of a bird sitting on a tree.

But, what is this getting at?

The concept of a "Message" should contain all the information needed for an interaction with a model INCLUDING images. Before, images were split off from Messages and a placeholder was put in the Message object so the model knew where to inject the content:

message_content = [
	{"type": "image"},
	{"type": "text", "content": "What is in this image?"},
]

images = ["https://en.wikipedia.org/wiki/Wikipedia:Images#/media/File:Spinifex_Pigeon_0A2A1585.jpg"]

After these changes, the model now has ALL the information it needs within the message content itself:

message_content = [
	{"type": "image", "content":  <PIL.Image.Image>},
	{"type": "text", "content": "What is in this image?"},
]

The core of the change is relatively simple; however, we already backed the split images work into all our multimodal stuff, so that all had to be reworked as well. See below for a full changelog.

Changelog

Added "content" field for PIL type in message content
Added a get_media function to easily get all the image content from a Message
Updated the Flamingo transform to use this new functionality
Added a function to load a PIL image from local drive or remote source
Added a new function to format content with images properly
Update Cauldron and Llava Instruct datasets with new changes
Tests!!! Including uploading an asset to test if it can actually load an image properly

Testing

(joe-torchtune) [[email protected] ~/projects/joe-torchtune (3fb44c161)]$ python -m pytest tests/torchtune/
...
 =============================================== 467 passed, 5 skipped, 8 warnings in 119.40s (0:01:59) ===============================================

FAQs

Why did you make it possible to load from remote or local sources? This is a feature usually demonstrated for inference like in Phi3 Vision and Qwen2 vision. We don't want to force users to have to download the image to device to start, so this handles both.

Why not convert to PIL image within Message? I wanted to do this originally; however, Message is a class used for text and multimodal models. Right now, it's possible to train and run inference without downloading extra things needed for multimodal like PIL and torchvision. This keeps the library lightweight. In keeping with that, if we make the specific multimodal model transforms load and use PIL, we keep this usage separate. I'd be open to changing if PIL and Multimodal becomes so important that it's used everywhere, but right now it's not. I did it, sue me.

pytorch-bot · 2024-09-05T21:10:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1504

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6fe061b with merge base 66590b4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchtune/data/_messages.py

RdoubleA

This would mean that model transform just takes in a List[Message] directly instead of a sample dictionary, or do we still plan to pass in sample["messages"] to model transforms?

RdoubleA · 2024-09-05T21:24:03Z

torchtune/datasets/multimodal/_llava_instruct.py

@@ -89,6 +89,8 @@ def __call__(self, sample: Mapping[str, Any]) -> Mapping[str, Any]:
                    role="system", content=self.new_system_prompt, masked=True, eot=True
                )
            )
+
+        # Add in image stuffs / load from file


I am leaning towards handling image loading in the model transform and keeping a URL here, so we don't need to pass in a heavy PIL image everywhere Messages is used. We only load in the image when it's absolutely needed. For tokenize_messages, messages would be more lightweight since it won't contain the actual image.

Re passing around PIL images, do you mean that there are implications for memory to doing something like this? Cause if we are just passing stuff by reference it shouldn't be an issue, right?

If there are a bunch of operations that we want to apply to messages that'd be more convenient to have the file path for, that'd be one thing. But it seems to me like most (nontrivial) functions we're gonna call on Messages containing images would be those that act on the image and not the path.

In that case I'd be inclined to have things as a PIL because it's the raw datatype of an image. It's analogous to the raw string we have in the content field of a Message with text type. I think {"text" -> str, "image" -> PIL} is a lot more natural for a user than {"text" -> str, "image" -> Path}. But lmk if I'm missing the point here

See comment in FAQs. Generally, I agree, but I'm making the tradeoff for "weight" of dependencies.

Right now, it's possible to train and run inference without downloading extra things needed for multimodal like PIL and torchvision

I think Joe's point here is quite valid. I agree having it PIL image by default is more intuitive but as of now multimodal is not prevalent enough as a use case.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

RdoubleA · 2024-09-09T18:08:40Z

tests/assets/dog_on_skateboard.jpg

totally tubular

"Requests changes"

😭

RdoubleA · 2024-09-09T18:11:36Z

tests/torchtune/datasets/multimodal/test_llava_instruct_dataset.py

@@ -55,7 +55,8 @@ def test_label_no_masking(self, load_dataset, tokenizer):
            model_transform=tokenizer,
            train_on_input=True,
        )
-        input, labels, images = ds[0]["tokens"], ds[0]["labels"], ds[0]["images"]
+
+        input, labels = ds[0]["tokens"], ds[0]["labels"]


we should have DummyTokenizer load in images from the message (to replicate the model transform) and then check that the image tensor is returned here as expected. that way it would be a good e2e test of passing the images through the messages into the model transform

Not sure I follow. Our tokenizers (model transforms) take in images through the messages, but they do not return them afterwards.

Looking at Flamingo transform for this.

we're using DummyTokenizer as the substitute for model transform here. Really it should be called DummyModelTransform. My point is we should test that images are passed through from the dataset to messages and are processed correctly. Right now with these changes we are not checking images at all, only the image special tokens.

RdoubleA · 2024-09-09T18:12:41Z

torchtune/data/_messages.py

-            if isinstance(content, str)
-            else content
-        )
+        self.content = self._convert_to_list_of_dict(content)


RdoubleA · 2024-09-09T18:14:58Z

torchtune/datasets/multimodal/_the_cauldron.py

-                        {"type": "image"},
+                        {
+                            "type": "image",
+                            "content": sample[self._column_map["images"]],


this will be a PIL image. can image content still double as PIL image or path string?

I mean it can - but it should be consistent

Damn, if this is coming in as a PIL image, then it might make sense to go ahead with the changes to load image into Message as a PIL image

ha... thinking about it a bit more, I suppose there's no actual memory hit. if you're using multimodal datasets you will certainly need to load the image at some point. then again, changing my mind after five minutes hurts my pride just a little

I don't really want a message to be Union[Path, str, PIL.Image]

That seems like a recipe for confusion.

ALRIGHT F IT, here's the plan:

I was a big ol dumb dumb. This stuff is gunna go. Messages will now hold PIL images. This will be loosely typed in Messages so that we don't need to actually import PIL and Messages can stay free of that dependency hell. Then, the onus for providing a proper PIL image to the Message class is on the dataset builder and the generation recipe. For generation, we could just have a separate one for multimodal or require PIL for everything. For dataset builder, everything is in the multimodal folder., which protects us from imports (thank for the forward thinking @RdoubleA)

This makes the most logical sense IMO.

data entry points (datasets, generation) load in the PIL image -> feed to message in proper format -> message is used by model transform however it wants.

Okay? Okay.

Idk what just happened here but I'm on board

ALRIGHT F IT

you right, I'm on board

torchtune/datasets/multimodal/_llava_instruct.py

torchtune/models/clip/_transform.py

RdoubleA · 2024-09-09T18:24:27Z

No snappy intro, I'm just tired.

I feel that

joecummings · 2024-09-09T22:46:54Z

torchtune/datasets/multimodal/_llava_instruct.py



 # TODO: point to Flamingo model transform as an example
 def llava_instruct_dataset(
    model_transform: Transform,
    *,
    source: str = "liuhaotian/LLaVA-Instruct-150K",
+    images_dir: str = "coco/",


I would love some input here. Right now this is a string b/c I'm anticipating usage from a config. However, the proper typing would be something like Path (Pathlib) b/c it would be much more tolerant to errant forward-slashes and plays better with combination of sub paths.

But then, I'll have to do some conversion on the backside to do that.

Thoughts? @RdoubleA @ebsmothers

If we convert to Path on the backside but still maintain a string parameter, that should handle any trailing/incorrect slashes right? If so I would do that

RdoubleA · 2024-09-09T22:53:17Z

torchtune/data/_utils.py

+        PIL.Image.Image: The loaded image.
+    """
+    # Hackily import PIL to avoid burdensome import in the main module
+    # TODO: Fix this


isn't this import already pretty prevalent throughout the library? or are you avoiding some other issue?

It's only prevalent if doing MM. Right now, you can completely run all our text stuff without the need to import torchvision, PIL, etc...

I'd love to keep it that way.

What's the fix here? Honestly I think this is an OK solution

Would be putting multimodal stuff in a subfolder likely.

RdoubleA · 2024-09-09T22:53:33Z

torchtune/data/_utils.py

@@ -53,7 +90,7 @@ def split_text_by_image_tag(content: str, image_tag: str) -> List[Dict[str, str]
                "role": "system" | "user" | "assistant",
                "content":
                    [
-                        {"type": "image"},
+                        {"type": "image", "content": "path/to/image1.png"},


this needs to be PIL image

_{u need to be a pil image} 😡

RdoubleA · 2024-09-09T22:53:58Z

torchtune/data/_utils.py

+        >>> content = format_content_with_media(
+        ...     "<|image|>hello <|image|>world",
+        ...     image_tag="<|image|>",
+        ...     images=[<PIL.Image.Image>, <PIL.Image.Image>"]


Suggested change

... images=[<PIL.Image.Image>, <PIL.Image.Image>"]

... images=[<PIL.Image.Image>, <PIL.Image.Image>]

RdoubleA · 2024-09-09T22:55:36Z

torchtune/data/_utils.py

+    return image
+
+
+def format_content_with_images(


nit: maybe the name insert_images_in_content is more clear on what the function is doing

True, but it also does the splitting?

splitting so we can insert the images in the correct location :) but, your call

RdoubleA · 2024-09-09T22:56:32Z

torchtune/data/_utils.py

+
+def format_content_with_images(
+    content: str, *, image_tag: str, images: List["PIL.Image.Image"]
+) -> List[Dict[str, str]]:
    """
    Given a raw text string, split by the specified ``image_tag``
    and form into list of dictionaries to be used in the ``Message`` content


nit: while you're here, could you describe what happens if image tag is not found in the content string? IIRC, it's just a no-op

So, if image_tag and images are both none, it's a no-op. But otherwise, it'll error b/c if there's images, it expects an image_tag.

RdoubleA · 2024-09-09T22:57:51Z

torchtune/datasets/multimodal/_llava_instruct.py

@@ -54,6 +55,8 @@ class LlavaInstructToMessages(Transform):
        new_system_prompt (Optional[str]): if specified, prepend a system message. This can
            serve as instructions to guide the model response. Setting this will OVERRIDE any system
            messages already present in the dataset. Default is None.
+        images_dir (str): path to the directory containing the images. User is expected to download the COCO dataset.


not sure if we should have a default. this should be explicitly defined by the user, otherwise they may use the builder/transform and forget to specify this and get confused when it break

I like the default for something like this where it's for a very specific dataset.

RdoubleA · 2024-09-09T22:58:23Z

torchtune/datasets/multimodal/_llava_instruct.py

+                pil_image = load_image(
+                    self.images_dir + sample[self._column_map["image"]]
+                )
+                content = format_content_with_images(


this reads really nicely now

RdoubleA

heroic effort, thanks for this. I have no other concerns

codecov-commenter · 2024-09-10T18:59:53Z

Codecov Report

Attention: Patch coverage is 95.62044% with 6 lines in your changes missing coverage. Please review.

Project coverage is 71.10%. Comparing base (66590b4) to head (5a74b30).

Files with missing lines	Patch %	Lines
torchtune/models/flamingo/_transform.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1504       +/-   ##
===========================================
+ Coverage   27.22%   71.10%   +43.88%     
===========================================
  Files         286      286               
  Lines       13828    13925       +97     
===========================================
+ Hits         3764     9901     +6137     
+ Misses      10064     4024     -6040

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pbontrager

I left one comment on a potential change but otherwise it looks good to go.

pbontrager · 2024-09-10T18:46:53Z

torchtune/data/_utils.py

+    from PIL import Image
+
+    # If pointing to remote source, try to load to local
+    if isinstance(image_loc, str) and image_loc.startswith("http"):


It might be more robust to use urllib.parse to check if the string is a url. I don't think http is required to use urlopen.

How? I can parse it, but urllib.parse parses it even if it's not a valid URL.

Limiting to http and https is restricting, sure, but we can relax when needed.

ebsmothers · 2024-09-10T19:08:35Z

torchtune/data/_messages.py

+        assert isinstance(
+            content, list
+        ), f"content must be of type List[Dict[str, Any]], got {content}"


I know we don't wanna do it here, but is there somewhere that we can/should validate that we have a PIL image when type="image"?

validate_messages?

Argh, I cannot do this without a nested import of PIL.

OK fine to leave it as is then

ebsmothers · 2024-09-10T19:10:31Z

torchtune/datasets/multimodal/_llava_instruct.py

            if role == "system" and self.new_system_prompt is not None:
                continue
-            content = split_text_by_image_tag(message["value"], "<image>")
+            if role == "user":
+                image_path = sample[self._column_map["image"]]


So this will be present in every sample in the dataset?

ebsmothers · 2024-09-10T19:12:40Z

torchtune/data/_utils.py

+        PIL.Image.Image: The loaded image.
+    """
+    # Hackily import PIL to avoid burdensome import in the main module
+    # TODO: Fix this


What's the fix here? Honestly I think this is an OK solution

torchtune/data/_utils.py

ebsmothers · 2024-09-10T19:16:34Z

torchtune/data/_utils.py

+            raise ValueError(f"Failed to load image from {image_loc}") from e
+
+    # Open the local image as a PIL image
+    try:
+        image = Image.open(image_loc)
+    except Exception as e:
+        raise ValueError(f"Failed to open image as PIL Image from {image_loc}") from e


nit: are these really ValueErrors? (Or are you doing the thing I also do and lazily using ValueError as a catch-all for things that are not actually ValueErrors)

WRT the first one, no it could be several different errors. Trying to call a remote URL always has plenty of exception types. I could just leave the error as is and let the user deal with it, but that seems messy.

WRT the second value error, more often than not, it will be a ValueError as the provided value is unable to be opened by the PIL Image interface.

Do you want me to remove the exceptions with error messages and just let it ride all the way to the user?

OK thanks for the explanation, in that case I think what you have is fine

Co-authored-by: ebsmothers <[email protected]>

[Pseudo-RFC] Images in Messages

f63c096

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 5, 2024

RdoubleA reviewed Sep 5, 2024

View reviewed changes

torchtune/data/_messages.py Outdated Show resolved Hide resolved

RdoubleA reviewed Sep 5, 2024

View reviewed changes

joecummings added 4 commits September 6, 2024 13:52

Updates

da977ef

<Replace this line with a title. Use 1 line only, 67 chars or less>

aab101c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Updates

b0a6702

Stub

12c3c30

joecummings changed the title ~~[Pseudo-RFC] Images in Messages~~ Images in Messages Sep 9, 2024

joecummings added 4 commits September 9, 2024 07:56

I AM A GOD

6458d1f

Add the doggo

3fb44c1

What have I done....

dd111ba

Merge remote-tracking branch 'upstream/main' into image-in-messages

f87eab7

joecummings marked this pull request as ready for review September 9, 2024 17:55

LINT

d833db1

joecummings requested review from ebsmothers and RdoubleA September 9, 2024 17:59

RdoubleA requested changes Sep 9, 2024

View reviewed changes

joecummings added 8 commits September 9, 2024 15:08

Fix Llava Instruct

c2e00ca

Last Llava test fixes

2bdb254

Fix the Cauldron

2f8657d

Move load_image to a utils loc

751186a

Remove unnecessary changes

7c63d16

Cleanup

08c8112

Actually use PIL images in test for formatting

c9416d6

Stop

2f17dfe

joecummings requested a review from RdoubleA September 9, 2024 22:42

joecummings commented Sep 9, 2024

View reviewed changes

RdoubleA reviewed Sep 9, 2024

View reviewed changes

joecummings added 4 commits September 10, 2024 10:11

Convert images_dir to Path on the backend

eb97847

Update docstring for format_content_with_images

26ef618

Update API ref with new functions

7cf3f33

Update DummyTokenizer to account for images

932b3d2

joecummings requested a review from RdoubleA September 10, 2024 17:56

joecummings added 2 commits September 10, 2024 11:17

Better docs rendering for data utils

8007afd

More formatting + updating Message test

5a74b30

joecummings force-pushed the image-in-messages branch from 47ff58f to 5a74b30 Compare September 10, 2024 18:52

RdoubleA approved these changes Sep 10, 2024

View reviewed changes

Whoops

c90b766

pbontrager approved these changes Sep 10, 2024

View reviewed changes

ebsmothers reviewed Sep 10, 2024

View reviewed changes

Update torchtune/data/_utils.py

6fe061b

Co-authored-by: ebsmothers <[email protected]>

ebsmothers approved these changes Sep 10, 2024

View reviewed changes

joecummings merged commit eb92658 into pytorch:main Sep 10, 2024
17 checks passed

joecummings deleted the image-in-messages branch September 10, 2024 20:07

	... images=[<PIL.Image.Image>, <PIL.Image.Image>"]
	... images=[<PIL.Image.Image>, <PIL.Image.Image>]

Images in Messages #1504

Images in Messages #1504

Conversation

joecummings commented Sep 5, 2024 • edited Loading

Put image content in Message (and other suspicious changes)

Background

Changelog

Testing

FAQs

pytorch-bot bot commented Sep 5, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1504

✅ No Failures

RdoubleA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA commented Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 10, 2024

Codecov Report

pbontrager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings commented Sep 5, 2024 •

edited

Loading

Put image content in `Message` (and other suspicious changes)

pytorch-bot bot commented Sep 5, 2024 •

edited

Loading

RdoubleA commented Sep 9, 2024 •

edited

Loading

joecummings Sep 10, 2024 •

edited

Loading