-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder #1084
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1084
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great first pass over adding all the functionality needed for high resolution CLIP image transforms. I've left a lot of comments on the structure of the transforms as you're adding the first ones to the library. After we update these I'll take a pass over the tests and docs.
If False, pick the canvas that minimizes downscaling, including no downscaling at all. | ||
Returns: | ||
Tuple[int, int]: The best resolution to fit the image into. | ||
Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: These examples are not rendering correctly in docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to add newlines between Args, Returns, Examples
Or you need to add a colon
Examples: | |
Examples:: |
If None, will upscale up to target_size. | ||
Returns: | ||
torch.Tensor: The resized and padded image tensor in the format [..., H, W]. | ||
Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These examples are not rendering correctly in docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome overall, thanks for all the work you put into this. One overall question I had is the distinction between the CLIP transform which lives in torchtune/models and the other transforms in modules/transforms/vision. Will only the model transforms be classes with __call__
and all the transforms that live in modules/transforms/
be normal functions?
Also, what is the expected approach for a user to make a new model transform? I had originally thought there would be common image transforms that they can just Compose
together in the model builder, but I see that CLIPImageTransform is a bit more involved than just piping transforms. So is this the overall user flow we want to enforce:
- all general transforms are standalone functions in modules
- to make a new model transform, user needs to define a class that uses the general transform functions in any way they want
- then they create a builder function that parametrizes the class and can be instantiated in the config
from torchtune.models.clip._transforms import CLIPImageTransform | ||
|
||
def clip_vit_336_transform(): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a paper ref for these numbers that we can add to the docstring?
If False, pick the canvas that minimizes downscaling, including no downscaling at all. | ||
Returns: | ||
Tuple[int, int]: The best resolution to fit the image into. | ||
Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to add newlines between Args, Returns, Examples
Or you need to add a colon
Examples: | |
Examples:: |
>>> get_canvas_best_fit(image, possible_resolutions, resize_to_max_canvas=False) | ||
(224, 448) | ||
|
||
In the example above, we calculate the scaling factors for each possible resolution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe these lines in between the code need to be tabbed back
Co-authored-by: Rafi Ayub <[email protected]>
Co-authored-by: Rafi Ayub <[email protected]>
Co-authored-by: Rafi Ayub <[email protected]>
Co-authored-by: Rafi Ayub <[email protected]>
Co-authored-by: Rafi Ayub <[email protected]>
Co-authored-by: Rafi Ayub <[email protected]>
This reverts commit 0a93a5d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major concerns on my end - thanks for pushing this through. You might have to wait until recipe tests is fixed on main before landing this.
Also curious if we plan to add model transforms to the api_ref_models.rst, cc @pbontrager . But this can be a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good here. I'll approve this, but I think you should update the check for possible_resolutions to explicitly check for None.
), f"Either possible_resolutions or max_num_tiles must be given. Got {possible_resolutions=} and {max_num_tiles=}" | ||
|
||
# If possible_resolutions are not given, then calculate possible ones based on max_num_tiles | ||
if not possible_resolutions and max_num_tiles: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to explicitly check for None, otherwise 0 or empty tuples resolve as false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible_resolutions is None or is empty or anything that is not a list with items, we must activate this condition to find possible_resolutions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible_resolutions = find_supported_resolutions( | ||
max_num_tiles=max_num_tiles, tile_size=tile_size | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else doesn't make sense to me
[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder (pytorch#1084)
) -> torch.Tensor: | ||
""" | ||
Places the image at the top left of the canvas and pads with 0 the right and bottom | ||
to fit to the target resolution. If target_size < image_size, it will crop the image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if target_size < image_size, it will crop the image.
From _get_max_res_without_distortion
, it seems like we always resize such that the image remains within bounds of target_size
- is cropping still required?
new_height = min(math.floor(original_height * scale_w), target_height) | ||
else: | ||
new_height = target_height | ||
new_width = min(math.floor(original_width * scale_h), target_width) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be simplified to:
new_width = min(math.floor(original_width * scale_h), target_width)
new_height = min(math.floor(original_height * scale_w), target_height)
) Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Rafi Ayub <[email protected]>
Context
Added the image transforms for clip.
Builder with the model specific values in:
torchtune/models/clip/_model_builders.py
Clip specific use of transforms in:
torchtune/models/clip/_transforms.py
Vision transforms in:
torchtune/modules/transforms/vision
Every transform is a function.
Algorithm TLDR:
Changelog
Added torchvision to the requirements
Test plan
Every function is covered, some indirectly. All tests pass.
pre-commit install
)pytest tests
pytest tests -m integration_test
docs