Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder #1084

Merged
merged 28 commits into from
Jul 5, 2024

Conversation

felipemello1
Copy link
Contributor

@felipemello1 felipemello1 commented Jun 13, 2024

Context

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Added the image transforms for clip.

Builder with the model specific values in:
torchtune/models/clip/_model_builders.py

Clip specific use of transforms in:
torchtune/models/clip/_transforms.py

Vision transforms in:
torchtune/modules/transforms/vision

Every transform is a function.

Algorithm TLDR:

  1. Gets list of possible resolutions based on tile_size and max_num_tiles;
  2. Finds resolutions that best fits the image;
  3. Resizes with distortion
  4. Pads
  5. Normalizes
  6. tile_crop -> output is of shape [num_tiles, 3, tile_size, tile_size]

Changelog

Added torchvision to the requirements

Test plan

Every function is covered, some indirectly. All tests pass.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
    • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

docs
image
image
image
image
image

Copy link

pytorch-bot bot commented Jun 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1084

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2024
@felipemello1 felipemello1 changed the title [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder - Do not land Jun 13, 2024
@felipemello1 felipemello1 changed the title [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder - Do not land Do not land - [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 13, 2024
@felipemello1 felipemello1 changed the title Do not land - [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 13, 2024
@felipemello1 felipemello1 changed the title [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder wip [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 14, 2024
@felipemello1 felipemello1 marked this pull request as draft June 14, 2024 20:39
Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great first pass over adding all the functionality needed for high resolution CLIP image transforms. I've left a lot of comments on the structure of the transforms as you're adding the first ones to the library. After we update these I'll take a pass over the tests and docs.

torchtune/models/clip/_model_builders.py Outdated Show resolved Hide resolved
torchtune/models/clip/_model_builders.py Outdated Show resolved Hide resolved
torchtune/models/clip/_model_builders.py Outdated Show resolved Hide resolved
torchtune/modules/transforms/pipelines.py Outdated Show resolved Hide resolved
torchtune/modules/transforms/pipelines.py Outdated Show resolved Hide resolved
torchtune/modules/transforms/transforms.py Outdated Show resolved Hide resolved
torchtune/modules/transforms/transforms.py Outdated Show resolved Hide resolved
torchtune/modules/transforms/utils.py Outdated Show resolved Hide resolved
torchtune/modules/transforms/utils.py Outdated Show resolved Hide resolved
torchtune/modules/transforms/utils.py Outdated Show resolved Hide resolved
@felipemello1 felipemello1 changed the title wip [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 25, 2024
If False, pick the canvas that minimizes downscaling, including no downscaling at all.
Returns:
Tuple[int, int]: The best resolution to fit the image into.
Examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: These examples are not rendering correctly in docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add newlines between Args, Returns, Examples
Or you need to add a colon

Suggested change
Examples:
Examples::

If None, will upscale up to target_size.
Returns:
torch.Tensor: The resized and padded image tensor in the format [..., H, W].
Examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These examples are not rendering correctly in docs.

@felipemello1 felipemello1 marked this pull request as ready for review June 25, 2024 01:32
Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome overall, thanks for all the work you put into this. One overall question I had is the distinction between the CLIP transform which lives in torchtune/models and the other transforms in modules/transforms/vision. Will only the model transforms be classes with __call__ and all the transforms that live in modules/transforms/ be normal functions?

Also, what is the expected approach for a user to make a new model transform? I had originally thought there would be common image transforms that they can just Compose together in the model builder, but I see that CLIPImageTransform is a bit more involved than just piping transforms. So is this the overall user flow we want to enforce:

  • all general transforms are standalone functions in modules
  • to make a new model transform, user needs to define a class that uses the general transform functions in any way they want
  • then they create a builder function that parametrizes the class and can be instantiated in the config

from torchtune.models.clip._transforms import CLIPImageTransform

def clip_vit_336_transform():

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a paper ref for these numbers that we can add to the docstring?

torchtune/models/clip/_transforms.py Outdated Show resolved Hide resolved
torchtune/models/clip/_transforms.py Outdated Show resolved Hide resolved
torchtune/models/clip/_transforms.py Outdated Show resolved Hide resolved
torchtune/models/clip/_transforms.py Outdated Show resolved Hide resolved
If False, pick the canvas that minimizes downscaling, including no downscaling at all.
Returns:
Tuple[int, int]: The best resolution to fit the image into.
Examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add newlines between Args, Returns, Examples
Or you need to add a colon

Suggested change
Examples:
Examples::

torchtune/modules/transforms/vision/get_canvas_best_fit.py Outdated Show resolved Hide resolved
>>> get_canvas_best_fit(image, possible_resolutions, resize_to_max_canvas=False)
(224, 448)

In the example above, we calculate the scaling factors for each possible resolution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe these lines in between the code need to be tabbed back

torchtune/modules/transforms/vision/get_canvas_best_fit.py Outdated Show resolved Hide resolved
torchtune/modules/transforms/vision/get_canvas_best_fit.py Outdated Show resolved Hide resolved
Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major concerns on my end - thanks for pushing this through. You might have to wait until recipe tests is fixed on main before landing this.

Also curious if we plan to add model transforms to the api_ref_models.rst, cc @pbontrager . But this can be a follow-up

Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good here. I'll approve this, but I think you should update the check for possible_resolutions to explicitly check for None.

), f"Either possible_resolutions or max_num_tiles must be given. Got {possible_resolutions=} and {max_num_tiles=}"

# If possible_resolutions are not given, then calculate possible ones based on max_num_tiles
if not possible_resolutions and max_num_tiles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to explicitly check for None, otherwise 0 or empty tuples resolve as false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible_resolutions is None or is empty or anything that is not a list with items, we must activate this condition to find possible_resolutions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

possible_resolutions = find_supported_resolutions(
max_num_tiles=max_num_tiles, tile_size=tile_size
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else doesn't make sense to me

@felipemello1 felipemello1 merged commit 06a125e into main Jul 5, 2024
28 checks passed
@felipemello1 felipemello1 deleted the image_transforms branch July 5, 2024 19:57
@felipemello1 felipemello1 restored the image_transforms branch July 5, 2024 19:58
Aditya-dom added a commit to Aditya-dom/torchtune that referenced this pull request Jul 6, 2024
[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder (pytorch#1084)
) -> torch.Tensor:
"""
Places the image at the top left of the canvas and pads with 0 the right and bottom
to fit to the target resolution. If target_size < image_size, it will crop the image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if target_size < image_size, it will crop the image.

From _get_max_res_without_distortion, it seems like we always resize such that the image remains within bounds of target_size - is cropping still required?

new_height = min(math.floor(original_height * scale_w), target_height)
else:
new_height = target_height
new_width = min(math.floor(original_width * scale_h), target_width)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be simplified to:

new_width = min(math.floor(original_width * scale_h), target_width)
new_height = min(math.floor(original_height * scale_w), target_height)

maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants