Skip to content

v0.24.0

Compare
Choose a tag to compare
@snarayan21 snarayan21 released this 26 Aug 14:48
020b0ef

What's New

1. Torch 2.4 Compatibility (#3542, #3549, #3553, #3552, #3565)

Composer now supports Torch 2.4! We are tracking a few issues with the latest PyTorch we have raised with the PyTorch team related to checkpointing:

  • [PyTorch Issue] Distributed checkpointing using PyTorch DCP has issues with stateless optimizers, e.g. SGD. We recommend using composer.optim.DecoupledSGDW as a workaround.
  • [PyTorch Issue] Distributed checkpointing using PyTorch DCP broke backwards compatibility. We have patched this using the following planner, but this may break custom planner loading.

2. New checkpointing APIs (#3447, #3474, #3488, #3452)

We've added new checkpointing APIs to download, upload, and load / save, so that checkpointing is usable outside of a Trainer object. We will be fully migrating to these new APIs in the next minor release.

3: Improved Auto-microbatching (#3510, #3522)

We've fixed deadlocks with auto-microbatching with FSDP, bringing throughput in line with manually setting the microbatch size. This is achieved through enabling sync hooks wherever a training run might OOM to find the correct microbatch size, and disabling these hooks for the rest of training.

Bug Fixes

1. Fix checkpoint symlink uploads (#3376)

Ensures that checkpoint files are uploaded before the symlink file, fixing errors with missing or incomplete checkpoints.

2. Optimizer tracks same parameters after FSDP wrapping (#3502)

When only a subset of parameters should be tracked by the optimizer, FSDP wrapping will now not interfere.

What's Changed

New Contributors

Full Changelog: v0.23.5...v0.24.0