v0.24.0
What's New
1. Torch 2.4 Compatibility (#3542, #3549, #3553, #3552, #3565)
Composer now supports Torch 2.4! We are tracking a few issues with the latest PyTorch we have raised with the PyTorch team related to checkpointing:
- [PyTorch Issue] Distributed checkpointing using PyTorch DCP has issues with stateless optimizers, e.g. SGD. We recommend using
composer.optim.DecoupledSGDW
as a workaround. - [PyTorch Issue] Distributed checkpointing using PyTorch DCP broke backwards compatibility. We have patched this using the following planner, but this may break custom planner loading.
2. New checkpointing APIs (#3447, #3474, #3488, #3452)
We've added new checkpointing APIs to download, upload, and load / save, so that checkpointing is usable outside of a Trainer
object. We will be fully migrating to these new APIs in the next minor release.
3: Improved Auto-microbatching (#3510, #3522)
We've fixed deadlocks with auto-microbatching with FSDP, bringing throughput in line with manually setting the microbatch size. This is achieved through enabling sync hooks wherever a training run might OOM to find the correct microbatch size, and disabling these hooks for the rest of training.
Bug Fixes
1. Fix checkpoint symlink uploads (#3376)
Ensures that checkpoint files are uploaded before the symlink file, fixing errors with missing or incomplete checkpoints.
2. Optimizer tracks same parameters after FSDP wrapping (#3502)
When only a subset of parameters should be tracked by the optimizer, FSDP wrapping will now not interfere.
What's Changed
- Bump ipykernel from 6.29.2 to 6.29.5 by @dependabot in #3459
- Update torchmetrics requirement from <1.3.3,>=0.10.0 to >=1.4.0.post0,<1.4.1 by @dependabot in #3460
- [Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload by @bigning in #3376
- Bump databricks-sdk from 0.28.0 to 0.29.0 by @dependabot in #3456
- Remove Log Exception by @jjanezhang in #3464
- Corrected docs for MFU in SpeedMonitor by @JackZ-db in #3469
- [checkpoint v2] Download api by @bigning in #3447
- Upload api by @bigning in #3474
- [Checkpoint V2] Upload API by @bigning in #3488
- Load api by @eracah in #3452
- Add helpful comment explaining HSDP initialization seeding by @mvpatel2000 in #3470
- Add fit start to mosaicmllogger by @ethanma-db in #3467
- Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching by @JackZ-db in #3510
- Move hooks and fsdp modules onto state rather than trainer by @JackZ-db in #3522
- Bump coverage[toml] from 7.5.4 to 7.6.0 by @dependabot in #3471
- revert a wip PR by @bigning in #3475
- Change FP8 Eval to default to activation dtype by @j316chuck in #3454
- Get a shared file system safe signal file name by @dakinggg in #3485
- Bumping flash attention version to v2.6.2 by @ShashankMosaicML in #3489
- Bump to Pytorch 2.4 by @mvpatel2000 in #3542
- Add Torch 2.4 Tests by @mvpatel2000 in #3549
- Fix torch 2.4 images for tests by @snarayan21 in #3553
- Fix torch 2.4 tests by @mvpatel2000 in #3552
- Fix bug when subset of model parameters is passed into optimizer with FSDP by @sashaDoubov in #3502
- Correctly process
parallelism_config['tp']
when it's a dict by @snarayan21 in #3434 - [torch2.4] Fix sharded checkpointing backward compatibility issue by @bigning in #3565
- [fix-daily] Use composer get_model_state_dict instead of torch's by @eracah in #3492
- Load Microbatches instead of Entire Batches to GPU by @JackZ-db in #3487
- Make Pytest log in color in Github Action by @eitanturok in #3505
- Revert "Load Microbatches instead of Entire Batches to GPU " by @JackZ-db in #3508
- Bump transformers version by @dakinggg in #3511
- Fix FSDP Config Validation by @mvpatel2000 in #3530
- Add FSDP input validation for use_orig_params and activation_cpu_offload flag by @j316chuck in #3515
- Fix checkpoint events by @b-chu in #3468
- Patch conf.py for readthedocs sphinx injection deprecation. by @mvpatel2000 in #3491
- save load path in state and pass to mosaicmllogger by @ethanma-db in #3506
- Disable gcs azure daily test by @bigning in #3514
- Update huggingface-hub requirement from <0.24,>=0.21.2 to >=0.21.2,<0.25 by @dependabot in #3481
- restore version on dev by @XiaohanZhangCMU in #3451
- Deprecate deepspeed by @dakinggg in #3512
- Update importlib-metadata requirement from <7,>=5.0.0 to >=5.0.0,<9 by @dependabot in #3519
- Update peft requirement from <0.12,>=0.10.0 to >=0.10.0,<0.13 by @dependabot in #3518
- Use gloo as part of DeviceGPU's process group backend by @snarayan21 in #3509
- Add a monitor of mlflow logger so that it sets run status as failed if main thread exits unexpectedly by @chenmoneygithub in #3449
- Revert "Use gloo as part of DeviceGPU's process group backend (#3509)" by @snarayan21 in #3523
- Fix autoresume docstring (save_overwrite) by @eracah in #3526
- Unpin pip by @dakinggg in #3524
- hasattr check for Wandb 0.17.6 by @mvpatel2000 in #3531
- Remove dev on github workflows by @mvpatel2000 in #3536
- Remove dev branch in GPU workflows by @mvpatel2000 in #3539
- restore google cloud object store test by @bigning in #3538
- Update moto[s3] requirement from <5,>=4.0.1 to >=4.0.1,<6 by @dependabot in #3516
- use s3 boto3 Adaptive retry as default retry mode by @bigning in #3543
- Use python 3.11 in GAs by @eitanturok in #3529
- Implement ruff rules enforcing pep 585 by @snarayan21 in #3551
- Update numpy requirement from <2.1.0,>=1.21.5 to >=1.21.5,<2.2.0 by @dependabot in #3556
- Bump databricks-sdk from 0.29.0 to 0.30.0 by @dependabot in #3559
- Update Optim to DecoupledSGD in Notebooks by @mvpatel2000 in #3554
- Remove lambda code eval testing by @mvpatel2000 in #3560
- Restore Azure Tests by @mvpatel2000 in #3561
- Remove tokens for
to_next_epoch
by @mvpatel2000 in #3562 - Change iteration timestamp for old checkpoints by @b-chu in #3563
- Fix typo in
composer_collect_env
by @dakinggg in #3566 - Add default value to get_device() by @coryMosaicML in #3568
- add ghcr and update build matrix generator by @KevDevSha in #3465
- Bump aws_ofi_nccl to 1.11.0 by @willgleich in #3569
- allow listed runners by @KevDevSha in #3486
- fix runner linux-ubuntu > ubuntu-latest by @KevDevSha in #3571
- Bump version to v0.24.0 + deprecations by @snarayan21 in #3570
New Contributors
- @ethanma-db made their first contribution in #3467
- @KevDevSha made their first contribution in #3465
Full Changelog: v0.23.5...v0.24.0