Release v0.24.0 · mosaicml/composer

What's New

1. Torch 2.4 Compatibility (#3542, #3549, #3553, #3552, #3565)

Composer now supports Torch 2.4! We are tracking a few issues with the latest PyTorch we have raised with the PyTorch team related to checkpointing:

[PyTorch Issue] Distributed checkpointing using PyTorch DCP has issues with stateless optimizers, e.g. SGD. We recommend using composer.optim.DecoupledSGDW as a workaround.
[PyTorch Issue] Distributed checkpointing using PyTorch DCP broke backwards compatibility. We have patched this using the following planner, but this may break custom planner loading.

2. New checkpointing APIs (#3447, #3474, #3488, #3452)

We've added new checkpointing APIs to download, upload, and load / save, so that checkpointing is usable outside of a Trainer object. We will be fully migrating to these new APIs in the next minor release.

3: Improved Auto-microbatching (#3510, #3522)

We've fixed deadlocks with auto-microbatching with FSDP, bringing throughput in line with manually setting the microbatch size. This is achieved through enabling sync hooks wherever a training run might OOM to find the correct microbatch size, and disabling these hooks for the rest of training.

Bug Fixes

1. Fix checkpoint symlink uploads (#3376)

Ensures that checkpoint files are uploaded before the symlink file, fixing errors with missing or incomplete checkpoints.

2. Optimizer tracks same parameters after FSDP wrapping (#3502)

When only a subset of parameters should be tracked by the optimizer, FSDP wrapping will now not interfere.

What's Changed

Bump ipykernel from 6.29.2 to 6.29.5 by @dependabot in #3459
Update torchmetrics requirement from <1.3.3,>=0.10.0 to >=1.4.0.post0,<1.4.1 by @dependabot in #3460
[Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload by @bigning in #3376
Bump databricks-sdk from 0.28.0 to 0.29.0 by @dependabot in #3456
Remove Log Exception by @jjanezhang in #3464
Corrected docs for MFU in SpeedMonitor by @JackZ-db in #3469
[checkpoint v2] Download api by @bigning in #3447
Upload api by @bigning in #3474
[Checkpoint V2] Upload API by @bigning in #3488
Load api by @eracah in #3452
Add helpful comment explaining HSDP initialization seeding by @mvpatel2000 in #3470
Add fit start to mosaicmllogger by @ethanma-db in #3467
Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching by @JackZ-db in #3510
Move hooks and fsdp modules onto state rather than trainer by @JackZ-db in #3522
Bump coverage[toml] from 7.5.4 to 7.6.0 by @dependabot in #3471
revert a wip PR by @bigning in #3475
Change FP8 Eval to default to activation dtype by @j316chuck in #3454
Get a shared file system safe signal file name by @dakinggg in #3485
Bumping flash attention version to v2.6.2 by @ShashankMosaicML in #3489
Bump to Pytorch 2.4 by @mvpatel2000 in #3542
Add Torch 2.4 Tests by @mvpatel2000 in #3549
Fix torch 2.4 images for tests by @snarayan21 in #3553
Fix torch 2.4 tests by @mvpatel2000 in #3552
Fix bug when subset of model parameters is passed into optimizer with FSDP by @sashaDoubov in #3502
Correctly process parallelism_config['tp'] when it's a dict by @snarayan21 in #3434
[torch2.4] Fix sharded checkpointing backward compatibility issue by @bigning in #3565
[fix-daily] Use composer get_model_state_dict instead of torch's by @eracah in #3492
Load Microbatches instead of Entire Batches to GPU by @JackZ-db in #3487
Make Pytest log in color in Github Action by @eitanturok in #3505
Revert "Load Microbatches instead of Entire Batches to GPU " by @JackZ-db in #3508
Bump transformers version by @dakinggg in #3511
Fix FSDP Config Validation by @mvpatel2000 in #3530
Add FSDP input validation for use_orig_params and activation_cpu_offload flag by @j316chuck in #3515
Fix checkpoint events by @b-chu in #3468
Patch conf.py for readthedocs sphinx injection deprecation. by @mvpatel2000 in #3491
save load path in state and pass to mosaicmllogger by @ethanma-db in #3506
Disable gcs azure daily test by @bigning in #3514
Update huggingface-hub requirement from <0.24,>=0.21.2 to >=0.21.2,<0.25 by @dependabot in #3481
restore version on dev by @XiaohanZhangCMU in #3451
Deprecate deepspeed by @dakinggg in #3512
Update importlib-metadata requirement from <7,>=5.0.0 to >=5.0.0,<9 by @dependabot in #3519
Update peft requirement from <0.12,>=0.10.0 to >=0.10.0,<0.13 by @dependabot in #3518
Use gloo as part of DeviceGPU's process group backend by @snarayan21 in #3509
Add a monitor of mlflow logger so that it sets run status as failed if main thread exits unexpectedly by @chenmoneygithub in #3449
Revert "Use gloo as part of DeviceGPU's process group backend (#3509)" by @snarayan21 in #3523
Fix autoresume docstring (save_overwrite) by @eracah in #3526
Unpin pip by @dakinggg in #3524
hasattr check for Wandb 0.17.6 by @mvpatel2000 in #3531
Remove dev on github workflows by @mvpatel2000 in #3536
Remove dev branch in GPU workflows by @mvpatel2000 in #3539
restore google cloud object store test by @bigning in #3538
Update moto[s3] requirement from <5,>=4.0.1 to >=4.0.1,<6 by @dependabot in #3516
use s3 boto3 Adaptive retry as default retry mode by @bigning in #3543
Use python 3.11 in GAs by @eitanturok in #3529
Implement ruff rules enforcing pep 585 by @snarayan21 in #3551
Update numpy requirement from <2.1.0,>=1.21.5 to >=1.21.5,<2.2.0 by @dependabot in #3556
Bump databricks-sdk from 0.29.0 to 0.30.0 by @dependabot in #3559
Update Optim to DecoupledSGD in Notebooks by @mvpatel2000 in #3554
Remove lambda code eval testing by @mvpatel2000 in #3560
Restore Azure Tests by @mvpatel2000 in #3561
Remove tokens for to_next_epoch by @mvpatel2000 in #3562
Change iteration timestamp for old checkpoints by @b-chu in #3563
Fix typo in composer_collect_env by @dakinggg in #3566
Add default value to get_device() by @coryMosaicML in #3568
add ghcr and update build matrix generator by @KevDevSha in #3465
Bump aws_ofi_nccl to 1.11.0 by @willgleich in #3569
allow listed runners by @KevDevSha in #3486
fix runner linux-ubuntu > ubuntu-latest by @KevDevSha in #3571
Bump version to v0.24.0 + deprecations by @snarayan21 in #3570

New Contributors

@ethanma-db made their first contribution in #3467
@KevDevSha made their first contribution in #3465

Full Changelog: v0.23.5...v0.24.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.24.0