Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLFlowObjectStore] [2/2] Support checkpointing with MLFlow #2810

Merged
merged 19 commits into from
Jan 12, 2024

Conversation

jerrychen109
Copy link
Contributor

@jerrychen109 jerrychen109 commented Jan 2, 2024

What does this PR do?

Follow-up PR to #2802 that integrates MLFlowObjectStore with the Trainer.

  • Adds MLFlowObjectStore as a backend for RemoteUploaderDownloader
  • Updates MLFlowLogger to tag MLFlow runs with the Composer run name so autoresume continues the same MLFlow run.
  • Updates save folder and filenames after Event.INIT to avoid format placeholders in run logs
  • Logs mlflow_experiment_id and mlflow_run_id as hyperparams

Tested manually with both unsharded and sharded checkpoints:

  • Full run with save_folder set to `dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts/'
  • Manually stopping run after a few checkpoints and resuming to verify that autoresume begins from latest checkpoint
  • Verifying HF checkpoints uploaded at end of run
  • With load_path set to the checkpoints path of a previous MLFlow run, loads the weights from that checkpoint

What issue(s) does this change relate to?

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@jerrychen109 jerrychen109 force-pushed the jerry/mlflow-objectstore-part2 branch 3 times, most recently from 55eb31f to 9d18239 Compare January 2, 2024 22:08
@jerrychen109 jerrychen109 marked this pull request as ready for review January 3, 2024 00:13
@jerrychen109 jerrychen109 requested review from eracah, dakinggg and a team as code owners January 3, 2024 00:13
@jerrychen109 jerrychen109 force-pushed the jerry/mlflow-objectstore-part2 branch from 02a3733 to 234a8ee Compare January 5, 2024 21:47
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks right to me. I'm only wondering if its better to stick with not {} placeholders for the mlflow object store to avoid all of the partial format changes. I know we've gone back and forth on this, what do you think now after having implemented it this way?

Base automatically changed from jerry/mlflow-objectstore-part1 to dev January 8, 2024 21:46
@jerrychen109 jerrychen109 requested a review from a team as a code owner January 8, 2024 21:46
Use MLFlow run tag for autoresume

Add MLFlowLogger test for existing composer run tag
Make MLFlow experiment and run ID available on all ranks

Fix path issue

Format mlflow placeholders in remote filenames
Add debug tracebacks

Bugfix

Add path to debug info

Try fixing RUD object store init

Pyright
pyright

No longer expect KeyError for format_with_dist using partial_format

Refactor partial_format for readability
@jerrychen109 jerrychen109 force-pushed the jerry/mlflow-objectstore-part2 branch from bb1434b to 8dd25f8 Compare January 8, 2024 21:50
@jerrychen109
Copy link
Contributor Author

I'm personally in favor of keeping the {} placeholders now. The logic won't be that much simpler without them - we still need to replace the mlflow placeholders; each instance of partial_format would just turn into two replace() calls, which I feel is actually a little messier.

Discussed with @dakinggg over Slack, we decided that the implementation of partial_format should be safe and we're not too worried about infinite loop behavior. I restructured the code slightly to make it more clear that if the try block ever completes or throws an unexpected exception (not KeyError or IndexError), the function will return on propagate the exception.

Another benefit of keeping the curly braces + partial_format is that it should technically make format_name_with_dist and format_name_with_dist a bit more robust in the rare situation that someone actually wants to put curly braces in their path but not as one of the supported format args.

@jerrychen109 jerrychen109 requested a review from dakinggg January 8, 2024 21:59
composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
composer/utils/string_helpers.py Outdated Show resolved Hide resolved
composer/utils/string_helpers.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
@jerrychen109 jerrychen109 merged commit 56fa4bd into dev Jan 12, 2024
16 checks passed
@jerrychen109 jerrychen109 deleted the jerry/mlflow-objectstore-part2 branch January 12, 2024 22:39
abhi-mosaic pushed a commit to abhi-mosaic/composer that referenced this pull request Jan 15, 2024
…#2810)

* Support checkpoint uploads to MLFlow (untested)

Use MLFlow run tag for autoresume

Add MLFlowLogger test for existing composer run tag

* Try formatting mlflow save folder after INIT

Make MLFlow experiment and run ID available on all ranks

Fix path issue

Format mlflow placeholders in remote filenames

* Unit tests for partial_format

* Log mlflow info as hyperparams

* partial_format doc update

* Fix formatting

* Pull distributed logic out of MLFlowObjectStore

Add debug tracebacks

Bugfix

Add path to debug info

Try fixing RUD object store init

Pyright

* Partial format in format_name helpers

* Fix import

* Add extra partial_format test

* Fix mlflow RUD check

* Fix test

pyright

No longer expect KeyError for format_with_dist using partial_format

Refactor partial_format for readability

* Max iters on partial_format

* Fix partial_format

* Clean up

* fix test import

* Fix test
cli99 added a commit that referenced this pull request Jan 17, 2024
* Bump torch to 2.1.1 version (#2717)

* Add more info when run doesnt complete (#2751)

* Lower sequence generation length on code gen to be dependent on max canonical solution length  (#2682)

* sequentialize generations_per_sample

* fix bug

* lower generation length

* lower generation length

* lower generation length

* fix gen len

* restore

* restore

* restore

* fix tests

* fix test

* Remove flatten params (#2761)

* remove flatten params

* simplify tests

* simplify tests

* clean

* fix more tests

* rerun tests

* speed up icl

* fix tests

* fix cpu tests

* add more fixtures

* fix tests

* token count

* fix vocab size

* remove logger

* remove clears

* fix mosaicml logger

* change codeowners

* clean up codeowners

* rerun tests

* shrink dataset

* fix tests

* fix test

* rerun tests

* fix tests

* fix tests

* fix seed

* set to 0

* rerun tests

* rerun tests

* change threshold

* rerun tests

* rerun tests

* logs

* remove changes

* logs

* logs

* remove logs

* rerun tests

* rerun tests

* logs

* rerun

* logs

* rerun

* rerun

* rerun tests

* many more logs

* rerun tests

* strip logs

* enable tests

* remove opt

* rerun tests

* add test

* lint

* rerun tests

* fix lint

* lint

* filter warnings

* rerun tests

* fixture

* add fixture

* change

* logs

* rerun tests

* add logs

* rerun tests

* fixture

* lint

* lint

* rerun tests

* fix ignore warning

* logs

* regex

* regex

* regex

* fix

* logs

* reformat

* fix lint (#2767)

* lint (#2768)

* Use time.tokens for speedmonitor instead of dataset length (#2762)

* change token math

* tokens

* add test

* fix tests

* remove exception (#2759)

* time to clean up time parsing 😉 (#2770)

* time to clean up time parsing

* fix type error

* updates

* Upgrade RunConfig compute specification (#2772)

* Upgrade RunConfig compute specification

* extra cluster

* Use async logging in MLflowLogger (#2693)

* async mlflow logging

Signed-off-by: chenmoneygithub <[email protected]>

* small fix

Signed-off-by: chenmoneygithub <[email protected]>

* clean up

* fix test

* fix tests

* deflake

* pin mlflow

---------

Signed-off-by: chenmoneygithub <[email protected]>

* Fix FSDP _param_init_fn to not reinit parameters multiple times (#2765)

* Gate FSDP param init test on torch 2.1 (#2774)

* Parallelize OCI multipart download (#2750)

* [UCVolumes] Add support for list API (#2769)

* Add the memory timeline profiling support through the PyTorch profiler. (#2771)

* v1

* fix issues

* add logs

* change names

* comment

* add device

* uncomment original trace

* add custome plot

* fix pyright

* Update composer/profiler/torch_profiler.py

Co-authored-by: Charles Tang <[email protected]>

* address comments

* fix code check

* fix formatting

* address comments

* add unit test

* fix check

* fix check

* fix check

* fix check

* fix print

* add test comment

* add test comment

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Charles Tang <[email protected]>

* Improve torch memory profiling arguments processing (#2777)

* improve torch profile args

* improve torch profile args

* change default torch_prof_memory_filename

* add memory profiling arg test

* fix check

* fix check

* fix check

* fix check

* fix check

* fix check

* Add platform AWS and bump aws ofi nccl version (#2776)

* Extend checkpoint loading to accept a validation function (#2726)

* Fix checkpoint validation tests for torch 1.13 (#2779)

* fix checkpoint validation tests for torch 1.13

* Fix

* Bump version to 0.17.2 (#2780)

* bump version

* 0.17.2

* update matrix

* bump transformers version (#2781)

* Bump sphinxext-opengraph from 0.9.0 to 0.9.1 (#2784)

Bumps [sphinxext-opengraph](https://github.com/wpilibsuite/sphinxext-opengraph) from 0.9.0 to 0.9.1.
- [Release notes](https://github.com/wpilibsuite/sphinxext-opengraph/releases)
- [Commits](wpilibsuite/sphinxext-opengraph@v0.9.0...v0.9.1)

---
updated-dependencies:
- dependency-name: sphinxext-opengraph
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump coverage[toml] from 7.3.0 to 7.3.3 (#2783)

Bumps [coverage[toml]](https://github.com/nedbat/coveragepy) from 7.3.0 to 7.3.3.
- [Release notes](https://github.com/nedbat/coveragepy/releases)
- [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
- [Commits](nedbat/coveragepy@7.3.0...7.3.3)

---
updated-dependencies:
- dependency-name: coverage[toml]
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 (#2785)

Updates the requirements on [torch](https://github.com/pytorch/pytorch) to permit the latest version.
- [Release notes](https://github.com/pytorch/pytorch/releases)
- [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md)
- [Commits](pytorch/pytorch@v1.13.1...v2.1.2)

---
updated-dependencies:
- dependency-name: torch
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [UCVolumes] Rely on databricks-sdk auth for the right requirements (#2789)

* Enable system metrics in mosaic mlflow logger (#2775)

* Enable system metrics in mosaic mlflow logger

* remove fixture

* Update composer/loggers/mlflow_logger.py

Co-authored-by: Mihir Patel <[email protected]>

* Update composer/loggers/mlflow_logger.py

Co-authored-by: Mihir Patel <[email protected]>

* Update composer/loggers/mlflow_logger.py

Co-authored-by: Mihir Patel <[email protected]>

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Update parse_uri (#2787)

* default-no-memory-timeline (#2790)

* Add eot token to ICL generate kwargs (#2782)

* add custome gen kwargs and stopping on eos token

* modify test

* modify test

* finish

* finish

* finish

* finish

* Add nightly image for torch 2.2.0 12-20-23 (#2791)

* Add torch nightly 12-13 (#2792)

* Add process group as arg to FSDP (#2794)

* add test

* only cast if PG is specified

* add to docstring

* filter warning

* filter warning

* docs

* support lists

* remove warnings

* lint

* hsdp monkeypatch

* logs

* change log

* fix patch

* typo

* clean up logs

* Bump coverage[toml] from 7.3.3 to 7.3.4 (#2798)

Bumps [coverage[toml]](https://github.com/nedbat/coveragepy) from 7.3.3 to 7.3.4.
- [Release notes](https://github.com/nedbat/coveragepy/releases)
- [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
- [Commits](nedbat/coveragepy@7.3.3...7.3.4)

---
updated-dependencies:
- dependency-name: coverage[toml]
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix load_ignore_keys with rng (#2803)

* fix rng load

* lint

* Bump ipykernel from 6.26.0 to 6.28.0 (#2806)

Bumps [ipykernel](https://github.com/ipython/ipykernel) from 6.26.0 to 6.28.0.
- [Release notes](https://github.com/ipython/ipykernel/releases)
- [Changelog](https://github.com/ipython/ipykernel/blob/main/CHANGELOG.md)
- [Commits](ipython/ipykernel@v6.26.0...v6.28.0)

---
updated-dependencies:
- dependency-name: ipykernel
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump junitparser from 3.1.0 to 3.1.1 (#2805)

Bumps [junitparser](https://github.com/weiwei/junitparser) from 3.1.0 to 3.1.1.
- [Changelog](https://github.com/weiwei/junitparser/blob/master/CHANGELOG.md)
- [Commits](weiwei/junitparser@3.1.0...3.1.1)

---
updated-dependencies:
- dependency-name: junitparser
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump pytest from 7.4.3 to 7.4.4 (#2807)

Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.4.3 to 7.4.4.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](pytest-dev/pytest@7.4.3...7.4.4)

---
updated-dependencies:
- dependency-name: pytest
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Avoid futures on close for MosaicML logger (#2804)

* avoid futures on close

* typo

* logs

* logs

* check (#2812)

* Better communication computation overlap (#2811)

* patched torch

* fixed torch imports

* fixed torch imports

* fixed torch imports

* patching through composer

* patching through composer

* patching typingr

* comment added

* don't patch torch 2.1.0

* patch torch 2.1.1 and 2.2.0

* linting fix

* Improve error message for speed monitor (#2801)

* fix flops

* stacklevel

* bump torch version (#2814)

* bump vision (#2815)

* fix rng load (#2816)

* Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. (#2817)

* patched torch

* fixed torch imports

* fixed torch imports

* fixed torch imports

* patching through composer

* patching through composer

* patching typingr

* comment added

* don't patch torch 2.1.0

* patch torch 2.1.1 and 2.2.0

* linting fix

* waiting on computation stream from unshard stream

* waiting on computation stream from unshard stream

* less waiting

* no waiting

* all unshard streams wait on computation stream now

* 2.2.0 dev change

* fix profiler (#2818)

* Bump traitlets from 5.13.0 to 5.14.1 (#2822)

Bumps [traitlets](https://github.com/ipython/traitlets) from 5.13.0 to 5.14.1.
- [Release notes](https://github.com/ipython/traitlets/releases)
- [Changelog](https://github.com/ipython/traitlets/blob/main/CHANGELOG.md)
- [Commits](ipython/traitlets@v5.13.0...v5.14.1)

---
updated-dependencies:
- dependency-name: traitlets
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* All unshard streams wait on computation every step (#2823)

* patched torch

* fixed torch imports

* fixed torch imports

* fixed torch imports

* patching through composer

* patching through composer

* patching typingr

* comment added

* don't patch torch 2.1.0

* patch torch 2.1.1 and 2.2.0

* linting fix

* waiting on computation stream from unshard stream

* waiting on computation stream from unshard stream

* less waiting

* no waiting

* all unshard streams wait on computation stream now

* 2.2.0 dev change

* correct waiting on computation stream

* fsdp state typiung

* patching root pre forward

* patching root pre forward

* fsdp state typing

* patch forward

* correct waiting

* linting

* Add encoding=utf-8 (#2824)

* Fix import for daily test (#2826)

* patched torch

* fixed torch imports

* fixed torch imports

* fixed torch imports

* patching through composer

* patching through composer

* patching typingr

* comment added

* don't patch torch 2.1.0

* patch torch 2.1.1 and 2.2.0

* linting fix

* waiting on computation stream from unshard stream

* waiting on computation stream from unshard stream

* less waiting

* no waiting

* all unshard streams wait on computation stream now

* 2.2.0 dev change

* correct waiting on computation stream

* fsdp state typiung

* patching root pre forward

* patching root pre forward

* fsdp state typing

* patch forward

* correct waiting

* linting

* daily test change

* daily test fix

* [MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore (#2802)

* Implementation of MLFlowObjectStore

* Update object store test settings

* Import mlflow dependencies inline

* Fix tests and ignore some pyright

* Bugfix

* Enforce experiment and run in get_artifact_path

* Update placeholders

* Make logs debug instead of info

* Minor PR comments

* MLflow casing

* tracking_uri fixes

* Update comments

* Update placeholders

* Fix tests

* Fix pyright

* Use tempfile for temp dirs

* Read tracking uri env var directly

* Remove dist from MLFlowObjectStore

---------

Co-authored-by: Daniel King <[email protected]>

* Remove fused layernorm (already deprecated for 2 versions) (#2827)

* remove fused layernorm

* remove import

* remove import

* remove

* fix

* remove docs

* all

* fix

* filter warnings

* norm

* lint

* refactor

---------

Co-authored-by: Your Name <[email protected]>

* checkpoint saver tracks all checkpoints/intervals in state (#2819)

* checkpoint tracking state

* fix some tests

* Update tests/callbacks/test_checkpoint_saver.py

* Checkpoint itself should be included in state, dont pickle timestamp object

* patch the key error (doesnt fix the bug though :sad:)

* avoid slashes in state, adjust tests

* fix gpu test, probably

* formatting

* feedback

* add a comment

* Apply suggestions from code review

Co-authored-by: Mihir Patel <[email protected]>

---------

Co-authored-by: Mihir Patel <[email protected]>

* code-quality timeout update (#2830)

Timed out after 10 minutes here https://github.com/mosaicml/composer/actions/runs/7465107219/job/20313553654?pr=2819 

Bumps runtime up to 15min

* [S] Fix how single value tensors are logged (#2831)

Co-authored-by: Daniel King <[email protected]>

* Adds DTensor Support (#2821)

* fixes to get dtensor to work

* more fixes

* Change state dict materialization for new version of torch

* get load working for new set_state_dict api

* use device_mesh

* Add fsdp init monkeypatch for DTensor

* Add checkpoint profiling logs

* attempt

* working single node

* fix optimizer

* allow 3d device mesh

* attempt to use different pg during 3d mesh save

* undo 3d mesh changes

* load_state_dict -> load

* allow parent mesh in FSDP init

* allow override of force_sync_module_states

* remove unnecessary exit

* ignore _validate_and_get_shard_state()

* save/load hsdp-moe working

* remove prints

* v1

* v2

* lint

* add more tests

* switch to PRs

* ignore warning

* fix lint

* version error

* fix version

* fix state dict

* update versions

* lint

* lint

* disable lint for mosaic fsdp utils

* remove bad line

* move around for legacy

* device mesh

* ignore warning

* fix import

* always init

* fix error

* fix load planner

* remove

* fix lint

* lint

* delay state dict

* test checkpoint

* checkpoint

* fix cpu tests

* fix rotate tests

* fix precision

* lint

* fix alibi

* cleanup

* cleanup

* remove force sync

* fix type

* merge

* lint

* fix gpt

* comment

* fix test

* lint

* minor optimizations

* Update composer/core/state.py

Co-authored-by: Evan Racah <[email protected]>

* revert tests

---------

Co-authored-by: Evan Racah <[email protected]>
Co-authored-by: Abhinav Venigalla <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Abhinav Venigalla <[email protected]>
Co-authored-by: Your Name <[email protected]>
Co-authored-by: Evan Racah <[email protected]>

* Remove duplicate checkpoint verifications (#2828)

* Fix seed for FSDP wrap (#2833)

* first try

* add context

* lint

* more lint

* remove comment

---------

Co-authored-by: Daniel King <[email protected]>
Co-authored-by: Your Name <[email protected]>

* Remove fsdp patch for comm overlap (#2836)

* allow hsdp (#2838)

* Bump torch 2.1.2 (#2840)

* bump torch

* bump

* bump

* Upgrade pyright to 1.1.310 (#2841)

* [MLFlowObjectStore] [2/2] Support checkpointing with MLFlow (#2810)

* Support checkpoint uploads to MLFlow (untested)

Use MLFlow run tag for autoresume

Add MLFlowLogger test for existing composer run tag

* Try formatting mlflow save folder after INIT

Make MLFlow experiment and run ID available on all ranks

Fix path issue

Format mlflow placeholders in remote filenames

* Unit tests for partial_format

* Log mlflow info as hyperparams

* partial_format doc update

* Fix formatting

* Pull distributed logic out of MLFlowObjectStore

Add debug tracebacks

Bugfix

Add path to debug info

Try fixing RUD object store init

Pyright

* Partial format in format_name helpers

* Fix import

* Add extra partial_format test

* Fix mlflow RUD check

* Fix test

pyright

No longer expect KeyError for format_with_dist using partial_format

Refactor partial_format for readability

* Max iters on partial_format

* Fix partial_format

* Clean up

* fix test import

* Fix test

* update nightly to torch 2.3 (#2842)

* update nightly to torch 2.3

* tighten

---------

Co-authored-by: Mihir Patel <[email protected]>

* Pin sphinxcontrib applehelp (#2854)

* pin release

* bump

* break pypi

* tighter pin

* pin

* pin

* pin

* Update setup.py (#2855)

* Torch 2.3 patch (#2849)

* add monkeypatch for verify_options

* patch

* fix

* fix

* partial precommit

* bit of cleanup

* doc

* debug

* fix version pinning

* precommit

* checkdown

* lint

---------

Co-authored-by: Evan Racah <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>

* Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 (#2866)

Updates the requirements on [mosaicml-cli](https://github.com/mosaicml/mosaicml-cli) to permit the latest version.
- [Commits](https://github.com/mosaicml/mosaicml-cli/commits)

---
updated-dependencies:
- dependency-name: mosaicml-cli
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Rewrite to use individual state functions (#2860)

* checkdown

* checkdown

* lint

* fix

* load ignore keys

* fix

* resolve comments

* fix load ignore keys

* offload

* fix gate

* merge

* lint

* use flag

* force trye

* Add custom stopping criteria to ICL generate tasks (#2800)

* add custome gen kwargs and stopping on eos token

* modify test

* modify test

* finish

* finish

* finish

* finish

* finish pr

* implement early stop

* add tesT

* fix bug

* bug fix

* add keys

* diff split

* fix typo

* fix precommit

* fix precommit

* fix precommit

* fix precommit

* fix precommit

* fix precommit

* fix conditional import

* add nlp metrics

* remove code gen changes

* fix nits

---------

Co-authored-by: Daniel King <[email protected]>

* Add save_ignore_keys (#2868)

* comment

* add it

* debug

* add the keys

* debug

* debug

* remove print statement

* docs and tests

* fix tests

---------

Co-authored-by: Daniel King <[email protected]>

---------

Signed-off-by: chenmoneygithub <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Charles Tang <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: Jeremy D <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Chen Qian <[email protected]>
Co-authored-by: Daniel King <[email protected]>
Co-authored-by: coryMosaicML <[email protected]>
Co-authored-by: Harsh Panchal <[email protected]>
Co-authored-by: willgleich <[email protected]>
Co-authored-by: Irene Dea <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: snarayan21 <[email protected]>
Co-authored-by: Jerry Chen <[email protected]>
Co-authored-by: Your Name <[email protected]>
Co-authored-by: Evan Racah <[email protected]>
Co-authored-by: Abhinav Venigalla <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Abhinav Venigalla <[email protected]>
Co-authored-by: Evan Racah <[email protected]>
Co-authored-by: Daniel King <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants