feat(benchmark): JungleGym WebArena #6691

Pwuts · 2024-01-09T17:38:57Z

Increase AGBenchmark's coverage of web browsing and interaction capabilities using the JungleGym WebArena dataset.

Web Navigation Challenge #3936

Changes 🏗️

Simplify models in report_types.py
Refactor & typefix report generation and handling logic
Refactor & rename functions in agent_interface.py and agent_api_interface.py
Set up structure to support more challenge providers
Add JungleGym WebArena challenges

PR Quality Scorecard ✨

Have you used the PR description template? +2 pts
Is your pull request atomic, focusing on a single change? +5 pts
Have you linked the GitHub issue(s) that this PR addresses? +5 pts
Have you documented your changes clearly and comprehensively? +5 pts
Have you changed or added a feature? -4 pts
- Have you added/updated corresponding documentation? +4 pts
- Have you added/updated corresponding integration tests? +5 pts
Have you changed the behavior of AutoGPT? -5 pts
- Have you also run agbenchmark to verify that these changes do not regress performance? +10 pts

- Removed ForbidOptionalMeta and BaseModelBenchmark classes. - Changed model attributes to optional: `Metrics.difficulty`, `Metrics.success`, `Metrics.success_percentage`, `Metrics.run_time`, and `Test.reached_cutoff`. - Added validator to `Metrics` model to require `success` and `run_time` fields if `attempted=True`. - Added default values to all optional model fields. - Removed duplicate imports. - Added condition in process_report.py to prevent null lookups if `metrics.difficulty` is not set.

…g logic - Rename functions in reports.py and ReportManager.py to better reflect what they do - `get_previous_test_results` -> `get_and_update_success_history` - `generate_single_call_report` -> `initialize_test_report` - `finalize_reports` -> `finalize_test_report` - `ReportManager.end_info_report` -> `SessionReportManager.finalize_session_report` - Modify `pytest_runtest_makereport` hook in conftest.py to finalize the report immediately after the challenge finishes running instead of after teardown - Move result processing logic from `initialize_test_report` to `finalize_test_report` in reports.py - Use `Test` and `Report` types from report_types.py where possible instead of untyped dicts: reports.py, utils.py, ReportManager.py - Differentiate `ReportManager` into `SessionReportManager`, `RegressionTestsTracker`, `SuccessRateTracker` - Move filtering of optional challenge categories from challenge.py (`Challenge.skip_optional_categories`) to conftest.py (`pytest_collection_modifyitems`) - Remove unused `scores` fixture in conftest.py

…y and agent_api_interface.py - `copy_artifacts_into_temp_folder` -> `copy_challenge_artifacts_into_workspace` - `copy_agent_artifacts_into_folder` -> `download_agent_artifacts_into_folder` - Reorder parameters of `run_api_agent`, `copy_challenge_artifacts_into_workspace`; use `Path` instead of `str`

…enge providers - Move `Challenge`, `ChallengeData`, `load_challenges` to `challenges/builtin.py` and rename to `BuiltinChallenge`, `BuiltinChallengeSpec`, `load_builtin_challenges` - Create `BaseChallenge` to serve as interface and base class for different challenge implementations - Create `ChallengeInfo` model to serve as universal challenge info object - Create `get_challenge_from_source_uri` function in `challenges/__init__.py` - Replace `ChallengeData` by `ChallengeInfo` everywhere except in `BuiltinChallenge` - Add strong typing to `task_informations` store in app.py - Use `call.duration` in `finalize_test_report` and remove `timer` fixture - Update docstring on `challenges/__init__.py:get_unique_categories` - Add docstring to `generate_test.py`

- Add `WebArenaChallenge`, `WebArenaChallengeSpec`, and other logic to make these challenges work - Add WebArena challenges to Pytest collection endpoint generate_test.py

github-actions · 2024-01-09T17:39:14Z

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

netlify · 2024-01-09T17:39:42Z

✅ Deploy Preview for auto-gpt-docs ready!

Name	Link
🔨 Latest commit	`f722435`
🔍 Latest deploy log	https://app.netlify.com/sites/auto-gpt-docs/deploys/65aac582d4051d0008f53a88
😎 Deploy Preview	https://deploy-preview-6691--auto-gpt-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

github-actions · 2024-01-16T15:25:00Z

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

github-actions · 2024-01-18T14:27:02Z

Conflicts have been resolved! 🎉 A maintainer will review the pull request shortly.

…enges

github-actions · 2024-01-19T14:41:26Z

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

github-actions · 2024-01-19T14:53:00Z

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

github-actions · 2024-01-19T16:33:29Z

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

github-actions · 2024-01-19T18:55:13Z

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

Pwuts added 5 commits January 9, 2024 15:21

feat(benchmark): Add JungleGym WebArena challenges

904dbd1

- Add `WebArenaChallenge`, `WebArenaChallengeSpec`, and other logic to make these challenges work - Add WebArena challenges to Pytest collection endpoint generate_test.py

github-actions bot added the size/xl label Jan 9, 2024

Pwuts added Classic Benchmark code quality ⬆️ PRs that improve code quality labels Jan 9, 2024

github-actions bot added the conflicts Automatically applied to PRs with merge conflicts label Jan 16, 2024

Merge branch 'master' into benchmark/integrate-junglegym-webarena

4f54e21

github-actions bot added size/l and removed conflicts Automatically applied to PRs with merge conflicts size/xl labels Jan 18, 2024

Pwuts removed the code quality ⬆️ PRs that improve code quality label Jan 18, 2024

feat(benchmark/webarena): Add hand-picked selection of WebArena chall…

6b00ca6

…enges

github-actions bot added size/xl and removed size/l labels Jan 19, 2024

Merge branch 'master' into benchmark/integrate-junglegym-webarena

8dd4acb

Merge branch 'master' into benchmark/integrate-junglegym-webarena

3e34a34

Merge branch 'master' into benchmark/integrate-junglegym-webarena

f722435

Pwuts marked this pull request as ready for review January 19, 2024 19:29

Pwuts requested a review from a team January 19, 2024 19:29

Pwuts merged commit 488f40a into master Jan 19, 2024
10 of 13 checks passed

Pwuts deleted the benchmark/integrate-junglegym-webarena branch January 19, 2024 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): JungleGym WebArena #6691

feat(benchmark): JungleGym WebArena #6691

Pwuts commented Jan 9, 2024 •

edited

Loading

github-actions bot commented Jan 9, 2024

netlify bot commented Jan 9, 2024 •

edited

Loading

github-actions bot commented Jan 16, 2024

github-actions bot commented Jan 18, 2024

github-actions bot commented Jan 19, 2024

github-actions bot commented Jan 19, 2024

github-actions bot commented Jan 19, 2024

github-actions bot commented Jan 19, 2024

feat(benchmark): JungleGym WebArena #6691

feat(benchmark): JungleGym WebArena #6691

Conversation

Pwuts commented Jan 9, 2024 • edited Loading

Changes 🏗️

PR Quality Scorecard ✨

github-actions bot commented Jan 9, 2024

netlify bot commented Jan 9, 2024 • edited Loading

✅ Deploy Preview for auto-gpt-docs ready!

github-actions bot commented Jan 16, 2024

github-actions bot commented Jan 18, 2024

github-actions bot commented Jan 19, 2024

github-actions bot commented Jan 19, 2024

github-actions bot commented Jan 19, 2024

github-actions bot commented Jan 19, 2024

Pwuts commented Jan 9, 2024 •

edited

Loading

netlify bot commented Jan 9, 2024 •

edited

Loading