Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(benchmark): JungleGym WebArena #6691

Merged
merged 10 commits into from
Jan 19, 2024

Conversation

Pwuts
Copy link
Member

@Pwuts Pwuts commented Jan 9, 2024

Increase AGBenchmark's coverage of web browsing and interaction capabilities using the JungleGym WebArena dataset.

Related:

Changes 🏗️

  • Simplify models in report_types.py
  • Refactor & typefix report generation and handling logic
  • Refactor & rename functions in agent_interface.py and agent_api_interface.py
  • Set up structure to support more challenge providers
  • Add JungleGym WebArena challenges

PR Quality Scorecard ✨

  • Have you used the PR description template?   +2 pts
  • Is your pull request atomic, focusing on a single change?   +5 pts
  • Have you linked the GitHub issue(s) that this PR addresses?   +5 pts
  • Have you documented your changes clearly and comprehensively?   +5 pts
  • Have you changed or added a feature?   -4 pts
    • Have you added/updated corresponding documentation?   +4 pts
    • Have you added/updated corresponding integration tests?   +5 pts
  • Have you changed the behavior of AutoGPT?   -5 pts
    • Have you also run agbenchmark to verify that these changes do not regress performance?   +10 pts

Pwuts added 5 commits January 9, 2024 15:21
- Removed ForbidOptionalMeta and BaseModelBenchmark classes.
- Changed model attributes to optional: `Metrics.difficulty`, `Metrics.success`, `Metrics.success_percentage`, `Metrics.run_time`, and `Test.reached_cutoff`.
- Added validator to `Metrics` model to require `success` and `run_time` fields if `attempted=True`.
- Added default values to all optional model fields.
- Removed duplicate imports.
- Added condition in process_report.py to prevent null lookups if `metrics.difficulty` is not set.
…g logic

- Rename functions in reports.py and ReportManager.py to better reflect what they do
   - `get_previous_test_results` -> `get_and_update_success_history`
   - `generate_single_call_report` -> `initialize_test_report`
   - `finalize_reports` -> `finalize_test_report`
   - `ReportManager.end_info_report` -> `SessionReportManager.finalize_session_report`
- Modify `pytest_runtest_makereport` hook in conftest.py to finalize the report immediately after the challenge finishes running instead of after teardown
   - Move result processing logic from `initialize_test_report` to `finalize_test_report` in reports.py
- Use `Test` and `Report` types from report_types.py where possible instead of untyped dicts: reports.py, utils.py, ReportManager.py
- Differentiate `ReportManager` into `SessionReportManager`, `RegressionTestsTracker`, `SuccessRateTracker`
- Move filtering of optional challenge categories from challenge.py (`Challenge.skip_optional_categories`) to conftest.py (`pytest_collection_modifyitems`)
- Remove unused `scores` fixture in conftest.py
…y and agent_api_interface.py

- `copy_artifacts_into_temp_folder` -> `copy_challenge_artifacts_into_workspace`
- `copy_agent_artifacts_into_folder` -> `download_agent_artifacts_into_folder`
- Reorder parameters of `run_api_agent`, `copy_challenge_artifacts_into_workspace`; use `Path` instead of `str`
…enge providers

- Move `Challenge`, `ChallengeData`, `load_challenges` to `challenges/builtin.py` and rename to `BuiltinChallenge`, `BuiltinChallengeSpec`, `load_builtin_challenges`
- Create `BaseChallenge` to serve as interface and base class for different challenge implementations
- Create `ChallengeInfo` model to serve as universal challenge info object
- Create `get_challenge_from_source_uri` function in `challenges/__init__.py`
- Replace `ChallengeData` by `ChallengeInfo` everywhere except in `BuiltinChallenge`
- Add strong typing to `task_informations` store in app.py
- Use `call.duration` in `finalize_test_report` and remove `timer` fixture
- Update docstring on `challenges/__init__.py:get_unique_categories`
- Add docstring to `generate_test.py`
- Add `WebArenaChallenge`, `WebArenaChallengeSpec`, and other logic to make these challenges work
- Add WebArena challenges to Pytest collection endpoint generate_test.py
Copy link
Contributor

github-actions bot commented Jan 9, 2024

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

Copy link

netlify bot commented Jan 9, 2024

Deploy Preview for auto-gpt-docs ready!

Name Link
🔨 Latest commit f722435
🔍 Latest deploy log https://app.netlify.com/sites/auto-gpt-docs/deploys/65aac582d4051d0008f53a88
😎 Deploy Preview https://deploy-preview-6691--auto-gpt-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@Pwuts Pwuts added Classic Benchmark code quality ⬆️ PRs that improve code quality labels Jan 9, 2024
@github-actions github-actions bot added the conflicts Automatically applied to PRs with merge conflicts label Jan 16, 2024
Copy link
Contributor

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

Copy link
Contributor

Conflicts have been resolved! 🎉 A maintainer will review the pull request shortly.

@github-actions github-actions bot added size/l and removed conflicts Automatically applied to PRs with merge conflicts size/xl labels Jan 18, 2024
@Pwuts Pwuts removed the code quality ⬆️ PRs that improve code quality label Jan 18, 2024
@github-actions github-actions bot added size/xl and removed size/l labels Jan 19, 2024
Copy link
Contributor

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

Copy link
Contributor

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

Copy link
Contributor

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

Copy link
Contributor

This PR exceeds the recommended size of 500 lines. Please make sure you are NOT addressing multiple issues with one PR.

@Pwuts Pwuts marked this pull request as ready for review January 19, 2024 19:29
@Pwuts Pwuts requested a review from a team January 19, 2024 19:29
@Pwuts Pwuts merged commit 488f40a into master Jan 19, 2024
10 of 13 checks passed
@Pwuts Pwuts deleted the benchmark/integrate-junglegym-webarena branch January 19, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant