Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML output support to web browser tool #959

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

farrelmahaztra
Copy link

This PR contains:

  • New features
  • Changes to dev-tools e.g. CI config / github tooling
  • Docs
  • Bug fixes
  • Code refactor

What is the current behavior? (You can also link to an open issue here)

Output from the web browser tool is currently always in the form of accessibility trees. This is great for readability and minimizing token count but excludes important context from websites that aren't a11y-friendly.

For instance, I suspect issues like this #636 arise from elements that aren't actually e.g. <input type="checkbox"> (since attributes like checked seem to already be surfaced in the ATs) but rather things like divs styled to look like checkboxes without the appropriate ARIA attributes. (Not sure if that specific issue is because of that though)

What is the new behavior?

You can now pass an output format into the web browser tool, like:

 solver=[
            use_tools(web_browser(output_format=CrawlerOutputFormat.HTML)),
            generate(),
        ]

Which will output HTML instead:

Screenshot 2024-12-08 at 03 12 53

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

I don't believe so, as the output format still defaults to accessibility trees.

Other information:

This is my first contribution here so do let me know if I've missed something obvious. I'm also on the Inspect Slack with the same name.

@jjallaire
Copy link
Collaborator

Hi @farrelmahaztra, thanks for submitting this. If we do go ahead with it I'll leave some comments on the review but before we get to that I'd like to check with @MariaIzobava for her take on this.

One concern I have is that context-windows are limited, and HTML web pages can be substantially larger than accessibility trees. We also limit tool output to 16k which it seems like many pages could overflow? Depending on how complex the HTML is models also might have a much harder time understanding/navigating compared to accessibility trees. @MariaIzobava just wondering if you have in-house experience with this you'd want to share. I don't want to add a feature without a strong use case and I certainly don't want to add one that is a footgun!

@farrelmahaztra I'll do a review of the code separately but let's hold off on addressing until we have the discussion here about whether to proceed.

Copy link
Collaborator

@jjallaire jjallaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the implementation, but the bigger picture is that I'm not entirely sold on whether we should do this. HTML will be harder for humans to do approvals on, spill more distracting stuff into the context window, and in many cases be less semantically obvious to the model in terms of understanding the page content.

I wonder if there are really two different scenarios: navigating the web for semantic content and navigating more interactive web experiences. I wonder for the latter if images might be better than HTML? While images would also put pressure on context windows, some adaptations could be made to be smarter about clipping all but the most recent images (that's what we've started working towards for desktop evals).

Appreciate the PR and we certainly might be convinced to take it, but I'd like to get input from @MariaIzobava and others first as they certainly have more experience than I in this domain!

@@ -11,43 +12,59 @@
from inspect_ai.util._store import store


def web_browser(interactive: bool = True) -> list[Tool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this sort of thing our style is to use typed Literal, so something like format: Literal["at", "html"]

@@ -57,7 +74,9 @@ def web_browser_go() -> Tool:
async def execute(url: str) -> str:
"""Navigate the web browser to a URL.

Once you have navigated to a page, you will be presented with a web accessibilty tree of the elements on the page. Each element has an ID, which is displayed in brackets at the beginning of its line. For example:
Once you have navigated to a page, you will be presented with HTML or a web accessibility tree of the elements on the page.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we would need to do dynamic prompting based on the output format chosen (note that the API above also implies that DOM and Pixels are supported -- not sure if they are I'd lean against for now -- but we'd need to prompt for them as well if they were in play).

Note that dynamic prompting for interactive vs not interactive is done here:

# start with go tool (excluding interactive docs if necessary)
. We'd need to do something similar for the output format. At some point it will become easier to just use ToolDefs for this: https://inspect.ai-safety-institute.org.uk/tools.html#sec-dynamic-tools

def web_at_viewer(call: ToolCall) -> ToolCallView:
# get the web accessiblity tree, if we have it create a view from it
def web_viewer(call: ToolCall) -> ToolCallView:
web_html = store().get(WEB_BROWSER_HTML, "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason I don't love HTML here is that we end up resenting the entire page, which will get truncated and make it very difficult for human reviewers to see what the model is doing (note below we can vector in on exactly what is being clicked and show a coherent snippet around it). It's quite important to have human oversight for browser tools and accessibility trees seem much better suited to this.

store().set(WEB_BROWSER_AT, web_at)
return web_at
else:
raise ValueError(f"Unknown output format: {output_format}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that enum up above implies support for DOM and Pixels but if the user tries to use these they'll get an error here (not saying we should support them here, just supporting the argument to narrow that enum).

@@ -285,10 +285,12 @@ def render(self, output_format: CrawlerOutputFormat) -> Any:
the currently active webpage rendered using given format.
"""
match output_format:
case CrawlerOutputFormat.HTML:
return self._page.content()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thought, we limit tool output to 16kb to go easy on context windows (if you have tool calls that frequently exceed this you will both fill up the context window and sometimes confuse the model with spurious content). One reason to favor accessibility trees is that they are much smaller and semantically clear so better as model input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants