-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HTML output support to web browser tool #959
base: main
Are you sure you want to change the base?
Add HTML output support to web browser tool #959
Conversation
Hi @farrelmahaztra, thanks for submitting this. If we do go ahead with it I'll leave some comments on the review but before we get to that I'd like to check with @MariaIzobava for her take on this. One concern I have is that context-windows are limited, and HTML web pages can be substantially larger than accessibility trees. We also limit tool output to 16k which it seems like many pages could overflow? Depending on how complex the HTML is models also might have a much harder time understanding/navigating compared to accessibility trees. @MariaIzobava just wondering if you have in-house experience with this you'd want to share. I don't want to add a feature without a strong use case and I certainly don't want to add one that is a footgun! @farrelmahaztra I'll do a review of the code separately but let's hold off on addressing until we have the discussion here about whether to proceed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on the implementation, but the bigger picture is that I'm not entirely sold on whether we should do this. HTML will be harder for humans to do approvals on, spill more distracting stuff into the context window, and in many cases be less semantically obvious to the model in terms of understanding the page content.
I wonder if there are really two different scenarios: navigating the web for semantic content and navigating more interactive web experiences. I wonder for the latter if images might be better than HTML? While images would also put pressure on context windows, some adaptations could be made to be smarter about clipping all but the most recent images (that's what we've started working towards for desktop evals).
Appreciate the PR and we certainly might be convinced to take it, but I'd like to get input from @MariaIzobava and others first as they certainly have more experience than I in this domain!
@@ -11,43 +12,59 @@ | |||
from inspect_ai.util._store import store | |||
|
|||
|
|||
def web_browser(interactive: bool = True) -> list[Tool]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this sort of thing our style is to use typed Literal
, so something like format: Literal["at", "html"]
@@ -57,7 +74,9 @@ def web_browser_go() -> Tool: | |||
async def execute(url: str) -> str: | |||
"""Navigate the web browser to a URL. | |||
|
|||
Once you have navigated to a page, you will be presented with a web accessibilty tree of the elements on the page. Each element has an ID, which is displayed in brackets at the beginning of its line. For example: | |||
Once you have navigated to a page, you will be presented with HTML or a web accessibility tree of the elements on the page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we would need to do dynamic prompting based on the output format chosen (note that the API above also implies that DOM and Pixels are supported -- not sure if they are I'd lean against for now -- but we'd need to prompt for them as well if they were in play).
Note that dynamic prompting for interactive vs not interactive is done here:
# start with go tool (excluding interactive docs if necessary) |
def web_at_viewer(call: ToolCall) -> ToolCallView: | ||
# get the web accessiblity tree, if we have it create a view from it | ||
def web_viewer(call: ToolCall) -> ToolCallView: | ||
web_html = store().get(WEB_BROWSER_HTML, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason I don't love HTML here is that we end up resenting the entire page, which will get truncated and make it very difficult for human reviewers to see what the model is doing (note below we can vector in on exactly what is being clicked and show a coherent snippet around it). It's quite important to have human oversight for browser tools and accessibility trees seem much better suited to this.
store().set(WEB_BROWSER_AT, web_at) | ||
return web_at | ||
else: | ||
raise ValueError(f"Unknown output format: {output_format}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that enum up above implies support for DOM and Pixels but if the user tries to use these they'll get an error here (not saying we should support them here, just supporting the argument to narrow that enum).
@@ -285,10 +285,12 @@ def render(self, output_format: CrawlerOutputFormat) -> Any: | |||
the currently active webpage rendered using given format. | |||
""" | |||
match output_format: | |||
case CrawlerOutputFormat.HTML: | |||
return self._page.content() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thought, we limit tool output to 16kb to go easy on context windows (if you have tool calls that frequently exceed this you will both fill up the context window and sometimes confuse the model with spurious content). One reason to favor accessibility trees is that they are much smaller and semantically clear so better as model input.
This PR contains:
What is the current behavior? (You can also link to an open issue here)
Output from the web browser tool is currently always in the form of accessibility trees. This is great for readability and minimizing token count but excludes important context from websites that aren't a11y-friendly.
For instance, I suspect issues like this #636 arise from elements that aren't actually e.g.
<input type="checkbox">
(since attributes likechecked
seem to already be surfaced in the ATs) but rather things like divs styled to look like checkboxes without the appropriate ARIA attributes. (Not sure if that specific issue is because of that though)What is the new behavior?
You can now pass an output format into the web browser tool, like:
Which will output HTML instead:
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
I don't believe so, as the output format still defaults to accessibility trees.
Other information:
This is my first contribution here so do let me know if I've missed something obvious. I'm also on the Inspect Slack with the same name.