Skip to content

Commit f6a57c2

Browse files
TLSDCrecursixxhlucagassejardinetsouffleton
authored
Release (#79)
* downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]> * version bump * Patching minor stuff (#69) * fixing sample_std for single experience * making gradio shared server non default * missing requirement for xray * Improve agent xray app (#70) * 0.2.2 Release (#67) * downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]> * version bump --------- Co-authored-by: Alexandre Lacoste <[email protected]> * Make share=TRue into a environment variable, disabled by default for security * fix floating point issue with std_reward in agent xray * Update src/agentlab/analyze/inspect_results.py * Update src/agentlab/analyze/agent_xray.py --------- Co-authored-by: Thibault LSDC <[email protected]> Co-authored-by: Alexandre Lacoste <[email protected]> * added tmlr definitive config (#71) * downgrading gradio version (#77) * Study refactor (#73) * adapting to new Benchmark class * fixing tests * fix tests * typo * not ready for gradio 5 * study id and a few fixes * fixing pricy tests --------- Co-authored-by: ThibaultLSDC <[email protected]> * adding message class and updating generic agent accordingly (#68) * adding message class and updating generic agent accordingly * updating tests * Reproducibility test before message class * Adding inspect_result.ipynb to reprod white list * Reproducibility test after message class * L1 before message class * L1 after message class * added append as method to the Discussion class, to make it totally similar to a list * changed to_markdown behavior * updated most_basic_agent * updated ReproAgent * Update src/agentlab/analyze/agent_xray.py * format * new journal entry * immutable as default kwarg * removing __add__ and __radd__ * added deprecation warning * updating tests * version bump * Updating generic_agent to fit use BGym's goal_object (#83) * updating generic agent to goal_object * fixing image markdown display * updating tests * fixing intruction BaseMessage * added merge text in discussion * added merge to discussion class * added tests * Minor revert (#86) * minor revert * revert tests too * Add tabs (#84) * add tabs * make sure it's not computed if not visible * Fix reproduce study (#87) * add tabs * this workaround is worst * bug fix * fix reproduce study * make sure it's not computed if not visible * upgrading gradio dependency (#88) * bgym update (#90) * Workarena TMLR experiments (#89) * new entry * adding llm configs * new journal entries * handling sequntial in VWA (#91) * handling sequntial in VWA * enable comments * format --------- Co-authored-by: ThibaultLSDC <[email protected]> * Tmlr workarena (#92) * adding llm configs * new L1 entries * tmp * reformat * adding assistantbench to reproducibility_util.py * gitignore (#97) * Vision fix (#105) * changing content name * Update src/agentlab/llm/llm_utils.py --------- Co-authored-by: Maxime Gasse <[email protected]> * L2 tmlr (#93) * adding llm configs * L2 entries * claude L3 * claude vision support * miniwob results * 405b L1 entry * Replacing Dask with Ray (#100) * dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <[email protected]> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <[email protected]> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <[email protected]> * switching to 2 for loops in _agents_on_benchmark (#107) * yet another way to kill timedout jobs (#108) * Fix prompt formatting in Observation and add static method to Study class (#110) * Bug fix (#111) * Fix prompt formatting in Observation and add static method to Study class * Update gradio version to 5.5 to fix DataFrame scrolling issue * Fixing openrouter pricing rate limit (#112) * Update unit_tests.yml (#101) * request is done once and then reused * Patching minor stuff (#69) * fixing sample_std for single experience * making gradio shared server non default * missing requirement for xray * Improve agent xray app (#70) * 0.2.2 Release (#67) * downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]> * version bump --------- Co-authored-by: Alexandre Lacoste <[email protected]> * Make share=TRue into a environment variable, disabled by default for security * fix floating point issue with std_reward in agent xray * Update src/agentlab/analyze/inspect_results.py * Update src/agentlab/analyze/agent_xray.py --------- Co-authored-by: Thibault LSDC <[email protected]> Co-authored-by: Alexandre Lacoste <[email protected]> * added tmlr definitive config (#71) * downgrading gradio version (#77) * Study refactor (#73) * adapting to new Benchmark class * fixing tests * fix tests * typo * not ready for gradio 5 * study id and a few fixes * fixing pricy tests --------- Co-authored-by: ThibaultLSDC <[email protected]> * adding message class and updating generic agent accordingly (#68) * adding message class and updating generic agent accordingly * updating tests * Reproducibility test before message class * Adding inspect_result.ipynb to reprod white list * Reproducibility test after message class * L1 before message class * L1 after message class * added append as method to the Discussion class, to make it totally similar to a list * changed to_markdown behavior * updated most_basic_agent * updated ReproAgent * Update src/agentlab/analyze/agent_xray.py * format * new journal entry * immutable as default kwarg * removing __add__ and __radd__ * added deprecation warning * updating tests * version bump * Updating generic_agent to fit use BGym's goal_object (#83) * updating generic agent to goal_object * fixing image markdown display * updating tests * fixing intruction BaseMessage * added merge text in discussion * added merge to discussion class * added tests * Minor revert (#86) * minor revert * revert tests too * Add tabs (#84) * add tabs * make sure it's not computed if not visible * Fix reproduce study (#87) * add tabs * this workaround is worst * bug fix * fix reproduce study * make sure it's not computed if not visible * upgrading gradio dependency (#88) * bgym update (#90) * Workarena TMLR experiments (#89) * new entry * adding llm configs * new journal entries * handling sequntial in VWA (#91) * handling sequntial in VWA * enable comments * format --------- Co-authored-by: ThibaultLSDC <[email protected]> * Tmlr workarena (#92) * adding llm configs * new L1 entries * tmp * reformat * adding assistantbench to reproducibility_util.py * gitignore (#97) * Vision fix (#105) * changing content name * Update src/agentlab/llm/llm_utils.py --------- Co-authored-by: Maxime Gasse <[email protected]> * L2 tmlr (#93) * adding llm configs * L2 entries * claude L3 * claude vision support * miniwob results * 405b L1 entry * Replacing Dask with Ray (#100) * dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <[email protected]> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <[email protected]> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <[email protected]> * switching to 2 for loops in _agents_on_benchmark (#107) * yet another way to kill timedout jobs (#108) * request is done once and then reused * switched to caching original function bc it doesnt break to tests * added a catch for some openrouter under-the-hood error --------- Co-authored-by: Maxime Gasse <[email protected]> Co-authored-by: Xing Han Lu <[email protected]> Co-authored-by: Alexandre Lacoste <[email protected]> * updating max prompt configs, vision support (#109) * Cross-product deepcopy fix (#106) Co-authored-by: Maxime Gasse <[email protected]> * slugify study_name (#114) * Improve timeout handling in task polling logic * Add method to override max_steps in Study class * add support for tab visibility in observation flags and update related components * fix tests * Fix sorting bug. improve directory content retrieval with summary statistics * fix test * black * Weblinx results (#104) * adding weblinx results * adding old weblinx results --------- Co-authored-by: ThibaultLSDC <[email protected]> * Max new tokens fix (#118) * Lower max_new_tokens for OpenAI models * updating configs --------- Co-authored-by: Thibault LSDC <[email protected]> Co-authored-by: ThibaultLSDC <[email protected]> * version bump (#119) * fix format (#120) * Clean pipeline (#117) * yet another way to kill timedout jobs * Improve timeout handling in task polling logic * Add method to override max_steps in Study class * add support for tab visibility in observation flags and update related components * fix tests * black * Improve timeout handling in task polling logic * yet another way to kill timedout jobs (#108) * Add method to override max_steps in Study class * add support for tab visibility in observation flags and update related components * fix tests * black * black * Fix sorting bug. improve directory content retrieval with summary statistics * fix test * black * tmp * add error report, add cum cost to summary and ray backend by default * black * fix test (chaing to joblib backend) * black --------- Co-authored-by: Maxime Gasse <[email protected]> --------- Co-authored-by: Alexandre Lacoste <[email protected]> Co-authored-by: Xing Han Lu <[email protected]> Co-authored-by: Maxime Gasse <[email protected]> Co-authored-by: Léo Boisvert <[email protected]>
1 parent f6f1680 commit f6a57c2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+2318
-1335
lines changed

.github/workflows/unit_tests.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,9 @@ jobs:
5858
- name: Check MiniWob availability
5959
run: curl -I "http://localhost:8080/miniwob/" || echo "MiniWob not reachable"
6060

61+
- name: Pre-download nltk ressources
62+
run: python -c "import nltk; nltk.download('punkt_tab')"
63+
6164
- name: Run AgentLab Unit Tests
6265
env:
6366
MINIWOB_URL: "http://localhost:8080/miniwob/"

.gitignore

Lines changed: 2 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -161,35 +161,10 @@ cython_debug/
161161
**/.DS_Store
162162

163163
.vscode
164-
allowed_selenium.json
165164

166-
# Torchtune
167-
finetuning/torchtune
168-
169-
# PyLLMD repo for finetuning
170-
pyllmd_tune/research-pyllmd/
171-
pyllmd_tune/data/
172-
173-
174-
datasets/*
175165
_sandbox.py
176-
node_modules/
177-
/test-results/
178-
/playwright-report/
179-
/blob-report/
180-
/playwright/.cache/
181-
/test-results/
182-
/playwright-report/
183-
/blob-report/
184-
/playwright/.cache/
185-
186166

187167
results/
188168

189-
# personal (optimass)
190-
ICML_deadline/
191-
mass_utils/
192-
pyllmd_tune/
193-
194-
# don't ignore the miniwob_tasks_all.csv file
195-
!miniwob_tasks_all.csv
169+
# gradio
170+
.gradio/

main.py

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,29 +7,28 @@
77
"""
88

99
import logging
10-
1110
from agentlab.agents.generic_agent import (
1211
RANDOM_SEARCH_AGENT,
1312
AGENT_4o,
1413
AGENT_4o_MINI,
1514
AGENT_LLAMA3_70B,
1615
AGENT_LLAMA31_70B,
1716
)
18-
from agentlab.analyze.inspect_results import get_most_recent_folder
19-
from agentlab.experiments import study_generators
17+
from agentlab.experiments.study import Study
2018

2119
logging.getLogger().setLevel(logging.INFO)
2220

2321
# choose your agent or provide a new agent
2422
agent_args = [AGENT_4o_MINI]
2523
# agent_args = [AGENT_4o]
2624

27-
## select the benchmark to run on
25+
26+
# ## select the benchmark to run on
2827
benchmark = "miniwob_tiny_test"
2928
# benchmark = "miniwob"
30-
# benchmark = "workarena.l1"
31-
# benchmark = "workarena.l2"
32-
# benchmark = "workarena.l3"
29+
# benchmark = "workarena_l1"
30+
# benchmark = "workarena_l2"
31+
# benchmark = "workarena_l3"
3332
# benchmark = "webarena"
3433

3534
# Set reproducibility_mode = True for reproducibility
@@ -53,13 +52,18 @@
5352

5453
if relaunch:
5554
# relaunch an existing study
56-
study_dir = get_most_recent_folder()
57-
study = study_generators.make_relaunch_study(study_dir, relaunch_mode="incomplete_or_error")
55+
study = Study.load_most_recent(contains=None)
56+
study.find_incomplete(include_errors=True)
5857

5958
else:
60-
study = study_generators.run_agents_on_benchmark(agent_args, benchmark)
61-
62-
study.run(n_jobs=n_jobs, parallel_backend="joblib", strict_reproducibility=reproducibility_mode)
59+
study = Study(agent_args, benchmark, logging_level_stdout=logging.WARNING)
60+
61+
study.run(
62+
n_jobs=n_jobs,
63+
parallel_backend="ray",
64+
strict_reproducibility=reproducibility_mode,
65+
n_relaunch=3,
66+
)
6367

6468
if reproducibility_mode:
6569
study.append_to_journal(strict_reproducibility=True)

reproducibility_journal.csv

Lines changed: 47 additions & 11 deletions
Large diffs are not rendered by default.

requirements.txt

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@ contexttimer
1616
ipython
1717
pyyaml>=6
1818
pandas
19-
gradio
19+
gradio>=5.5 # issue with DataFrame scrolling before 5.5
2020
gitpython # for the reproducibility script
21-
requests
21+
requests
22+
matplotlib
23+
ray[default]
24+
python-slugify

src/agentlab/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.2.2"
1+
__version__ = "0.3.0"

src/agentlab/agents/agent_args.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
from bgym import AbstractAgentArgs
2+
import bgym
23

34

45
class AgentArgs(AbstractAgentArgs):
56

6-
def set_benchmark(self, benchmark: str, demo_mode: bool):
7+
def set_benchmark(self, benchmark: bgym.Benchmark, demo_mode: bool):
78
"""Optional method to set benchmark specific flags.
89
910
This allows the agent to have minor adjustments based on the benchmark.

src/agentlab/agents/dynamic_prompting.py

Lines changed: 82 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
import abc
2-
import difflib
32
import logging
43
import platform
54
import time
@@ -9,12 +8,12 @@
98
from typing import Literal
109
from warnings import warn
1110

11+
import bgym
1212
from browsergym.core.action.base import AbstractActionSet
13-
from browsergym.core.action.highlevel import HighLevelActionSet
14-
from browsergym.core.action.python import PythonActionSet
1513
from browsergym.utils.obs import flatten_axtree_to_str, flatten_dom_to_str, overlay_som, prune_html
1614

1715
from agentlab.llm.llm_utils import (
16+
BaseMessage,
1817
ParseError,
1918
count_tokens,
2019
extract_code_blocks,
@@ -70,6 +69,7 @@ class ObsFlags(Flags):
7069

7170
use_html: bool = True
7271
use_ax_tree: bool = False
72+
use_tabs: bool = False
7373
use_focused_element: bool = False
7474
use_error_logs: bool = False
7575
use_history: bool = False
@@ -94,13 +94,14 @@ class ObsFlags(Flags):
9494

9595
@dataclass
9696
class ActionFlags(Flags):
97-
multi_actions: bool = False
98-
action_set: str = "bid"
99-
is_strict: bool = False
100-
demo_mode: Literal["off", "default", "all_blue", "only_visible_elements"] = "off"
97+
action_set: bgym.HighLevelActionSetArgs = None # should be set by the set_benchmark method
10198
long_description: bool = True
10299
individual_examples: bool = False
103100

101+
# for backward compatibility
102+
multi_actions: bool = None
103+
is_strict: bool = None
104+
104105

105106
class PromptElement:
106107
"""Base class for all prompt elements. Prompt elements can be hidden."""
@@ -121,7 +122,7 @@ def __init__(self, visible: bool = True) -> None:
121122
self._visible = visible
122123

123124
@property
124-
def prompt(self):
125+
def prompt(self) -> str | BaseMessage:
125126
"""Avoid overriding this method. Override _prompt instead."""
126127
if self.is_visible:
127128
return self._prompt
@@ -252,7 +253,14 @@ def fit_tokens(
252253
if isinstance(prompt, str):
253254
prompt_str = prompt
254255
elif isinstance(prompt, list):
256+
# warn deprecated
257+
warn(
258+
"Using list of prompts is deprecated. Use a Discussion object instead.",
259+
DeprecationWarning,
260+
)
255261
prompt_str = "\n".join([p["text"] for p in prompt if p["type"] == "text"])
262+
elif isinstance(prompt, BaseMessage):
263+
prompt_str = str(prompt)
256264
else:
257265
raise ValueError(f"Unrecognized type for prompt: {type(prompt)}")
258266
n_token = count_tokens(prompt_str, model=model_name)
@@ -357,6 +365,29 @@ def __init__(self, bid, visible: bool = True, prefix="") -> None:
357365
"""
358366

359367

368+
class Tabs(PromptElement):
369+
def __init__(self, obs, visible: bool = True, prefix="") -> None:
370+
super().__init__(visible=visible)
371+
self.obs = obs
372+
self.prefix = prefix
373+
374+
@property
375+
def _prompt(self) -> str:
376+
# by implementing this as a property, it's only coputed if visible
377+
prompt_pieces = [f"\n{self.prefix}Currently open tabs:"]
378+
for page_index, (page_url, page_title) in enumerate(
379+
zip(self.obs["open_pages_urls"], self.obs["open_pages_titles"])
380+
):
381+
active_or_not = " (active tab)" if page_index == self.obs["active_page_index"] else ""
382+
prompt_piece = f"""\
383+
Tab {page_index}{active_or_not}:
384+
Title: {page_title}
385+
URL: {page_url}
386+
"""
387+
prompt_pieces.append(prompt_piece)
388+
return "\n".join(prompt_pieces)
389+
390+
360391
class Observation(Shrinkable):
361392
"""Observation of the current step.
362393
@@ -367,6 +398,13 @@ def __init__(self, obs, flags: ObsFlags) -> None:
367398
super().__init__()
368399
self.flags = flags
369400
self.obs = obs
401+
402+
self.tabs = Tabs(
403+
obs,
404+
visible=lambda: flags.use_tabs,
405+
prefix="## ",
406+
)
407+
370408
self.html = HTML(
371409
obs[flags.html_type],
372410
visible_elements_only=flags.filter_visible_elements_only,
@@ -400,25 +438,18 @@ def shrink(self):
400438
def _prompt(self) -> str:
401439
return f"""
402440
# Observation of current step:
403-
{self.html.prompt}{self.ax_tree.prompt}{self.focused_element.prompt}{self.error.prompt}
441+
{self.tabs.prompt}{self.html.prompt}{self.ax_tree.prompt}{self.focused_element.prompt}{self.error.prompt}
404442
405443
"""
406444

407-
def add_screenshot(self, prompt):
445+
def add_screenshot(self, prompt: BaseMessage) -> BaseMessage:
408446
if self.flags.use_screenshot:
409-
if isinstance(prompt, str):
410-
prompt = [{"type": "text", "text": prompt}]
411447
if self.flags.use_som:
412448
screenshot = self.obs["screenshot_som"]
413449
else:
414450
screenshot = self.obs["screenshot"]
415451
img_url = image_to_jpg_base64_url(screenshot)
416-
prompt.append(
417-
{
418-
"type": "image_url",
419-
"image_url": {"url": img_url, "detail": self.flags.openai_vision_detail},
420-
}
421-
)
452+
prompt.add_image(img_url, detail=self.flags.openai_vision_detail)
422453
return prompt
423454

424455

@@ -441,24 +472,36 @@ def __init__(self, visible: bool = True) -> None:
441472

442473

443474
class GoalInstructions(PromptElement):
444-
def __init__(self, goal, visible: bool = True, extra_instructions=None) -> None:
475+
def __init__(self, goal_object, visible: bool = True, extra_instructions=None) -> None:
445476
super().__init__(visible)
446-
self._prompt = f"""\
477+
self._prompt = [
478+
dict(
479+
type="text",
480+
text=f"""\
447481
# Instructions
448482
Review the current state of the page and all other information to find the best
449483
possible next action to accomplish your goal. Your answer will be interpreted
450484
and executed by a program, make sure to follow the formatting instructions.
451485
452486
## Goal:
453-
{goal}
454-
"""
487+
""",
488+
)
489+
]
490+
491+
self._prompt += goal_object
492+
455493
if extra_instructions:
456-
self._prompt += f"""
494+
self._prompt += [
495+
dict(
496+
type="text",
497+
text=f"""
457498
458499
## Extra instructions:
459500
460501
{extra_instructions}
461-
"""
502+
""",
503+
)
504+
]
462505

463506

464507
class ChatInstructions(PromptElement):
@@ -592,24 +635,24 @@ def _parse_answer(self, text_answer):
592635
return ans_dict
593636

594637

595-
def make_action_set(action_flags: ActionFlags) -> AbstractActionSet:
638+
# def make_action_set(action_flags: ActionFlags) -> AbstractActionSet:
596639

597-
if action_flags.action_set == "python":
598-
action_set = PythonActionSet(strict=action_flags.is_strict)
599-
if action_flags.demo_mode != "off":
600-
warn(
601-
f'Action_set "python" is incompatible with demo_mode={repr(action_flags.demo_mode)}.'
602-
)
603-
return action_set
640+
# if action_flags.action_set == "python":
641+
# action_set = PythonActionSet(strict=action_flags.is_strict)
642+
# if action_flags.demo_mode != "off":
643+
# warn(
644+
# f'Action_set "python" is incompatible with demo_mode={repr(action_flags.demo_mode)}.'
645+
# )
646+
# return action_set
604647

605-
action_set = HighLevelActionSet(
606-
subsets=list(set(["chat"] + ["infeas"] + action_flags.action_set.split("+"))),
607-
multiaction=action_flags.multi_actions,
608-
strict=action_flags.is_strict,
609-
demo_mode=action_flags.demo_mode,
610-
)
648+
# action_set = HighLevelActionSet(
649+
# subsets=list(set(["chat"] + ["infeas"] + action_flags.action_set.split("+"))),
650+
# multiaction=action_flags.multi_actions,
651+
# strict=action_flags.is_strict,
652+
# demo_mode=action_flags.demo_mode,
653+
# )
611654

612-
return action_set
655+
# return action_set
613656

614657

615658
class Think(PromptElement):

0 commit comments

Comments
 (0)