facebookresearch · YifeiZhou02 · Mar 16, 2025 · Mar 17, 2025 · Mar 17, 2025 · Mar 17, 2025
diff --git a/projects/sweet_rl/README.md b/projects/sweet_rl/README.md
@@ -0,0 +1,251 @@
+# SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
+
+Official implementation for Collaborative Agent Bench and SWEET-RL.
+
+<p align="center">
+| <a href="xx"><b>Paper</b></a> | <a href="https://huggingface.co/datasets/facebook/collaborative_agent_bench"><b>Data</b></a> |
+</p>
+
+---
+
+[Yifei Zhou](https://yifeizhou02.github.io/), [Song Jiang](https://songjiang0909.github.io/), [Yuandong Tian](https://yuandong-tian.com/), [Jason Weston](https://ai.meta.com/people/1163645124801199/jason-weston/), [Sergey Levine](https://people.eecs.berkeley.edu/~svlevine/), [Sainbayar Sukhbataar*](https://tesatory.github.io/), [Xian Li*](https://ai.meta.com/people/1804676186610787/xian-li/)
+<br>
+UC Berkeley, FAIR
+<br>
+*Equal advising, alphabetical order
+![paper_teaser](paper_teaser.png)
+
+## Collaborative Agent Bench
+### Quick Start
+To set up the environment for Collaborative Agent Bench, run:
+```bash
+pip install -e .
+git clone https://github.com/YifeiZhou02/collab_openrlhf
+cd collab_openrlhf
+pip install -e .
+```
+This should have set up the environment for Backend Programming, and it uses a custom fork of openrlhf to support multi-turn DPO and length normalization. 
+Optionally, if you also wish to run Frontend Design, you need to install GeckoDriver and Firefox in your system(e.g. https://www.mozilla.org/en-US/firefox/all/desktop-release/ and the command below). 
+```bash
+wget https://github.com/mozilla/geckodriver/releases/download/v0.35.0/geckodriver-v0.36.0-linux64.tar.gz
+tar -xvzf geckodriver-v0.35.0-linux64.tar.gz
+sudo mv geckodriver /usr/local/bin/
+```
+To verify installation, run:
+```bash
+geckodriver --version
+```
+
+
+Note that it is possible to install Firefox and GeckoDriver without sudo access by including the path to the applications in ```$PATH``` variable in your system.
+
+To download data, run:
+```bash
+huggingface-cli download facebook/collaborative_agent_bench colbench_code.jsonl colbench_code_offline_15k_llama8b.jsonl
+```
+
+### Testing Your Model on CollaborativeAgentBench
+#### Backend Programming
+
+For testing on Backend Programming, you need to first set up an VLLM server as the simulation for human collaborator. To do that, simply run:
+```bash
+python -m vllm.entrypoints.openai.api_server --model /path/to/llama3.1-70b-instruct --max-model-len 16384 --tensor-parallel-size 8 --gpu-memory-utilization=0.85 --max-num-seqs 16 --port 8000 --enforce-eager --trust-remote-code 
+```
+Feel free to use llama3.1-8b-instruct as simulator for the human collaborator for reduced gpu memory, but the result may be different from provided in the paper..
+
+After setting up the VLLM server for human collaborator, you can now test your model. For coding, run:
+```bash
+python scripts/simulate_interactions.py --agent_model /path/to/Llama-3.1-8B-Instruct \
+    --hostname xxx or localhost \
+    --task_type code \
+    --num_tasks 1000 \
+    --input_path /path/to/backend_tasks/test.jsonl \
+    --output_path /path/for/output/temp_test.jsonl \
+    --env_model /path/to/llama3.1-70b-instruct
+python scripts/evaluate_code.py /path/for/output/temp_test.jsonl
+```
+The success rate and the percentage of tests passed will be printed in the end. Note that sometimes LLM generated code might contain print messages, so part of the outputs might be flooded with those messages.
+<br>
+We also offer a script for you to visualize the trajectories, run:
+```bash
+python visualizers/visualize_dialogue_histories.py /path/for/output/temp_test.jsonl
+```
+#### Frontend Design
+You can run the following script to download data from WebSight:
+```python
+from sweet_rl.utils.webpage_utils import replace_urls, render_full_html
+import json
+from tqdm import tqdm
+train_tasks_path = "/your/data/path/frontend_tasks/train.jsonl"
+test_tasks_path = "/your/data/path/frontend_tasks/test.jsonl"
+
+from datasets import load_dataset
+
+ds = load_dataset("HuggingFaceM4/WebSight", "v0.2")["train"]
+
+
+filtered_data = []
+for i in tqdm(range(20000)):
+    filtered_data.append({
+        "problem_description": ds[i]["llm_generated_idea"], 
+        "ground_truth": replace_urls(ds[i]["text"]),
+    })
+
+with open(train_tasks_path, "w") as f:
+    for d in filtered_data[:10000]:
+        f.write(json.dumps(d) + "\n")
+
+with open(test_tasks_pathh, "w") as f:
+    for d in filtered_data[10000:]:
+        f.write(json.dumps(d) + "\n")
+
+```
+
+For testing on Frontend Design, you need to first set up an VLLM server as the simulation for human collaborator. To do that, simply run:
+```bash
+python -m vllm.entrypoints.openai.api_server --model /path/to/Qwen2-VL-72B-Instruct --max-model-len 16384 --tensor-parallel-size 8 --gpu-memory-utilization=0.85 --max-num-seqs 16 --port 8000 --enforce-eager --limit-mm-per-prompt image=2 --trust-remote-code 
+```
+Feel free to use Qwen2-VL-7B-Instruct as simulator for the human collaborator for reduced gpu memory, but the result may be different from provided in the paper.
+
+
+After setting up the VLLM server for human collaborator, you can now test your model for Frontend Design, run:
+```bash
+python scripts/simulate_interactions.py --agent_model /path/to/Llama-3.1-8B-Instruct \
+    --task_type html \
+    --num_tasks (100 for fast tests, 500 for paper results) \
+    --hostname xxx or localhost \
+    --output_path /path/for/output/temp_test_html.jsonl\
+    --input_path /path/to/webpage_tasks_all.jsonl \
+    --env_model /path/to/Qwen2-VL-72B-Instruct \
+python scripts/evaluate_html.py /path/for/output/temp_test_html.jsonl 
+```
+
+The average cosine similarity will be printed in the end. We also offer a script for you to visualize the trajectories, run:
+```bash
+python visualizers/visualize_design_dialogue_histories.py /path/for/output/temp_test_html.jsonl
+```
+
+## SWEET-RL (**S**tep-**W**is**E** **E**valuation w/ Training-time information)
+Now we provide an example script for running SWEET-RL on Backend Programming. This part assumes that you have set up the environment for Backend Programming.
+First set up the paths for loading data and saving intermediate results.
+```bash
+DATA_PATH=/fsx-ram/yifeizhou/collab_llm/outputs/nov24_train20000_shorter_templatefixed_annotated.jsonl
+
+OUTPUT_DIR=/fsx-ram/yifeizhou/collab_llm/outputs
+CHECKPOINT_DIR=/fsx-ram/yifeizhou/collab_llm/checkpoints
+```
+The intermediate data and checkpoints will be saved to:
+```bash
+GROUND_TRUTH_PREFERENCES_PATH=$OUTPUT_DIR/temp_ground_truth_preferences.jsonl
+REWARD_PATH=$CHECKPOINT_DIR/temp_rm
+SAMPLED_PATH=$OUTPUT_DIR/temp_sampled.jsonl
+RANKED_PATH=$OUTPUT_DIR/temp_ranked.jsonl
+RANDOM_PAIRS_PATH=$OUTPUT_DIR/temp_random_pairs.jsonl
+SAVE_PATH=$CHECKPOINT_DIR/temp_dpo
+EVALUATION_PATH=$OUTPUT_DIR/temp_evaluation.jsonl
+```
+We will first train a step-level reward model:
+```bash
+# first train the step-level reward model with additional training-time information
+python scripts/evaluate_code.py $DATA_PATH --k 3 --ground_truth_preference_path $GROUND_TRUTH_PREFERENCES_PATH
+
+deepspeed --module openrlhf.cli.train_dpo \
+   --save_path $REWARD_PATH \
+   --save_steps -1 \
+   --logging_steps 1 \
+   --eval_steps -1 \
+   --train_batch_size 8 \
+   --micro_train_batch_size 1 \
+   --pretrain /PATH/TO/8BLLAMA \
+   --bf16 \
+   --max_epochs 4 \
+   --max_len 8192 \
+   --zero_stage 3 \
+   --learning_rate 2e-7 \
+   --beta 0.1 \
+   --dataset $GROUND_TRUTH_PATH \
+   --chosen_key chosen \
+   --rejected_key rejected \
+   --flash_attn \
+   --gradient_checkpointing \
+   --use_wandb WANDB_KEY \
+   --response_template "<|start_header_id|>assistant<|end_header_id|>" \
+   --wandb_run_name sweet_code_rm \
+   --mean_log_prob
+```
+After that, we can use this step-level reward model to generate step-level preference pairs:
+```bash
+# # Those commands will generate preference pairs given the step-level reward model
+python scripts/sample_best_of_n.py $DATA_PATH $SAMPLED_PATH --data_fraction 0.1
+
+
+python scripts/rank_best_of_n.py --model_id $REWARD_PATH \
+    --input_path  $SAMPLED_PATH \
+    --output_path $RANKED_PATH 
+
+
+python scripts/generate_random_pairs_from_ranks.py $RANKED_PATH $RANDOM_PAIRS_PATH --no_prompt --num_pairs 4
+```
+Finally we can train the model and perform evaluations:
+```bash
+# # Train the model with step-level preference pairs
+deepspeed --module openrlhf.cli.train_dpo \
+   --save_path $SAVE_PATH \
+   --save_steps -1 \
+   --logging_steps 1 \
+   --eval_steps  -1 \
+   --train_batch_size 8 \
+   --micro_train_batch_size 1 \
+   --pretrain /PATH/TO/Meta-Llama-3.1-8B-Instruct \
+   --bf16 \
+   --max_epochs 1 \
+   --max_len 16384 \
+   --zero_stage 3 \
+   --learning_rate 2e-7 \
+   --beta 0.1 \
+   --dataset $RANDOM_PAIRS_PATH \
+   --chosen_key chosen \
+   --rejected_key rejected \
+   --flash_attn \
+   --gradient_checkpointing \
+   --nll_loss_coef 0.01 \
+   --use_wandb WANDB_KEY \
+   --wandb_run_name sweet_code_8b \
+
+
+
+# carry out evaluations
+python scripts/simulate_interactions.py --agent_model $SAVE_PATH \
+    --hostname host-of-human-simulator \
+    --input_path /path/to/backend_tasks/test.jsonl \ \
+    --task_type code \
+    --num_tasks 1000  --output_path $EVALUATION_PATH
+
+python scripts/evaluate_code.py $EVALUATION_PATH
+```
+You should be able to see result similar to reported in the paper with a success rate around 40\%.
+
+### Data on Frontend Design
+We provide the same command where you can generate the offline data for Frontend Design yourself:
+```bash
+python scripts/simulate_interactions.py --agent_model /path/to/Llama-3.1-8B-Instruct \
+    --task_type html \
+    --num_tasks 1000 \
+    --best_of_n 6 \
+    ---train \
+    --hostname xxx or localhost \
+    --output_path /path/for/output/temp_test_html.jsonl\
+    --input_path /your/data/path/frontend_tasks/train.jsonl \
+    --env_model /path/to/Qwen2-VL-72B-Instruct \
+    --to_continue
+```
+
+
+## Citation
+If you find our benchmark or algorithm useful, please consider citing:
+
+
+
+
+
+
diff --git a/projects/sweet_rl/paper_teaser.pdf b/projects/sweet_rl/paper_teaser.pdf
diff --git a/projects/sweet_rl/paper_teaser.png b/projects/sweet_rl/paper_teaser.png
diff --git a/projects/sweet_rl/prompts/generate_code_example.txt b/projects/sweet_rl/prompts/generate_code_example.txt
@@ -0,0 +1,20 @@
+You are a helpful agent. You will be given a piece of text scrawled from the Internet.
+You are going to help me make some synthetic data inspired from this piece of text to train a collaborative LLM Agent.
+Your task is to synthesize a highly personalized and non-generic python function where the LLM agent to be trained should interact with a human user to answer.
+The dialogue starts with a high-level and vague problem description that the human user proposes to the agent. 
+In order to solve the task, the LLM agent needs to interact with the human to get clarifications so that its final answers can pass some hidden test cases.
+The LLM agent needs to interact with the human to get clarifications 
+
+The synthetic dialogue setting will need to have the following important components.
+1) Ground Truth Answer: This is the goal of the human that they want to agent to derive at and answer the human with this goal. This will be a python function as the ground truth.
+2) Problem high-level description: This is the initial problem description that human will pose to the agent. Note that likely this description is high-level and ambiguous. 
+The agent needs to collaborate and interact with the human user to resolve the ambiguity to arrive at the final answer.
+3) Test Cases: Some example function calls so that the test cases will be executed to compare the outputs from the agent answers and ground truth python function. You should have 10 test cases in total.
+
+You should format your response in json. It is important that you ONLY OUTPUT THIS JSON in your answer and nothing else:
+{
+    "thought": "provide a thought on how you will come up with the synthetic dialogue as inspired from the web data",
+    "ground_truth": "directly output the python function in plain text, do not say anything else, e.g. def get_employee_performance(employee_monthly_stats, employee_names) xxx",
+    "problem_description": "a high-level and ambiguous request the the human proposes initially to the agent, explicitly mention that you want the agent to write a python function",
+    "test_cases": "directly output your test function calls in json format: e.g. {"test1": "get_employee_performance([xxx], [xxx])", xxx}"
+}
diff --git a/projects/sweet_rl/prompts/human_simulator_code_prompt.txt b/projects/sweet_rl/prompts/human_simulator_code_prompt.txt
@@ -0,0 +1,16 @@
+Your task is to simulate a human user that interacts with an LLM agent in a dialogue.
+You would like the LLM agent to help you with the following problem:
+{problem_description}
+
+Your goal is to engage in the conversation with the LLM agent so that it can get to a personalized answer.
+You should make use of the following hidden information to answer the LLM agent.
+YOU SHOULD BEHAVE LIKE A HUMAN THAT NEEDS THE HELP FROM AN AGENT.
+You SHOULD ONLY ANSWER QUESTIONS WITH INFORMATION PROVIDED IN THE HIDDEN INFORMATION, AND SAY YOU DON"T KNOW IF THE ANSWER CAN NOT BE FOUND IN THE HIDDEN INFORMATION.
+
+{hidden_information}
+
+Here is the dialogue so far:
+{dialogue_history}
+
+
+Now directly output your answer to the LLM agent IN TWO SENTENCES. DO NOT SAY ANYTHING ELSE.
diff --git a/projects/sweet_rl/prompts/human_simulator_html_prompt.txt b/projects/sweet_rl/prompts/human_simulator_html_prompt.txt
@@ -0,0 +1,9 @@
+Your task is to simulate a human user that interacts with an LLM agent in a dialogue.
+Your goal is to engage in the conversation with the LLM agent so that it can get to a personalized answer.
+YOU SHOULD BEHAVE LIKE A HUMAN THAT NEEDS THE HELP FROM AN AGENT.
+The ultimate goal is to have the agent to construct the EXACT DESIGN that you have in mind.
+You will be given an image made by the agent and a ground-truth image that the human user wants.
+Describe briefly how is the image made by the agent is mainly different from the image that the human user wants. 
+You should PRIORITIZE MOST OUTSTANDING DIFFERENCES. DESCRIBE CONCRETELY HOW EACH COMPONENT IS DIFFERENT (e.g. image has a larger size, text alignment should be in the center, etc)
+1) The first image will be the agent provided image.
+2) The second image will be the image that the human user wants
diff --git a/projects/sweet_rl/prompts/llm_agent_code_prompt.txt b/projects/sweet_rl/prompts/llm_agent_code_prompt.txt
@@ -0,0 +1,16 @@
+You are a helpful LLM agent. 
+Your task is to help a human user to resolve their problem, in particular python programming.
+1) Note that the problem is highly personalized so you need to explicitly gather information 
+by asking questions to the human user about some hidden information and implicit constraints.
+YOU SHOULD TRY TO ASK CLARIFICATION QUESTIONS.
+2) Note that you should not ask human users complicated questions as they will only answer questions briefly in two sentences.
+3) When you have gathered enough information to answer, say "I WANT TO ANSWER:" in the beginning of your response and provide your final answer.
+4) Note that you can only interact with the human users WITHIN 10 back-and-forth rounds and you have to provide your final answer before the conversation ends.
+5) You should be as concise as possible in your response to human.
+
+
+"I WANT TO ANSWER:" should be included in your response to human if you think that you have gathered enough information for addressing this problem.
+Directly output the raw python code after "I WANT TO ANSWER:".
+
+Complete only the immediate agent response in this dialogue:
+{dialogue_history}
diff --git a/projects/sweet_rl/prompts/llm_agent_html_prompt.txt b/projects/sweet_rl/prompts/llm_agent_html_prompt.txt
@@ -0,0 +1,20 @@
+You are a helpful LLM agent. 
+Your task is to help a human user to code a complete website with a good design in HTML and Tailwind CSS.
+Write the code inside a tag <html>.
+Write real and long sentences about the business.
+You don’t have to include images, but if you do, use only this source
+https://picsum.photos/id/48/W/H, by replacing W and H with the width and height of the image.
+Keep the id the same to only use id 48 image.
+
+1) Note that the problem is highly personalized so you need to go through a few rounds of revisions.
+2) When you have gathered enough information to answer, say "I WANT TO ANSWER:" in the beginning of your response and provide your final answer.
+3) Note that you can only interact with the human users WITHIN 10 back-and-forth rounds and you have to provide your final answer before the conversation ends.
+4) You will be judged both by the quality of the final answer and the efficiency of the conversation.
+5) You can include ONLY ONE snippet raw html and Tailwind css code (wrapped in <html> tag)in your response to human user to ask how is the proposed design different from what the human user wants. 
+This snippet of raw html and Tailwind css code (WRAPPED IN <html> TAG) will be rendered for the human to see a screenshot of the webpage.
+The human user will respond by comparing your rendered webpage with the webpage that the human user has in mind.
+6) You need to make sure that your html webpage looks exactly as the human user wants, including the overall layout, navigation bars, background color etc.
+7) The human user can only see your rendered image and provide suggestions based on the rendered image, and not any text questions.
+
+First output your thought on your remaining uncertainties about the understanding of the problem and user preferences such as name of the function, input format, output format, and etc.
+Then say "OUTPUT:\n" followed by your proposal html.
diff --git a/projects/sweet_rl/requirements.txt b/projects/sweet_rl/requirements.txt
@@ -0,0 +1,4 @@
+fire
+selenium
+openai
+gradio