diff --git a/.gitignore b/.gitignore
index d03d3a7e..95f07a95 100644
--- a/.gitignore
+++ b/.gitignore
@@ -210,4 +210,7 @@ marimo/_static/
 marimo/_lsp/
 __marimo__/
 
-logs/
\ No newline at end of file
+logs/
+
+data/*
+!data/README.md
\ No newline at end of file
diff --git a/README.md b/README.md
index 617acc16..437aa35d 100644
--- a/README.md
+++ b/README.md
@@ -1,174 +1,110 @@
-<h1 align="center">
-<em>MiroFlow</em>: A Consistent Agent Framework with Reproducible Performance
-</h1>
+<div align="center">
+  <img src="docs/figs/MiroFlow_logo.png" width="65%" alt="MiroFlow" />
+</div>
 
+<br> 
 
-<p align="center">
-<a href="https://huggingface.co/miromind-ai"><img src="https://img.shields.io/badge/-gery?style=social&label=%F0%9F%A4%97%20Huggingface" alt="HuggingFace" style="height: 20px;"></a>
-<a href="https://x.com/miromind_ai"><img src="https://img.shields.io/badge/-grey?style=social&logo=x&label=MiroMindAI" alt="X" style="height: 20px;"></a>
-<a href="https://www.xiaohongshu.com/user/profile/663098830000000003033edc"><img src="https://img.shields.io/badge/-grey?style=social&logo=red&label=RedNote" alt="小红书" style="height: 20px;"></a>
-<a href="https://discord.gg/GPqEnkzQZd"><img src="https://img.shields.io/badge/-grey?style=social&logo=discord&label=Discord" alt="Discord" style="height: 20px;"></a>
-<a href="./docs/figs/wechat-group-qr-code.jpg"><img src="https://img.shields.io/badge/-grey?style=social&logo=wechat&label=WeChat" alt="WeChat" style="height: 20px;"></a>
-<a href="https://deepwiki.com/MiroMindAI/MiroFlow"><img src="https://img.shields.io/badge/-grey?style=social&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACwAAAAyCAYAAAAnWDnqAAAAAXNSR0IArs4c6QAAA05JREFUaEPtmUtyEzEQhtWTQyQLHNak2AB7ZnyXZMEjXMGeK/AIi+QuHrMnbChYY7MIh8g01fJoopFb0uhhEqqcbWTp06/uv1saEDv4O3n3dV60RfP947Mm9/SQc0ICFQgzfc4CYZoTPAswgSJCCUJUnAAoRHOAUOcATwbmVLWdGoH//PB8mnKqScAhsD0kYP3j/Yt5LPQe2KvcXmGvRHcDnpxfL2zOYJ1mFwrryWTz0advv1Ut4CJgf5uhDuDj5eUcAUoahrdY/56ebRWeraTjMt/00Sh3UDtjgHtQNHwcRGOC98BJEAEymycmYcWwOprTgcB6VZ5JK5TAJ+fXGLBm3FDAmn6oPPjR4rKCAoJCal2eAiQp2x0vxTPB3ALO2CRkwmDy5WohzBDwSEFKRwPbknEggCPB/imwrycgxX2NzoMCHhPkDwqYMr9tRcP5qNrMZHkVnOjRMWwLCcr8ohBVb1OMjxLwGCvjTikrsBOiA6fNyCrm8V1rP93iVPpwaE+gO0SsWmPiXB+jikdf6SizrT5qKasx5j8ABbHpFTx+vFXp9EnYQmLx02h1QTTrl6eDqxLnGjporxl3NL3agEvXdT0WmEost648sQOYAeJS9Q7bfUVoMGnjo4AZdUMQku50McDcMWcBPvr0SzbTAFDfvJqwLzgxwATnCgnp4wDl6Aa+Ax283gghmj+vj7feE2KBBRMW3FzOpLOADl0Isb5587h/U4gGvkt5v60Z1VLG8BhYjbzRwyQZemwAd6cCR5/XFWLYZRIMpX39AR0tjaGGiGzLVyhse5C9RKC6ai42ppWPKiBagOvaYk8lO7DajerabOZP46Lby5wKjw1HCRx7p9sVMOWGzb/vA1hwiWc6jm3MvQDTogQkiqIhJV0nBQBTU+3okKCFDy9WwferkHjtxib7t3xIUQtHxnIwtx4mpg26/HfwVNVDb4oI9RHmx5WGelRVlrtiw43zboCLaxv46AZeB3IlTkwouebTr1y2NjSpHz68WNFjHvupy3q8TFn3Hos2IAk4Ju5dCo8B3wP7VPr/FGaKiG+T+v+TQqIrOqMTL1VdWV1DdmcbO8KXBz6esmYWYKPwDL5b5FA1a0hwapHiom0r/cKaoqr+27/XcrS5UwSMbQAAAABJRU5ErkJggg==&label=Deepwiki" alt="DeepWiki"></a>
-<!-- DeepWiki badge generated by https://deepwiki.ryoppippi.com/ -->
-<a href="https://miromind.ai"><img src="https://img.shields.io/badge/-grey?style=social&logo=google-chrome&label=miromind.ai" alt="miromind.ai" style="height: 20px;"></a>
+<div align="center">
 
-</p>
+[![MODELS](https://img.shields.io/badge/MiroThinker_Models-5EDDD2?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/collections/miromind-ai/mirothinker-v01-689301b6d0563321862d44a1)
+[![DATA](https://img.shields.io/badge/MiroVerse_Data-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1)
+[![WEBSITE](https://img.shields.io/badge/MiroMind_Website-4285F4?style=for-the-badge&logo=google-chrome&logoColor=white)](https://miromind.ai/)
 
+[![DISCORD](https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.com/invite/GPqEnkzQZd)
+[![WeChat](https://img.shields.io/badge/WeChat-07C160?style=for-the-badge&logo=wechat&logoColor=white)](https://cdn-uploads.huggingface.co/production/uploads/68525b342230a897a65cc1c0/SGK70isvVpeJwk_fny9sb.png)
+[![RedNote](https://img.shields.io/badge/RedNote-FF2442?style=for-the-badge&logo=revoltdotchat&logoColor=white)](https://www.xiaohongshu.com/user/profile/663098830000000003033edc)
+[![DeepWiki](https://img.shields.io/badge/DeepWiki-grey?style=for-the-badge&logo=deepwiki&logoColor=white)](https://deepwiki.com/MiroMindAI/MiroFlow)
 
+# 🚀[Please try our Demo!](https://dr.miromind.ai/)🚀
 
-<p align="center">
-<a href="https://dr.miromind.ai/" style="color:rgb(30, 203, 255); text-decoration: underline; text-decoration-thickness: 2px;"><b><u>Try our demo with MiroThinker here!</u></b></a>
-</p>
+</div>
 
-## 📚 Table of Contents
+# MiroFlow: A Leading Open-Source Deep Research Project
 
-- [🎯 Overview](#-overview)
-- [✨ MiroFlow SOTA Performance](#-miroflow-sota-performance)
-- [🤖 MiroFlow: Modular AI Agent Framework](#-miroflow-modular-ai-agent-framework)
-  - [Workflow Overview](#workflow-overview)
-  - [Architecture Components](#architecture-components)
-    - [Core System 💻](#core-system-)
-    - [Tool Integration 🔧](#tool-integration-)
-    - [Agent System 👷](#agent-system-)
-    - [Support Systems ⚙️](#support-systems-️)
-- [🚀 Getting Started](#-getting-started)
-  - [Prerequisites](#prerequisites)
-  - [Runing a single task](#runing-a-single-task)
-  - [Evaluate on Benchmark](#evaluate-on-benchmark)
-  - [[Optional] Customized Configuration](#optional-customized-configuration)
-- [🌟 MiroThinker](#-mirothinker)
-- [❓ FAQ](#-faq)
-- [🎉 Join Our Communities!](#-join-our-communities)
-
-# 🎯 Overview 
-
-<img src="./docs/figs/logo.png" alt="MiroFlow Logo" width="200" align="right">
-
-**MiroFlow** is a **battle-tested** agent framework that reliably completes complex tool-use tasks. We have extensively used it to generate high-quality, post-training agent trace data for **[MiroThinker](https://huggingface.co/collections/miromind-ai/mirothinker-v01-689301b6d0563321862d44a1)**, our suite of open-source agentic models. Some key features are:
-
-- 🌟 **Reproducible SOTA**: **MiroFlow** consistently achieves 72.2% (pass@1 average@3) on GAIA validation set. Follow our [getting-started guide](#get-start) below, or view our many runs of gaia trace on huggingfaces. If you can't reproduce our result, please open a Github issue - We take reproducibility seriously.
-- 🌟 **High Concurrency and Fault Tolerance**: **MiroFlow**  scales data collection efficiently and handles rate-limited APIs and unstable network connections with ease.
-- 🌟 **Baked-in observability and evaluation**: **MiroFlow** ships with scripts for benchmarking agents and a straightforward web-ui for visualizing and debugging agent trace data.
-
-# ✨ MiroFlow SOTA Performance
+<img src="docs/figs/logo.png" alt="MiroFlow Logo" width="150" align="right">
 
-MiroFlow, equipped with Claude Sonnet 3.7 as its primary LLM, **achieved 81.8% pass@3, 82.4% maj. vote, 74.5% pass@1 (best@3), and 72.2% pass@1 (avg@3) on the GAIA validation set**. This represents **state-of-the-art (SOTA) performance** among open-source agent frameworks.
 
-![GAIA Validation Performance](./docs/figs/gaia_score.png)
-
-> [!NOTE]
-> Our pass@1 scores are reported as both the average across three runs (avg@3) and the best score among those runs (best@3). For most other reported pass@1 results, it is unclear whether they represent an average or a best score across multiple trials (indicated with *). 
-
-To prevent agents from retrieving answers directly from Hugging Face, we disabled access to it during the inference and trace collection.
-
-*We have evaluated multiple agent frameworks on GAIA. Please note that some reported results may be overstated or lack clear definitions, and are not reproducible.*
-In contrast, reproducing MiroFlow's results is straightforward with just a few required API keys.
-
-# 🤖 MiroFlow: Modular AI Agent Framework
+- [📰 News & Updates](#-news--updates)
+- [📝 Introduction](#-introduction)
+- [✨ Performance on Benchmarks](#-performance-on-benchmarks)
+- [🚀 Getting Started](#-getting-started)
+- [🌟 MiroThinker](docs/mirothinker.md)
+- [📄 License & Support](#-license--support)
 
-MiroFlow is a sophisticated, modular framework for building intelligent AI agents with multi-turn conversation capabilities, comprehensive tool integration, and hierarchical sub-agent support.
 
-![MiroFlow Architecture](./docs/figs/miroflow_architecture.png)
+## 📰 News & Updates
 
-## Workflow Overview
+- **2025-08-27**: 🎉 **MiroFlow v0.2** - Achieves SOTA performance across [multiple agentic benchmarks](https://miromind.ai/blog/miroflow). Highlights include **HLE 27.2%**, **HLE-Text-Only 29.5%**, **BrowserComp-EN 33.2%**, **BrowserComp-ZH 47.1%**, and **xBench-DeepSearch 72.0%**.
+- **2025-08-26**: 🎉 [GAIA Validation Trace](apps/public-trace/gaia-validation) released (73.94% with pass@1) and [Gradio Demo](https://github.com/MiroMindAI/MiroThinker/tree/main/apps/gradio-demo) released for local deployment.
+- **2025-08-08**: 🎉 **MiroFlow v0.1** - Framework, model, and data are now fully open-sourced!
 
-MiroFlow handles user queries through a multi-stage and agentic process designed for flexibility and depth. The workflow is organized as follows:
 
-1. **Intent Recognition & Query Augmentation**  
-   LLMs analyze user input to detect intent and refine the query.
+## 📝 Introduction
 
-2. **Planning & Task Orchestration**  
-   The main agent drafts an execution plan, invokes tools, and coordinates sub-agents.
+**MiroFlow** is a fully open-sourced agent framework designed to reliably complete complex tool-use tasks. Our comprehensive ecosystem includes the following key components:
 
-3. **Delegation to Sub-Agents**  
-   Specialized agents (e.g., agent-browsing) handle complex or domain-specific tasks. Sub-agents independently plan, act, and execute tool calls as needed.
+- 🌟 **Reproducible SOTA Performance**: MiroFlow consistently achieves 72.2% (pass@1 average@3) on the GAIA benchmark. Follow our detailed guide to reproduce our released GAIA traces and verify results.
+- 🌟 **Advanced Data Collection**: Our framework features sophisticated data collection capabilities that generate high-quality, post-training agent trace data. We've open-sourced extensive datasets through [MiroVerse](https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1).
+- 🌟 **Open Source Models**: We provide fully open-sourced models that you can deploy locally and fine-tune for your specific needs. Explore our model collection at [MiroThinker](https://huggingface.co/collections/miromind-ai/mirothinker-v01-689301b6d0563321862d44a1).
+- 🌟 **Comprehensive Training Framework**: We've open-sourced our complete SFT and DPO training recipes, available at [MiroTrain](https://github.com/MiroMindAI/MiroTrain).
+- 🌟 **Reinforcement Learning Framework**: Our RL training exploration and methodologies are fully available through [MiroRL](https://github.com/MiroMindAI/MiroRL).
 
-4. **Tool Access via MCP Servers**  
-   When external capabilities are required, agents leverage specialized tools by connecting to MCP (Model Context Protocol) servers.
 
-5. **Result Synthesis & Output Alignment**  
-   After task completion, a dedicated summary process synthesizes results, ensuring the output is high-quality and aligned with user instructions (or benchmark formats).
 
-## Architecture Components
+## ✨ Performance on Benchmarks
 
-All core components are located in the `MiroFlow/libs/` directory.
+<div align="center">
+  <img src="docs/figs/09xyHJV9dkbY2yacsv4zYTBbKM.avif" width="90%" alt="Comprehensive Benchmark Performance Comparison" style="border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
+</div>
 
-```
-MiroFlow/libs/
-├── miroflow/
-│   └── src/miroflow/
-│       ├── prebuilt/
-│       │   ├── pipeline.py              # Pipeline: coordinates task execution
-│       │   ├── orchestrator.py          # Orchestrator: manages LLM ↔ tool flow
-│       │   └── config/                  # Hydra configs for agents, LLMs, pricing
-│       ├── llm/
-│       │   └── client.py                # Unified LLM client
-│       ├── utils/
-│       │   ├── io_utils.py              # Output formatting utilities
-│       │   ├── prompt_utils.py          # Prompt definitions for agents
-│       │   └── tool_utils.py            # Tool configuration helpers
-│       └── logging/                     # Task logging & metrics
-│
-├── miroflow-tool/
-│   └── src/miroflow/tool/
-│       ├── manager.py                   # Tool Manager: MCP server connector
-│       └── mcp_servers/                 # Individual MCP tool servers
-│           ├── python_server.py         # Code execution
-│           ├── vision_mcp_server.py     # Visual perception
-│           ├── searching_mcp_server.py  # Web search & retrieval
-│           ├── audio_mcp_server.py      # Audio transcription
-│           ├── reasoning_mcp_server.py  # Enhanced reasoning
-│           └── reading_mcp_server.py    # Document processing
-```
+We benchmark MiroFlow on a series of benchmarks including **GAIA**, **HLE**, **BrowseComp** and **xBench-DeepSearch**. Meantime, we are working on more benchmarks.
 
-![Core Component Architecture](docs/figs/core_component_architecture.png)
+| Model/Framework | GAIA Val | HLE | HLE-Text | BrowserComp-EN | BrowserComp-ZH | xBench-DeepSearch |
+|----------------|----------|-----|----------|----------------|----------------|-------------------|
+| **MiroFlow** | **82.4%** | 27.2% | **29.5%** | 33.2% | **47.1%** | **72.0%** |
+| OpenAI Deep Research | 67.4% | 26.6% | - | **51.5%** | 42.9% | - |
+| Gemini Deep Research | - | 26.9% | - | - | - | 50+% |
+| Kimi Researcher | - | - | 26.9% | - | - | 69.0% |
+| WebSailor-72B | 55.4% | - | - | - | 30.1% | 55.0% |
+| Manus | 73.3% | - | - | - | - | - |
+| DeepSeek v3.1 | - | **29.8%** | - | - | - | 71.2% |
 
-### Core System 💻
 
-- **Pipeline** (`./miroflow/src/miroflow/prebuilt/pipeline.py`): Main entry point that creates and manages all components, handles error recovery, and returns final results
 
-- **Orchestrator** (`./miroflow/src/miroflow/prebuilt/orchestrator.py`): Manages multi-turn conversations, parses tool calls, executes tools, and delegates to sub-agents
+### GAIA-Validation
 
-- **LLM Client** (`./miroflow/src/miroflow/llm/client.py`): Unified interface supporting Anthropic, OpenAI, Google, Qwen, DeepSeek, and local deployments
+<img src="docs/figs/gaia_score.png" width="40%" alt="GAIA Validation Performance" align="right">
 
-### Tool Integration 🔧
+MiroFlow **achieved 81.8% pass@3, 82.4% maj. vote, 74.5% pass@1 (best@3), and 72.2% pass@1 (avg@3) on the GAIA validation set**. This represents **state-of-the-art (SOTA) performance** among open-source agent frameworks.
 
-- **Tool Manager** (`./miroflow-tool/src/miroflow/tool/manager.py`) : Comprehensive MCP server connection manager with tool discovery, persistent connections, and error handling
-
-- **MCP Servers** (`./miroflow-tool/src/miroflow/tool/mcp_servers/`) : Individual tool implementations built on FastMCP. Provides extensive capabilities including:
-  - Code execution and analysis (`./python_server.py`)
-  - Visual perception (`./vision_mcp_server.py`)
-  - Web search and content retrieval (`./searching_mcp_server.py`)
-  - Audio transcription (`./audio_mcp_server.py`)
-  - Enhanced reasoning capabilities (`./reasoning_mcp_server.py`)
-  - Document processing and analysis (`./reading_mcp_server.py`)
+> [!NOTE]
+> Our pass@1 scores are reported as both the average across three runs (avg@3) and the best score among those runs (best@3). For most other reported pass@1 results, it is unclear whether they represent an average or a best score across multiple trials (indicated with *). 
 
-### Agent System 👷
+To prevent agents from retrieving answers directly from Hugging Face, we disabled access to it during the inference and trace collection.
 
-**Sub-Agents**  
-Specialized agents designed for specific domains (e.g., `agent-browsing` for web navigation). Each sub-agent maintains dedicated tool sets and custom prompts, allowing the main agent to delegate tasks requiring specialized expertise. Agent definitions are managed through configuration files with prompts and descriptions customized in `./miroflow/src/miroflow/utils/prompt_utils.py` and `tool_utils.py`.
+*We have evaluated multiple agent frameworks on GAIA. Please note that some reported results may be overstated or lack clear definitions, and are not reproducible.*
+In contrast, reproducing MiroFlow's results is straightforward with just a few required API keys.
 
-### Support Systems ⚙️
 
-- **Configuration System** (`./miroflow/src/miroflow/prebuilt/config/`) : Hydra-powered YAML configuration for agents, LLMs, benchmarks, and pricing
+# 🤖 MiroFlow: Modular AI Agent Framework
 
-- **Output Formatter** (`./miroflow/src/miroflow/utils/io_utils.py`) : Intelligent response formatting that adapts to various benchmark requirements
+MiroFlow is a high-performance, modular framework for building intelligent AI agents that achieve state-of-the-art results on complex benchmarks. It features multi-turn conversation capabilities, comprehensive tool integration, and hierarchical sub-agent support for superior task completion.
 
-- **Task Logger** (`./miroflow/src/miroflow/logging/`) : Comprehensive logging for agent interactions, tool executions, and performance metrics
+<div align="center">
+<img src="docs/figs/miroflow_architecture.png" width="60%" alt="MiroFlow Architecture">
+</div>
 
-### Execution Pipeline Data Flow
+More information on our agent [workflow](docs/workflow.md).
 
-![Execution Pipeline Data Flow](docs/figs/execution_pipeline.png)
 
 <a id="get-start"></a>
 # 🚀 Getting Started
 
-## Prerequisites
+### Prerequisites
 > [!TIP]
 > we recommend using [`uv`](https://docs.astral.sh/uv/) with `python>= 3.12` 
 
-**Step 1:** Clone repo and prepare python environment:
+### Step 1: Clone repo and prepare python environment
 
 ```bash
 ## clone the repo
@@ -179,9 +115,10 @@ cd MiroFlow/apps/run-agent
 uv sync
 ```
 
-**Step 2:** Set up environment dependencies:
+### Step 2: Set up environment variables
+
+#### a. Set up `MiroFlow/apps/prepare-benchmark/.env`
 
-a. Set up `MiroFlow/apps/prepare-benchmark/.env` by:
 ```bash
 ## copy environment variable template and prepare yours in .env file
 cd MiroFlow/apps/prepare-benchmark
@@ -189,8 +126,10 @@ cd MiroFlow/apps/prepare-benchmark
 # Edit .env with your actual API keys
 cp .env.template .env
 ```
-Edit `.env` to configure environment variables:  
-```
+
+Edit `.env` to configure environment variables:
+
+```env
 # For downloading datasets from Hugging Face
 HF_TOKEN="<your-huggingface-token>"
 
@@ -198,7 +137,8 @@ HF_TOKEN="<your-huggingface-token>"
 DATA_DIR="../../data" # relative to this file 
 ```
 
-b. Set up `MiroFlow/apps/run-agent/.env` by:
+#### b. Set up `MiroFlow/apps/run-agent/.env`
+
 ```bash
 ## copy environment variable template and prepare yours in .env file
 cd MiroFlow/apps/run-agent
@@ -206,8 +146,10 @@ cd MiroFlow/apps/run-agent
 # Edit .env with your actual API keys
 cp .env.template .env
 ```
-Edit `.env` to configure environment variables:  
-```
+
+Edit `.env` to configure environment variables:
+
+```env
 # Using OpenRouter to provide primary agent model
 OPENROUTER_API_KEY=""
 OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
@@ -241,51 +183,16 @@ HTTPS_PROXY=""
 DATA_DIR="../../data"
 ```
 
-If you wish to use a different LLM as the primary agent model, you will need to provide the corresponding API keys.
-
-
-**Step 3:** Prepare E2B Sandbox (Optional)
-
-> [!TIP]
-> We provide a public E2B sandbox template. Follow this step if you want to reproduce.
->
-> For the E2B sandbox service, we recommend setting up a Linux Docker image with a comprehensive set of apt and Python packages pre-installed. Without these pre-installed packages, the agent will need to spend extra steps and context installing them, resulting in reduced token efficiency.
->
-> you need to have `npm` install and `docker` running locally.
-
-
-1. Install `e2b` command line and login:
-
-```shell
-## install e2b
-npm install -g @e2b/cli
-## check that it is available
-which e2b 
-```
-
-2. Download our pre-configured Dockerfile:
-[e2b.Dockerfile](https://github.com/MiroMindAI/MiroFlow/blob/main/docs/e2b.Dockerfile).
-
-```shell
-wget https://github.com/MiroMindAI/MiroFlow/blob/main/docs/e2b.Dockerfile
-```
-
-3. Run `e2b template build` command [check official doc here](https://e2b.dev/docs/sdk-reference/cli/v1.0.2/template), use `all_pip_apt_pkg` as the name of template.
-
-```shell
-## build the template with `docker build` locally
-E2B_ACCESS_TOKEN=${your-token}
-e2b template build -c "/root/.jupyter/start-up.sh" -n "all_pip_apt_pkg" -d ./e2b.Dockerfile
-## check that template is built successfully
-E2B_ACCESS_TOKEN=${your-token} e2b template list
-```
+> [!NOTE]
+> If you wish to use a different LLM as the primary agent model, you will need to provide the corresponding API keys.
 
-For additional information, please see the [E2B Docker documentation](https://e2b.dev/docs/sandbox-template).
+### Step 3: Local E2B Sandbox Deployment
+To achieve our best benchmark results, we recommend using a pre-defined sandbox template that includes the most commonly used Python and apt packages. Please see our [installation guide](docs/local_e2b.md) for detailed instructions.
 
+If you prefer not to use a sandbox template, you can disable it by commenting out the line `template=DEFAULT_TEMPLATE_ID,` in `libs/miroflow-tool/src/miroflow/tool/mcp_servers/python_server.py` (line 145).
 
-## Runing a single task
 
-Run a single task:
+### Run a single task
 
 ```bash
 ## run a task with instruction
@@ -293,147 +200,72 @@ cd MiroFlow/apps/run-agent
 uv run main.py trace --task="your task description" --task_file_name="path to related task file"
 ```
 
-## Evaluate on Benchmark
-
-Run prebuilt agent on the benchmark data:
+### Evaluate on Benchmark
 
+Prepare datasets according to your requirements. Some datasets may need to be downloaded manually into the `/data/<benchmark>` folder, and you should also create a corresponding `standardized_data.jsonl` metafile. We will support as many datasets as possible as soon as we can.
 ```bash
-## download data
+## supported benchmarks
 cd MiroFlow/apps/prepare-benchmark
 uv run main.py get gaia-val
-## run the code
-cd MiroFlow/apps/run-agent
-uv run main.py common-benchmark benchmark=gaia-validation
+uv run main.py get browsecomp-test
+uv run main.py get browsecomp-zh-test
+uv run main.py get hle
 ```
 
-To perform parallel multi-run evaluations, you can use the provided script:
-
-```bash
-cd MiroFlow/apps/run-agent
-bash scripts/claude-sonnet-3.7/run_evaluate_multiple_runs_gaia-validation.sh
-```
-
-## [Optional] Customized Configuration
-
-MiroFlow uses [Hydra](https://hydra.cc/) for flexible configuration management, supporting different setups for LLMs, agents, benchmarks, and pricing models.
-
-## Structure
-
-```
-MiroFlow/libs/miroflow/src/miroflow/prebuilt/config
-├── config.yaml              # Main configuration with defaults
-├── agent/                   # Agent configurations (tools, limits)
-├── benchmark/               # Benchmark configurations (datasets, execution)
-└── llm/                     # Language model configurations (providers, models)
-```
-
-## Usage
-
-Run with default configuration:
+Run evaluation using the default settings. (Not parallelized; not recommended.)
 ```bash
+## run the code
 cd MiroFlow/apps/run-agent
-uv run main.py common-benchmark
-```
-
-Default configuration is defined in  
-`MiroFlow/libs/miroflow/src/miroflow/prebuilt/config/config.yaml`:
-
-```yaml
-# conf/config.yaml
-defaults:
-  - llm: claude_openrouter
-  - agent: miroflow
-  - benchmark: gaia-validation
-  - pricing: _default
-
-# Other configurations...
-```
-
-| Component  | Default Value         | File Path                                                                 |
-|------------|----------------------|---------------------------------------------------------------------------|
-| LLM        | `claude_openrouter`  | `libs/miroflow/src/miroflow/prebuilt/config/llm/claude_openrouter.yaml`                                   |
-| Agent      | `miroflow`           | `libs/miroflow/src/miroflow/prebuilt/config/agent/miroflow.yaml`                        |
-| Benchmark  | `gaia-validation`    | `libs/miroflow/src/miroflow/prebuilt/config/benchmark/gaia-validation.yaml`                                       |
-
-
-## Override Configurations
-
-### Component Override
-Switch between existing configurations using the filename (without `.yaml`):
-```bash
-uv run main.py common-benchmark llm=<filename> agent=<filename> benchmark=<filename>
+uv run main.py common-benchmark benchmark=gaia-validation
+uv run main.py common-benchmark benchmark=browsecomp
+uv run main.py common-benchmark benchmark=browsecomp-zh
+uv run main.py common-benchmark benchmark=hle
 ```
 
-For example, if you have `conf/llm/claude_openrouter.yaml`, use `llm=claude_openrouter`
+For parallel and multi-run evaluations, and to gain better control over environment settings using Hydra, **we recommend using the provided script**:
 
-
-### Parameter Override
-Override specific parameters:
 ```bash
 cd MiroFlow/apps/run-agent
-uv run main.py common-benchmark llm.temperature=0.1 agent.main_agent.max_turns=30
+bash ./scripts/main-worker-dual/run_evaluate_multiple_runs_gaia-validation.sh
+bash ./scripts/main-worker-dual/run_evaluate_multiple_runs_browsecomp.sh
+bash ./scripts/main-worker-dual/run_evaluate_multiple_runs_browsecomp-zh.sh
+bash ./scripts/main-worker-dual/run_evaluate_multiple_runs_hle.sh
 ```
 
-## Create Custom Configurations
-
-1. **Create new config file** in the appropriate subdirectory (e.g., `conf/llm/my_config.yaml`)
-2. **Inherit from defaults** using Hydra's composition:
-   ```yaml
-   defaults:
-     - _default  # Inherit base configuration
-     - _self_    # Allow self-overrides
-   
-   # Your custom parameters
-   parameter: value
-   ```
-3. **Use your config**: `uv run main.py common-benchmark component=my_config`
-
+You can easily modify and customize these scripts to suit your needs. See [Customized Configuration](#customized-configuration) for more details.
 
-# 🌟 MiroThinker
+### Customized Configuration
 
+MiroFlow leverages [Hydra](https://hydra.cc/) for powerful configuration management, allowing you to easily switch between different LLMs, agents, benchmarks, and pricing models using YAML configuration files. For detailed instructions on configuration management, see our [configuration guide](docs/hydra_config.md).
 
-[MiroThinker](https://github.com/MiroMindAI/MiroThinker) (7B/14B/32B) is our suite of open-source agentic models, designed to work seamlessly with the MiroFlow framework. Our models are specifically built to handle **complex, multi-tool tasks**, leveraging the reproducible and robust foundation that MiroFlow provides.
 
-By combining MiroFlow’s reliable orchestration with MiroThinker’s advanced reasoning capabilities, we offer a powerful, end-to-end solution for building high-performing, reproducible AI agents.
-These models are a direct result of our extensive data collection efforts, utilizing MiroFlow to generate high-quality, post-training agent trace data. This unique approach enables MiroThinker to excel in planning, executing, and reasoning through complex multi-step tasks.
-We invite the community to explore and build upon these models. For more details on the architecture and implementation, please take a look at our codebase.
 
-# ❓ FAQ
+## 📄 License & Support
 
-**Q: What is the estimated cost of running the GAIA validation set for a single run?** <br>
-**A**: The cost is approximately **$450 USD** for a run without a cache. Enabling the cache can significantly reduce this cost by 50-67%, bringing it down to the **$150 - $225** range.
+This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. Some components may have different licenses as specified in their respective file headers.
 
+### 🙏 Acknowledgments
 
-**Q: How long does it take to run the GAIA validation set for a single run?** <br>
-**A**: With the `max_concurrent` parameter set to 20, a full run takes about **5 hours** to complete.
+- **Benchmark Contributors** for the comprehensive evaluation datasets
+- **Open Source Community** for the tools and libraries that make this possible
 
-**Q: Are all the specified APIs required?** <br>
-**A**: **Yes.** To fully reproduce our published results, access to all the listed APIs is necessary.
+### 🔧 Support
 
+- Issues: For questions or bug reports, please use [GitHub Issues](https://github.com/MiroMindAI/MiroFlow/issues).
+- FAQ Documentation: See [faq.md](docs/faq.md) for additional guidelines
 
-**Q: What is the difference between MiroFlow and MiroThinker?** <br>
-**A**:  **MiroFlow** is primarily focused on interacting with proprietary models; **MiroThinker** is designed for our own open-source models.
 
-We plan to merge these two projects in the future to create a single, unified platform.
+<div align="center">
+    <img src="https://api.star-history.com/svg?repos=MiroMindAI/MiroFlow&type=Date" alt="Star History Chart" height="300">
+</div>
 
-## 🎉 Join Our Communities! 
-
-- Follow us on social media for timely updates!
-    - [X - MiroMindAI](https://x.com/miromind_ai)
-    - [RedNote - MiroMind](https://www.xiaohongshu.com/user/profile/663098830000000003033edc)
-- Join our communities:
-    - [Discord server](https://discord.gg/GPqEnkzQZd)
-    - <details>
-        <summary>WeChat Group</summary>
-        <div align="center" style="display: flex; justify-content: center; gap: 40px;">
-            <div>
-                <p>WeChat Bot QR Code</p>
-                <img alt="WeChat Bot QR" src="./docs/figs/wechat-bot-qr-code.jpg" width="200" style="margin: 3px;">
-            </div>
-            <div>
-                <p>WeChat Group QR Code</p>
-                <img alt="WeChat Group QR" src="./docs/figs/wechat-group-qr-code.jpg" width="200" style="margin: 3px;">
-            </div>
-        </div>
-      </details>
+### References
 
+```
+@misc{2025mirothinker,
+    title={MiroFlow: An Open-Source Agentic Framework for Deep Research},
+    author={MiroMind AI Team},
+    howpublished={\url{https://github.com/MiroMindAI/MiroFlow}},
+    year={2025}
+}
+```
\ No newline at end of file
diff --git a/apps/eval-agent/.python-version b/apps/eval-agent/.python-version
deleted file mode 100644
index e4fba218..00000000
--- a/apps/eval-agent/.python-version
+++ /dev/null
@@ -1 +0,0 @@
-3.12
diff --git a/apps/eval-agent/README.md b/apps/eval-agent/README.md
deleted file mode 100644
index e69de29b..00000000
diff --git a/apps/eval-agent/pyproject.toml b/apps/eval-agent/pyproject.toml
deleted file mode 100644
index a6f538e3..00000000
--- a/apps/eval-agent/pyproject.toml
+++ /dev/null
@@ -1,14 +0,0 @@
-[project]
-name = "eval-agent"
-version = "0.1.0"
-description = "Add your description here"
-readme = "README.md"
-authors = [
-    { name = "Lei Lei", email = "lei.lei@shanda.com" }
-]
-requires-python = ">=3.12"
-dependencies = []
-
-[build-system]
-requires = ["hatchling"]
-build-backend = "hatchling.build"
diff --git a/apps/eval-agent/src/eval_agent/__init__.py b/apps/eval-agent/src/eval_agent/__init__.py
deleted file mode 100644
index 72363bdb..00000000
--- a/apps/eval-agent/src/eval_agent/__init__.py
+++ /dev/null
@@ -1,7 +0,0 @@
-# SPDX-FileCopyrightText: 2025 MiromindAI
-#
-# SPDX-License-Identifier: Apache-2.0
-
-
-def hello() -> str:
-    return "Hello from eval-agent!"
diff --git a/apps/eval-agent/src/eval_agent/py.typed b/apps/eval-agent/src/eval_agent/py.typed
deleted file mode 100644
index e69de29b..00000000
diff --git a/apps/prepare-benchmark/common.py b/apps/prepare-benchmark/common.py
index 94e5e278..c456f012 100644
--- a/apps/prepare-benchmark/common.py
+++ b/apps/prepare-benchmark/common.py
@@ -18,7 +18,7 @@ class Task:
     metadata: MutableMapping[str, Any] = dataclasses.field(default_factory=dict)
 
     def to_json(self) -> bytes:
-        return json.dumps(dataclasses.asdict(self)).encode()
+        return json.dumps(dataclasses.asdict(self), ensure_ascii=False).encode()
 
     @classmethod
     def from_json(cls, b: bytes):
diff --git a/apps/prepare-benchmark/gen_browsecomp.py b/apps/prepare-benchmark/gen_browsecomp.py
index f23c9122..b91e89b0 100644
--- a/apps/prepare-benchmark/gen_browsecomp.py
+++ b/apps/prepare-benchmark/gen_browsecomp.py
@@ -30,7 +30,7 @@ def decrypt(ciphertext_b64: str, password: str) -> str:
     encrypted = base64.b64decode(ciphertext_b64)
     key = derive_key(password, len(encrypted))
     decrypted = bytes(a ^ b for a, b in zip(encrypted, key))
-    return decrypted.decode()
+    return decrypted.decode("utf-8")
 
 
 def gen_browsecomp_test(hf_token: str) -> Generator[Task, None, None]:
@@ -53,3 +53,26 @@ def gen_browsecomp_test(hf_token: str) -> Generator[Task, None, None]:
         )
         yield task
     return
+
+
+def gen_browsecomp_zh_test(hf_token: str) -> Generator[Task, None, None]:
+    dataset = load_dataset(
+        "PALIN2018/BrowseComp-ZH",
+        token=hf_token,
+        split="test",
+    )
+    for idx, x in enumerate(dataset):
+        metadata: MutableMapping = x
+        problem_encrypted = metadata.pop("Question")
+        answer_encrypted = metadata.pop("Answer")
+        canary = metadata.pop("canary")
+        metadata["Topic"] = decrypt(metadata["Topic"], canary)
+        task = Task(
+            task_id=str(idx),
+            task_question=decrypt(problem_encrypted, canary),
+            ground_truth=decrypt(answer_encrypted, canary),
+            file_path=None,
+            metadata=metadata,
+        )
+        yield task
+    return
diff --git a/apps/prepare-benchmark/gen_hle.py b/apps/prepare-benchmark/gen_hle.py
new file mode 100644
index 00000000..66a7bcb9
--- /dev/null
+++ b/apps/prepare-benchmark/gen_hle.py
@@ -0,0 +1,75 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import base64
+import pathlib
+from typing import Generator, MutableMapping
+
+from datasets import load_dataset
+
+from common import Task
+
+
+def save_image(image, data_dir: str, task_id: str) -> str:
+    if not image:
+        return None
+    # Ensure data_dir is absolute and resolved to avoid ugly .. in the path
+    data_dir_path = pathlib.Path(data_dir).resolve()
+    image_path = data_dir_path / "hle" / "images" / f"{task_id}.png"
+    image_path.parent.mkdir(parents=True, exist_ok=True)
+
+    # Handle different image formats
+    if isinstance(image, str):
+        # If it's a data URL, extract the base64 part
+        if image.startswith("data:"):
+            try:
+                header, b64data = image.split(",", 1)
+                image_data = base64.b64decode(b64data)
+                image_path.write_bytes(image_data)
+            except Exception as e:
+                raise ValueError(
+                    f"Cannot process image data:<class 'str'> (data URL): {e}"
+                )
+        else:
+            try:
+                image_data = base64.b64decode(image)
+                image_path.write_bytes(image_data)
+            except Exception as e:
+                raise ValueError(
+                    f"Cannot process image data:<class 'str'> (raw b64): {e}"
+                )
+    elif hasattr(image, "save"):
+        # If it's a PIL Image object
+        image.save(image_path)
+    else:
+        # Try to handle it as bytes directly
+        try:
+            image_path.write_bytes(image)
+        except Exception:
+            raise ValueError(f"Cannot process image data: {type(image)}")
+
+    return str(image_path)
+
+
+def gen_hle_test(hf_token: str, data_dir: str) -> Generator[Task, None, None]:
+    dataset = load_dataset("cais/hle", split="test", token=hf_token)
+    for x in dataset:
+        metadata: MutableMapping = x  # type: ignore
+        task_id = metadata.pop("id")
+        question = metadata.pop("question")
+        gt = metadata.pop("answer")
+        image = metadata.pop("image")  # base64 encoded image
+        image_uri = save_image(image, data_dir, task_id)
+        metadata.pop("image_preview")
+        metadata.pop("rationale_image")
+        task = Task(
+            task_id=task_id,
+            task_question=question,
+            ground_truth=gt,
+            file_path=image_uri,
+            metadata=metadata,
+        )
+        yield task
+
+    return
diff --git a/apps/prepare-benchmark/main.py b/apps/prepare-benchmark/main.py
index d9f3a6b7..99eccb6a 100644
--- a/apps/prepare-benchmark/main.py
+++ b/apps/prepare-benchmark/main.py
@@ -8,10 +8,11 @@
 import dotenv
 import fire
 
-from gen_browsecomp import gen_browsecomp_test
+from gen_browsecomp import gen_browsecomp_test, gen_browsecomp_zh_test
 from gen_frames import gen_frames_test
 from gen_gaia import gen_gaia_validation
 from gen_gaia_text_only import gen_gaia_text_only
+from gen_hle import gen_hle_test
 from gen_webwalkerqa import gen_webwalkerqa
 
 
@@ -23,6 +24,8 @@ class _Env:
         "frames-test",
         "webwalkerqa",
         "browsecomp-test",
+        "browsecomp-zh-test",
+        "hle",
     )
     meta_filename = "standardized_data.jsonl"
     data_dir: pathlib.Path
@@ -56,6 +59,13 @@ def gen():
                 for x in gen_browsecomp_test(env.hf_token):
                     yield x
 
+            return gen
+        case "browsecomp-zh-test":
+
+            def gen():
+                for x in gen_browsecomp_zh_test(env.hf_token):
+                    yield x
+
             return gen
         case "frames-test":
 
@@ -79,6 +89,13 @@ def gen():
                 for x in gen_webwalkerqa(env.hf_token):
                     yield x
 
+            return gen
+        case "hle":
+
+            def gen():
+                for x in gen_hle_test(env.hf_token, env.data_dir):
+                    yield x
+
             return gen
         case _:
             raise ValueError("not supported")
diff --git a/apps/prepare-benchmark/pyproject.toml b/apps/prepare-benchmark/pyproject.toml
index 29173530..5241f376 100644
--- a/apps/prepare-benchmark/pyproject.toml
+++ b/apps/prepare-benchmark/pyproject.toml
@@ -11,6 +11,7 @@ dependencies = [
     "fire>=0.7.0",
     "python-dotenv>=1.1.1",
     "requests>=2.32.4",
+    "Pillow",
 ]
 
 [dependency-groups]
diff --git a/apps/prepare-benchmark/uv.lock b/apps/prepare-benchmark/uv.lock
index 95252c9d..894cbc6e 100644
--- a/apps/prepare-benchmark/uv.lock
+++ b/apps/prepare-benchmark/uv.lock
@@ -594,6 +594,72 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/d5/f9/07086f5b0f2a19872554abeea7658200824f5835c58a106fa8f2ae96a46c/pandas-2.3.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:5db9637dbc24b631ff3707269ae4559bce4b7fd75c1c4d7e13f40edc42df4444", size = 13189044, upload-time = "2025-07-07T19:19:39.999Z" },
 ]
 
+[[package]]
+name = "pillow"
+version = "11.3.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/f3/0d/d0d6dea55cd152ce3d6767bb38a8fc10e33796ba4ba210cbab9354b6d238/pillow-11.3.0.tar.gz", hash = "sha256:3828ee7586cd0b2091b6209e5ad53e20d0649bbe87164a459d0676e035e8f523", size = 47113069, upload-time = "2025-07-01T09:16:30.666Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/40/fe/1bc9b3ee13f68487a99ac9529968035cca2f0a51ec36892060edcc51d06a/pillow-11.3.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:fdae223722da47b024b867c1ea0be64e0df702c5e0a60e27daad39bf960dd1e4", size = 5278800, upload-time = "2025-07-01T09:14:17.648Z" },
+    { url = "https://files.pythonhosted.org/packages/2c/32/7e2ac19b5713657384cec55f89065fb306b06af008cfd87e572035b27119/pillow-11.3.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:921bd305b10e82b4d1f5e802b6850677f965d8394203d182f078873851dada69", size = 4686296, upload-time = "2025-07-01T09:14:19.828Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/1e/b9e12bbe6e4c2220effebc09ea0923a07a6da1e1f1bfbc8d7d29a01ce32b/pillow-11.3.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:eb76541cba2f958032d79d143b98a3a6b3ea87f0959bbe256c0b5e416599fd5d", size = 5871726, upload-time = "2025-07-03T13:10:04.448Z" },
+    { url = "https://files.pythonhosted.org/packages/8d/33/e9200d2bd7ba00dc3ddb78df1198a6e80d7669cce6c2bdbeb2530a74ec58/pillow-11.3.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:67172f2944ebba3d4a7b54f2e95c786a3a50c21b88456329314caaa28cda70f6", size = 7644652, upload-time = "2025-07-03T13:10:10.391Z" },
+    { url = "https://files.pythonhosted.org/packages/41/f1/6f2427a26fc683e00d985bc391bdd76d8dd4e92fac33d841127eb8fb2313/pillow-11.3.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:97f07ed9f56a3b9b5f49d3661dc9607484e85c67e27f3e8be2c7d28ca032fec7", size = 5977787, upload-time = "2025-07-01T09:14:21.63Z" },
+    { url = "https://files.pythonhosted.org/packages/e4/c9/06dd4a38974e24f932ff5f98ea3c546ce3f8c995d3f0985f8e5ba48bba19/pillow-11.3.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:676b2815362456b5b3216b4fd5bd89d362100dc6f4945154ff172e206a22c024", size = 6645236, upload-time = "2025-07-01T09:14:23.321Z" },
+    { url = "https://files.pythonhosted.org/packages/40/e7/848f69fb79843b3d91241bad658e9c14f39a32f71a301bcd1d139416d1be/pillow-11.3.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:3e184b2f26ff146363dd07bde8b711833d7b0202e27d13540bfe2e35a323a809", size = 6086950, upload-time = "2025-07-01T09:14:25.237Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/1a/7cff92e695a2a29ac1958c2a0fe4c0b2393b60aac13b04a4fe2735cad52d/pillow-11.3.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:6be31e3fc9a621e071bc17bb7de63b85cbe0bfae91bb0363c893cbe67247780d", size = 6723358, upload-time = "2025-07-01T09:14:27.053Z" },
+    { url = "https://files.pythonhosted.org/packages/26/7d/73699ad77895f69edff76b0f332acc3d497f22f5d75e5360f78cbcaff248/pillow-11.3.0-cp312-cp312-win32.whl", hash = "sha256:7b161756381f0918e05e7cb8a371fff367e807770f8fe92ecb20d905d0e1c149", size = 6275079, upload-time = "2025-07-01T09:14:30.104Z" },
+    { url = "https://files.pythonhosted.org/packages/8c/ce/e7dfc873bdd9828f3b6e5c2bbb74e47a98ec23cc5c74fc4e54462f0d9204/pillow-11.3.0-cp312-cp312-win_amd64.whl", hash = "sha256:a6444696fce635783440b7f7a9fc24b3ad10a9ea3f0ab66c5905be1c19ccf17d", size = 6986324, upload-time = "2025-07-01T09:14:31.899Z" },
+    { url = "https://files.pythonhosted.org/packages/16/8f/b13447d1bf0b1f7467ce7d86f6e6edf66c0ad7cf44cf5c87a37f9bed9936/pillow-11.3.0-cp312-cp312-win_arm64.whl", hash = "sha256:2aceea54f957dd4448264f9bf40875da0415c83eb85f55069d89c0ed436e3542", size = 2423067, upload-time = "2025-07-01T09:14:33.709Z" },
+    { url = "https://files.pythonhosted.org/packages/1e/93/0952f2ed8db3a5a4c7a11f91965d6184ebc8cd7cbb7941a260d5f018cd2d/pillow-11.3.0-cp313-cp313-ios_13_0_arm64_iphoneos.whl", hash = "sha256:1c627742b539bba4309df89171356fcb3cc5a9178355b2727d1b74a6cf155fbd", size = 2128328, upload-time = "2025-07-01T09:14:35.276Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/e8/100c3d114b1a0bf4042f27e0f87d2f25e857e838034e98ca98fe7b8c0a9c/pillow-11.3.0-cp313-cp313-ios_13_0_arm64_iphonesimulator.whl", hash = "sha256:30b7c02f3899d10f13d7a48163c8969e4e653f8b43416d23d13d1bbfdc93b9f8", size = 2170652, upload-time = "2025-07-01T09:14:37.203Z" },
+    { url = "https://files.pythonhosted.org/packages/aa/86/3f758a28a6e381758545f7cdb4942e1cb79abd271bea932998fc0db93cb6/pillow-11.3.0-cp313-cp313-ios_13_0_x86_64_iphonesimulator.whl", hash = "sha256:7859a4cc7c9295f5838015d8cc0a9c215b77e43d07a25e460f35cf516df8626f", size = 2227443, upload-time = "2025-07-01T09:14:39.344Z" },
+    { url = "https://files.pythonhosted.org/packages/01/f4/91d5b3ffa718df2f53b0dc109877993e511f4fd055d7e9508682e8aba092/pillow-11.3.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:ec1ee50470b0d050984394423d96325b744d55c701a439d2bd66089bff963d3c", size = 5278474, upload-time = "2025-07-01T09:14:41.843Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/0e/37d7d3eca6c879fbd9dba21268427dffda1ab00d4eb05b32923d4fbe3b12/pillow-11.3.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:7db51d222548ccfd274e4572fdbf3e810a5e66b00608862f947b163e613b67dd", size = 4686038, upload-time = "2025-07-01T09:14:44.008Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/b0/3426e5c7f6565e752d81221af9d3676fdbb4f352317ceafd42899aaf5d8a/pillow-11.3.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:2d6fcc902a24ac74495df63faad1884282239265c6839a0a6416d33faedfae7e", size = 5864407, upload-time = "2025-07-03T13:10:15.628Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/c1/c6c423134229f2a221ee53f838d4be9d82bab86f7e2f8e75e47b6bf6cd77/pillow-11.3.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:f0f5d8f4a08090c6d6d578351a2b91acf519a54986c055af27e7a93feae6d3f1", size = 7639094, upload-time = "2025-07-03T13:10:21.857Z" },
+    { url = "https://files.pythonhosted.org/packages/ba/c9/09e6746630fe6372c67c648ff9deae52a2bc20897d51fa293571977ceb5d/pillow-11.3.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c37d8ba9411d6003bba9e518db0db0c58a680ab9fe5179f040b0463644bc9805", size = 5973503, upload-time = "2025-07-01T09:14:45.698Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/1c/a2a29649c0b1983d3ef57ee87a66487fdeb45132df66ab30dd37f7dbe162/pillow-11.3.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:13f87d581e71d9189ab21fe0efb5a23e9f28552d5be6979e84001d3b8505abe8", size = 6642574, upload-time = "2025-07-01T09:14:47.415Z" },
+    { url = "https://files.pythonhosted.org/packages/36/de/d5cc31cc4b055b6c6fd990e3e7f0f8aaf36229a2698501bcb0cdf67c7146/pillow-11.3.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:023f6d2d11784a465f09fd09a34b150ea4672e85fb3d05931d89f373ab14abb2", size = 6084060, upload-time = "2025-07-01T09:14:49.636Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/ea/502d938cbaeec836ac28a9b730193716f0114c41325db428e6b280513f09/pillow-11.3.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:45dfc51ac5975b938e9809451c51734124e73b04d0f0ac621649821a63852e7b", size = 6721407, upload-time = "2025-07-01T09:14:51.962Z" },
+    { url = "https://files.pythonhosted.org/packages/45/9c/9c5e2a73f125f6cbc59cc7087c8f2d649a7ae453f83bd0362ff7c9e2aee2/pillow-11.3.0-cp313-cp313-win32.whl", hash = "sha256:a4d336baed65d50d37b88ca5b60c0fa9d81e3a87d4a7930d3880d1624d5b31f3", size = 6273841, upload-time = "2025-07-01T09:14:54.142Z" },
+    { url = "https://files.pythonhosted.org/packages/23/85/397c73524e0cd212067e0c969aa245b01d50183439550d24d9f55781b776/pillow-11.3.0-cp313-cp313-win_amd64.whl", hash = "sha256:0bce5c4fd0921f99d2e858dc4d4d64193407e1b99478bc5cacecba2311abde51", size = 6978450, upload-time = "2025-07-01T09:14:56.436Z" },
+    { url = "https://files.pythonhosted.org/packages/17/d2/622f4547f69cd173955194b78e4d19ca4935a1b0f03a302d655c9f6aae65/pillow-11.3.0-cp313-cp313-win_arm64.whl", hash = "sha256:1904e1264881f682f02b7f8167935cce37bc97db457f8e7849dc3a6a52b99580", size = 2423055, upload-time = "2025-07-01T09:14:58.072Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/80/a8a2ac21dda2e82480852978416cfacd439a4b490a501a288ecf4fe2532d/pillow-11.3.0-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:4c834a3921375c48ee6b9624061076bc0a32a60b5532b322cc0ea64e639dd50e", size = 5281110, upload-time = "2025-07-01T09:14:59.79Z" },
+    { url = "https://files.pythonhosted.org/packages/44/d6/b79754ca790f315918732e18f82a8146d33bcd7f4494380457ea89eb883d/pillow-11.3.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:5e05688ccef30ea69b9317a9ead994b93975104a677a36a8ed8106be9260aa6d", size = 4689547, upload-time = "2025-07-01T09:15:01.648Z" },
+    { url = "https://files.pythonhosted.org/packages/49/20/716b8717d331150cb00f7fdd78169c01e8e0c219732a78b0e59b6bdb2fd6/pillow-11.3.0-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:1019b04af07fc0163e2810167918cb5add8d74674b6267616021ab558dc98ced", size = 5901554, upload-time = "2025-07-03T13:10:27.018Z" },
+    { url = "https://files.pythonhosted.org/packages/74/cf/a9f3a2514a65bb071075063a96f0a5cf949c2f2fce683c15ccc83b1c1cab/pillow-11.3.0-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:f944255db153ebb2b19c51fe85dd99ef0ce494123f21b9db4877ffdfc5590c7c", size = 7669132, upload-time = "2025-07-03T13:10:33.01Z" },
+    { url = "https://files.pythonhosted.org/packages/98/3c/da78805cbdbee9cb43efe8261dd7cc0b4b93f2ac79b676c03159e9db2187/pillow-11.3.0-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1f85acb69adf2aaee8b7da124efebbdb959a104db34d3a2cb0f3793dbae422a8", size = 6005001, upload-time = "2025-07-01T09:15:03.365Z" },
+    { url = "https://files.pythonhosted.org/packages/6c/fa/ce044b91faecf30e635321351bba32bab5a7e034c60187fe9698191aef4f/pillow-11.3.0-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:05f6ecbeff5005399bb48d198f098a9b4b6bdf27b8487c7f38ca16eeb070cd59", size = 6668814, upload-time = "2025-07-01T09:15:05.655Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/51/90f9291406d09bf93686434f9183aba27b831c10c87746ff49f127ee80cb/pillow-11.3.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:a7bc6e6fd0395bc052f16b1a8670859964dbd7003bd0af2ff08342eb6e442cfe", size = 6113124, upload-time = "2025-07-01T09:15:07.358Z" },
+    { url = "https://files.pythonhosted.org/packages/cd/5a/6fec59b1dfb619234f7636d4157d11fb4e196caeee220232a8d2ec48488d/pillow-11.3.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:83e1b0161c9d148125083a35c1c5a89db5b7054834fd4387499e06552035236c", size = 6747186, upload-time = "2025-07-01T09:15:09.317Z" },
+    { url = "https://files.pythonhosted.org/packages/49/6b/00187a044f98255225f172de653941e61da37104a9ea60e4f6887717e2b5/pillow-11.3.0-cp313-cp313t-win32.whl", hash = "sha256:2a3117c06b8fb646639dce83694f2f9eac405472713fcb1ae887469c0d4f6788", size = 6277546, upload-time = "2025-07-01T09:15:11.311Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/5c/6caaba7e261c0d75bab23be79f1d06b5ad2a2ae49f028ccec801b0e853d6/pillow-11.3.0-cp313-cp313t-win_amd64.whl", hash = "sha256:857844335c95bea93fb39e0fa2726b4d9d758850b34075a7e3ff4f4fa3aa3b31", size = 6985102, upload-time = "2025-07-01T09:15:13.164Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/7e/b623008460c09a0cb38263c93b828c666493caee2eb34ff67f778b87e58c/pillow-11.3.0-cp313-cp313t-win_arm64.whl", hash = "sha256:8797edc41f3e8536ae4b10897ee2f637235c94f27404cac7297f7b607dd0716e", size = 2424803, upload-time = "2025-07-01T09:15:15.695Z" },
+    { url = "https://files.pythonhosted.org/packages/73/f4/04905af42837292ed86cb1b1dabe03dce1edc008ef14c473c5c7e1443c5d/pillow-11.3.0-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:d9da3df5f9ea2a89b81bb6087177fb1f4d1c7146d583a3fe5c672c0d94e55e12", size = 5278520, upload-time = "2025-07-01T09:15:17.429Z" },
+    { url = "https://files.pythonhosted.org/packages/41/b0/33d79e377a336247df6348a54e6d2a2b85d644ca202555e3faa0cf811ecc/pillow-11.3.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:0b275ff9b04df7b640c59ec5a3cb113eefd3795a8df80bac69646ef699c6981a", size = 4686116, upload-time = "2025-07-01T09:15:19.423Z" },
+    { url = "https://files.pythonhosted.org/packages/49/2d/ed8bc0ab219ae8768f529597d9509d184fe8a6c4741a6864fea334d25f3f/pillow-11.3.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:0743841cabd3dba6a83f38a92672cccbd69af56e3e91777b0ee7f4dba4385632", size = 5864597, upload-time = "2025-07-03T13:10:38.404Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/3d/b932bb4225c80b58dfadaca9d42d08d0b7064d2d1791b6a237f87f661834/pillow-11.3.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:2465a69cf967b8b49ee1b96d76718cd98c4e925414ead59fdf75cf0fd07df673", size = 7638246, upload-time = "2025-07-03T13:10:44.987Z" },
+    { url = "https://files.pythonhosted.org/packages/09/b5/0487044b7c096f1b48f0d7ad416472c02e0e4bf6919541b111efd3cae690/pillow-11.3.0-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:41742638139424703b4d01665b807c6468e23e699e8e90cffefe291c5832b027", size = 5973336, upload-time = "2025-07-01T09:15:21.237Z" },
+    { url = "https://files.pythonhosted.org/packages/a8/2d/524f9318f6cbfcc79fbc004801ea6b607ec3f843977652fdee4857a7568b/pillow-11.3.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:93efb0b4de7e340d99057415c749175e24c8864302369e05914682ba642e5d77", size = 6642699, upload-time = "2025-07-01T09:15:23.186Z" },
+    { url = "https://files.pythonhosted.org/packages/6f/d2/a9a4f280c6aefedce1e8f615baaa5474e0701d86dd6f1dede66726462bbd/pillow-11.3.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:7966e38dcd0fa11ca390aed7c6f20454443581d758242023cf36fcb319b1a874", size = 6083789, upload-time = "2025-07-01T09:15:25.1Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/54/86b0cd9dbb683a9d5e960b66c7379e821a19be4ac5810e2e5a715c09a0c0/pillow-11.3.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:98a9afa7b9007c67ed84c57c9e0ad86a6000da96eaa638e4f8abe5b65ff83f0a", size = 6720386, upload-time = "2025-07-01T09:15:27.378Z" },
+    { url = "https://files.pythonhosted.org/packages/e7/95/88efcaf384c3588e24259c4203b909cbe3e3c2d887af9e938c2022c9dd48/pillow-11.3.0-cp314-cp314-win32.whl", hash = "sha256:02a723e6bf909e7cea0dac1b0e0310be9d7650cd66222a5f1c571455c0a45214", size = 6370911, upload-time = "2025-07-01T09:15:29.294Z" },
+    { url = "https://files.pythonhosted.org/packages/2e/cc/934e5820850ec5eb107e7b1a72dd278140731c669f396110ebc326f2a503/pillow-11.3.0-cp314-cp314-win_amd64.whl", hash = "sha256:a418486160228f64dd9e9efcd132679b7a02a5f22c982c78b6fc7dab3fefb635", size = 7117383, upload-time = "2025-07-01T09:15:31.128Z" },
+    { url = "https://files.pythonhosted.org/packages/d6/e9/9c0a616a71da2a5d163aa37405e8aced9a906d574b4a214bede134e731bc/pillow-11.3.0-cp314-cp314-win_arm64.whl", hash = "sha256:155658efb5e044669c08896c0c44231c5e9abcaadbc5cd3648df2f7c0b96b9a6", size = 2511385, upload-time = "2025-07-01T09:15:33.328Z" },
+    { url = "https://files.pythonhosted.org/packages/1a/33/c88376898aff369658b225262cd4f2659b13e8178e7534df9e6e1fa289f6/pillow-11.3.0-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:59a03cdf019efbfeeed910bf79c7c93255c3d54bc45898ac2a4140071b02b4ae", size = 5281129, upload-time = "2025-07-01T09:15:35.194Z" },
+    { url = "https://files.pythonhosted.org/packages/1f/70/d376247fb36f1844b42910911c83a02d5544ebd2a8bad9efcc0f707ea774/pillow-11.3.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:f8a5827f84d973d8636e9dc5764af4f0cf2318d26744b3d902931701b0d46653", size = 4689580, upload-time = "2025-07-01T09:15:37.114Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/1c/537e930496149fbac69efd2fc4329035bbe2e5475b4165439e3be9cb183b/pillow-11.3.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:ee92f2fd10f4adc4b43d07ec5e779932b4eb3dbfbc34790ada5a6669bc095aa6", size = 5902860, upload-time = "2025-07-03T13:10:50.248Z" },
+    { url = "https://files.pythonhosted.org/packages/bd/57/80f53264954dcefeebcf9dae6e3eb1daea1b488f0be8b8fef12f79a3eb10/pillow-11.3.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:c96d333dcf42d01f47b37e0979b6bd73ec91eae18614864622d9b87bbd5bbf36", size = 7670694, upload-time = "2025-07-03T13:10:56.432Z" },
+    { url = "https://files.pythonhosted.org/packages/70/ff/4727d3b71a8578b4587d9c276e90efad2d6fe0335fd76742a6da08132e8c/pillow-11.3.0-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4c96f993ab8c98460cd0c001447bff6194403e8b1d7e149ade5f00594918128b", size = 6005888, upload-time = "2025-07-01T09:15:39.436Z" },
+    { url = "https://files.pythonhosted.org/packages/05/ae/716592277934f85d3be51d7256f3636672d7b1abfafdc42cf3f8cbd4b4c8/pillow-11.3.0-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:41342b64afeba938edb034d122b2dda5db2139b9a4af999729ba8818e0056477", size = 6670330, upload-time = "2025-07-01T09:15:41.269Z" },
+    { url = "https://files.pythonhosted.org/packages/e7/bb/7fe6cddcc8827b01b1a9766f5fdeb7418680744f9082035bdbabecf1d57f/pillow-11.3.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:068d9c39a2d1b358eb9f245ce7ab1b5c3246c7c8c7d9ba58cfa5b43146c06e50", size = 6114089, upload-time = "2025-07-01T09:15:43.13Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/f5/06bfaa444c8e80f1a8e4bff98da9c83b37b5be3b1deaa43d27a0db37ef84/pillow-11.3.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:a1bc6ba083b145187f648b667e05a2534ecc4b9f2784c2cbe3089e44868f2b9b", size = 6748206, upload-time = "2025-07-01T09:15:44.937Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/77/bc6f92a3e8e6e46c0ca78abfffec0037845800ea38c73483760362804c41/pillow-11.3.0-cp314-cp314t-win32.whl", hash = "sha256:118ca10c0d60b06d006be10a501fd6bbdfef559251ed31b794668ed569c87e12", size = 6377370, upload-time = "2025-07-01T09:15:46.673Z" },
+    { url = "https://files.pythonhosted.org/packages/4a/82/3a721f7d69dca802befb8af08b7c79ebcab461007ce1c18bd91a5d5896f9/pillow-11.3.0-cp314-cp314t-win_amd64.whl", hash = "sha256:8924748b688aa210d79883357d102cd64690e56b923a186f35a82cbc10f997db", size = 7121500, upload-time = "2025-07-01T09:15:48.512Z" },
+    { url = "https://files.pythonhosted.org/packages/89/c7/5572fa4a3f45740eaab6ae86fcdf7195b55beac1371ac8c619d880cfe948/pillow-11.3.0-cp314-cp314t-win_arm64.whl", hash = "sha256:79ea0d14d3ebad43ec77ad5272e6ff9bba5b679ef73375ea760261207fa8e0aa", size = 2512835, upload-time = "2025-07-01T09:15:50.399Z" },
+]
+
 [[package]]
 name = "prepare-benchmark"
 version = "0.1.0"
@@ -601,6 +667,7 @@ source = { virtual = "." }
 dependencies = [
     { name = "datasets" },
     { name = "fire" },
+    { name = "pillow" },
     { name = "python-dotenv" },
     { name = "requests" },
 ]
@@ -614,6 +681,7 @@ dev = [
 requires-dist = [
     { name = "datasets", specifier = "<4.0.0" },
     { name = "fire", specifier = ">=0.7.0" },
+    { name = "pillow" },
     { name = "python-dotenv", specifier = ">=1.1.1" },
     { name = "requests", specifier = ">=2.32.4" },
 ]
diff --git a/apps/run-agent/.cursor/rules/agent-framework.mdc b/apps/run-agent/.cursor/rules/agent-framework.mdc
index 41b750ad..bd933875 100644
--- a/apps/run-agent/.cursor/rules/agent-framework.mdc
+++ b/apps/run-agent/.cursor/rules/agent-framework.mdc
@@ -10,15 +10,15 @@ The primary agent framework is located in `apps/reorg-modular-structure/` - this
 - **Entry Points**: 
   - [main.py](mdc:apps/reorg-modular-structure/main.py) - Main application entry point
   - [main_gaia_example.py](mdc:apps/reorg-modular-structure/main_gaia_example.py) - GAIA benchmark example
-- **Configuration**: `conf/` directory contains YAML configs for agents, benchmarks, and LLM providers
+- **Configuration**: `config/` directory contains YAML configs for agents, benchmarks, and LLM providers
 - **Source Code**: `src/mirage_agent/` contains the core agent implementation
 - **Scripts**: `scripts/` contains benchmark execution scripts
 
 ## Configuration Structure
-- `conf/agent/` - Agent configurations (default.yaml, for_debug.yaml, etc.)
-- `conf/benchmark/` - Benchmark configurations (gaia-validation.yaml, bbeh.yaml, etc.)
-- `conf/llm/` - LLM provider configurations (claude.yaml, openai.yaml, etc.)
-- `conf/pricing/` - Pricing configurations
+- `config/agent/` - Agent configurations (default.yaml, for_debug.yaml, etc.)
+- `config/benchmark/` - Benchmark configurations (gaia-validation.yaml, bbeh.yaml, etc.)
+- `config/llm/` - LLM provider configurations (claude.yaml, openai.yaml, etc.)
+- `config/pricing/` - Pricing configurations
 
 ## Development Workflow
 1. **Setup**: `cd apps/reorg-modular-structure && uv sync`
diff --git a/apps/run-agent/.cursor/rules/configuration-guide.mdc b/apps/run-agent/.cursor/rules/configuration-guide.mdc
index 162df0e0..459c85ca 100644
--- a/apps/run-agent/.cursor/rules/configuration-guide.mdc
+++ b/apps/run-agent/.cursor/rules/configuration-guide.mdc
@@ -5,29 +5,29 @@ alwaysApply: false
 # Configuration Management Guide
 
 ## Configuration Structure
-The project uses Hydra for configuration management with YAML files organized in `apps/reorg-modular-structure/conf/`.
+The project uses Hydra for configuration management with YAML files organized in `apps/reorg-modular-structure/config/`.
 
 ## Configuration Categories
-- **Agent Configs** (`conf/agent/`): Define agent behavior and capabilities
-  - [default.yaml](mdc:apps/reorg-modular-structure/conf/agent/default.yaml) - Default agent configuration
-  - [for_debug.yaml](mdc:apps/reorg-modular-structure/conf/agent/for_debug.yaml) - Debug configuration
-  - [owl_set.yaml](mdc:apps/reorg-modular-structure/conf/agent/owl_set.yaml) - OWL-specific configuration
+- **Agent Configs** (`config/agent/`): Define agent behavior and capabilities
+  - [default.yaml](mdc:apps/reorg-modular-structure/config/agent/default.yaml) - Default agent configuration
+  - [for_debug.yaml](mdc:apps/reorg-modular-structure/config/agent/for_debug.yaml) - Debug configuration
+  - [owl_set.yaml](mdc:apps/reorg-modular-structure/config/agent/owl_set.yaml) - OWL-specific configuration
 
-- **Benchmark Configs** (`conf/benchmark/`): Define benchmark datasets and evaluation
-  - [gaia-validation.yaml](mdc:apps/reorg-modular-structure/conf/benchmark/gaia-validation.yaml) - GAIA validation
-  - [bbeh.yaml](mdc:apps/reorg-modular-structure/conf/benchmark/bbeh.yaml) - BBEH benchmark
-  - [browsecomp.yaml](mdc:apps/reorg-modular-structure/conf/benchmark/browsecomp.yaml) - BrowseComp benchmark
+- **Benchmark Configs** (`config/benchmark/`): Define benchmark datasets and evaluation
+  - [gaia-validation.yaml](mdc:apps/reorg-modular-structure/config/benchmark/gaia-validation.yaml) - GAIA validation
+  - [bbeh.yaml](mdc:apps/reorg-modular-structure/config/benchmark/bbeh.yaml) - BBEH benchmark
+  - [browsecomp.yaml](mdc:apps/reorg-modular-structure/config/benchmark/browsecomp.yaml) - BrowseComp benchmark
 
-- **LLM Configs** (`conf/llm/`): Define language model providers and settings
-  - [claude.yaml](mdc:apps/reorg-modular-structure/conf/llm/claude.yaml) - Anthropic Claude
-  - [openai.yaml](mdc:apps/reorg-modular-structure/conf/llm/openai.yaml) - OpenAI models
-  - [qwen3-32b.yaml](mdc:apps/reorg-modular-structure/conf/llm/qwen3-32b.yaml) - Qwen models
+- **LLM Configs** (`config/llm/`): Define language model providers and settings
+  - [claude.yaml](mdc:apps/reorg-modular-structure/config/llm/claude.yaml) - Anthropic Claude
+  - [openai.yaml](mdc:apps/reorg-modular-structure/config/llm/openai.yaml) - OpenAI models
+  - [qwen3-32b.yaml](mdc:apps/reorg-modular-structure/config/llm/qwen3-32b.yaml) - Qwen models
 
-- **Pricing Configs** (`conf/pricing/`): Define cost tracking and pricing models
-  - [default.yaml](mdc:apps/reorg-modular-structure/conf/pricing/default.yaml) - Default pricing
+- **Pricing Configs** (`config/pricing/`): Define cost tracking and pricing models
+  - [default.yaml](mdc:apps/reorg-modular-structure/config/pricing/default.yaml) - Default pricing
 
 ## Configuration Usage
-- **Main Config**: [config.yaml](mdc:apps/reorg-modular-structure/conf/config.yaml) - Root configuration
+- **Main Config**: [config.yaml](mdc:apps/reorg-modular-structure/config/config.yaml) - Root configuration
 - **Hydra Integration**: Configurations are injected into [main.py](mdc:apps/reorg-modular-structure/main.py)
 - **Override Syntax**: Use Hydra's override syntax for runtime configuration changes
 
diff --git a/apps/run-agent/README.md b/apps/run-agent/README.md
index b2700caf..e34dc362 100644
--- a/apps/run-agent/README.md
+++ b/apps/run-agent/README.md
@@ -1,60 +1,367 @@
-# Mirage Agent - Modular Structure
+# Major Update on 08/20 项目更新说明：原MiroFlow Bug修复
 
-This project is an example of a modular agent structure, utilizing Hydra for advanced configuration management.
+**Bug修复：**
+1. `OPENROUTER`相关环境变量未在tool_utils中导入到reasoning工具（开源repo已同步修复）
+2. ctrl+c有时无法终止任务后台运行、在`common_benchmark.py`中新增`signal_handler`（开源repo已同步修复）
+3. HF下载后文件路径加载丢失后缀、去除path的resolve方法（开源repo已同步修复）
+4. log实时更新的json格式有问题（开源repo已同步修复）
+5. Python Server修改`DEFAULT_TEMPLATE_ID`（开源repo已同步修复）
+6. Hydra config未保存到工作目录、更改`config.yaml`配置和`main.py`启动入口（开源repo已同步修复）
+7. `create_message`的retry装饰应当为 retry_if_**not**_exception_type(ContextLimitError)（开源repo没有问题）
+8. 将`llm_as_judge_result`改为`judge_result`
 
-## Quick Testing
+# Major Update on 08/20 项目更新说明：功能变更
 
-You may refer to `scripts/exmamples` to run simple tests.
+本文档详细说明了所有的功能性变更，共涉及**53个文件**。
 
-## Running the Application
+## 文件分类统计
 
-This project uses `uv` for environment and package management.
+| 文件类别 | 数量 | 具体说明 |
+|---------|------|----------|
+| **配置文件** | **29个** | • Agent配置(14个): claude/deepseek/qwen3/kimi/seed等组合<br>• LLM配置(13个): 各种模型参数配置<br>• Benchmark配置(2个): browsecomp-zh, xbench-ds |
+| **Python代码** | **10个** | • 核心框架(6个): orchestrator, pipeline等<br>• 评估系统(2个): common_benchmark, eval_utils<br>• LLM客户端(2个): provider_client_base, anthropic_openrouter<br>• 搜索工具(1个): searching_mcp_server |
+| **Shell脚本** | **8个** | • 并行评估脚本: gaia-test/validation, hle, browsecomp, xbench-ds等 |
+| **工具函数** | **4个** | • util_aggregate_results_xlsx.py: Excel结果聚合<br>• util_llm_parallel_thinking.py: 并行思考<br>• util_llm_simple_voting.py: 投票机制<br>• util_statistics_hle_text_only.py: HLE统计 |
+| **依赖管理** | **2个** | • pyproject.toml: 添加pandas和openpyxl<br>• uv.lock: 依赖锁定文件 |
+| **总计** | **53个** | 配置文件占比54.7%，代码文件占比45.3% |
 
-### Basic Execution
+## 目录
+- [依赖更新](#依赖更新)
+- [核心功能改进](#核心功能改进)
+- [新增配置文件](#新增配置文件)
+- [评估系统增强](#评估系统增强)
+- [工具函数新增](#工具函数新增)
+- [搜索工具优化](#搜索工具优化)
 
-To run the main application with the default configuration (as defined in `conf/config.yaml`), use the following command:
+---
 
-```bash
-uv run main.py
+## 依赖更新
+
+### 1. apps/run-agent/pyproject.toml & uv.lock
+**变更内容：**
+- 新增依赖：`openpyxl>=3.1.5` 和 `pandas>=2.3.0`
+- 用于支持Excel文件处理和数据分析功能
+
+**影响：** 增强了数据处理能力，特别是对评估结果的Excel导出支持，支持新增的分析工具（`util`）
+
+---
+
+## 核心功能改进
+
+### 2. common_benchmark.py
+**变更内容：**
+- 修改JSON输出时添加 `ensure_ascii=False` 参数
+- 确保中文等非ASCII字符能正确保存
+
+**代码变更：**
+```python
+# 原代码
+json.dumps(asdict(result))
+# 新代码
+json.dumps(asdict(result), ensure_ascii=False)
 ```
 
-### Using Hydra for Configuration
+**影响：** 改善了对中文benchmark（如xbench-ds、browsecomp-zh）的支持
 
-Hydra allows for powerful configuration overrides directly from the command line.
+### 3. eval_utils.py 
+**主要改进：**
+1. **新增XBench评估支持**
+   - 添加了 `verify_answer_llm_xbench()` 函数，支持中文基准测试评估
+   - 使用o3作为裁判模型（可配置为Gemini 2.0 Flash）
+   - 支持中文prompt和结构化输出
 
-**1. Switching LLM Configuration:**
+2. **错误处理增强**
+   - 所有评估函数添加了 `@retry` 装饰器，使用tenacity实现重试机制
+   - 重试策略：指数退避，最多5次尝试
 
-You can switch the entire language model configuration by specifying the `llm` group. For example, to use the `gemini` configuration instead of the default `gpt-4`:
+3. **异常处理统一化**
+   - 将原来的简单try-catch改为更严格的异常处理
+   - 失败时抛出具体异常而不是返回NOT_ATTEMPED默认值
 
-```bash
-uv run main.py llm=gemini
+**新增的XBench评估prompt模板：**
+```python
+XBENCH_LLM_JUDGE_PROMPT = """
+你是一个通用人工智能助手。根据下面给出的[正确答案], 判断以下对[原问题]的[回答]的回答是否正确。
+...
+结论: 如果[最终答案]与上方给出的[正确答案]一致...则填写'正确'; 否则...填写'错误'。
+"""
 ```
 
-Available LLM configurations can be found in `conf/llm/`.
+### 4. libs/miroflow/src/miroflow/prebuilt/orchestrator.py
+**重大改进：**
+
+1. **双LLM客户端支持**
+   - 新增 `sub_agent_llm_client` 参数，允许主agent和子agent使用不同的LLM模型
+   - 自动根据agent类型选择正确的LLM客户端
+
+2. **中文上下文支持**
+   - 通过环境变量 `CHINESE_CONTEXT` 控制
+   - 在O3提示抽取和最终答案抽取中添加中文特定指导
 
-**2. Overriding Individual Parameters:**
+3. **O3最终答案抽取增强**
+   - 添加了详细的置信度评估指导（0-100分）
+   - 新增支持证据和潜在不足的结构化输出
+   - 支持中英文双语输出格式
 
-Any parameter in the configuration can be overridden. For example, to change the temperature of the `gemini` model:
+4. **Context限制检查优化**
+   - 注释掉了原有的context限制检查代码（已有ContextLimitError处理，去除冗余）
 
-```bash
-uv run main.py llm=gemini llm.temperature=0.9
+**中文支持示例：**
+```python
+if chinese_context:
+    instruction += """
+## 中文分析指导
+如果问题涉及中文语境，请特别注意：
+- 语言理解：识别可能存在的中文表达歧义...
+- 文化背景：考虑可能需要中文文化背景知识...
+"""
 ```
 
-**3. Running Benchmarks:**
+### 5. libs/miroflow/src/miroflow/prebuilt/pipeline.py
+**变更内容：**
+- 支持为主agent和子agent分别配置不同的LLM
+- 从配置文件动态加载LLM配置
+- 改进了资源清理逻辑
 
-The benchmark runner also uses Hydra. To run a benchmark, specify the `benchmark` group:
+**代码改进：**
+```python
+# 新增主agent LLM配置加载
+main_agent_llm_config = cfg.agent.get("main_agent_llm", None)
+if main_agent_llm_config:
+    main_agent_cfg = OmegaConf.load(f"conf/llm/{main_agent_llm_config}.yaml")
+    llm_client = LLMClient(task_id=task_id, cfg=OmegaConf.create({"llm": main_agent_cfg}))
 
-```bash
-uv run benchmarks/common_benchmark.py benchmark=gaia
+# 新增子agent LLM配置加载
+sub_agent_llm_config = cfg.agent.get("sub_agent_llm", None)
 ```
 
+### 6. libs/miroflow/src/miroflow/utils/io_utils.py
+**重要改进：**
+- 重写了 `_extract_boxed_content()` 方法
+- 使用平衡括号计数算法替代正则表达式
+- 正确处理任意层级的嵌套括号
+
+**算法改进：**
+```python
+# 新算法：通过计数括号来处理嵌套
+brace_count = 1
+while content_end < len(text) and brace_count > 0:
+    if char == '{':
+        brace_count += 1
+    elif char == '}':
+        brace_count -= 1
+```
+
+### 7. libs/miroflow/src/miroflow/utils/prompt_utils.py
+**全面的中文支持增强：**
+
+1. **MCP系统prompt中文指导**
+   - 子任务委托使用中文描述
+   - 搜索关键词使用中文
+   - 思考过程使用中文表达
+
+2. **Agent特定prompt改进**
+   - 主agent：支持reasoning工具和深度思考两种模式
+   - Worker agent：添加中文内容处理指导，包括Google搜索参数设置
+
+3. **总结prompt中文支持**
+   - 所有agent类型都添加了中文总结要求
+
+---
+
+## 新增配置文件
+
+### Agent配置文件（18个新文件）
+位置：`libs/miroflow/src/miroflow/prebuilt/config/agent/`
+
+**双模型配置系列：**
+- `claude03_claude_dual.yaml` - Claude 3.7 Sonnet (temp=0.3) 主从配置
+- `claude05_claude_dual.yaml` - Claude 3.7 Sonnet (temp=0.5) 主从配置  
+- `claude07_claude_dual.yaml` - Claude 3.7 Sonnet (temp=0.7) 主从配置
+- `deepseek_claude_dual.yaml` - DeepSeek R1主 + Claude 3.7从
+- `deepseek_deepseek_dual.yaml` - DeepSeek全栈配置
+- `deepseek_kimi_dual.yaml` - DeepSeek主 + Kimi K2从
+- `deepseek_qwen3_dual.yaml` - DeepSeek主 + Qwen3从
+- `deepseek_qwen3flash_dual.yaml` - DeepSeek主 + Qwen3 Flash从
+- `gptoss_gptoss_dual.yaml` - GPT-OSS 120B全栈配置
+- `kimi_claude_dual.yaml` - Kimi K2主 + Claude从
+- `qwen3_claude_dual.yaml` - Qwen3主 + Claude从
+- `seed_claude_dual.yaml` - Seed 1.6主 + Claude从
+
+**特殊配置：**
+- `deepseek_nohint_claude_dual.yaml` - 禁用O3提示
+- `deepseek_nohintreason_claude_dual.yaml` - 禁用O3提示和推理工具
+
+### LLM配置文件（13个新文件）
+位置：`libs/miroflow/src/miroflow/prebuilt/config/llm/`
+
+**新增模型配置：**
+1. **Claude系列**
+   - `claude-3.7-sonnet_temp03.yaml` (temperature=0.3)
+   - `claude-3.7-sonnet_temp05.yaml` (temperature=0.5)
+   - `claude-3.7-sonnet_temp07.yaml` (temperature=0.7)
+   - `claude-4-sonnet.yaml` - Claude 4配置
+
+2. **DeepSeek系列**
+   - `deepseek-r1-0528.yaml` - DeepSeek R1模型
+   - `deepseek-v3.yaml` - DeepSeek V3 (火山方舟)
+
+3. **其他模型**
+   - `gemini-2-5-pro.yaml` - Gemini 2.5 Pro
+   - `gpt-oss-120b.yaml` - GPT-OSS 120B
+   - `kimi-k2.yaml` - Kimi K2 (火山方舟)
+   - `qwen3-235b-thinking.yaml` - Qwen3 235B思考模型（火山云自部署）
+   - `qwen3-coder.yaml` - Qwen3 Coder 480B（火山云自部署）
+   - `qwen3-coder-flash.yaml` - Qwen3 Coder 30B Flash（火山云自部署）
+   - `seed-1-6-thinking.yaml` - Seed 1.6思考模型 (火山方舟)
+
+### Benchmark配置文件（2个新文件）
+位置：`libs/miroflow/src/miroflow/prebuilt/config/benchmark/`
+- `browsecomp-zh.yaml` - 中文浏览理解基准测试
+- `xbench-ds.yaml` - XBench深度搜索基准测试
+
+---
+
+## 评估系统增强
+
+### 8个新运行脚本
+位置：`apps/run-agent/scripts/main-worker-dual/`
+
+**新增的并行评估脚本：**
+- `run_evaluate_multiple_runs_browsecomp-zh.sh` - 中文浏览理解评估
+- `run_evaluate_multiple_runs_browsecomp.sh` - 英文浏览理解评估
+- `run_evaluate_multiple_runs_gaia-test.sh` - GAIA测试集评估
+- `run_evaluate_multiple_runs_gaia-validation.sh` - GAIA验证集评估
+- `run_evaluate_multiple_runs_noansbox_gaia-validation.sh` -（无Google答案框）GAIA验证集评估
+- `run_evaluate_multiple_runs_hle.sh` - HLE基准评估
+- `run_evaluate_multiple_runs_nohintreason_hle.sh` - （无O3提示和推理工具）HLE基准评估
+- `run_evaluate_multiple_runs_xbench-ds.sh` - XBench深度搜索评估
+
+**脚本特点：**
+- 支持并行运行多次实验
+- 自动检测中文基准并设置 `CHINESE_CONTEXT=true`
+- 支持配置环境变量控制Google搜索结果过滤
+- 自动计算平均分数
+
+---
+
+## 工具函数新增
+
+### util_aggregate_results_xlsx.py (416行)
+**功能：** 将多次运行的基准测试结果聚合到Excel文件
+
+**主要特性：**
+- 支持多run结果合并
+- 条件格式化（正确答案灰色，错误答案深红色）
+- 自动计算准确率统计
+- 保持run_1的任务顺序
+- 支持中文内容
+
+### util_llm_parallel_thinking.py (567行)
+**功能：** 使用LLM进行并行思考和答案选择
+
+**核心功能：**
+1. **多答案聚合**
+   - 从多个agent运行中提取答案
+   - 使用O3模型进行最佳答案选择
+
+2. **基准特定prompt**
+   - GAIA基准：英文评估，等价性规则
+   - XBench基准：中文评估，语义一致性判断
+
+3. **并发处理**
+   - 支持25个并发API请求
+   - 实时进度显示和成本计算
+
+4. **成本追踪**
+   - O3定价：输入$2/1M tokens，缓存$0.5/1M，输出$8/1M
+
+### util_llm_simple_voting.py (444行)
+**功能：** 简单投票机制选择最佳答案
+
+**实现方式：**
+- 答案归一化和等价性判断
+- 多数投票选择
+- 支持数值、文本、列表类型答案
+- 特殊规则处理（长列表、百分比等）
+
+### util_statistics_hle_text_only.py (88行)
+**功能：** HLE纯文本任务统计分析
+
+**分析内容：**
+- 任务类型分布
+- 答案格式统计
+- 文本长度分析
+
+---
+
+## 搜索工具优化
+
+### libs/miroflow-tool/src/miroflow/tool/mcp_servers/searching_mcp_server.py
+
+**新增功能：**
+
+1. **Google搜索结果过滤**
+   - 通过环境变量控制：
+     - `REMOVE_SNIPPETS` - 移除Google摘要
+     - `REMOVE_KNOWLEDGE_GRAPH` - 移除Google知识图谱
+     - `REMOVE_ANSWER_BOX` - 移除Google答案框
+
+2. **新增搜索参数**
+   - `gl` - 国家上下文（如'cn'中国，'us'美国）
+   - `hl` - 界面语言（如'zh'中文，'en'英文）
+
+3. **过滤器实现**
+```python
+def filter_google_search_result(result_content: str) -> str:
+    """根据环境变量过滤搜索结果"""
+    # 移除知识图谱
+    if REMOVE_KNOWLEDGE_GRAPH and "knowledgeGraph" in data:
+        del data["knowledgeGraph"]
+    # 移除答案框
+    if REMOVE_ANSWER_BOX and "answerBox" in data:
+        del data["answerBox"]
+    # 移除摘要
+    if REMOVE_SNIPPETS:
+        for item in data["organic"]:
+            if "snippet" in item:
+                del item["snippet"]
+```
+
+---
+
+## 其他改进
+
+### LLM Provider客户端改进
+**文件：** `libs/miroflow/src/miroflow/llm/provider_client_base.py` 和 `libs/miroflow/src/miroflow/llm/providers/claude_openrouter_client.py`
+
+**改进内容：**
+1. 支持 `repetition_penalty` 参数
+2. 改进OpenRouter配置处理
+3. 增强context限制错误检测
+4. 支持通过extra_body传递top_k和min_p参数
+
+### 相应配置增强
+**文件：** `libs/miroflow/src/miroflow/prebuilt/config/config.yaml`和`libs/miroflow/src/miroflow/utils/tool_utils.py`
+
+**新增部分环境变量传递：**
+```python
+"REMOVE_SNIPPETS": os.environ.get("REMOVE_SNIPPETS", "false"),
+"REMOVE_KNOWLEDGE_GRAPH": os.environ.get("REMOVE_KNOWLEDGE_GRAPH", "false"),
+"REMOVE_ANSWER_BOX": os.environ.get("REMOVE_ANSWER_BOX", "false"),
+```
+等。
+
+---
+
+## 总结
 
-## Configuration Structure
+本次功能更新主要聚焦于：
 
-The configuration for this application resides in the `conf/` directory, following Hydra conventions:
+1. **多语言支持增强** - 全面支持中文基准测试和中文语境处理
+2. **评估系统升级** - 新增XBench支持，改进评估鲁棒性
+3. **灵活的模型配置** - 支持主从agent使用不同LLM模型
+4. **评测链完善** - 新增多个实用utils工具函数，支持voting & selecting
+5. **搜索优化** - 可配置的搜索结果过滤
 
-- `conf/config.yaml`: The main entry point for configuration. It defines the default composition of configuration groups.
-- `conf/llm/`: Contains configurations for different language models (e.g., `gpt-4.yaml`, `gemini.yaml`).
-- `conf/agent/`: Contains configurations for different agent setups.
-- `conf/benchmark/`: Contains configurations for different benchmark tasks.
-- `conf/pricing/`: Contains configurations for different models' price.
+这些改进显著提升了系统的灵活性、可靠性和对中文任务的支持能力。
\ No newline at end of file
diff --git a/apps/run-agent/calculate_score_from_log.py b/apps/run-agent/calculate_score_from_log.py
index f6bd549d..6730bdc4 100755
--- a/apps/run-agent/calculate_score_from_log.py
+++ b/apps/run-agent/calculate_score_from_log.py
@@ -17,15 +17,13 @@ def extract_score_from_log(run_dir, task_score_dict):
         task_id = log_file.split("/")[-1].split("_")[0]
         with open(log_file, "r") as f:
             data = json.load(f)
-            if "llm_as_judge_result" in data and data["llm_as_judge_result"] in (
+            if "judge_result" in data and data["judge_result"] in (
                 "CORRECT",
                 "INCORRECT",
             ):
                 if task_id not in task_score_dict:
                     task_score_dict[task_id] = []
-                task_score_dict[task_id].append(
-                    data["llm_as_judge_result"] == "CORRECT"
-                )
+                task_score_dict[task_id].append(data["judge_result"] == "CORRECT")
 
 
 def main(results_dir: str, pass_at_k: int = 3):
diff --git a/apps/run-agent/common_benchmark.py b/apps/run-agent/common_benchmark.py
index 69064f5d..be9a322a 100644
--- a/apps/run-agent/common_benchmark.py
+++ b/apps/run-agent/common_benchmark.py
@@ -11,10 +11,12 @@
 from enum import StrEnum
 from pathlib import Path
 from typing import Any, Callable, Dict, List, Optional, Tuple, TypedDict
-import dotenv
+import random
 
+import dotenv
 import hydra
 import openai
+from eval_utils import verify_answer_for_datasets
 from miroflow.contrib.tracing import set_tracing_disabled, set_tracing_export_api_key
 from miroflow.contrib.tracing.otlp_setup import bootstrap_silent_trace_provider
 from miroflow.logging.logger import bootstrap_logger
@@ -24,9 +26,6 @@
     execute_task_pipeline,
 )
 from omegaconf import DictConfig, OmegaConf
-from pydantic import BaseModel, Field
-
-from eval_utils import verify_answer_for_datasets
 from summary_time_cost import generate_summary
 
 
@@ -58,12 +57,13 @@ class AttemptStats(TypedDict):
     model_boxed_answer: str
     status: TaskStatus
     log_file_path: Optional[Path]
-    llm_as_judge_result: Optional[str]
+    judge_result: Optional[str]
     is_correct: bool
     error_message: Optional[str]
 
 
-class BenchmarkResult(BaseModel):
+@dataclass
+class BenchmarkResult:
     """Generic benchmark evaluation result structure"""
 
     task_id: str
@@ -73,15 +73,29 @@ class BenchmarkResult(BaseModel):
     model_response: str
     model_boxed_answer: str
     status: str
-    metadata: Dict[str, Any] = Field(default_factory=dict)
+    metadata: Dict[str, Any] = field(default_factory=dict)
     error_message: str = ""
-    llm_as_judge_result: Optional[str] = None
+    judge_result: Optional[str] = None
     log_file_path: Optional[Path] = None
     # Pass@K support fields
-    attempts: List[AttemptStats] = Field(default_factory=list)  # Store all attempts
+    attempts: List[AttemptStats] = field(default_factory=list)  # Store all attempts
     pass_at_k_success: bool = False  # Whether task passed using pass@k evaluation
     k_value: int = 1  # The k value used for this evaluation
 
+    def to_dict(self):
+        """Convert the object to a serializable dictionary."""
+        result = self.__dict__.copy()  # Copy the object's dictionary
+        # Convert Path objects to string
+        if isinstance(result.get("log_file_path"), Path):
+            result["log_file_path"] = str(result["log_file_path"])
+        if isinstance(result.get("file_path"), Path):
+            result["file_path"] = str(result["file_path"])
+        # Convert any Path objects inside the attempts list
+        for attempt in result.get("attempts", []):
+            if isinstance(attempt.get("log_file_path"), Path):
+                attempt["log_file_path"] = str(attempt["log_file_path"])
+        return result
+
 
 class BenchmarkEvaluator(ABC):
     """Abstract base class for benchmark evaluators"""
@@ -160,6 +174,11 @@ async def run_single_task(self, task: BenchmarkTask) -> BenchmarkResult:
             model_boxed_answer="",
             status="pending",
             metadata=task.metadata.copy(),
+            error_message="",
+            judge_result=None,
+            log_file_path=None,
+            attempts=[],
+            pass_at_k_success=False,
             k_value=self.pass_at_k,
         )
 
@@ -198,7 +217,7 @@ async def run_single_task(self, task: BenchmarkTask) -> BenchmarkResult:
                             output_formatter=self.output_formatter,
                             ground_truth=task.ground_truth,
                             log_path=self.output_dir
-                            / f"{task.task_id}_attempt_{attempt}.json",
+                            / f"task_{task.task_id}_attempt_{attempt}.json",
                         )
 
                         attempt_result["model_response"] = response if response else ""
@@ -216,7 +235,11 @@ async def run_single_task(self, task: BenchmarkTask) -> BenchmarkResult:
                         print(f"    Error in attempt {attempt}: {e}")
 
                 # Perform LLM verification if we have an answer and haven't verified yet
-                if attempt_result["status"] == TaskStatus.RUN_COMPLETED:
+                if (
+                    attempt_result["status"] == TaskStatus.RUN_COMPLETED
+                    or attempt_result["judge_result"] == "NOT_ATTEMPTED"
+                ):
+                    # if attempt_result["status"] == TaskStatus.RUN_COMPLETED:
                     print(f"    Verifying answer for attempt {attempt}...")
                     try:
                         evaluation_result = await verify_answer_for_datasets(
@@ -226,7 +249,7 @@ async def run_single_task(self, task: BenchmarkTask) -> BenchmarkResult:
                             target=task.ground_truth,
                             predicted_answer=attempt_result["model_boxed_answer"],
                         )
-                        attempt_result["llm_as_judge_result"] = evaluation_result
+                        attempt_result["judge_result"] = evaluation_result
                         attempt_result["is_correct"] = evaluation_result == "CORRECT"
 
                         # Update the log file with verification result
@@ -247,15 +270,15 @@ async def run_single_task(self, task: BenchmarkTask) -> BenchmarkResult:
 
                     except Exception as e:
                         print(f"    Error verifying attempt {attempt}: {e}")
-                        attempt_result["llm_as_judge_result"] = "ERROR"
+                        attempt_result["judge_result"] = "ERROR"
                         attempt_result["is_correct"] = False
 
                 if attempt_result["is_correct"]:
                     print(f"    ✅ Attempt {attempt}: CORRECT (cached)")
                     found_correct_answer = True
-                elif attempt_result["llm_as_judge_result"]:
+                elif attempt_result["judge_result"]:
                     print(
-                        f"    ❌ Attempt {attempt}: INCORRECT (cached: {attempt_result['llm_as_judge_result']})"
+                        f"    ❌ Attempt {attempt}: INCORRECT (cached: {attempt_result['judge_result']})"
                     )
                 else:
                     print(f"    ⚠️  Attempt {attempt}: No valid answer to verify")
@@ -291,9 +314,9 @@ async def run_single_task(self, task: BenchmarkTask) -> BenchmarkResult:
 
             # Set main result LLM judge result based on pass@k outcome
             if found_correct_answer:
-                result.llm_as_judge_result = "PASS_AT_K_SUCCESS"
+                result.judge_result = "PASS_AT_K_SUCCESS"
             else:
-                result.llm_as_judge_result = "PASS_AT_K_FAILED"
+                result.judge_result = "PASS_AT_K_FAILED"
 
             print(f"Task {task.task_id} completed with {len(result.attempts)} attempts")
             print(
@@ -310,11 +333,11 @@ def scan_latest_attempt(self, task: BenchmarkTask, attempt: int) -> AttemptStats
             "model_boxed_answer": "",
             "status": TaskStatus.PENDING,
             "log_file_path": None,
-            "llm_as_judge_result": None,
+            "judge_result": None,
             "is_correct": False,
             "error_message": None,
         }
-        trace_filename_pattern = f"task_{task.task_id}_attempt_{attempt}_*.json"
+        trace_filename_pattern = f"task_{task.task_id}_attempt_{attempt}.json"
         matched_logs = self.output_dir.glob(trace_filename_pattern)
         sorted_logs = sorted(matched_logs, reverse=True)
         if len(sorted_logs) == 0:
@@ -331,14 +354,10 @@ def scan_latest_attempt(self, task: BenchmarkTask, attempt: int) -> AttemptStats
                 attempt_result["model_boxed_answer"] = log_data["final_boxed_answer"]
                 attempt_result["model_response"] = log_data.get("output", "")
                 # Check if we already have LLM judge result in log
-                if log_data.get("llm_as_judge_result"):
+                if log_data.get("judge_result"):
                     attempt_result["status"] = TaskStatus.RESULT_JUDGED
-                    attempt_result["llm_as_judge_result"] = log_data[
-                        "llm_as_judge_result"
-                    ]
-                    attempt_result["is_correct"] = (
-                        log_data["llm_as_judge_result"] == "CORRECT"
-                    )
+                    attempt_result["judge_result"] = log_data["judge_result"]
+                    attempt_result["is_correct"] = log_data["judge_result"] == "CORRECT"
                 print(
                     f"    Loaded existing result: {attempt_result['model_boxed_answer']}"
                 )
@@ -358,26 +377,36 @@ async def run_with_semaphore(task):
             async with semaphore:
                 return await self.run_single_task(task)
 
+        # Shuffle tasks to avoid order bias and improve balancing
+        shuffled_tasks = tasks.copy()
+        random.shuffle(shuffled_tasks)
+
         # Run tasks in parallel
         results = await asyncio.gather(
-            *[run_with_semaphore(task) for task in tasks], return_exceptions=True
+            *[run_with_semaphore(task) for task in shuffled_tasks],
+            return_exceptions=True,
         )
 
         # Handle exceptions
         processed_results = []
         for i, result in enumerate(results):
             if isinstance(result, Exception):
-                print(f"Exception in task {tasks[i].task_id}: {result}")
+                print(f"Exception in task {shuffled_tasks[i].task_id}: {result}")
                 error_result = BenchmarkResult(
-                    task_id=tasks[i].task_id,
-                    task_question=tasks[i].task_question,
-                    ground_truth=tasks[i].ground_truth,
-                    file_path=tasks[i].file_path,
+                    task_id=shuffled_tasks[i].task_id,
+                    task_question=shuffled_tasks[i].task_question,
+                    ground_truth=shuffled_tasks[i].ground_truth,
+                    file_path=shuffled_tasks[i].file_path,
                     model_response="",
                     model_boxed_answer="",
                     status="failed",
-                    metadata=tasks[i].metadata.copy(),
+                    metadata=shuffled_tasks[i].metadata.copy(),
                     error_message=str(result),
+                    judge_result=None,
+                    log_file_path=None,
+                    attempts=[],
+                    pass_at_k_success=False,
+                    k_value=self.pass_at_k,
                 )
                 processed_results.append(error_result)
             else:
@@ -392,7 +421,7 @@ def save_results(self, output_path: Path) -> Path:
 
         with open(output_path, "w", encoding="utf-8") as f:
             for result in self.results:
-                f.write(result.model_dump_json() + "\n")
+                f.write(json.dumps(result.to_dict(), ensure_ascii=False) + "\n")
 
         print(f"Results saved to {output_path}")
         return output_path
@@ -423,7 +452,7 @@ async def evaluate_accuracy(self) -> float:
             # Show details of each attempt
             for attempt in result.attempts:
                 attempt_num = attempt.get("attempt_number", "?")
-                judge_result = attempt.get("llm_as_judge_result", "NOT_VERIFIED")
+                judge_result = attempt.get("judge_result", "NOT_VERIFIED")
                 is_correct = attempt.get("is_correct", False)
                 status_icon = (
                     "✅"
@@ -462,12 +491,12 @@ async def _update_log_file_with_evaluation(
                 log_data = json.load(f)
 
             # Update with evaluation result
-            log_data["llm_as_judge_result"] = evaluation_result
+            log_data["judge_result"] = evaluation_result
 
             # Write to a temporary file and then atomically replace
             temp_log_file = log_file.with_suffix(f"{log_file.suffix}.tmp")
             with open(temp_log_file, "w", encoding="utf-8") as f:
-                json.dump(log_data, f, indent=2)
+                json.dump(log_data, f, indent=2, ensure_ascii=False)
 
             os.replace(temp_log_file, log_file)
             print(f"    Updated log file {log_file.name} with evaluation result.")
diff --git a/apps/run-agent/eval_answer_from_log.py b/apps/run-agent/eval_answer_from_log.py
index e5499b73..7ea71a88 100644
--- a/apps/run-agent/eval_answer_from_log.py
+++ b/apps/run-agent/eval_answer_from_log.py
@@ -34,8 +34,8 @@ async def main(input_dir: str, benchmark_name: str):
             ground_truth = data.get("ground_truth", "")
             predicted_answer = data.get("final_boxed_answer", "")
             # If already has judge result, skip
-            # if "llm_as_judge_result" in data and data["llm_as_judge_result"] in ("CORRECT", "INCORRECT"):
-            #     print(f"Log {log_file} already has judge result: {data['llm_as_judge_result']}")
+            # if "judge_result" in data and data["judge_result"] in ("CORRECT", "INCORRECT"):
+            #     print(f"Log {log_file} already has judge result: {data['judge_result']}")
             #     continue
             # Call LLM judge
             result = await verify_answer_for_datasets(
@@ -47,7 +47,7 @@ async def main(input_dir: str, benchmark_name: str):
             )
             print(f"{os.path.basename(log_file)}: {result}")
             # Optionally, update the log file with the result
-            # data["llm_as_judge_result"] = result
+            # data["judge_result"] = result
             # with open(log_file, "w", encoding="utf-8") as f:
             #     json.dump(data, f, ensure_ascii=False, indent=2)
             if result == "CORRECT":
diff --git a/apps/run-agent/eval_utils.py b/apps/run-agent/eval_utils.py
index 59c2b24c..f5f58829 100644
--- a/apps/run-agent/eval_utils.py
+++ b/apps/run-agent/eval_utils.py
@@ -2,7 +2,6 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
-import os
 import re
 import string
 import warnings
@@ -10,6 +9,7 @@
 
 from openai import AsyncOpenAI
 from pydantic import BaseModel
+from tenacity import retry, stop_after_attempt, wait_exponential
 
 EVALUATION_PROMPT_SIMPLEQA = """
 Your job is to look at a question, a gold target, and a predicted answer, and then assign a grade of either ["CORRECT", "INCORRECT", "NOT_ATTEMPTED"].
@@ -93,6 +93,7 @@
 """.strip()
 
 
+@retry(wait=wait_exponential(multiplier=5), stop=stop_after_attempt(5))
 async def verify_answer_llm_simpleqa(
     openai_client: AsyncOpenAI, question: str, target: str, predicted_answer: str
 ) -> str:
@@ -110,23 +111,85 @@ async def verify_answer_llm_simpleqa(
     ]
     CHOICE_MAP = {"A": "CORRECT", "B": "INCORRECT", "C": "NOT_ATTEMPTED"}
 
-    try:
-        llm_response = await openai_client.chat.completions.create(
-            model="gpt-4o-mini",
-            messages=messages,  # type: ignore
-            max_completion_tokens=2,
-        )
-        content = llm_response.choices[0].message.content
-        if not isinstance(content, str):
-            raise ValueError("llm failed to return")
-
-        match = re.search(r"(A|B|C)", content)
-        if match:
-            return CHOICE_MAP[match.group(0)]
-    except Exception as e:
-        print(f"LLM evaluation failed: {e}")
+    llm_response = await openai_client.chat.completions.create(
+        model="gpt-4o-mini", messages=messages, max_completion_tokens=2
+    )
+    content = llm_response.choices[0].message.content
+    match = re.search(r"(A|B|C)", content)
+    if match:
+        return CHOICE_MAP[match.group(0)]
+    else:
+        raise Exception(f"SimpleQA LLM evaluation failed: {content}")
+
+
+# XBench LLM Judge prompt template (Chinese)
+XBENCH_LLM_JUDGE_PROMPT = """
+你是一个通用人工智能助手。根据下面给出的[正确答案], 判断以下对[原问题]的[回答]的回答是否正确。
+
+[原问题]: {question}
+
+[正确答案]: {correct_answer}
+
+[回答]:{response}
+
+你的判断必须按照以下格式和标准进行:
+
+最终答案: 从[回答]中提取出的最终准确答案。如果[回答]中没有明确的最终答案, 则填写'无'。
+
+解释: 根据[原问题]解释为什么[最终答案]是正确的或错误的。只关注[最终答案]与[正确答案]之间是否存在实质性差异, 不要评论题目的背景, 不要尝试重新解题, 不要为任何不同于[正确答案]的答案辩护, 只专注于判断答案是否一致。
+
+结论: 如果[最终答案]与上方给出的[正确答案]一致, 或者在数值题目中处于可接受的微小误差范围内, 则填写'正确'; 否则（即存在任何不一致、歧义、不等价或提取出的答案错误的情况）填写'错误'。
+""".strip()
+
+
+class XBenchExtractedAnswer(BaseModel):
+    最终答案: str
+    解释: str
+    结论: Literal["正确", "错误"]
+    strict: Literal[True] = True  # 100% reliability
+
+
+@retry(wait=wait_exponential(multiplier=5), stop=stop_after_attempt(5))
+async def verify_answer_llm_xbench(
+    openai_client: AsyncOpenAI, question: str, target: str, predicted_answer: str
+) -> str:
+    """
+    Use XBench-style LLM judge (o3) to verify if the predicted answer is correct.
+    Uses structured output format similar to verify_answer_llm_hle.
+
+    Args:
+        question: The question being answered
+        target: The correct/target answer
+        predicted_answer: The model's predicted answer
+
+    Returns:
+        String indicating the evaluation result: "CORRECT", "INCORRECT", or "NOT_ATTEMPTED"
+    """
+    prompt = XBENCH_LLM_JUDGE_PROMPT.format(
+        question=question, correct_answer=target, response=predicted_answer
+    )
+
+    response = await openai_client.beta.chat.completions.parse(
+        model="o3",  # xbench by default uses deepseek-v3 ?
+        max_completion_tokens=4096,
+        messages=[{"role": "user", "content": prompt}],
+        response_format=XBenchExtractedAnswer,
+    )
+
+    content = response.choices[0].message.parsed
 
-    return "NOT_ATTEMPTED"
+    # Print XBench reasoning
+    print(f"XBench LLM Judge Extracted Answer: {content.最终答案}")
+    print(f"XBench LLM Judge Reasoning: {content.解释}")
+    print(f"XBench LLM Judge Result: {content.结论}")
+
+    # Convert XBench format to eval_utils format
+    if content.结论 == "正确":
+        return "CORRECT"
+    elif content.结论 == "错误":
+        return "INCORRECT"
+    else:
+        raise Exception(f"XBench LLM evaluation failed: {content}")
 
 
 # HLE Judge prompt and model
@@ -146,9 +209,18 @@ async def verify_answer_llm_simpleqa(
 
 correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.
 
-confidence: The extracted confidence score between 0|\\%| and 100|\\%| from [response]. Put 100 if there is no confidence score available."""
+confidence: The extracted confidence score between 0|%| and 100|%| from [response]. Put 100 if there is no confidence score available."""
+
 
+class HLEExtractedAnswer(BaseModel):
+    extracted_final_answer: str
+    reasoning: str
+    correct: Literal["yes", "no"]
+    confidence: int
+    strict: Literal[True] = True  # 100% reliability
 
+
+@retry(wait=wait_exponential(multiplier=5), stop=stop_after_attempt(5))
 async def verify_answer_llm_hle(
     openai_client: AsyncOpenAI, question: str, target: str, predicted_answer: str
 ) -> str:
@@ -164,48 +236,31 @@ async def verify_answer_llm_hle(
     Returns:
         String indicating the evaluation result
     """
-
-    class HLEExtractedAnswer(BaseModel):
-        extracted_final_answer: str
-        reasoning: str
-        correct: Literal["yes", "no"]
-        confidence: int
-        strict: Literal[True] = True  # 100% reliability
-
     prompt = HLE_JUDGE_PROMPT.format(
         question=question, correct_answer=target, response=predicted_answer
     )
 
-    try:
-        response = await openai_client.beta.chat.completions.parse(
-            model="o3-mini-2025-01-31",
-            max_completion_tokens=4096,
-            messages=[{"role": "user", "content": prompt}],
-            response_format=HLEExtractedAnswer,
-        )
-
-        content = response.choices[0].message.parsed
-
-        if not isinstance(content, HLEExtractedAnswer):
-            return "INCORRECT"
+    response = await openai_client.beta.chat.completions.parse(
+        model="o3-mini-2025-01-31",
+        max_completion_tokens=4096,
+        messages=[{"role": "user", "content": prompt}],
+        response_format=HLEExtractedAnswer,
+    )
 
-        # Print HLE reasoning
-        print(f"LLM as Judge Reasoning: {content.reasoning}")
-        print(f"LLM as Judge Result: {content.correct}")
-        print(f"LLM as Judge Confidence: {content.confidence}%")
+    content = response.choices[0].message.parsed
 
-        # Convert HLE format to eval_utils format
-        if content.correct == "yes":
-            return "CORRECT"
-        else:
-            return "INCORRECT"
+    # Print HLE reasoning
+    print(f"LLM as Judge Reasoning: {content.reasoning}")
+    print(f"LLM as Judge Result: {content.correct}")
+    print(f"LLM as Judge Confidence: {content.confidence}%")
 
-    except Exception as e:
-        if "Incorrect API key provided" in str(e):
-            print(f"LLM evaluation failed: {e}")
-            os._exit(1)
-        print(f"LLM evaluation failed: {e}")
-        return "NOT_ATTEMPTED"
+    # Convert HLE format to eval_utils format
+    if content.correct == "yes":
+        return "CORRECT"
+    elif content.correct == "no":
+        return "INCORRECT"
+    else:
+        raise Exception(f"HLE LLM evaluation failed: {content}")
 
 
 async def verify_answer_gaia(target: str, predicted_answer: str) -> str:
@@ -330,21 +385,35 @@ async def verify_answer_for_datasets(
     Verify the answer for a given dataset.
     """
 
-    # for all questions, do gaia scorer first, if not return CORRECT, then do others
-    gaia_scorer_answer = await verify_answer_gaia(target, predicted_answer)
+    try:
+        # for all questions, do gaia scorer first, if not return CORRECT, then do others
+        gaia_scorer_answer = await verify_answer_gaia(target, predicted_answer)
 
-    if gaia_scorer_answer == "CORRECT":
-        return "CORRECT"
+        if gaia_scorer_answer == "CORRECT":
+            return "CORRECT"
 
-    elif "gaia-validation-text" not in benchmark_name and "gaia" in benchmark_name:
-        return gaia_scorer_answer
+        elif "gaia-validation-text" not in benchmark_name and "gaia" in benchmark_name:
+            return gaia_scorer_answer
 
-    elif benchmark_name == "simpleqa":
-        return await verify_answer_llm_simpleqa(
-            openai_client, question, target, predicted_answer
-        )
+        elif benchmark_name == "simpleqa":
+            return await verify_answer_llm_simpleqa(
+                openai_client, question, target, predicted_answer
+            )
 
-    else:
-        return await verify_answer_llm_hle(
-            openai_client, question, target, predicted_answer
-        )
+        elif "xbench" in benchmark_name:
+            return await verify_answer_llm_xbench(
+                openai_client, question, target, predicted_answer
+            )
+
+        elif "browsecomp-zh" in benchmark_name:
+            return await verify_answer_llm_hle(
+                openai_client, question, target, predicted_answer
+            )
+
+        else:
+            return await verify_answer_llm_hle(
+                openai_client, question, target, predicted_answer
+            )
+    except Exception as e:
+        print(f"Evaluation failed: {e}")
+        return "NOT_ATTEMPTED"
diff --git a/apps/run-agent/main.py b/apps/run-agent/main.py
index ac9e3b66..ba58e05f 100644
--- a/apps/run-agent/main.py
+++ b/apps/run-agent/main.py
@@ -1,15 +1,18 @@
-import dotenv
-import fire
-import hydra
-from miroflow.logging.logger import bootstrap_logger
-from miroflow.prebuilt.config import config_name, config_path, debug_config
-from rich.traceback import install
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
 
 import calculate_average_score
 import calculate_score_from_log
 import common_benchmark
+import dotenv
 import eval_answer_from_log
+import fire
+import hydra
 import trace_single_task
+from miroflow.logging.logger import bootstrap_logger
+from miroflow.prebuilt.config import config_name, config_path, debug_config
+from rich.traceback import install
 
 
 def print_config(*args):
diff --git a/apps/run-agent/pyproject.toml b/apps/run-agent/pyproject.toml
index fdf6a193..fc1a8711 100644
--- a/apps/run-agent/pyproject.toml
+++ b/apps/run-agent/pyproject.toml
@@ -10,6 +10,8 @@ dependencies = [
     "miroflow>=0.1.0",
     "miroflow-contrib>=0.1.0",
     "miroflow-tool>=0.1.0",
+    "openpyxl>=3.1.5",
+    "pandas>=2.3.0",
 ]
 
 [build-system]
diff --git a/apps/run-agent/scripts/claude-sonnet-3.7/run_evaluate_multiple_runs_gaia-validation.sh b/apps/run-agent/scripts/claude-sonnet-3.7/run_evaluate_multiple_runs_gaia-validation.sh
index 6c2a82f0..499b9b27 100644
--- a/apps/run-agent/scripts/claude-sonnet-3.7/run_evaluate_multiple_runs_gaia-validation.sh
+++ b/apps/run-agent/scripts/claude-sonnet-3.7/run_evaluate_multiple_runs_gaia-validation.sh
@@ -4,12 +4,15 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
-NUM_RUNS=3
+NUM_RUNS=1
+MAX_CONCURRENT=20
 BENCHMARK_NAME="gaia-validation"
 LLM_PROVIDER="claude_openrouter"
 LLM_MODEL="anthropic/claude-3.7-sonnet"
 AGENT_SET="miroflow"
-MAX_CONCURRENT=5
+ADD_MESSAGE_ID="true"  # Set to true to add random message ID to all messages sent to LLM
+MAX_TURNS=-1
+TEMPERATURE=0.3
 
 RESULTS_DIR="logs/${BENCHMARK_NAME}/${LLM_PROVIDER}_${LLM_MODEL}_${AGENT_SET}"
 
@@ -31,11 +34,15 @@ for i in $(seq 1 $NUM_RUNS); do
             llm=claude_openrouter \
             llm.provider=$LLM_PROVIDER \
             llm.model_name=$LLM_MODEL \
+            llm.temperature=$TEMPERATURE \
             llm.async_client=true \
             benchmark.execution.max_tasks=null \
             benchmark.execution.max_concurrent=$MAX_CONCURRENT \
             benchmark.execution.pass_at_k=1 \
             agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
             output_dir="$RESULTS_DIR/$RUN_ID" \
             > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
         
diff --git a/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_browsecomp-zh.sh b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_browsecomp-zh.sh
new file mode 100644
index 00000000..2fd70369
--- /dev/null
+++ b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_browsecomp-zh.sh
@@ -0,0 +1,85 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# Configuration parameters - dual model configuration
+NUM_RUNS=1
+MAX_CONCURRENT=50
+BENCHMARK_NAME="browsecomp-zh"
+AGENT_SET="claude03_claude_dual"
+ADD_MESSAGE_ID="true"  # Set to true to add random message ID to all messages sent to LLM
+MAX_TURNS=-1
+
+# Automatically set Chinese context - if BENCHMARK_NAME contains xbench or -zh
+if [[ $BENCHMARK_NAME == "xbench-ds" ]] || [[ $BENCHMARK_NAME == "browsecomp-zh" ]]; then
+    export CHINESE_CONTEXT="true"
+    echo "检测到中文相关基准测试，已启用中文上下文：CHINESE_CONTEXT=true"
+fi
+
+# export REMOVE_SNIPPETS="true"
+# export REMOVE_KNOWLEDGE_GRAPH="true"
+# export REMOVE_ANSWER_BOX="true"
+
+RESULTS_DIR="logs/${BENCHMARK_NAME}/${AGENT_SET}"
+
+echo "Starting $NUM_RUNS runs of the evaluation..."
+echo "Results will be saved in: $RESULTS_DIR"
+
+mkdir -p "$RESULTS_DIR"
+
+for i in $(seq 1 $NUM_RUNS); do
+    echo "=========================================="
+    echo "Launching experiment $i/$NUM_RUNS"
+    echo "=========================================="
+    
+    RUN_ID="run_$i"
+    
+    (
+        uv run main.py common-benchmark \
+            benchmark=$BENCHMARK_NAME \
+            agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
+            benchmark.execution.max_tasks=null \
+            benchmark.execution.max_concurrent=$MAX_CONCURRENT \
+            benchmark.execution.pass_at_k=1 \
+            output_dir="$RESULTS_DIR/$RUN_ID" \
+            hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
+            > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
+        
+        if [ $? -eq 0 ]; then
+            echo "Run $i completed successfully"
+            RESULT_FILE=$(find "${RESULTS_DIR}/$RUN_ID" -name "*accuracy.txt" 2>/dev/null | head -1)
+            if [ -f "$RESULT_FILE" ]; then
+                echo "Results saved to $RESULT_FILE"
+            else
+                echo "Warning: Result file not found for run $i"
+            fi
+        else
+            echo "Run $i failed!"
+        fi
+    ) &
+    
+    sleep 2
+done
+
+echo "All $NUM_RUNS runs have been launched in parallel"
+echo "Waiting for all runs to complete..."
+
+wait
+
+echo "=========================================="
+echo "All $NUM_RUNS runs completed!"
+echo "=========================================="
+
+echo "Calculating average scores..."
+uv run main.py avg-score "$RESULTS_DIR"
+
+echo "=========================================="
+echo "Multiple runs evaluation completed!"
+echo "Check results in: $RESULTS_DIR"
+echo "Check individual run logs: $RESULTS_DIR/run_*_output.log"
+echo "=========================================="
\ No newline at end of file
diff --git a/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_browsecomp.sh b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_browsecomp.sh
new file mode 100644
index 00000000..c1bdbc71
--- /dev/null
+++ b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_browsecomp.sh
@@ -0,0 +1,85 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# Configuration parameters - dual model configuration
+NUM_RUNS=1
+MAX_CONCURRENT=50
+BENCHMARK_NAME="browsecomp"
+AGENT_SET="claude03_claude_dual"
+ADD_MESSAGE_ID="true"  # Set to true to add random message ID to all messages sent to LLM
+MAX_TURNS=-1
+
+# Automatically set Chinese context - if BENCHMARK_NAME contains xbench or -zh
+if [[ $BENCHMARK_NAME == "xbench-ds" ]] || [[ $BENCHMARK_NAME == "browsecomp-zh" ]]; then
+    export CHINESE_CONTEXT="true"
+    echo "检测到中文相关基准测试，已启用中文上下文：CHINESE_CONTEXT=true"
+fi
+
+# export REMOVE_SNIPPETS="true"
+# export REMOVE_KNOWLEDGE_GRAPH="true"
+# export REMOVE_ANSWER_BOX="true"
+
+RESULTS_DIR="logs/${BENCHMARK_NAME}/${AGENT_SET}"
+
+echo "Starting $NUM_RUNS runs of the evaluation..."
+echo "Results will be saved in: $RESULTS_DIR"
+
+mkdir -p "$RESULTS_DIR"
+
+for i in $(seq 1 $NUM_RUNS); do
+    echo "=========================================="
+    echo "Launching experiment $i/$NUM_RUNS"
+    echo "=========================================="
+    
+    RUN_ID="run_$i"
+    
+    (
+        uv run main.py common-benchmark \
+            benchmark=$BENCHMARK_NAME \
+            agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
+            benchmark.execution.max_tasks=null \
+            benchmark.execution.max_concurrent=$MAX_CONCURRENT \
+            benchmark.execution.pass_at_k=1 \
+            output_dir="$RESULTS_DIR/$RUN_ID" \
+            hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
+            > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
+        
+        if [ $? -eq 0 ]; then
+            echo "Run $i completed successfully"
+            RESULT_FILE=$(find "${RESULTS_DIR}/$RUN_ID" -name "*accuracy.txt" 2>/dev/null | head -1)
+            if [ -f "$RESULT_FILE" ]; then
+                echo "Results saved to $RESULT_FILE"
+            else
+                echo "Warning: Result file not found for run $i"
+            fi
+        else
+            echo "Run $i failed!"
+        fi
+    ) &
+    
+    sleep 2
+done
+
+echo "All $NUM_RUNS runs have been launched in parallel"
+echo "Waiting for all runs to complete..."
+
+wait
+
+echo "=========================================="
+echo "All $NUM_RUNS runs completed!"
+echo "=========================================="
+
+echo "Calculating average scores..."
+uv run main.py avg-score "$RESULTS_DIR"
+
+echo "=========================================="
+echo "Multiple runs evaluation completed!"
+echo "Check results in: $RESULTS_DIR"
+echo "Check individual run logs: $RESULTS_DIR/run_*_output.log"
+echo "==========================================" 
\ No newline at end of file
diff --git a/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_gaia-test.sh b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_gaia-test.sh
new file mode 100644
index 00000000..509e127b
--- /dev/null
+++ b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_gaia-test.sh
@@ -0,0 +1,84 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# Configuration parameters - dual model configuration
+NUM_RUNS=3
+MAX_CONCURRENT=20
+BENCHMARK_NAME="gaia-test"
+AGENT_SET="claude03_claude_dual"
+ADD_MESSAGE_ID="true"  # Set to true to add random message ID to all messages sent to LLM
+MAX_TURNS=-1
+
+# Automatically set Chinese context - if BENCHMARK_NAME contains xbench or -zh
+if [[ $BENCHMARK_NAME == "xbench-ds" ]] || [[ $BENCHMARK_NAME == "browsecomp-zh" ]]; then
+    export CHINESE_CONTEXT="true"
+    echo "检测到中文相关基准测试，已启用中文上下文：CHINESE_CONTEXT=true"
+fi
+
+# export REMOVE_SNIPPETS="true"
+# export REMOVE_KNOWLEDGE_GRAPH="true"
+# export REMOVE_ANSWER_BOX="true"
+
+RESULTS_DIR="logs/${BENCHMARK_NAME}/${AGENT_SET}"
+
+echo "Starting $NUM_RUNS runs of the evaluation..."
+echo "Results will be saved in: $RESULTS_DIR"
+
+mkdir -p "$RESULTS_DIR"
+
+for i in $(seq 1 $NUM_RUNS); do
+    echo "=========================================="
+    echo "Launching experiment $i/$NUM_RUNS"
+    echo "=========================================="
+    
+    RUN_ID="run_$i"
+    
+    (
+        uv run main.py common-benchmark \
+            benchmark=$BENCHMARK_NAME \
+            agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
+            benchmark.execution.max_tasks=null \
+            benchmark.execution.max_concurrent=$MAX_CONCURRENT \
+            benchmark.execution.pass_at_k=1 \
+            hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
+            > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
+        
+        if [ $? -eq 0 ]; then
+            echo "Run $i completed successfully"
+            RESULT_FILE=$(find "${RESULTS_DIR}/$RUN_ID" -name "*accuracy.txt" 2>/dev/null | head -1)
+            if [ -f "$RESULT_FILE" ]; then
+                echo "Results saved to $RESULT_FILE"
+            else
+                echo "Warning: Result file not found for run $i"
+            fi
+        else
+            echo "Run $i failed!"
+        fi
+    ) &
+    
+    sleep 2
+done
+
+echo "All $NUM_RUNS runs have been launched in parallel"
+echo "Waiting for all runs to complete..."
+
+wait
+
+echo "=========================================="
+echo "All $NUM_RUNS runs completed!"
+echo "=========================================="
+
+echo "Calculating average scores..."
+uv run main.py avg-score "$RESULTS_DIR"
+
+echo "=========================================="
+echo "Multiple runs evaluation completed!"
+echo "Check results in: $RESULTS_DIR"
+echo "Check individual run logs: $RESULTS_DIR/run_*_output.log"
+echo "==========================================" 
\ No newline at end of file
diff --git a/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_gaia-validation.sh b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_gaia-validation.sh
new file mode 100644
index 00000000..e20ffc1e
--- /dev/null
+++ b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_gaia-validation.sh
@@ -0,0 +1,85 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# Configuration parameters - dual model configuration
+NUM_RUNS=3
+MAX_CONCURRENT=20
+BENCHMARK_NAME="gaia-validation"
+AGENT_SET="claude03_claude_dual"
+ADD_MESSAGE_ID="true"  # Set to true to add random message ID to all messages sent to LLM
+MAX_TURNS=-1
+
+# Automatically set Chinese context - if BENCHMARK_NAME contains xbench or -zh
+if [[ $BENCHMARK_NAME == "xbench-ds" ]] || [[ $BENCHMARK_NAME == "browsecomp-zh" ]]; then
+    export CHINESE_CONTEXT="true"
+    echo "检测到中文相关基准测试，已启用中文上下文：CHINESE_CONTEXT=true"
+fi
+
+# export REMOVE_SNIPPETS="true"
+# export REMOVE_KNOWLEDGE_GRAPH="true"
+# export REMOVE_ANSWER_BOX="true"
+
+RESULTS_DIR="logs/${BENCHMARK_NAME}/${AGENT_SET}"
+
+echo "Starting $NUM_RUNS runs of the evaluation..."
+echo "Results will be saved in: $RESULTS_DIR"
+
+mkdir -p "$RESULTS_DIR"
+
+for i in $(seq 1 $NUM_RUNS); do
+    echo "=========================================="
+    echo "Launching experiment $i/$NUM_RUNS"
+    echo "=========================================="
+    
+    RUN_ID="run_$i"
+    
+    (
+        uv run main.py common-benchmark \
+            benchmark=$BENCHMARK_NAME \
+            agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
+            benchmark.execution.max_tasks=null \
+            benchmark.execution.max_concurrent=$MAX_CONCURRENT \
+            benchmark.execution.pass_at_k=1 \
+            output_dir="$RESULTS_DIR/$RUN_ID" \
+            hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
+            > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
+        
+        if [ $? -eq 0 ]; then
+            echo "Run $i completed successfully"
+            RESULT_FILE=$(find "${RESULTS_DIR}/$RUN_ID" -name "*accuracy.txt" 2>/dev/null | head -1)
+            if [ -f "$RESULT_FILE" ]; then
+                echo "Results saved to $RESULT_FILE"
+            else
+                echo "Warning: Result file not found for run $i"
+            fi
+        else
+            echo "Run $i failed!"
+        fi
+    ) &
+    
+    sleep 2
+done
+
+echo "All $NUM_RUNS runs have been launched in parallel"
+echo "Waiting for all runs to complete..."
+
+wait
+
+echo "=========================================="
+echo "All $NUM_RUNS runs completed!"
+echo "=========================================="
+
+echo "Calculating average scores..."
+uv run main.py avg-score "$RESULTS_DIR"
+
+echo "=========================================="
+echo "Multiple runs evaluation completed!"
+echo "Check results in: $RESULTS_DIR"
+echo "Check individual run logs: $RESULTS_DIR/run_*_output.log"
+echo "==========================================" 
\ No newline at end of file
diff --git a/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_hle.sh b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_hle.sh
new file mode 100644
index 00000000..a0c34326
--- /dev/null
+++ b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_hle.sh
@@ -0,0 +1,85 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# Configuration parameters - dual model configuration
+NUM_RUNS=1
+MAX_CONCURRENT=50
+BENCHMARK_NAME="hle"
+AGENT_SET="deepseek_claude_dual"
+ADD_MESSAGE_ID="true"  # Set to true to add random message ID to all messages sent to LLM
+MAX_TURNS=-1
+
+# Automatically set Chinese context - if BENCHMARK_NAME contains xbench or -zh
+if [[ $BENCHMARK_NAME == "xbench-ds" ]] || [[ $BENCHMARK_NAME == "browsecomp-zh" ]]; then
+    export CHINESE_CONTEXT="true"
+    echo "检测到中文相关基准测试，已启用中文上下文：CHINESE_CONTEXT=true"
+fi
+
+# export REMOVE_SNIPPETS="true"
+# export REMOVE_KNOWLEDGE_GRAPH="true"
+# export REMOVE_ANSWER_BOX="true"
+
+RESULTS_DIR="logs/${BENCHMARK_NAME}/${AGENT_SET}"
+
+echo "Starting $NUM_RUNS runs of the evaluation..."
+echo "Results will be saved in: $RESULTS_DIR"
+
+mkdir -p "$RESULTS_DIR"
+
+for i in $(seq 1 $NUM_RUNS); do
+    echo "=========================================="
+    echo "Launching experiment $i/$NUM_RUNS"
+    echo "=========================================="
+    
+    RUN_ID="run_$i"
+    
+    (
+        uv run main.py common-benchmark \
+            benchmark=$BENCHMARK_NAME \
+            agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
+            benchmark.execution.max_tasks=null \
+            benchmark.execution.max_concurrent=$MAX_CONCURRENT \
+            benchmark.execution.pass_at_k=1 \
+            output_dir="$RESULTS_DIR/$RUN_ID" \
+            hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
+            > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
+        
+        if [ $? -eq 0 ]; then
+            echo "Run $i completed successfully"
+            RESULT_FILE=$(find "${RESULTS_DIR}/$RUN_ID" -name "*accuracy.txt" 2>/dev/null | head -1)
+            if [ -f "$RESULT_FILE" ]; then
+                echo "Results saved to $RESULT_FILE"
+            else
+                echo "Warning: Result file not found for run $i"
+            fi
+        else
+            echo "Run $i failed!"
+        fi
+    ) &
+    
+    sleep 2
+done
+
+echo "All $NUM_RUNS runs have been launched in parallel"
+echo "Waiting for all runs to complete..."
+
+wait
+
+echo "=========================================="
+echo "All $NUM_RUNS runs completed!"
+echo "=========================================="
+
+echo "Calculating average scores..."
+uv run main.py avg-score "$RESULTS_DIR"
+
+echo "=========================================="
+echo "Multiple runs evaluation completed!"
+echo "Check results in: $RESULTS_DIR"
+echo "Check individual run logs: $RESULTS_DIR/run_*_output.log"
+echo "==========================================" 
\ No newline at end of file
diff --git a/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_noansbox_gaia-validation.sh b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_noansbox_gaia-validation.sh
new file mode 100644
index 00000000..b51ed022
--- /dev/null
+++ b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_noansbox_gaia-validation.sh
@@ -0,0 +1,84 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# Configuration parameters - dual model configuration
+NUM_RUNS=3
+MAX_CONCURRENT=20
+BENCHMARK_NAME="gaia-validation"
+AGENT_SET="claude03_claude_dual"
+ADD_MESSAGE_ID="true"  # Set to true to add random message ID to all messages sent to LLM
+MAX_TURNS=-1
+
+# Automatically set Chinese context - if BENCHMARK_NAME contains xbench or -zh
+if [[ $BENCHMARK_NAME == "xbench-ds" ]] || [[ $BENCHMARK_NAME == "browsecomp-zh" ]]; then
+    export CHINESE_CONTEXT="true"
+    echo "检测到中文相关基准测试，已启用中文上下文：CHINESE_CONTEXT=true"
+fi
+
+# export REMOVE_SNIPPETS="true"
+# export REMOVE_KNOWLEDGE_GRAPH="true"
+export REMOVE_ANSWER_BOX="true"
+
+RESULTS_DIR="logs/${BENCHMARK_NAME}/noansbox_${AGENT_SET}"
+
+echo "Starting $NUM_RUNS runs of the evaluation..."
+echo "Results will be saved in: $RESULTS_DIR"
+
+mkdir -p "$RESULTS_DIR"
+
+for i in $(seq 1 $NUM_RUNS); do
+    echo "=========================================="
+    echo "Launching experiment $i/$NUM_RUNS"
+    echo "=========================================="
+    
+    RUN_ID="run_$i"
+    
+    (
+        uv run main.py common-benchmark \
+            benchmark=$BENCHMARK_NAME \
+            agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
+            benchmark.execution.max_tasks=null \
+            benchmark.execution.max_concurrent=$MAX_CONCURRENT \
+            benchmark.execution.pass_at_k=1 \
+            hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
+            > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
+        
+        if [ $? -eq 0 ]; then
+            echo "Run $i completed successfully"
+            RESULT_FILE=$(find "${RESULTS_DIR}/$RUN_ID" -name "*accuracy.txt" 2>/dev/null | head -1)
+            if [ -f "$RESULT_FILE" ]; then
+                echo "Results saved to $RESULT_FILE"
+            else
+                echo "Warning: Result file not found for run $i"
+            fi
+        else
+            echo "Run $i failed!"
+        fi
+    ) &
+    
+    sleep 2
+done
+
+echo "All $NUM_RUNS runs have been launched in parallel"
+echo "Waiting for all runs to complete..."
+
+wait
+
+echo "=========================================="
+echo "All $NUM_RUNS runs completed!"
+echo "=========================================="
+
+echo "Calculating average scores..."
+uv run main.py avg-score "$RESULTS_DIR"
+
+echo "=========================================="
+echo "Multiple runs evaluation completed!"
+echo "Check results in: $RESULTS_DIR"
+echo "Check individual run logs: $RESULTS_DIR/run_*_output.log"
+echo "==========================================" 
\ No newline at end of file
diff --git a/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_nohintreason_hle.sh b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_nohintreason_hle.sh
new file mode 100644
index 00000000..96b46a3c
--- /dev/null
+++ b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_nohintreason_hle.sh
@@ -0,0 +1,84 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# Configuration parameters - dual model configuration
+NUM_RUNS=1
+MAX_CONCURRENT=50
+BENCHMARK_NAME="hle"
+AGENT_SET="deepseek_nohintreason_claude_dual"
+ADD_MESSAGE_ID="true"  # Set to true to add random message ID to all messages sent to LLM
+MAX_TURNS=-1
+
+# Automatically set Chinese context - if BENCHMARK_NAME contains xbench or -zh
+if [[ $BENCHMARK_NAME == "xbench-ds" ]] || [[ $BENCHMARK_NAME == "browsecomp-zh" ]]; then
+    export CHINESE_CONTEXT="true"
+    echo "检测到中文相关基准测试，已启用中文上下文：CHINESE_CONTEXT=true"
+fi
+
+# export REMOVE_SNIPPETS="true"
+# export REMOVE_KNOWLEDGE_GRAPH="true"
+# export REMOVE_ANSWER_BOX="true"
+
+RESULTS_DIR="logs/${BENCHMARK_NAME}/${AGENT_SET}"
+
+echo "Starting $NUM_RUNS runs of the evaluation..."
+echo "Results will be saved in: $RESULTS_DIR"
+
+mkdir -p "$RESULTS_DIR"
+
+for i in $(seq 1 $NUM_RUNS); do
+    echo "=========================================="
+    echo "Launching experiment $i/$NUM_RUNS"
+    echo "=========================================="
+    
+    RUN_ID="run_$i"
+    
+    (
+        uv run main.py common-benchmark \
+            benchmark=$BENCHMARK_NAME \
+            agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
+            benchmark.execution.max_tasks=null \
+            benchmark.execution.max_concurrent=$MAX_CONCURRENT \
+            benchmark.execution.pass_at_k=1 \
+            hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
+            > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
+        
+        if [ $? -eq 0 ]; then
+            echo "Run $i completed successfully"
+            RESULT_FILE=$(find "${RESULTS_DIR}/$RUN_ID" -name "*accuracy.txt" 2>/dev/null | head -1)
+            if [ -f "$RESULT_FILE" ]; then
+                echo "Results saved to $RESULT_FILE"
+            else
+                echo "Warning: Result file not found for run $i"
+            fi
+        else
+            echo "Run $i failed!"
+        fi
+    ) &
+    
+    sleep 2
+done
+
+echo "All $NUM_RUNS runs have been launched in parallel"
+echo "Waiting for all runs to complete..."
+
+wait
+
+echo "=========================================="
+echo "All $NUM_RUNS runs completed!"
+echo "=========================================="
+
+echo "Calculating average scores..."
+uv run main.py avg-score "$RESULTS_DIR"
+
+echo "=========================================="
+echo "Multiple runs evaluation completed!"
+echo "Check results in: $RESULTS_DIR"
+echo "Check individual run logs: $RESULTS_DIR/run_*_output.log"
+echo "==========================================" 
\ No newline at end of file
diff --git a/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_xbench-ds.sh b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_xbench-ds.sh
new file mode 100644
index 00000000..6fe184df
--- /dev/null
+++ b/apps/run-agent/scripts/main-worker-dual/run_evaluate_multiple_runs_xbench-ds.sh
@@ -0,0 +1,84 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+# Configuration parameters - dual model configuration
+NUM_RUNS=3
+MAX_CONCURRENT=20
+BENCHMARK_NAME="xbench-ds"
+AGENT_SET="claude03_claude_dual"
+ADD_MESSAGE_ID="true"  # Set to true to add random message ID to all messages sent to LLM
+MAX_TURNS=-1
+
+# Automatically set Chinese context - if BENCHMARK_NAME contains xbench or -zh
+if [[ $BENCHMARK_NAME == "xbench-ds" ]] || [[ $BENCHMARK_NAME == "browsecomp-zh" ]]; then
+    export CHINESE_CONTEXT="true"
+    echo "检测到中文相关基准测试，已启用中文上下文：CHINESE_CONTEXT=true"
+fi
+
+# export REMOVE_SNIPPETS="true"
+# export REMOVE_KNOWLEDGE_GRAPH="true"
+# export REMOVE_ANSWER_BOX="true"
+
+RESULTS_DIR="logs/${BENCHMARK_NAME}/${AGENT_SET}"
+
+echo "Starting $NUM_RUNS runs of the evaluation..."
+echo "Results will be saved in: $RESULTS_DIR"
+
+mkdir -p "$RESULTS_DIR"
+
+for i in $(seq 1 $NUM_RUNS); do
+    echo "=========================================="
+    echo "Launching experiment $i/$NUM_RUNS"
+    echo "=========================================="
+    
+    RUN_ID="run_$i"
+    
+    (
+        uv run main.py common-benchmark \
+            benchmark=$BENCHMARK_NAME \
+            agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
+            benchmark.execution.max_tasks=null \
+            benchmark.execution.max_concurrent=$MAX_CONCURRENT \
+            benchmark.execution.pass_at_k=1 \
+            hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
+            > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
+        
+        if [ $? -eq 0 ]; then
+            echo "Run $i completed successfully"
+            RESULT_FILE=$(find "${RESULTS_DIR}/$RUN_ID" -name "*accuracy.txt" 2>/dev/null | head -1)
+            if [ -f "$RESULT_FILE" ]; then
+                echo "Results saved to $RESULT_FILE"
+            else
+                echo "Warning: Result file not found for run $i"
+            fi
+        else
+            echo "Run $i failed!"
+        fi
+    ) &
+    
+    sleep 2
+done
+
+echo "All $NUM_RUNS runs have been launched in parallel"
+echo "Waiting for all runs to complete..."
+
+wait
+
+echo "=========================================="
+echo "All $NUM_RUNS runs completed!"
+echo "=========================================="
+
+echo "Calculating average scores..."
+uv run main.py avg-score "$RESULTS_DIR"
+
+echo "=========================================="
+echo "Multiple runs evaluation completed!"
+echo "Check results in: $RESULTS_DIR"
+echo "Check individual run logs: $RESULTS_DIR/run_*_output.log"
+echo "==========================================" 
\ No newline at end of file
diff --git a/apps/run-agent/summary_time_cost.py b/apps/run-agent/summary_time_cost.py
index dba2e434..22f08937 100644
--- a/apps/run-agent/summary_time_cost.py
+++ b/apps/run-agent/summary_time_cost.py
@@ -82,7 +82,7 @@ def generate_summary(log_dir: Path):
     """
     Generates a summary of benchmark results by reading log files from a directory,
     calculating total and average trace data, both overall and grouped by
-    llm_as_judge_result.
+    judge_result.
 
     Args:
         log_dir: The directory where the individual result log files are and where
@@ -115,7 +115,7 @@ def generate_summary(log_dir: Path):
         _update_summary_data(overall_summary, perf_summary, tool_workload)
 
         # Update summary by judge result
-        judge_result = result.get("llm_as_judge_result", "unknown")
+        judge_result = result.get("judge_result", "unknown")
         _update_summary_data(
             summary_by_judge[judge_result], perf_summary, tool_workload
         )
diff --git a/apps/run-agent/trace_single_task.py b/apps/run-agent/trace_single_task.py
index 9397a2fa..909f97d0 100644
--- a/apps/run-agent/trace_single_task.py
+++ b/apps/run-agent/trace_single_task.py
@@ -1,3 +1,7 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
 import asyncio
 import logging
 import pathlib
diff --git a/apps/run-agent/util_aggregate_results_xlsx.py b/apps/run-agent/util_aggregate_results_xlsx.py
new file mode 100644
index 00000000..32df656f
--- /dev/null
+++ b/apps/run-agent/util_aggregate_results_xlsx.py
@@ -0,0 +1,468 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+
+"""
+Aggregate benchmark results from multiple runs into a single CSV file.
+
+This script reads benchmark_results.jsonl files from different runs and aggregates
+the results into a CSV format with task_id, task_question, ground_truth, and
+model_boxed_answer from each run.
+"""
+
+import json
+from pathlib import Path
+from typing import Dict, List
+import sys
+import pandas as pd
+from openpyxl.utils import get_column_letter
+
+# Configuration parameters - modify these as needed
+BASE_LOG_DIR = "<your_results_dirs>"
+
+# Specify the results directories to aggregate (relative to BASE_LOG_DIR)
+# RESULTS_SUBDIR = "claude-sonnet-4_owl_set"
+RESULTS_BASE_DIR = f"{BASE_LOG_DIR}"
+
+# Output Excel file path (also in the shared directory)
+OUTPUT_EXCEL_FILENAME = "aggregated_results_0824_gaia.xlsx"
+# OUTPUT_EXCEL_FILENAME = f"{os.path.basename(BASE_LOG_DIR)}.xlsx"
+OUTPUT_EXCEL_PATH = f"{BASE_LOG_DIR}/{OUTPUT_EXCEL_FILENAME}"
+
+# Font color control switch
+# When set to True, all fonts are black; when set to False, different colors are displayed based on pass_at_k_success
+ALWAYS_BLACK_FONT = False
+
+
+def find_benchmark_results_files(base_dir: str) -> List[Path]:
+    """
+    Find all benchmark_results.jsonl files in the directory structure.
+
+    Args:
+        base_dir: Base directory to search in
+
+    Returns:
+        List of paths to benchmark_results.jsonl files
+    """
+    base_path = Path(base_dir)
+    if not base_path.exists():
+        print(f"Error: Base directory {base_dir} does not exist")
+        sys.exit(1)
+
+    results_files = []
+
+    # Look for run_* directories and their benchmark_results.jsonl files
+    for item in base_path.rglob("**/benchmark_results.jsonl"):
+        # Only include files that are in run_* directories
+        if any(parent.name.startswith("run_") for parent in item.parents):
+            results_files.append(item)
+
+    if not results_files:
+        print(f"Error: No benchmark_results.jsonl files found in {base_dir}")
+        sys.exit(1)
+
+    print(f"Found {len(results_files)} benchmark_results.jsonl files:")
+    for file_path in sorted(results_files):
+        print(f"  {file_path}")
+
+    return sorted(results_files)
+
+
+def extract_run_number(file_path: Path) -> str:
+    """
+    Extract run number from the file path.
+
+    Args:
+        file_path: Path to the benchmark results file
+
+    Returns:
+        Run identifier (e.g., "run_1", "run_2")
+    """
+    for parent in file_path.parents:
+        if parent.name.startswith("run_"):
+            return parent.name
+    return "unknown_run"
+
+
+def load_benchmark_results(file_path: Path) -> List[Dict]:
+    """
+    Load benchmark results from a JSONL file.
+
+    Args:
+        file_path: Path to the JSONL file
+
+    Returns:
+        List of benchmark result dictionaries
+    """
+    results = []
+    try:
+        with open(file_path, "r", encoding="utf-8") as f:
+            for line_num, line in enumerate(f, 1):
+                line = line.strip()
+                if line:
+                    try:
+                        result = json.loads(line)
+                        results.append(result)
+                    except json.JSONDecodeError as e:
+                        print(
+                            f"Warning: Error parsing JSON on line {line_num} in {file_path}: {e}"
+                        )
+                        continue
+    except Exception as e:
+        print(f"Error reading file {file_path}: {e}")
+        return []
+
+    print(f"Loaded {len(results)} results from {file_path}")
+    return results
+
+
+def aggregate_results(
+    results_files: List[Path],
+) -> tuple[Dict[str, Dict], List[str], List[str]]:
+    """
+    Aggregate results from multiple files.
+
+    Args:
+        results_files: List of paths to benchmark results files
+
+    Returns:
+        Tuple of (aggregated_data, run_ids, task_order_from_run1)
+    """
+    aggregated = {}
+    all_run_ids = set()
+    task_order_from_run1 = []
+
+    for file_path in results_files:
+        run_id = extract_run_number(file_path)
+        all_run_ids.add(run_id)
+
+        results = load_benchmark_results(file_path)
+
+        # If this is run_1, capture the task order
+        if run_id == "run_1":
+            task_order_from_run1 = [
+                result.get("task_id", "")
+                for result in results
+                if result.get("task_id", "")
+            ]
+            print(f"Captured task order from run_1: {len(task_order_from_run1)} tasks")
+
+        for result in results:
+            task_id = result.get("task_id", "")
+            task_question = result.get("task_question", "")
+            ground_truth = result.get("ground_truth", "")
+            model_boxed_answer = result.get("model_boxed_answer", "")
+            pass_at_k_success = result.get("pass_at_k_success", False)
+
+            if not task_id:
+                print(f"Warning: Missing task_id in result from {file_path}")
+                continue
+
+            if task_id not in aggregated:
+                aggregated[task_id] = {
+                    "task_id": task_id,
+                    "task_question": task_question,
+                    "ground_truth": ground_truth,
+                    "runs": {},
+                }
+            else:
+                # Verify that task_question and ground_truth are consistent across runs
+                if aggregated[task_id]["task_question"] != task_question:
+                    print(f"Warning: Inconsistent task_question for task_id {task_id}")
+                if aggregated[task_id]["ground_truth"] != ground_truth:
+                    print(f"Warning: Inconsistent ground_truth for task_id {task_id}")
+
+            # Store the model_boxed_answer and pass_at_k_success for this run
+            aggregated[task_id]["runs"][run_id] = {
+                "model_boxed_answer": model_boxed_answer,
+                "pass_at_k_success": pass_at_k_success,
+            }
+
+    print(
+        f"Aggregated results for {len(aggregated)} unique tasks across {len(all_run_ids)} runs"
+    )
+    print(f"Run IDs found: {sorted(all_run_ids)}")
+
+    return aggregated, sorted(all_run_ids), task_order_from_run1
+
+
+def write_excel(
+    aggregated_data: Dict[str, Dict],
+    run_ids: List[str],
+    task_order: List[str],
+    output_path: str,
+):
+    """
+    Write aggregated data to Excel file with conditional formatting.
+
+    Args:
+        aggregated_data: Aggregated benchmark results
+        run_ids: List of run identifiers
+        task_order: List of task_ids in the order from run_1
+        output_path: Path to output Excel file
+    """
+    try:
+        # Prepare data for DataFrame
+        data_rows = []
+        formatting_info = []  # Store formatting information
+
+        # Use task_order from run_1 to maintain the same order
+        for row_idx, task_id in enumerate(task_order):
+            if task_id not in aggregated_data:
+                print(
+                    f"Warning: task_id {task_id} from run_1 not found in aggregated data"
+                )
+                continue
+
+            task_data = aggregated_data[task_id]
+
+            row = {
+                "task_id": task_data["task_id"],
+                "task_question": task_data["task_question"],
+                "ground_truth": task_data["ground_truth"],
+            }
+
+            row_formatting = {
+                "row_idx": row_idx + 2,
+                "runs": {},
+            }  # +2 because Excel is 1-indexed and we have headers
+
+            # Add model answers for each run
+            for run_id in run_ids:
+                run_data = task_data["runs"].get(run_id, {})
+                if isinstance(run_data, dict):
+                    answer = run_data.get("model_boxed_answer", "")
+                    pass_at_k_success = run_data.get("pass_at_k_success", False)
+                else:
+                    # Backward compatibility for old format
+                    answer = run_data
+                    pass_at_k_success = False
+
+                # If answer is correct (pass_at_k_success=True) and ALWAYS_BLACK_FONT is False, leave blank; otherwise show the answer
+                if ALWAYS_BLACK_FONT:
+                    display_answer = (
+                        answer  # Always show the answer when ALWAYS_BLACK_FONT is True
+                    )
+                else:
+                    display_answer = (
+                        "" if pass_at_k_success else answer
+                    )  # Original logic
+                row[f"model_boxed_answer_{run_id}"] = display_answer
+                row_formatting["runs"][run_id] = pass_at_k_success
+
+            data_rows.append(row)
+            formatting_info.append(row_formatting)
+
+        # Calculate accuracy stats based on pass_at_k_success field
+        accuracy_stats = []
+        for idx, run_id in enumerate(run_ids):
+            successes = 0
+            total_tasks = len(data_rows)
+
+            # Count based on pass_at_k_success from formatting_info
+            for fmt_info in formatting_info:
+                pass_at_k_success = fmt_info["runs"].get(run_id, False)
+                if pass_at_k_success:
+                    successes += 1
+
+            accuracy = successes / total_tasks if total_tasks > 0 else 0
+            accuracy_stats.append(
+                {
+                    "run_id": run_id,
+                    "successes": successes,
+                    "total": total_tasks,
+                    "accuracy": accuracy,
+                }
+            )
+
+        # Add accuracy stats to the data rows for inclusion in Excel
+        # Important: Use only plain text to avoid any formula interpretation
+        summary_rows = []
+        summary_rows.append(
+            ["Accuracy Statistics", "", "", ""]
+        )  # Remove "===" which might be interpreted as formula
+
+        for stat in accuracy_stats:
+            summary_rows.append(
+                [
+                    f"{stat['run_id']} Accuracy",  # Remove colon which might cause issues
+                    f"{stat['accuracy']:.2%}",
+                    f"{stat['successes']} out of {stat['total']}",  # Based on pass_at_k_success
+                    "",
+                ]
+            )
+
+        summary_rows.append(
+            ["Total Tasks", str(len(data_rows)), "", ""]
+        )  # Convert to string
+        summary_rows.append(["Number of Runs", str(len(run_ids)), "", ""])
+
+        # Create initial DataFrame to get column names
+        df = pd.DataFrame(data_rows)
+
+        # Create summary rows with proper column mapping
+        summary_data = []
+        for row in summary_rows:
+            summary_dict = {}
+            col_names = list(df.columns)
+            for i, value in enumerate(row):
+                if i < len(col_names):
+                    summary_dict[col_names[i]] = value
+            summary_data.append(summary_dict)
+
+        # Combine main data with summary
+        all_data = data_rows + [{}] + summary_data  # Add empty row as separator
+        df_final = pd.DataFrame(all_data)
+
+        # Write to Excel using the safest possible method
+        try:
+            from openpyxl import Workbook
+            from openpyxl.styles import Font
+
+            # Create a new workbook from scratch to avoid any pandas-related issues
+            wb = Workbook()
+            ws = wb.active
+            ws.title = "Aggregated Results"
+
+            # Get column headers
+            headers = list(df_final.columns)
+
+            # Write headers
+            for col_idx, header in enumerate(headers, 1):
+                cell = ws.cell(row=1, column=col_idx, value=str(header))
+                cell.font = Font(bold=True)
+
+            # Write data rows
+            for row_idx, (_, row) in enumerate(df_final.iterrows(), 2):
+                for col_idx, header in enumerate(headers, 1):
+                    value = row[header]
+                    if pd.isna(value):
+                        value = ""
+                    else:
+                        value = str(value)  # Convert everything to string
+
+                    cell = ws.cell(row=row_idx, column=col_idx, value=value)
+
+                    # Apply text format to avoid number interpretation issues
+                    cell.number_format = "@"
+
+                    # Apply conditional formatting for model answer columns
+                    if (
+                        col_idx >= 4 and row_idx <= len(data_rows) + 1
+                    ):  # Only for data rows, not summary
+                        # Find corresponding formatting info
+                        data_row_idx = row_idx - 2  # Convert to 0-based data index
+                        if data_row_idx < len(formatting_info):
+                            fmt_info = formatting_info[data_row_idx]
+                            run_idx = col_idx - 4  # Convert to run index
+                            if run_idx < len(run_ids):
+                                run_id = run_ids[run_idx]
+                                pass_at_k_success = fmt_info["runs"].get(run_id, False)
+
+                                # Apply font color based on ALWAYS_BLACK_FONT setting
+                                if ALWAYS_BLACK_FONT:
+                                    cell.font = Font(color="000000")  # Always black
+                                else:
+                                    if pass_at_k_success:
+                                        cell.font = Font(color="808080")  # Light gray
+                                    else:
+                                        cell.font = Font(color="8B0000")  # Dark red
+
+            # Set column widths to 25
+            for col_idx in range(1, len(headers) + 1):
+                column_letter = get_column_letter(col_idx)
+                ws.column_dimensions[column_letter].width = 25
+
+            # Set row height to 20
+            for row_idx in range(1, ws.max_row + 1):
+                ws.row_dimensions[row_idx].height = 20
+
+            # Disable error checking to remove green triangles
+            ws.ignore_errors = True
+
+            # Save the workbook
+            wb.save(output_path)
+            wb.close()
+
+            print("Successfully created Excel file with custom formatting")
+
+        except Exception as e:
+            print(f"Error creating Excel file: {e}")
+            # Ultimate fallback: basic pandas save
+            df_final.to_excel(
+                output_path,
+                sheet_name="Aggregated Results",
+                index=False,
+                engine="openpyxl",
+            )
+
+        print(f"Successfully wrote aggregated results to {output_path}")
+        print(
+            f"Excel file contains {len(data_rows)} tasks with answers from {len(run_ids)} runs"
+        )
+        print("Task order matches run_1 order")
+        if ALWAYS_BLACK_FONT:
+            print("Applied font color: Always black (ALWAYS_BLACK_FONT=True)")
+            print(
+                "Content display: Always show model answers regardless of pass_at_k_success"
+            )
+        else:
+            print(
+                "Applied conditional formatting: pass_at_k_success=True (light gray), False (dark red)"
+            )
+            print("Content display: Hide answers when pass_at_k_success=True")
+        print("Added accuracy calculation formulas at the bottom")
+
+    except Exception as e:
+        print(f"Error writing Excel file {output_path}: {e}")
+        sys.exit(1)
+
+
+def main():
+    """Main function to orchestrate the aggregation process."""
+    print("=== Benchmark Results Aggregation Script ===")
+    print(f"Looking for results in: {RESULTS_BASE_DIR}")
+    print(f"Output Excel file will be saved to: {OUTPUT_EXCEL_PATH}")
+    print()
+
+    # Find all benchmark results files
+    results_files = find_benchmark_results_files(RESULTS_BASE_DIR)
+    print()
+
+    # Aggregate results from all files
+    aggregated_data, run_ids, task_order = aggregate_results(results_files)
+    print()
+
+    # Write to Excel
+    write_excel(aggregated_data, run_ids, task_order, OUTPUT_EXCEL_PATH)
+    print()
+
+    # Summary statistics
+    print("=== Summary ===")
+    print(f"Total unique tasks: {len(aggregated_data)}")
+    print(f"Total runs processed: {len(run_ids)}")
+    print(f"Runs: {', '.join(run_ids)}")
+    print("Task order preserved from: run_1")
+
+    # Check for missing data
+    missing_count = 0
+    for task_id, task_data in aggregated_data.items():
+        for run_id in run_ids:
+            run_data = task_data["runs"].get(run_id, {})
+            if not run_data or (
+                isinstance(run_data, dict)
+                and not run_data.get("model_boxed_answer", "")
+            ):
+                missing_count += 1
+
+    if missing_count > 0:
+        print(f"Warning: {missing_count} missing model answers detected")
+    else:
+        print("All tasks have answers from all runs")
+
+    print("Aggregation completed successfully!")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/apps/run-agent/util_llm_parallel_thinking.py b/apps/run-agent/util_llm_parallel_thinking.py
new file mode 100644
index 00000000..7b5ede5c
--- /dev/null
+++ b/apps/run-agent/util_llm_parallel_thinking.py
@@ -0,0 +1,619 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+
+import asyncio
+import glob
+import json
+import os
+import sys
+from typing import Dict, List, Any, Optional, Tuple, Literal
+
+from openai import AsyncOpenAI
+from openai import APIError, APIConnectionError, RateLimitError, APITimeoutError
+from pydantic import BaseModel
+from tenacity import stop_after_attempt, wait_exponential
+from tenacity.asyncio import AsyncRetrying
+
+from eval_utils import verify_answer_for_datasets
+from dotenv import load_dotenv
+
+load_dotenv()
+
+OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
+
+
+class ExtractedAnswer(BaseModel):
+    reasoning: str
+    final_answer: str
+    strict: Literal[True] = True  # 100% reliability
+
+
+# Constants
+# BENCHMARK_NAME = "xbench-ds"  # Benchmark name for evaluation
+
+
+BENCHMARK_NAME = "gaia-validation"  # Benchmark name for evaluation
+RESULTS_DIRS = ["<your_results_dirs>"]
+
+DEFAULT_MODEL = "o3"
+OPENAI_BASE_URL = "https://api.openai.com/v1"
+MAX_RETRY_ATTEMPTS = 3
+RETRY_WAIT_MIN = 1  # seconds
+RETRY_WAIT_MAX = 10  # seconds
+MAX_CONCURRENT_REQUESTS = 25  # Maximum concurrent API requests
+SEMAPHORE_TIMEOUT = 300  # Timeout for acquiring semaphore in seconds
+VERBOSE = True
+
+# O3 Pricing (per 1M tokens)
+O3_INPUT_PRICE = 2.00  # $2.00 per 1M input tokens
+O3_CACHED_INPUT_PRICE = 0.50  # $0.50 per 1M cached input tokens
+O3_OUTPUT_PRICE = 8.00  # $8.00 per 1M output tokens
+
+
+def process_message_history(main_agent_message_history: Dict[str, Any]) -> str:
+    """Process and concatenate message history content."""
+    try:
+        message_history = main_agent_message_history["message_history"]
+
+        # Process the second-to-last message content
+        # preliminary_content = message_history[-2]["content"]
+        # preliminary_content = preliminary_content.replace("## Final Answer", "## Preliminary Answer")
+
+        # Process the last message content
+        final_content = message_history[-1]["content"][0]["text"]
+        final_content = final_content.replace(
+            "O3 extracted final answer:", "## Final Answer Reasoning\n"
+        )
+
+        # Concatenate the two parts
+        # combined_content = preliminary_content + "\n\n" + final_content
+        combined_content = final_content
+        return combined_content
+
+    except (KeyError, IndexError, TypeError) as e:
+        print(f"Warning: Could not process message history: {e}")
+        return ""
+
+
+def extract_from_log(
+    run_dir: str, task_score_dict: Dict[str, List[Dict[str, Any]]]
+) -> None:
+    """Extract task data from log files in a run directory."""
+    try:
+        log_files = glob.glob(os.path.join(run_dir, "*attempt*"))
+        for log_file in log_files:
+            try:
+                task_id = log_file.split("/")[-1].split("_")[1]
+                with open(log_file, "r") as f:
+                    data = json.load(f)
+                    if task_id not in task_score_dict:
+                        task_score_dict[task_id] = []
+                    task_score_dict[task_id].append(
+                        # select some keys from data
+                        {
+                            "task_id": data["task_id"],
+                            "task_name": data["task_name"],
+                            "ground_truth": data["ground_truth"],
+                            "final_boxed_answer": data["final_boxed_answer"],
+                            "input": data["input"],
+                            "agent_summary": process_message_history(
+                                data["main_agent_message_history"]
+                            ),
+                        }
+                    )
+            except (json.JSONDecodeError, KeyError, IOError) as e:
+                print(f"Warning: Could not process log file {log_file}: {e}")
+                continue
+    except Exception as e:
+        print(f"Error processing run directory {run_dir}: {e}")
+        raise
+
+
+async def select_best_solution(
+    prompt: str,
+    n_runs: int,
+    model: str = DEFAULT_MODEL,
+    semaphore: Optional[asyncio.Semaphore] = None,
+) -> str:
+    """Select the best solution using LLM with retry logic and concurrency control."""
+
+    async def _make_api_call():
+        """Make the actual API call with proper error handling."""
+        api_key = OPENAI_API_KEY
+
+        if not api_key:
+            raise ValueError("OPENAI_API_KEY environment variable not set")
+
+        client = AsyncOpenAI(
+            base_url=OPENAI_BASE_URL,
+            api_key=api_key,
+        )
+
+        completion = await client.beta.chat.completions.parse(
+            model=model,
+            messages=[
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": prompt},
+                    ],
+                }
+            ],
+            response_format=ExtractedAnswer,
+        )
+
+        response = completion.choices[0].message.content
+        if not response:
+            raise ValueError("Empty response from API")
+
+        # Parse the structured response
+        parsed_response = json.loads(response)
+
+        # Get usage information
+        usage = completion.usage
+        return parsed_response, usage
+
+    # Use semaphore for concurrency control if provided
+    if semaphore:
+        async with semaphore:
+            return await _retry_api_call(_make_api_call)
+    else:
+        return await _retry_api_call(_make_api_call)
+
+
+async def _retry_api_call(api_call_func):
+    """Retry logic for API calls using AsyncRetrying."""
+    async for attempt in AsyncRetrying(
+        stop=stop_after_attempt(MAX_RETRY_ATTEMPTS),
+        wait=wait_exponential(multiplier=1, min=RETRY_WAIT_MIN, max=RETRY_WAIT_MAX),
+        reraise=True,
+    ):
+        with attempt:
+            try:
+                return await api_call_func()
+            except (
+                APIError,
+                APIConnectionError,
+                RateLimitError,
+                APITimeoutError,
+                ConnectionError,
+            ) as e:
+                print(
+                    f"Retryable API error (attempt {attempt.retry_state.attempt_number}): {e}"
+                )
+                raise  # Let tenacity handle the retry
+            except Exception as e:
+                print(f"Non-retryable error in select_best_solution: {e}")
+                raise
+
+
+def load_task_data(results_dir: str) -> Dict[str, List[Dict[str, Any]]]:
+    """Load task data from all run directories."""
+    run_dirs = glob.glob(os.path.join(results_dir, "run_*"))
+    run_dirs = [d for d in run_dirs if os.path.isdir(d)]
+
+    task_score_dict: Dict[str, List[Dict[str, Any]]] = {}
+    for run_dir in run_dirs:
+        extract_from_log(run_dir, task_score_dict)
+
+    return task_score_dict
+
+
+def load_combined_task_data(
+    results_dirs: List[str],
+) -> Tuple[Dict[str, List[Dict[str, Any]]], int]:
+    """Load and combine task data from multiple result directories."""
+    combined_dict: Dict[str, List[Dict[str, Any]]] = {}
+    total_runs = 0
+
+    for results_dir in results_dirs:
+        if not os.path.exists(results_dir):
+            print(f"Warning: Skipping non-existent directory: {results_dir}")
+            continue
+        task_data = load_task_data(results_dir)
+        run_count = len(
+            [
+                d
+                for d in os.listdir(results_dir)
+                if os.path.isdir(os.path.join(results_dir, d)) and d.startswith("run_")
+            ]
+        )
+        total_runs += run_count
+
+        for task_id, data_list in task_data.items():
+            if task_id not in combined_dict:
+                combined_dict[task_id] = []
+            combined_dict[task_id].extend(data_list)
+
+    return combined_dict, total_runs
+
+
+def create_parallel_thinking_gaia_prompt(
+    task_data: List[Dict[str, Any]], n_runs: int
+) -> str:
+    """Create prompt for parallel thinking for GAIA benchmark."""
+    #     prompt = f"""You are an expert evaluator. Your task is to analyze multiple answers to a question and determine the final answer based on majority vote.
+
+    # Question:
+    # {task_data[0]["input"]}
+
+    # Refer to the following {n_runs} solutions and select the best solution. Make sure the answer is in `\\boxed{{}}`.
+    #     """
+    # Generate agent summaries with markdown formatting
+    agent_summaries = []
+    for i, d in enumerate(task_data, 1):
+        agent_summaries.append(
+            f"**Agent Summary {i}:**\n```markdown\n{d['agent_summary']}\n```"
+        )
+
+    prompt = f"""You are an expert evaluator working with me to determine the best answer from multiple agent summaries. I need your help to analyze these detailed summaries and extract the final answers to determine the best solution.
+
+Question: {task_data[0]["input"]}
+
+Agent Summaries:
+{"\n\n".join(agent_summaries)}
+
+Here's how we can approach this together:
+
+**Understanding Equivalence:**
+I'd like you to group answers that are equivalent according to these precise normalization rules:
+
+For numerical answers:
+   - Remove symbols "$", "%", and "," then convert to float numbers and compare
+   - Examples: "1.5" equals "1.50", "$1,000" equals "1000", "50%" equals "50"
+   - Must be exactly equal as float numbers after normalization
+
+For text answers (single text, not lists):
+   - Remove all spaces and punctuation, convert to lowercase, then compare
+   - Examples: "sea gull" equals "seagull", "New York!" equals "newyork"
+   - Note: "NYC" ≠ "New York City" (becomes "nyc" vs "newyorkcity" - different words)
+
+For list answers (containing commas or semicolons):
+   - Split into elements, lists must have same length
+   - Compare elements in the same position
+   - For each element: if it's a number, use number rules; if text, remove spaces only (keep punctuation), convert to lowercase
+   - All corresponding elements must match
+
+**Important:** Valid answers always exist for these questions. Ignore responses containing "none", "not found", or similar expressions indicating no answer exists.
+
+**Special Interpret Rules:**
+- **Long List Rule**: Any numerical list with >10 elements should be interpreted as a single summary number (sum, count, average, etc.) for equivalence comparison, regardless of question wording.
+- **Instruction Priority Rule**: If agent summary mentions a conditional-vs-unconditional issue in Potential Weaknesses, correct the answer by applying unconditional instructions over conditional instructions.
+- **Percentage Comparison Rule**: If the answer needs to compare two percentages (percent above or below), calculate the direct difference between percentages, not the relative change ratio.
+- **Percentage Trivial Rule**: If the question asks for a percentage answer, discard trivial answers like 0 or 100 as they are unlikely to be correct.
+- **Birthplace Rule**: If the answer is related to birthplaces, use the historical names at the time of birth, not the current names.
+- **Time Distinction Rule**: Pay attention to time references in questions, especially distinguishing between release time and award time, as they are typically in different years.
+
+**Your Task:**
+Please analyze these agent summaries thoughtfully and follow these steps:
+1. **Extract answers**: Extract the final answer from each agent summary (look for \\boxed{{}} format, extract only the content inside the braces)
+2. **Apply interpret rules**: Apply the Special Interpret Rules above to each answer. ONLY use the specific interpret rules I provided. If any answer requires correction, apply the same correction to all agents with the same original answer, regardless of their process quality.
+3. **Assess information gathering**: Cross-compare all agent summaries to identify any agents with clear information gathering defects. Mainly Focus on the information collection process, unless there is evidence of critical inaccuracy in the gathered information. ONLY exclude agents if you find obvious signs of: (a) critically incomplete information gathering that other agents successfully obtained, or (b) clear information bias where the agent accessed wrong sources while others accessed correct ones. Be extremely conservative - when in doubt, include the agent.
+4. **Group answers**: Group answers that are equivalent (using both original form and interpreted form for comparison) from reliable agents only
+5. **Count support**: Count support for each group 
+6. **Select winner**: Choose the group with the most support, then select one exact original answer from that winning group
+
+**IMPORTANT:** Your final_answer must be exactly one of the original answer contents that were inside the \\boxed{{}} sections, with NO modifications and NO \\boxed{{}} wrapper.
+
+Please respond in JSON format:
+{{
+    "reasoning": "Show your step-by-step analysis: (1) extracted answers, (2) interpret rules applications to each answer, (3) information gathering assessment and agent exclusions, (4) grouping with equivalence explanations, (5) group support counts, (6) winning group selection",
+    "final_answer": "The exact content from inside one of the \\boxed{{}} sections"
+}}
+"""
+    #     for i, d in enumerate(task_data):
+    #         prompt += f"""
+    # {'-'*100}
+    # Solution {i+1}:
+    # {d["main_agent_message_history"]["message_history"][-2]["content"]}
+    # {d["main_agent_message_history"]["message_history"][-1]["content"][0]["text"]}
+    # """
+    return prompt
+
+
+def create_parallel_thinking_xbench_prompt(
+    task_data: List[Dict[str, Any]], n_runs: int
+) -> str:
+    """Create prompt for parallel thinking for XBench benchmark (Chinese)."""
+    # Generate agent summaries with markdown formatting
+    agent_summaries = []
+    for i, d in enumerate(task_data, 1):
+        agent_summaries.append(
+            f"**智能体总结 {i}：**\n```markdown\n{d['agent_summary']}\n```"
+        )
+
+    prompt = f"""你是一位专业的评估专家，我需要你帮助我从多个智能体总结中确定最佳答案。请分析这些详细总结并提取最终答案来确定最佳解决方案。
+
+问题：{task_data[0]["input"]}
+
+智能体总结：
+{"\n\n".join(agent_summaries)}
+
+让我们一起来解决这个问题：
+
+**等价性理解：**
+按照XBench评判标准对答案进行等价分组：
+
+- 只关注答案之间是否存在**实质性差异**，不要过度关注格式细节
+- 对于数值题目，允许**可接受的微小误差范围**内的差异
+- 如果答案在语义上**一致**，即使表述方式不同也视为等价
+- 存在任何**不一致、歧义、不等价**的情况则视为不同答案
+- 不要重新解题或为不同答案辩护，专注于判断答案是否一致
+
+**重要：** 这些问题总是存在有效答案。忽略包含"无"、"未找到"或类似表示无答案的回答。
+
+**数值答案要求：** 对于数值答案，避免使用"约"、"大约"等模糊表达，即使题目中有此类询问也要给出尽可能的精确值。
+
+**特殊解释规则：**
+- **地点明确规则**：对于地点名称，总是主动添加地理位置、所属城市、区域、线路等限定词来增强表述的准确性和完整性。
+- **数值单位规则**：数值答案应包含数量单位，具体采用何种表述形式以及单位，应根据问题的具体问法和要求以及中文日常表述习惯来决定。
+    - 注意”列车“是名词，不是数量单位，”辆“是其数量单位。
+- **序数词规则**：当答案涉及排序或顺序时，使用"第一"、"第二"等完整的序数词表述。
+
+**你的任务：**
+请仔细分析这些智能体总结并按照以下步骤进行：
+1. **提取答案**：从每个智能体总结中提取最终答案（寻找\\boxed{{}}格式，只提取大括号内的内容）
+2. **应用解释规则**：对每个答案应用上述特殊解释规则。仅使用我提供的特定解释规则。如果任何答案需要纠正，对所有具有相同原始答案的智能体应用相同纠正，无论其过程质量如何。
+3. **评估信息收集**：交叉比较所有智能体总结，识别任何具有明显信息收集缺陷的智能体。主要关注信息收集过程，不关注答案的正确性。仅在发现明显迹象时排除智能体：(a) 其他智能体成功获得的关键不完整信息收集，或 (b) 智能体访问错误来源而其他智能体访问正确来源的明显信息偏差。极其保守——有疑虑时包含智能体。如果因为问题答案而排除任何智能体，排除所有具有相同答案的智能体。
+4. **分组答案**：仅对可靠智能体的答案进行等价分组（使用原始形式和解释形式进行比较）
+5. **计算支持**：计算每组的支持数
+6. **选择获胜者**：选择支持最多的组，然后将获胜答案改写为简短的表述形式（短语或词汇，禁止完整句子，禁止描述性或解释性表述）
+
+**重要：** 你的最终答案必须是尽可能简短的表述，使用短语或词汇，严禁使用完整句子、描述性或解释性内容。遵循上述特殊解释规则。
+
+请用JSON格式回复：
+{{
+    "reasoning": "显示你的逐步分析：(1) 提取的答案，(2) 对每个答案的解释规则应用，(3) 信息收集评估和智能体排除，(4) 分组和等价性解释，(5) 组支持计数，(6) 获胜组选择",
+    "final_answer": "简短的表述（短语或词汇，禁止完整句子，遵循特殊解释规则）"
+}}
+"""
+    return prompt
+
+
+async def process_single_task(
+    task_id: str, data: List[Dict[str, Any]], n_runs: int, semaphore: asyncio.Semaphore
+) -> Tuple[str, Dict[str, Any], Any]:
+    """Process a single task and return its result."""
+    # Choose prompt function based on benchmark
+    if "xbench" in BENCHMARK_NAME:
+        prompt = create_parallel_thinking_xbench_prompt(data, n_runs)
+    elif "gaia" in BENCHMARK_NAME:
+        prompt = create_parallel_thinking_gaia_prompt(data, n_runs)
+    else:
+        raise ValueError(f"Unsupported benchmark name: {BENCHMARK_NAME}")
+
+    response, usage = await select_best_solution(prompt, n_runs, semaphore=semaphore)
+    selected_solution = response["final_answer"]
+    reasoning = response["reasoning"]
+    client = AsyncOpenAI(
+        base_url=OPENAI_BASE_URL,
+        api_key=OPENAI_API_KEY,
+    )
+
+    result = await verify_answer_for_datasets(
+        client, BENCHMARK_NAME, "", data[0]["ground_truth"], selected_solution
+    )
+
+    task_result = {
+        "task_id": task_id,
+        "candidate_answers": [d["final_boxed_answer"] for d in data],
+        "task_input": data[0]["input"],
+        "prompt_input": prompt,
+        "ground_truth": data[0]["ground_truth"],
+        "selected_solution": selected_solution,
+        "selected_solution_result": result,
+        "selected_solution_reasoning": reasoning,
+    }
+
+    return task_id, task_result, usage
+
+
+async def process_tasks(
+    task_score_dict: Dict[str, List[Dict[str, Any]]],
+    n_runs: int,
+    max_concurrent_requests: int = MAX_CONCURRENT_REQUESTS,
+) -> Dict[str, Dict[str, Any]]:
+    """Process all tasks concurrently and select best solutions."""
+    # Create semaphore for rate limiting
+    semaphore = asyncio.Semaphore(max_concurrent_requests)
+
+    # Create tasks for concurrent execution
+    tasks = [
+        process_single_task(task_id, data, n_runs, semaphore)
+        for task_id, data in task_score_dict.items()
+    ]
+
+    total_tasks = len(tasks)
+    print(
+        f"Processing {total_tasks} tasks concurrently (max {max_concurrent_requests} concurrent requests)..."
+    )
+
+    # Process tasks and show progress as they complete
+    task_results: Dict[str, Dict[str, Any]] = {}
+    completed_tasks = 0
+
+    # Token usage tracking
+    total_input_tokens = 0
+    total_cached_input_tokens = 0
+    total_output_tokens = 0
+
+    for coro in asyncio.as_completed(tasks):
+        try:
+            result = await coro
+            task_id, task_result, usage = result
+            task_results[task_id] = task_result
+            completed_tasks += 1
+
+            # Update token usage
+            if usage:
+                total_input_tokens += getattr(usage, "prompt_tokens", 0)
+                total_cached_input_tokens += (
+                    getattr(usage, "prompt_tokens_details", {}).cached_tokens
+                    if hasattr(usage, "prompt_tokens_details")
+                    else 0
+                )
+                total_output_tokens += getattr(usage, "completion_tokens", 0)
+
+            # Show progress indicator
+            progress_percent = (completed_tasks / total_tasks) * 100
+            if VERBOSE:
+                print(
+                    f"Progress: {completed_tasks}/{total_tasks} ({progress_percent:.1f}%) - Completed task: {task_id}"
+                )
+                print(
+                    f"  Tokens: Input={total_input_tokens-total_cached_input_tokens}, Cached={total_cached_input_tokens}, Output={total_output_tokens}"
+                )
+                input_cost = (
+                    (total_input_tokens - total_cached_input_tokens)
+                    * O3_INPUT_PRICE
+                    / 1_000_000
+                )
+                cached_input_cost = (
+                    total_cached_input_tokens * O3_CACHED_INPUT_PRICE / 1_000_000
+                )
+                output_cost = total_output_tokens * O3_OUTPUT_PRICE / 1_000_000
+                total_cost = input_cost + cached_input_cost + output_cost
+                print(
+                    f"  Costs: Input=${input_cost:.4f}, Cached=${cached_input_cost:.4f}, Output=${output_cost:.4f}, Total=${total_cost:.4f}"
+                )
+
+        except Exception as e:
+            completed_tasks += 1
+            progress_percent = (completed_tasks / total_tasks) * 100
+            if VERBOSE:
+                print(
+                    f"Progress: {completed_tasks}/{total_tasks} ({progress_percent:.1f}%) - Error processing task: {e}"
+                )
+                print(
+                    f"  Tokens: Input={total_input_tokens-total_cached_input_tokens}, Cached={total_cached_input_tokens}, Output={total_output_tokens}"
+                )
+                input_cost = (
+                    (total_input_tokens - total_cached_input_tokens)
+                    * O3_INPUT_PRICE
+                    / 1_000_000
+                )
+                cached_input_cost = (
+                    total_cached_input_tokens * O3_CACHED_INPUT_PRICE / 1_000_000
+                )
+                output_cost = total_output_tokens * O3_OUTPUT_PRICE / 1_000_000
+                total_cost = input_cost + cached_input_cost + output_cost
+                print(
+                    f"  Costs: Input=${input_cost:.4f}, Cached=${cached_input_cost:.4f}, Output=${output_cost:.4f}, Total=${total_cost:.4f}"
+                )
+            # Continue with other tasks instead of failing completely
+            continue
+
+    # Final pricing summary
+    final_input_cost = (
+        (total_input_tokens - total_cached_input_tokens) * O3_INPUT_PRICE / 1_000_000
+    )
+    final_cached_input_cost = (
+        total_cached_input_tokens * O3_CACHED_INPUT_PRICE / 1_000_000
+    )
+    final_output_cost = total_output_tokens * O3_OUTPUT_PRICE / 1_000_000
+    final_total_cost = final_input_cost + final_cached_input_cost + final_output_cost
+
+    print(f"Successfully processed {len(task_results)} out of {total_tasks} tasks")
+    print("\n=== FINAL PRICING SUMMARY ===")
+    print(
+        f"Total Input Tokens: {total_input_tokens-total_cached_input_tokens:,} (${final_input_cost:.4f})"
+    )
+    print(
+        f"Total Cached Input Tokens: {total_cached_input_tokens:,} (${final_cached_input_cost:.4f})"
+    )
+    print(f"Total Output Tokens: {total_output_tokens:,} (${final_output_cost:.4f})")
+    print(f"TOTAL COST: ${final_total_cost:.4f}")
+    print("============================\n")
+
+    return task_results
+
+
+def save_results(
+    results_dir: str, task_results: Dict[str, Dict[str, Any]], n_runs: int
+) -> None:
+    """Save results to files."""
+    try:
+        # Save detailed results
+        results_file = os.path.join(
+            results_dir, f"llm_parallel_thinking_{n_runs}runs.json"
+        )
+        with open(results_file, "w") as f:
+            json.dump(task_results, f, ensure_ascii=False, indent=4)
+
+        # Calculate and save accuracy
+        correct_count = sum(
+            1
+            for data in task_results.values()
+            if data["selected_solution_result"] == "CORRECT"
+        )
+        accuracy = correct_count / len(task_results) if task_results else 0.0
+
+        print(f"Accuracy: {accuracy}")
+
+        accuracy_file = os.path.join(
+            results_dir, f"llm_parallel_thinking_accuracy_{n_runs}runs.txt"
+        )
+        with open(accuracy_file, "w") as f:
+            f.write(f"Accuracy: {accuracy}")
+
+    except IOError as e:
+        print(f"Error saving results: {e}")
+        raise
+
+
+async def main(
+    results_dir: str, max_concurrent_requests: int = MAX_CONCURRENT_REQUESTS
+) -> None:
+    """Main function to analyze results and select best solutions."""
+    if not os.path.exists(results_dir):
+        print(f"Results directory does not exist: {results_dir}")
+        sys.exit(1)
+
+    print(f"Analyzing results from: {results_dir}")
+
+    # Load task data from all runs
+    task_score_dict = load_task_data(results_dir)
+    if not task_score_dict:
+        print("No task data found")
+        return
+
+    # Get number of runs
+    run_dirs = glob.glob(os.path.join(results_dir, "run_*"))
+    n_runs = len([d for d in run_dirs if os.path.isdir(d)])
+
+    # Process all tasks
+    task_results = await process_tasks(task_score_dict, n_runs, max_concurrent_requests)
+
+    # Save results
+    save_results(results_dir, task_results, n_runs)
+
+
+if __name__ == "__main__":
+    max_concurrent_requests = MAX_CONCURRENT_REQUESTS
+
+    # Use single or multiple directory mode based on whether results_dirs is defined above
+    results_dirs = RESULTS_DIRS
+
+    if results_dirs:
+        # Multiple directories mode
+        combined_dict, total_runs = load_combined_task_data(results_dirs)
+        if not combined_dict:
+            print("No task data found")
+            sys.exit(1)
+        print(
+            f"Loaded {len(combined_dict)} tasks with {total_runs} total runs from {len(results_dirs)} directories"
+        )
+
+        async def main_combined():
+            task_results = await process_tasks(
+                combined_dict, total_runs, max_concurrent_requests
+            )
+            save_results(os.path.dirname(results_dirs[0]), task_results, total_runs)
+
+        asyncio.run(main_combined())
+    else:
+        # Single directory mode
+        # asyncio.run(main(results_dir, max_concurrent_requests))
+        pass
diff --git a/apps/run-agent/util_llm_simple_voting.py b/apps/run-agent/util_llm_simple_voting.py
new file mode 100644
index 00000000..7b6cee2f
--- /dev/null
+++ b/apps/run-agent/util_llm_simple_voting.py
@@ -0,0 +1,457 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+
+import asyncio
+import glob
+import json
+import os
+import sys
+from typing import Dict, List, Any, Optional, Tuple, Literal
+
+from openai import AsyncOpenAI
+from openai import APIError, APIConnectionError, RateLimitError, APITimeoutError
+from pydantic import BaseModel
+from tenacity import stop_after_attempt, wait_exponential
+from tenacity.asyncio import AsyncRetrying
+
+from eval_utils import verify_answer_for_datasets
+from dotenv import load_dotenv
+
+load_dotenv()
+
+OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
+
+
+class ExtractedAnswer(BaseModel):
+    reasoning: str
+    final_answer: str
+    strict: Literal[True] = True  # 100% reliability
+
+
+# Constants
+BENCHMARK_NAME = "gaia-validation"  # Benchmark name for evaluation
+RESULTS_DIRS = ["<your_results_dirs>"]
+
+DEFAULT_MODEL = "o3"
+OPENAI_BASE_URL = "https://api.openai.com/v1"
+MAX_RETRY_ATTEMPTS = 3
+RETRY_WAIT_MIN = 1  # seconds
+RETRY_WAIT_MAX = 10  # seconds
+MAX_CONCURRENT_REQUESTS = 25  # Maximum concurrent API requests
+SEMAPHORE_TIMEOUT = 300  # Timeout for acquiring semaphore in seconds
+VERBOSE = True
+
+
+def process_message_history(main_agent_message_history: Dict[str, Any]) -> str:
+    """Process and concatenate message history content."""
+    try:
+        message_history = main_agent_message_history["message_history"]
+
+        # Process the second-to-last message content
+        # preliminary_content = message_history[-2]["content"]
+        # preliminary_content = preliminary_content.replace("## Final Answer", "## Preliminary Answer")
+
+        # Process the last message content
+        final_content = message_history[-1]["content"][0]["text"]
+        final_content = final_content.replace(
+            "O3 extracted final answer:", "## Final Answer Reasoning\n"
+        )
+
+        # Concatenate the two parts
+        # combined_content = preliminary_content + "\n\n" + final_content
+        combined_content = final_content
+        return combined_content
+
+    except (KeyError, IndexError, TypeError) as e:
+        print(f"Warning: Could not process message history: {e}")
+        return ""
+
+
+def extract_from_log(
+    run_dir: str, task_score_dict: Dict[str, List[Dict[str, Any]]]
+) -> None:
+    """Extract task data from log files in a run directory."""
+    try:
+        log_files = glob.glob(os.path.join(run_dir, "*attempt*"))
+        for log_file in log_files:
+            try:
+                task_id = log_file.split("/")[-1].split("_")[1]
+                with open(log_file, "r") as f:
+                    data = json.load(f)
+                    if task_id not in task_score_dict:
+                        task_score_dict[task_id] = []
+                    task_score_dict[task_id].append(
+                        # select some keys from data
+                        {
+                            "task_id": data["task_id"],
+                            "task_name": data["task_original_name"],
+                            "ground_truth": data["ground_truth"],
+                            "final_boxed_answer": data["final_boxed_answer"],
+                            "input": data["input"],
+                            "agent_summary": process_message_history(
+                                data["main_agent_message_history"]
+                            ),
+                        }
+                    )
+            except (json.JSONDecodeError, KeyError, IOError) as e:
+                print(f"Warning: Could not process log file {log_file}: {e}")
+                continue
+    except Exception as e:
+        print(f"Error processing run directory {run_dir}: {e}")
+        raise
+
+
+async def select_best_solution(
+    prompt: str,
+    n_runs: int,
+    model: str = DEFAULT_MODEL,
+    semaphore: Optional[asyncio.Semaphore] = None,
+) -> str:
+    """Select the best solution using LLM with retry logic and concurrency control."""
+
+    async def _make_api_call():
+        """Make the actual API call with proper error handling."""
+        api_key = OPENAI_API_KEY
+
+        if not api_key:
+            raise ValueError("OPENAI_API_KEY environment variable not set")
+
+        client = AsyncOpenAI(
+            base_url=OPENAI_BASE_URL,
+            api_key=api_key,
+        )
+
+        completion = await client.beta.chat.completions.parse(
+            model=model,
+            messages=[
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": prompt},
+                    ],
+                }
+            ],
+            response_format=ExtractedAnswer,
+        )
+
+        response = completion.choices[0].message.content
+        if not response:
+            raise ValueError("Empty response from API")
+
+        # Parse the structured response
+        parsed_response = json.loads(response)
+        return parsed_response
+
+    # Use semaphore for concurrency control if provided
+    if semaphore:
+        async with semaphore:
+            return await _retry_api_call(_make_api_call)
+    else:
+        return await _retry_api_call(_make_api_call)
+
+
+async def _retry_api_call(api_call_func):
+    """Retry logic for API calls using AsyncRetrying."""
+    async for attempt in AsyncRetrying(
+        stop=stop_after_attempt(MAX_RETRY_ATTEMPTS),
+        wait=wait_exponential(multiplier=1, min=RETRY_WAIT_MIN, max=RETRY_WAIT_MAX),
+        reraise=True,
+    ):
+        with attempt:
+            try:
+                return await api_call_func()
+            except (
+                APIError,
+                APIConnectionError,
+                RateLimitError,
+                APITimeoutError,
+                ConnectionError,
+            ) as e:
+                print(
+                    f"Retryable API error (attempt {attempt.retry_state.attempt_number}): {e}"
+                )
+                raise  # Let tenacity handle the retry
+            except Exception as e:
+                print(f"Non-retryable error in select_best_solution: {e}")
+                raise
+
+
+def load_task_data(results_dir: str) -> Dict[str, List[Dict[str, Any]]]:
+    """Load task data from all run directories."""
+    run_dirs = glob.glob(os.path.join(results_dir, "run_*"))
+    run_dirs = [d for d in run_dirs if os.path.isdir(d)]
+
+    task_score_dict: Dict[str, List[Dict[str, Any]]] = {}
+    for run_dir in run_dirs:
+        extract_from_log(run_dir, task_score_dict)
+
+    return task_score_dict
+
+
+def load_combined_task_data(
+    results_dirs: List[str],
+) -> Tuple[Dict[str, List[Dict[str, Any]]], int]:
+    """Load and combine task data from multiple result directories."""
+    combined_dict: Dict[str, List[Dict[str, Any]]] = {}
+    total_runs = 0
+
+    for results_dir in results_dirs:
+        if not os.path.exists(results_dir):
+            print(f"Warning: Skipping non-existent directory: {results_dir}")
+            continue
+        task_data = load_task_data(results_dir)
+        run_count = len(
+            [
+                d
+                for d in os.listdir(results_dir)
+                if os.path.isdir(os.path.join(results_dir, d)) and d.startswith("run_")
+            ]
+        )
+        total_runs += run_count
+
+        for task_id, data_list in task_data.items():
+            if task_id not in combined_dict:
+                combined_dict[task_id] = []
+            combined_dict[task_id].extend(data_list)
+
+    return combined_dict, total_runs
+
+
+def create_selection_gaia_prompt(task_data: List[Dict[str, Any]], n_runs: int) -> str:
+    """Create prompt for solution selection."""
+    #     prompt = f"""You are an expert evaluator. Your task is to analyze multiple answers to a question and determine the final answer based on majority vote.
+
+    # Question:
+    # {task_data[0]["input"]}
+
+    # Refer to the following {n_runs} solutions and select the best solution. Make sure the answer is in `\\boxed{{}}`.
+    #     """
+    # answers_text = ";".join([d["final_boxed_answer"] for d in task_data])
+    answers_text = [f"{d['final_boxed_answer']}" for d in task_data]
+    prompt = f"""You are an expert evaluator working with me to determine the best answer from multiple responses. I need your help to identify which answers are equivalent and then select the most frequently occurring one.
+
+Question: {task_data[0]["input"]}
+
+Multiple Answers:
+{answers_text}
+
+Here's how we can approach this together:
+
+**Understanding Equivalence:**
+I'd like you to group answers that are equivalent according to these precise normalization rules:
+
+For numerical answers:
+   - Remove symbols "$", "%", and "," then convert to float numbers and compare
+   - Examples: "1.5" equals "1.50", "$1,000" equals "1000", "50%" equals "50"
+   - Must be exactly equal as float numbers after normalization
+
+For text answers (single text, not lists):
+   - Remove all spaces and punctuation, convert to lowercase, then compare
+   - Examples: "sea gull" equals "seagull", "New York!" equals "newyork"
+   - Note: "NYC" ≠ "New York City" (becomes "nyc" vs "newyorkcity" - different words)
+
+For list answers (containing commas or semicolons):
+   - Split into elements, lists must have same length
+   - Compare elements in the same position
+   - For each element: if it's a number, use number rules; if text, remove spaces only (keep punctuation), convert to lowercase
+   - All corresponding elements must match
+   - [Rule]: Questions shouldn't require pure numerical lists with >10 elements. If you see long numerical lists, the question likely expects a single number (e.g., sum, conversion). Interpret based on question intent and convert list to a single number before comparing equivalence.
+
+**Important:** Valid answers always exist for these questions. Ignore responses containing "none", "not found", or similar expressions indicating no answer exists.
+
+**Your Task:**
+Please analyze these answers thoughtfully and:
+1. Group the answers that you determine are equivalent
+2. Identify which group appears most frequently 
+3. Select the clearest representative answer from the winning group
+4. Choose only from the original answers provided
+
+I trust your judgment in applying these guidelines sensibly, especially for any edge cases that might arise.
+
+Please respond in JSON format:
+{{
+    "reasoning": "Your analysis of how you grouped the answers and determined the majority",
+    "final_answer": "Your selected answer (exactly as it appears in the original list)"
+}}
+"""
+    #     for i, d in enumerate(task_data):
+    #         prompt += f"""
+    # {'-'*100}
+    # Solution {i+1}:
+    # {d["main_agent_message_history"]["message_history"][-2]["content"]}
+    # {d["main_agent_message_history"]["message_history"][-1]["content"][0]["text"]}
+    # """
+    return prompt
+
+
+async def process_single_task(
+    task_id: str, data: List[Dict[str, Any]], n_runs: int, semaphore: asyncio.Semaphore
+) -> Tuple[str, Dict[str, Any]]:
+    """Process a single task and return its result."""
+    if "gaia" in BENCHMARK_NAME:
+        prompt = create_selection_gaia_prompt(data, n_runs)
+    else:
+        raise ValueError(f"Unsupported benchmark name: {BENCHMARK_NAME}")
+
+    response = await select_best_solution(prompt, n_runs, semaphore=semaphore)
+    selected_solution = response["final_answer"]
+    reasoning = response["reasoning"]
+    result = await verify_answer_for_datasets(
+        BENCHMARK_NAME, "", data[0]["ground_truth"], selected_solution
+    )
+
+    task_result = {
+        "task_id": task_id,
+        "candidate_answers": [d["final_boxed_answer"] for d in data],
+        "task_input": data[0]["input"],
+        "prompt_input": prompt,
+        "ground_truth": data[0]["ground_truth"],
+        "selected_solution": selected_solution,
+        "selected_solution_result": result,
+        "selected_solution_reasoning": reasoning,
+    }
+
+    return task_id, task_result
+
+
+async def process_tasks(
+    task_score_dict: Dict[str, List[Dict[str, Any]]],
+    n_runs: int,
+    max_concurrent_requests: int = MAX_CONCURRENT_REQUESTS,
+) -> Dict[str, Dict[str, Any]]:
+    """Process all tasks concurrently and select best solutions."""
+    # Create semaphore for rate limiting
+    semaphore = asyncio.Semaphore(max_concurrent_requests)
+
+    # Create tasks for concurrent execution
+    tasks = [
+        process_single_task(task_id, data, n_runs, semaphore)
+        for task_id, data in task_score_dict.items()
+    ]
+
+    total_tasks = len(tasks)
+    print(
+        f"Processing {total_tasks} tasks concurrently (max {max_concurrent_requests} concurrent requests)..."
+    )
+
+    # Process tasks and show progress as they complete
+    task_results: Dict[str, Dict[str, Any]] = {}
+    completed_tasks = 0
+
+    for coro in asyncio.as_completed(tasks):
+        try:
+            result = await coro
+            task_id, task_result = result
+            task_results[task_id] = task_result
+            completed_tasks += 1
+
+            # Show progress indicator
+            progress_percent = (completed_tasks / total_tasks) * 100
+            if VERBOSE:
+                print(
+                    f"Progress: {completed_tasks}/{total_tasks} ({progress_percent:.1f}%) - Completed task: {task_id}"
+                )
+
+        except Exception as e:
+            completed_tasks += 1
+            progress_percent = (completed_tasks / total_tasks) * 100
+            if VERBOSE:
+                print(
+                    f"Progress: {completed_tasks}/{total_tasks} ({progress_percent:.1f}%) - Error processing task: {e}"
+                )
+            # Continue with other tasks instead of failing completely
+            continue
+
+    print(f"Successfully processed {len(task_results)} out of {total_tasks} tasks")
+    return task_results
+
+
+def save_results(
+    results_dir: str, task_results: Dict[str, Dict[str, Any]], n_runs: int
+) -> None:
+    """Save results to files."""
+    try:
+        # Save detailed results
+        results_file = os.path.join(
+            results_dir, f"llm_majority_voter_{n_runs}runs.json"
+        )
+        with open(results_file, "w") as f:
+            json.dump(task_results, f, ensure_ascii=False, indent=4)
+
+        # Calculate and save accuracy
+        correct_count = sum(
+            1
+            for data in task_results.values()
+            if data["selected_solution_result"] == "CORRECT"
+        )
+        accuracy = correct_count / len(task_results) if task_results else 0.0
+
+        print(f"Accuracy: {accuracy}")
+
+        accuracy_file = os.path.join(
+            results_dir, f"llm_majority_voter_accuracy_{n_runs}runs.txt"
+        )
+        with open(accuracy_file, "w") as f:
+            f.write(f"Accuracy: {accuracy}")
+
+    except IOError as e:
+        print(f"Error saving results: {e}")
+        raise
+
+
+async def main(
+    results_dir: str, max_concurrent_requests: int = MAX_CONCURRENT_REQUESTS
+) -> None:
+    """Main function to analyze results and select best solutions."""
+    if not os.path.exists(results_dir):
+        print(f"Results directory does not exist: {results_dir}")
+        sys.exit(1)
+
+    print(f"Analyzing results from: {results_dir}")
+
+    # Load task data from all runs
+    task_score_dict = load_task_data(results_dir)
+    if not task_score_dict:
+        print("No task data found")
+        return
+
+    # Get number of runs
+    run_dirs = glob.glob(os.path.join(results_dir, "run_*"))
+    n_runs = len([d for d in run_dirs if os.path.isdir(d)])
+
+    # Process all tasks
+    task_results = await process_tasks(task_score_dict, n_runs, max_concurrent_requests)
+
+    # Save results
+    save_results(results_dir, task_results, n_runs)
+
+
+if __name__ == "__main__":
+    max_concurrent_requests = MAX_CONCURRENT_REQUESTS
+
+    # Use single or multiple directory mode based on whether results_dirs is defined above
+    results_dirs = RESULTS_DIRS
+
+    if results_dirs:
+        # Multiple directories mode
+        combined_dict, total_runs = load_combined_task_data(results_dirs)
+        if not combined_dict:
+            print("No task data found")
+            sys.exit(1)
+        print(
+            f"Loaded {len(combined_dict)} tasks with {total_runs} total runs from {len(results_dirs)} directories"
+        )
+
+        async def main_combined():
+            task_results = await process_tasks(
+                combined_dict, total_runs, max_concurrent_requests
+            )
+            save_results(os.path.dirname(results_dirs[0]), task_results, total_runs)
+
+        asyncio.run(main_combined())
+    else:
+        # Single directory mode
+        # asyncio.run(main(results_dir, max_concurrent_requests))
+        pass
diff --git a/apps/run-agent/util_statistics_hle_text_only.py b/apps/run-agent/util_statistics_hle_text_only.py
new file mode 100644
index 00000000..7f2230eb
--- /dev/null
+++ b/apps/run-agent/util_statistics_hle_text_only.py
@@ -0,0 +1,96 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import json
+from pathlib import Path
+
+
+def analyze_json_files(folder_path):
+    """
+    Analyze judge_result and task_file_name statistics in JSON files
+    """
+    folder = Path(folder_path)
+
+    # Initialize counters
+    total_correct = 0
+    total_incorrect = 0
+    null_task_file_correct = 0
+    null_task_file_incorrect = 0
+
+    # Store processed file information
+    processed_files = 0
+    error_files = []
+
+    print(f"Starting to analyze folder: {folder_path}")
+    print("=" * 60)
+
+    # Iterate through all JSON files in the folder
+    for json_file in folder.glob("*.json"):
+        try:
+            processed_files += 1
+            print(f"Processing file {processed_files}: {json_file.name}")
+
+            with open(json_file, "r", encoding="utf-8") as f:
+                data = json.load(f)
+
+            # Count judge_result
+            if "judge_result" in data:
+                if data["judge_result"] == "CORRECT":
+                    total_correct += 1
+                elif data["judge_result"] == "INCORRECT":
+                    total_incorrect += 1
+
+                # Check if task_file_name under input is null
+                if "input" in data and isinstance(data["input"], dict):
+                    if data["input"].get("task_file_name") is None:
+                        if data["judge_result"] == "CORRECT":
+                            null_task_file_correct += 1
+                        elif data["judge_result"] == "INCORRECT":
+                            null_task_file_incorrect += 1
+
+        except json.JSONDecodeError as e:
+            error_files.append(f"{json_file.name}: JSON parsing error - {e}")
+        except Exception as e:
+            error_files.append(f"{json_file.name}: Other error - {e}")
+
+    # Output statistics results
+    print("\n" + "=" * 60)
+    print("Statistics Results:")
+    print("=" * 60)
+    print(f"Total files processed: {processed_files}")
+    print(f"Total CORRECT count: {total_correct}")
+    print(f"Total INCORRECT count: {total_incorrect}")
+    print(f"Total: {total_correct + total_incorrect}")
+    print()
+    print(f"CORRECT count when task_file_name is null: {null_task_file_correct}")
+    print(f"INCORRECT count when task_file_name is null: {null_task_file_incorrect}")
+    print(
+        f"Total when task_file_name is null: {null_task_file_correct + null_task_file_incorrect}"
+    )
+
+    # Calculate percentages
+    if total_correct + total_incorrect > 0:
+        correct_percentage = (total_correct / (total_correct + total_incorrect)) * 100
+        print(f"\nOverall accuracy: {correct_percentage:.2f}%")
+
+    if null_task_file_correct + null_task_file_incorrect > 0:
+        null_correct_percentage = (
+            null_task_file_correct / (null_task_file_correct + null_task_file_incorrect)
+        ) * 100
+        print(f"Accuracy when task_file_name is null: {null_correct_percentage:.2f}%")
+
+    # Output error file information
+    if error_files:
+        print("\n" + "=" * 60)
+        print("Files with processing errors:")
+        print("=" * 60)
+        for error in error_files:
+            print(f"  {error}")
+
+
+if __name__ == "__main__":
+    # Target folder path
+    folder_path = ["<your_results_dirs>"]
+
+    analyze_json_files(folder_path)
diff --git a/apps/run-agent/uv.lock b/apps/run-agent/uv.lock
index 1be983e5..e6928de4 100644
--- a/apps/run-agent/uv.lock
+++ b/apps/run-agent/uv.lock
@@ -1230,7 +1230,7 @@ requires-dist = [
     { name = "json5", specifier = ">=0.12.0" },
     { name = "miroflow-contrib", editable = "../../libs/miroflow-contrib" },
     { name = "miroflow-tool", editable = "../../libs/miroflow-tool" },
-    { name = "openai", specifier = ">=1.98.0" },
+    { name = "openai", specifier = "==1.78.1" },
     { name = "pyyaml", specifier = ">=6.0.2" },
     { name = "rich", specifier = ">=14.1.0" },
     { name = "tenacity", specifier = ">=8.2.3,<9.0.0" },
@@ -1302,7 +1302,7 @@ requires-dist = [
     { name = "markitdown-mcp", specifier = ">=0.0.1a3" },
     { name = "mcp", specifier = ">=1.12.2" },
     { name = "mutagen", specifier = ">=1.47.0" },
-    { name = "openai", specifier = ">=1.98.0" },
+    { name = "openai", specifier = "==1.78.1" },
     { name = "requests", specifier = ">=2.32.0" },
     { name = "rich", specifier = ">=14.1.0" },
     { name = "wikipedia", specifier = ">=1.4.0" },
@@ -1550,7 +1550,7 @@ wheels = [
 
 [[package]]
 name = "openai"
-version = "1.98.0"
+version = "1.78.1"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
     { name = "anyio" },
@@ -1562,9 +1562,9 @@ dependencies = [
     { name = "tqdm" },
     { name = "typing-extensions" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/d8/9d/52eadb15c92802711d6b6cf00df3a6d0d18b588f4c5ba5ff210c6419fc03/openai-1.98.0.tar.gz", hash = "sha256:3ee0fcc50ae95267fd22bd1ad095ba5402098f3df2162592e68109999f685427", size = 496695, upload-time = "2025-07-30T12:48:03.701Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/a4/3f/4e5e7b0548a15eabc4a755c93cd5f9564887e3d2fd45b6ff531352e5859d/openai-1.78.1.tar.gz", hash = "sha256:8b26b364531b100df1b961d03560042e5f5be11301d7d49a6cd1a2b9af824dca", size = 442985, upload-time = "2025-05-12T09:59:51.098Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/a8/fe/f64631075b3d63a613c0d8ab761d5941631a470f6fa87eaaee1aa2b4ec0c/openai-1.98.0-py3-none-any.whl", hash = "sha256:b99b794ef92196829120e2df37647722104772d2a74d08305df9ced5f26eae34", size = 767713, upload-time = "2025-07-30T12:48:01.264Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/4c/3889bc332a6c743751eb78a4bada5761e50a8a847ff0e46c1bd23ce12362/openai-1.78.1-py3-none-any.whl", hash = "sha256:7368bf147ca499804cc408fe68cdb6866a060f38dec961bbc97b04f9d917907e", size = 680917, upload-time = "2025-05-12T09:59:48.948Z" },
 ]
 
 [[package]]
@@ -2370,6 +2370,8 @@ dependencies = [
     { name = "miroflow" },
     { name = "miroflow-contrib" },
     { name = "miroflow-tool" },
+    { name = "openpyxl" },
+    { name = "pandas" },
 ]
 
 [package.dev-dependencies]
@@ -2392,6 +2394,8 @@ requires-dist = [
     { name = "miroflow", editable = "../../libs/miroflow" },
     { name = "miroflow-contrib", editable = "../../libs/miroflow-contrib" },
     { name = "miroflow-tool", editable = "../../libs/miroflow-tool" },
+    { name = "openpyxl", specifier = ">=3.1.5" },
+    { name = "pandas", specifier = ">=2.3.0" },
 ]
 
 [package.metadata.requires-dev]
diff --git a/apps/visualize-trace/run.py b/apps/visualize-trace/run.py
index 4ef7b7bf..cad4ed72 100644
--- a/apps/visualize-trace/run.py
+++ b/apps/visualize-trace/run.py
@@ -1,5 +1,3 @@
-#!/usr/bin/env python3
-
 # SPDX-FileCopyrightText: 2025 MiromindAI
 #
 # SPDX-License-Identifier: Apache-2.0
diff --git a/apps/visualize-trace/static/js/script.js b/apps/visualize-trace/static/js/script.js
index 4b3c51ab..6ba42a04 100644
--- a/apps/visualize-trace/static/js/script.js
+++ b/apps/visualize-trace/static/js/script.js
@@ -536,7 +536,7 @@ function updateBasicInfo(data) {
         </div>
         <div class="stat-item">
             <span class="stat-label">Judgment Result:</span>
-            <span class="stat-value badge ${data.llm_as_judge_result === 'CORRECT' ? 'bg-success' : 'bg-danger'}">${data.llm_as_judge_result || 'N/A'}</span>
+            <span class="stat-value badge ${data.judge_result === 'CORRECT' ? 'bg-success' : 'bg-danger'}">${data.judge_result || 'N/A'}</span>
         </div>
     `;
     
diff --git a/apps/visualize-trace/test_demo.py b/apps/visualize-trace/test_demo.py
index d605077e..1607fdf7 100644
--- a/apps/visualize-trace/test_demo.py
+++ b/apps/visualize-trace/test_demo.py
@@ -1,5 +1,3 @@
-#!/usr/bin/env python3
-
 # SPDX-FileCopyrightText: 2025 MiromindAI
 #
 # SPDX-License-Identifier: Apache-2.0
diff --git a/apps/visualize-trace/trace_analyzer.py b/apps/visualize-trace/trace_analyzer.py
index 59ed03ce..8c0ee936 100644
--- a/apps/visualize-trace/trace_analyzer.py
+++ b/apps/visualize-trace/trace_analyzer.py
@@ -93,7 +93,7 @@ def get_basic_info(self) -> Dict[str, Any]:
             "end_time": self.data.get("end_time"),
             "final_boxed_answer": self.data.get("final_boxed_answer"),
             "ground_truth": self.data.get("ground_truth"),
-            "llm_as_judge_result": self.data.get("llm_as_judge_result"),
+            "judge_result": self.data.get("judge_result"),
             "error": self.data.get("error", ""),
         }
 
diff --git a/docs/faq.md b/docs/faq.md
new file mode 100644
index 00000000..4f8c27bf
--- /dev/null
+++ b/docs/faq.md
@@ -0,0 +1,15 @@
+**Q: What is the estimated cost of running the GAIA validation set for a single run?** <br>
+**A**: The cost is approximately **$450 USD** for a run without a cache. Enabling the cache can significantly reduce this cost by 50-67%, bringing it down to the **$150 - $225** range.
+
+
+**Q: How long does it take to run the GAIA validation set for a single run?** <br>
+**A**: With the `max_concurrent` parameter set to 20, a full run takes about **5 hours** to complete.
+
+**Q: Are all the specified APIs required?** <br>
+**A**: **Yes.** To fully reproduce our published results, access to all the listed APIs is necessary.
+
+
+**Q: What is the difference between MiroFlow and MiroThinker?** <br>
+**A**:  **MiroFlow** is primarily focused on interacting with proprietary models; **MiroThinker** is designed for our own open-source models.
+
+We plan to merge these two projects in the future to create a single, unified platform.
\ No newline at end of file
diff --git a/docs/figs/09xyHJV9dkbY2yacsv4zYTBbKM.avif b/docs/figs/09xyHJV9dkbY2yacsv4zYTBbKM.avif
new file mode 100644
index 00000000..0b44e3ae
Binary files /dev/null and b/docs/figs/09xyHJV9dkbY2yacsv4zYTBbKM.avif differ
diff --git a/docs/figs/MiroFlow_logo.png b/docs/figs/MiroFlow_logo.png
new file mode 100644
index 00000000..5847b8d0
Binary files /dev/null and b/docs/figs/MiroFlow_logo.png differ
diff --git a/docs/figs/gaia_score.png b/docs/figs/gaia_score.png
index 145e8cb8..d7a9978c 100644
Binary files a/docs/figs/gaia_score.png and b/docs/figs/gaia_score.png differ
diff --git a/docs/figs/logo.png b/docs/figs/logo.png
index ea384dd3..b60b56b5 100644
Binary files a/docs/figs/logo.png and b/docs/figs/logo.png differ
diff --git a/docs/figs/miroflow_architecture.png b/docs/figs/miroflow_architecture.png
index bc032cc1..e5c3cf2f 100644
Binary files a/docs/figs/miroflow_architecture.png and b/docs/figs/miroflow_architecture.png differ
diff --git a/docs/figs/wechat-bot-qr-code.jpg b/docs/figs/wechat-bot-qr-code.jpg
deleted file mode 100644
index 52079868..00000000
Binary files a/docs/figs/wechat-bot-qr-code.jpg and /dev/null differ
diff --git a/docs/figs/wechat-group-qr-code.jpg b/docs/figs/wechat-group-qr-code.jpg
deleted file mode 100644
index 83bcaff3..00000000
Binary files a/docs/figs/wechat-group-qr-code.jpg and /dev/null differ
diff --git a/docs/hydra_config.md b/docs/hydra_config.md
new file mode 100644
index 00000000..1d72d82c
--- /dev/null
+++ b/docs/hydra_config.md
@@ -0,0 +1,152 @@
+## Hydra config system
+### Config File Structure
+
+```
+MiroFlow/libs/miroflow/src/miroflow/prebuilt/config
+├── config.yaml              # Main configuration with defaults
+├── agent/                   # Agent configurations (tools, limits)
+├── benchmark/               # Benchmark configurations (datasets, execution)
+└── llm/                     # Language model configurations (providers, models)
+```
+
+### Usage
+
+Run with default configuration:
+```bash
+cd MiroFlow/apps/run-agent
+uv run main.py common-benchmark
+```
+
+Default configuration is defined in  
+`MiroFlow/libs/miroflow/src/miroflow/prebuilt/config/config.yaml`:
+
+```yaml
+# conf/config.yaml
+defaults:
+  - llm: claude_openrouter
+  - agent: miroflow
+  - benchmark: gaia-validation
+  - pricing: _default
+
+# Other configurations...
+```
+
+| Component  | Default Value         | File Path                                                                 |
+|------------|----------------------|---------------------------------------------------------------------------|
+| LLM        | `claude_openrouter`  | `libs/miroflow/src/miroflow/prebuilt/config/llm/claude_openrouter.yaml`                                   |
+| Agent      | `miroflow`           | `libs/miroflow/src/miroflow/prebuilt/config/agent/miroflow.yaml`                        |
+| Benchmark  | `gaia-validation`    | `libs/miroflow/src/miroflow/prebuilt/config/benchmark/gaia-validation.yaml`                                       |
+
+
+### Override Configurations
+
+#### Component Override
+Switch between existing configurations using the filename (without `.yaml`):
+```bash
+uv run main.py common-benchmark llm=<filename> agent=<filename> benchmark=<filename>
+```
+
+For example, if you have `conf/llm/claude_openrouter.yaml`, use `llm=claude_openrouter`
+
+
+#### Parameter Override
+Override specific parameters:
+```bash
+cd MiroFlow/apps/run-agent
+uv run main.py common-benchmark llm.temperature=0.1 agent.main_agent.max_turns=30
+```
+
+### Create Custom Configurations
+
+1. **Create new config file** in the appropriate subdirectory (e.g., `conf/llm/my_config.yaml`)
+2. **Inherit from defaults** using Hydra's composition:
+   ```yaml
+   defaults:
+     - default  # Inherit base configuration
+     - _self_    # Allow self-overrides
+   
+   # Your custom parameters
+   parameter: value
+   ```
+3. **Use your config**: `uv run main.py common-benchmark component=my_config`
+
+## Recommended: Use scripts to run experiments.
+We recommend using the scripts provided under `./apps/run-agent/scripts`, as script files are much easier to read, maintain, and customize compared to single-line commands.
+
+A script file example is as follows:
+```bash
+#!/bin/bash
+
+# Number of runs for the benchmark
+NUM_RUNS=4 
+# The parallelization concurrent number for a single run (total concurrent = NUM_RUNS * MAX_CONCURRENT)
+MAX_CONCURRENT=15 
+# Benchmark name (must match the benchmark's config file name)
+BENCHMARK_NAME="gaia-validation"
+# Agent set name (must match the agent's config file name)
+AGENT_SET="claude03_claude_dual"
+# Set to true to add a random message ID to all messages sent to the LLM
+ADD_MESSAGE_ID="true"
+# When set to a positive finite number, all main and sub agents will have turn limits. Set to -1 for no limit.
+MAX_TURNS=-1
+
+# Automatically enable Chinese context - if BENCHMARK_NAME contains xbench or -zh
+if [[ $BENCHMARK_NAME == "xbench-ds" ]] || [[ $BENCHMARK_NAME == "browsecomp-zh" ]]; then
+    export CHINESE_CONTEXT="true"
+    echo "检测到中文相关基准测试，已启用中文上下文：CHINESE_CONTEXT=true"
+fi
+
+# These options are used to filter Google search results.
+# export REMOVE_SNIPPETS="true"
+# export REMOVE_KNOWLEDGE_GRAPH="true"
+# export REMOVE_ANSWER_BOX="true"
+
+# Define the results directory to save outputs
+RESULTS_DIR="logs/${BENCHMARK_NAME}/${AGENT_SET}"
+
+echo "Starting $NUM_RUNS runs of the evaluation..."
+echo "Results will be saved in: $RESULTS_DIR"
+
+mkdir -p "$RESULTS_DIR"
+
+# For loop to start NUM_RUNS experiments in parallel.
+for i in $(seq 1 $NUM_RUNS); do
+    RUN_ID="run_$i"
+    (
+        # You can override any parameters you want here
+        uv run main.py common-benchmark \
+            benchmark=$BENCHMARK_NAME \
+            agent=$AGENT_SET \
+            agent.add_message_id=$ADD_MESSAGE_ID \
+            agent.main_agent.max_turns=$MAX_TURNS \
+            agent.sub_agents.agent-worker.max_turns=$MAX_TURNS \
+            benchmark.execution.max_tasks=null \
+            benchmark.execution.max_concurrent=$MAX_CONCURRENT \
+            benchmark.execution.pass_at_k=1 \
+            output_dir="$RESULTS_DIR/$RUN_ID" \
+            hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
+            > "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
+    ) &
+    
+    sleep 2
+done
+
+echo "All $NUM_RUNS runs have been launched in parallel"
+echo "Waiting for all runs to complete..."
+
+wait
+
+echo "=========================================="
+echo "All $NUM_RUNS runs completed!"
+echo "=========================================="
+
+# Calculate average scores
+echo "Calculating average scores..."
+uv run main.py avg-score "$RESULTS_DIR"
+
+echo "=========================================="
+echo "Multiple runs evaluation completed!"
+echo "Check results in: $RESULTS_DIR"
+echo "Check individual run logs: $RESULTS_DIR/run_*_output.log"
+echo "=========================================="
+```
\ No newline at end of file
diff --git a/docs/local_e2b.md b/docs/local_e2b.md
new file mode 100644
index 00000000..637c6000
--- /dev/null
+++ b/docs/local_e2b.md
@@ -0,0 +1,38 @@
+
+# Prepare E2B Sandbox (Optional)
+
+> [!TIP]
+> We provide a public E2B sandbox template. Follow this step if you want to reproduce the best scores.
+>
+> For the E2B sandbox service, we recommend setting up a Linux Docker image with a comprehensive set of apt and Python packages pre-installed. Without these pre-installed packages, the agent will need to spend extra steps and context installing them, resulting in reduced token efficiency.
+>
+> you need to have `npm` install and `docker` running locally.
+
+
+1. Install `e2b` command line and login:
+
+```shell
+## install e2b
+npm install -g @e2b/cli
+## check that it is available
+which e2b 
+```
+
+2. Download our pre-configured Dockerfile:
+[e2b.Dockerfile](https://github.com/MiroMindAI/MiroFlow/blob/main/docs/e2b.Dockerfile).
+
+```shell
+wget https://github.com/MiroMindAI/MiroFlow/blob/main/docs/e2b.Dockerfile
+```
+
+3. Run `e2b template build` command [check official doc here](https://e2b.dev/docs/sdk-reference/cli/v1.0.2/template), use `all_pip_apt_pkg` as the name of template.
+
+```shell
+## build the template with `docker build` locally
+E2B_ACCESS_TOKEN=${your-token}
+e2b template build -c "/root/.jupyter/start-up.sh" -n "all_pip_apt_pkg" -d ./e2b.Dockerfile
+## check that template is built successfully
+E2B_ACCESS_TOKEN=${your-token} e2b template list
+```
+
+You can also create your own custom sandbox template for specific use cases by following similar steps. For more information, please refer to the [E2B Docker documentation](https://e2b.dev/docs/sandbox-template).
diff --git a/docs/mirothinker.md b/docs/mirothinker.md
new file mode 100644
index 00000000..ec2cd9d7
--- /dev/null
+++ b/docs/mirothinker.md
@@ -0,0 +1,6 @@
+## 🌟 MiroThinker
+
+[MiroThinker](https://github.com/MiroMindAI/MiroThinker) (7B/14B/32B) is our suite of open-source agentic models, designed to work seamlessly with the MiroFlow framework. Our models are specifically built to handle **complex, multi-tool tasks**, leveraging the reproducible and robust foundation that MiroFlow provides.
+
+By combining MiroFlow’s reliable orchestration with MiroThinker’s advanced reasoning capabilities, we offer a powerful, end-to-end solution for building high-performing, reproducible AI agents.
+These models are a direct result of our extensive data collection efforts, utilizing MiroFlow to generate high-quality, post-training agent trace data. This unique approach enables MiroThinker to excel in planning, executing, and reasoning through complex multi-step tasks.
\ No newline at end of file
diff --git a/docs/workflow.md b/docs/workflow.md
new file mode 100644
index 00000000..717a981b
--- /dev/null
+++ b/docs/workflow.md
@@ -0,0 +1,90 @@
+
+## Workflow Overview
+
+MiroFlow handles user queries through a multi-stage and agentic process designed for flexibility and depth. The workflow is organized as follows:
+
+1. **Intent Recognition & Query Augmentation**  
+   LLMs analyze user input to detect intent and refine the query.
+
+2. **Planning & Task Orchestration**  
+   The main agent drafts an execution plan, invokes tools, and coordinates sub-agents.
+
+3. **Delegation to Sub-Agents**  
+   Specialized agents (e.g., agent-browsing) handle complex or domain-specific tasks. Sub-agents independently plan, act, and execute tool calls as needed.
+
+4. **Tool Access via MCP Servers**  
+   When external capabilities are required, agents leverage specialized tools by connecting to MCP (Model Context Protocol) servers.
+
+5. **Result Synthesis & Output Alignment**  
+   After task completion, a dedicated summary process synthesizes results, ensuring the output is high-quality and aligned with user instructions (or benchmark formats).
+
+## Architecture Components
+
+All core components are located in the `MiroFlow/libs/` directory.
+
+```
+MiroFlow/libs/
+├── miroflow/
+│   └── src/miroflow/
+│       ├── prebuilt/
+│       │   ├── pipeline.py              # Pipeline: coordinates task execution
+│       │   ├── orchestrator.py          # Orchestrator: manages LLM ↔ tool flow
+│       │   └── config/                  # Hydra configs for agents, LLMs, pricing
+│       ├── llm/
+│       │   └── client.py                # Unified LLM client
+│       ├── utils/
+│       │   ├── io_utils.py              # Output formatting utilities
+│       │   ├── prompt_utils.py          # Prompt definitions for agents
+│       │   └── tool_utils.py            # Tool configuration helpers
+│       └── logging/                     # Task logging & metrics
+│
+├── miroflow-tool/
+│   └── src/miroflow/tool/
+│       ├── manager.py                   # Tool Manager: MCP server connector
+│       └── mcp_servers/                 # Individual MCP tool servers
+│           ├── python_server.py         # Code execution
+│           ├── vision_mcp_server.py     # Visual perception
+│           ├── searching_mcp_server.py  # Web search & retrieval
+│           ├── audio_mcp_server.py      # Audio transcription
+│           ├── reasoning_mcp_server.py  # Enhanced reasoning
+│           └── reading_mcp_server.py    # Document processing
+```
+
+![Core Component Architecture](figs/core_component_architecture.png)
+
+### Core System 💻
+
+- **Pipeline** (`./miroflow/src/miroflow/prebuilt/pipeline.py`): Main entry point that creates and manages all components, handles error recovery, and returns final results
+
+- **Orchestrator** (`./miroflow/src/miroflow/prebuilt/orchestrator.py`): Manages multi-turn conversations, parses tool calls, executes tools, and delegates to sub-agents
+
+- **LLM Client** (`./miroflow/src/miroflow/llm/client.py`): Unified interface supporting Anthropic, OpenAI, Google, Qwen, DeepSeek, and local deployments
+
+### Tool Integration 🔧
+
+- **Tool Manager** (`./miroflow-tool/src/miroflow/tool/manager.py`) : Comprehensive MCP server connection manager with tool discovery, persistent connections, and error handling
+
+- **MCP Servers** (`./miroflow-tool/src/miroflow/tool/mcp_servers/`) : Individual tool implementations built on FastMCP. Provides extensive capabilities including:
+  - Code execution and analysis (`./python_server.py`)
+  - Visual perception (`./vision_mcp_server.py`)
+  - Web search and content retrieval (`./searching_mcp_server.py`)
+  - Audio transcription (`./audio_mcp_server.py`)
+  - Enhanced reasoning capabilities (`./reasoning_mcp_server.py`)
+  - Document processing and analysis (`./reading_mcp_server.py`)
+
+### Agent System 👷
+
+**Sub-Agents**  
+Specialized agents designed for specific domains (e.g., `agent-browsing` for web navigation). Each sub-agent maintains dedicated tool sets and custom prompts, allowing the main agent to delegate tasks requiring specialized expertise. Agent definitions are managed through configuration files with prompts and descriptions customized in `./miroflow/src/miroflow/utils/prompt_utils.py` and `tool_utils.py`.
+
+### Support Systems ⚙️
+
+- **Configuration System** (`./miroflow/src/miroflow/prebuilt/config/`) : Hydra-powered YAML configuration for agents, LLMs, benchmarks, and pricing
+
+- **Output Formatter** (`./miroflow/src/miroflow/utils/io_utils.py`) : Intelligent response formatting that adapts to various benchmark requirements
+
+- **Task Logger** (`./miroflow/src/miroflow/logging/`) : Comprehensive logging for agent interactions, tool executions, and performance metrics
+
+### Execution Pipeline Data Flow
+
+![Execution Pipeline Data Flow](figs/execution_pipeline.png)
\ No newline at end of file
diff --git a/libs/miroflow-contrib/src/miroflow/contrib/pocket/__init__.py b/libs/miroflow-contrib/src/miroflow/contrib/pocket/__init__.py
index e69de29b..78033c33 100644
--- a/libs/miroflow-contrib/src/miroflow/contrib/pocket/__init__.py
+++ b/libs/miroflow-contrib/src/miroflow/contrib/pocket/__init__.py
@@ -0,0 +1,5 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+# SPDX-FileCopyrightText: 2025 OpenAI
+#
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-License-Identifier: MIT
diff --git a/libs/miroflow-contrib/src/miroflow/contrib/pocket/core.py b/libs/miroflow-contrib/src/miroflow/contrib/pocket/core.py
index 0de0cdb1..90d69c8a 100644
--- a/libs/miroflow-contrib/src/miroflow/contrib/pocket/core.py
+++ b/libs/miroflow-contrib/src/miroflow/contrib/pocket/core.py
@@ -1,3 +1,9 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+# SPDX-FileCopyrightText: 2025 OpenAI
+#
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-License-Identifier: MIT
+
 import asyncio
 import copy
 import time
diff --git a/libs/miroflow-contrib/src/miroflow/contrib/pocket/core_v2.py b/libs/miroflow-contrib/src/miroflow/contrib/pocket/core_v2.py
index db1d163a..7251e4f3 100644
--- a/libs/miroflow-contrib/src/miroflow/contrib/pocket/core_v2.py
+++ b/libs/miroflow-contrib/src/miroflow/contrib/pocket/core_v2.py
@@ -1,3 +1,9 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+# SPDX-FileCopyrightText: 2025 OpenAI
+#
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-License-Identifier: MIT
+
 from typing import Any, Protocol, TypeVar
 from dataclasses import dataclass
 import copy
diff --git a/libs/miroflow-tool/pyproject.toml b/libs/miroflow-tool/pyproject.toml
index 8fa6971a..01e0715a 100644
--- a/libs/miroflow-tool/pyproject.toml
+++ b/libs/miroflow-tool/pyproject.toml
@@ -18,7 +18,7 @@ dependencies = [
     "markitdown-mcp>=0.0.1a3",
     "mcp>=1.12.2",
     "mutagen>=1.47.0",
-    "openai>=1.98.0",
+    "openai==1.78.1",
     "requests>=2.32.0",
     "rich>=14.1.0",
     "wikipedia>=1.4.0",
diff --git a/libs/miroflow-tool/src/miroflow/tool/README.md b/libs/miroflow-tool/src/miroflow/tool/README.md
deleted file mode 100644
index 8423b9cd..00000000
--- a/libs/miroflow-tool/src/miroflow/tool/README.md
+++ /dev/null
@@ -1,369 +0,0 @@
-# MCP Tools Framework
-
-This document explains how Model Context Protocol (MCP) tools work in the Mirage framework and provides a step-by-step guide for adding new tools.
-
-## Overview
-
-The Mirage framework uses MCP (Model Context Protocol) to provide extensible tool capabilities to LLM agents. MCP allows agents to interact with external services, APIs, and capabilities through a standardized protocol.
-
-## Architecture
-
-The MCP tools architecture consists of several key components:
-
-```
-├── mcp_servers/          # MCP server implementations
-├── tool-tests/           # Test files for MCP tools
-├── manager.py           # ToolManager - handles tool discovery and execution
-├── settings.py          # Configuration and server parameter setup
-├── orchestrator.py      # Main orchestration logic
-└── prompt_utils.py      # System prompt generation with tool definitions
-```
-
-### Key Components
-
-1. **MCP Servers** (`mcp_servers/`): Individual tool implementations
-2. **ToolManager** (`manager.py`): Coordinates tool discovery and execution
-3. **Configuration** (`settings.py`): Defines which tools are available to which agents
-4. **Orchestrator** (`orchestrator.py`): Manages the interaction between LLM and tools
-5. **Prompt Utils** (`prompt_utils.py`): Generates system prompts with tool definitions
-
-## How MCP Tools Work
-
-### 1. Tool Discovery Flow
-
-```
-Agent Config (YAML) → Settings.py → ToolManager → MCP Servers → Tool Definitions
-```
-
-1. Agent configuration (`conf/agent/default.yaml`) specifies which tools are enabled
-2. `settings.py` creates MCP server parameters based on the configuration
-3. `ToolManager` connects to MCP servers and retrieves tool definitions
-4. Tool definitions are included in the system prompt via `prompt_utils.py`
-
-### 2. Tool Execution Flow
-
-```
-LLM Request → Orchestrator → ToolManager → MCP Server → Tool Result → LLM Response
-```
-
-1. LLM requests a tool call in the standardized format
-2. `Orchestrator` parses the tool call and delegates to `ToolManager`
-3. `ToolManager` executes the tool call on the appropriate MCP server
-4. Tool result is formatted and returned to the LLM
-
-### 3. System Prompt Integration
-
-Tools are automatically included in the system prompt with their definitions:
-
-```
-## Server name: tool-vqa
-### Tool name: visual_question_answering
-Description: Ask question about an image or a video and get the answer with a vision language model.
-Input JSON schema: {"type": "object", "properties": {...}, "required": [...]}
-```
-
-## Adding a New MCP Tool
-
-### Step 1: Create the MCP Server
-
-Create a new file in `mirage/libs/mirage-contrib/src/mirage/contrib/tools/mcp_servers/`:
-
-```python
-# my_new_tool_server.py
-import os
-from fastmcp import FastMCP
-
-# Initialize FastMCP server
-mcp = FastMCP("my-new-tool-server")
-
-# Get environment variables if needed
-MY_API_KEY = os.environ.get("MY_API_KEY", "")
-
-@mcp.tool()
-async def my_tool_function(input_param: str) -> str:
-    """Description of what this tool does.
-    
-    Args:
-        input_param: Description of the input parameter.
-    
-    Returns:
-        Description of what the tool returns.
-    """
-    
-    # Your tool implementation here
-    try:
-        # Process the input and generate output
-        result = f"Processed: {input_param}"
-        return result
-    except Exception as e:
-        return f"Error: {e}"
-
-if __name__ == "__main__":
-    mcp.run(transport="stdio")
-```
-
-### Step 2: Add Configuration in Settings
-
-Add your tool configuration to `mirage/apps/reorg-modular-structure/src/mirage_agent/config/settings.py`:
-
-```python
-def create_mcp_server_parameters(cfg: DictConfig, agent_cfg: DictConfig):
-    """Define and return MCP server configuration list"""
-    configs = []
-    
-    # ... existing code ...
-    
-    # Add your new tool configuration
-    if agent_cfg.get("tools", None) is not None and "tool-my-new-tool" in agent_cfg["tools"]:
-        configs.append(
-            {
-                "name": "tool-my-new-tool",
-                "params": StdioServerParameters(
-                    command=sys.executable,
-                    args=["-m", "mirage.contrib.tools.mcp_servers.my_new_tool_server"],
-                    env={
-                        "MY_API_KEY": MY_API_KEY,
-                        # Add other environment variables as needed
-                    },
-                ),
-            }
-        )
-    
-    # ... rest of the function ...
-```
-
-If the `MY_API_KEY` is not defined before, don't forget to introduce it at the beginning of `settings.py` and its `get_env_info()`.
-
-### Step 3: Update Agent Configuration
-
-Add your tool to the agent configuration file (`conf/agent/default.yaml`):
-
-```yaml
-# conf/agent/default.yaml
-main_agent:
-  tools:
-    - tool-code
-    - tool-vqa
-    - tool-my-new-tool  # Add your new tool name here
-  max_turns: 20
-
-sub_agents:
-  agent-browsing:
-    tools:
-      - tool-serper-search
-      - tool-my-new-tool  # Or add to sub-agents as needed
-    max_turns: 20
-```
-
-### Step 4: Create Tests
-
-Create a test file in `mirage/libs/mirage-contrib/tests/tool-tests/`:
-
-```python
-# test_my_new_tool_server.py
-import os
-import sys
-import asyncio
-import pytest
-from typing import Dict, Any
-
-from mirage.contrib.tools.manager import ToolManager
-from mcp import StdioServerParameters
-
-
-class TestMyNewToolServer:
-    """Test suite for My New Tool MCP Server functionality."""
-
-    def _get_credentials(self) -> Dict[str, str]:
-        """Get API credentials, skip test if not available."""
-        api_key = os.environ.get("MY_API_KEY")
-        if not api_key:
-            pytest.skip("MY_API_KEY environment variable not set")
-
-        return {
-            "MY_API_KEY": api_key,
-        }
-
-    # Use the same way how tools are created as in Mirage framework.
-    def _create_tool_manager(self) -> ToolManager:
-        """Create a configured ToolManager instance."""
-        credentials = self._get_credentials()
-        tool_configs = [
-            {
-                "name": "tool-my-new-tool",
-                "params": StdioServerParameters(
-                    command=sys.executable,
-                    args=["-m", "mirage.contrib.tools.mcp_servers.my_new_tool_server"],
-                    env=credentials,
-                ),
-            }
-        ]
-        return ToolManager(tool_configs)
-
-    # Example test functions:
-    @pytest.mark.asyncio
-    async def test_tool_definitions_available(self):
-        """Test that tool definitions are properly loaded."""
-        pass
-
-    @pytest.mark.asyncio
-    async def test_my_tool_function(self):
-        """Test the main functionality of your tool."""
-        pass
-        # Add more specific assertions based on your tool's behavior
-```
-
-### Step 5: Add Environment Variables
-
-If your tool requires API keys or other configuration:
-
-1. Add them to your `.env` file:
-```bash
-MY_API_KEY=your_api_key_here
-```
-
-2. Add them to `settings.py`:
-```python
-MY_API_KEY = os.environ.get("MY_API_KEY")
-```
-
-Make sure they are passed to the tool.
-
-### Step 6: Test Your Tool
-
-Run the tests to ensure everything works:
-
-```bash
-# Run specific tool tests
-uv run pytest ./libs/mirage-contrib/tests/tool-tests/test_my_new_tool_server.py -v
-
-# Run all tool tests
-uv run pytest mirage/libs/mirage-contrib/tests/tool-tests/ -v
-```
-
-## Tool Implementation Best Practices
-
-### 1. Error Handling
-
-Always implement proper error handling:
-
-```python
-@mcp.tool()
-async def my_tool_function(input_param: str) -> str:
-    """Tool description."""
-    try:
-        # Your logic here
-        result = process_input(input_param)
-        return result
-    except Exception as e:
-        return f"Error: {e}"
-```
-
-Make sure agent will know what happened during the tool calling. Agent may adjust its behavior according to these error messages.
-
-### 2. Parameter Validation
-
-Validate input parameters:
-
-```python
-@mcp.tool()
-async def my_tool_function(required_param: str, optional_param: str = "default") -> str:
-    """Tool description.
-    
-    Args:
-        required_param: Required parameter description.
-        optional_param: Optional parameter description.
-    """
-    if not required_param:
-        return "Error: required_param cannot be empty"
-    
-    # Process the parameters
-    return f"Processed: {required_param}, {optional_param}"
-```
-
-### 3. Environment Variables
-
-Use environment variables for configuration:
-
-```python
-import os
-
-API_KEY = os.environ.get("MY_API_KEY", "")
-BASE_URL = os.environ.get("MY_BASE_URL", "https://api.example.com")
-
-if not API_KEY:
-    raise ValueError("MY_API_KEY environment variable is required")
-```
-
-### 4. Async/Await
-
-Use async/await for I/O operations:
-
-```python
-import aiohttp
-
-@mcp.tool()
-async def fetch_data(url: str) -> str:
-    """Fetch data from a URL."""
-    try:
-        async with aiohttp.ClientSession() as session:
-            async with session.get(url) as response:
-                return await response.text()
-    except:
-        pass
-```
-
-(Don't forget to try except them. Otherwise failed tool call will trigger the process to terminate.)
-
-## Tool Blacklisting
-
-You can blacklist specific tools by adding them to the agent configuration:
-
-```yaml
-# conf/agent/default.yaml
-main_agent:
-  tools:
-    - tool-my-new-tool
-  tool_blacklist:
-    - ["tool-my-new-tool", "specific_function_name"]  # Blacklist specific functions
-  max_turns: 20
-```
-
-## Tool Categories
-
-### Current Available Tools
-
-1. **Code Execution** (`tool-code`): Execute Python code and shell commands
-2. **Vision/VQA** (`tool-vqa`): Visual question answering with images
-3. **Audio** (`tool-transcribe`): Audio transcription and processing
-4. **Reasoning** (`tool-reasoning`): Enhanced reasoning capabilities
-5. **Document Processing** (`tool-markitdown`(old tool)): Convert documents to markdown
-6. **Reading** (`tool-reading`): Read and process documents
-7. **Search** (`tool-serper-search`(old tool), `tool-searching`): Web search related capabilities
-
-### Tool Naming Convention
-
-- Use prefix `tool-` for all tool names
-- Use lowercase with hyphens for multi-word names
-- Example: `tool-my-new-feature`
-
-## Sub-Agents and Tool Access
-
-Different sub-agents can have access to different tools:
-
-```yaml
-sub_agents:
-  agent-browsing:
-    tools:
-      - tool-serper-search
-      - tool-markitdown
-    max_turns: 20
-    
-  agent-coding:
-    tools:
-      - tool-code
-      - tool-reasoning
-    max_turns: 20
-```
-
-To create sub-agents, you should modify `prompt_utils.py` to define sub-agents' system prompts. Sub-agents should always be named after `agent-`.
\ No newline at end of file
diff --git a/libs/miroflow-tool/src/miroflow/tool/manager.py b/libs/miroflow-tool/src/miroflow/tool/manager.py
index 73e958ca..e66931be 100644
--- a/libs/miroflow-tool/src/miroflow/tool/manager.py
+++ b/libs/miroflow-tool/src/miroflow/tool/manager.py
@@ -231,7 +231,7 @@ async def get_all_tool_definitions(self):
 
         return all_servers_for_prompt
 
-    @with_timeout(600)
+    @with_timeout(900)
     async def execute_tool_call(self, server_name, tool_name, arguments) -> Any:
         """
         Execute a single tool call.
@@ -247,10 +247,57 @@ async def execute_tool_call(self, server_name, tool_name, arguments) -> Any:
             logger.error(
                 f"Error: Attempting to call server '{server_name}' that was not found"
             )
+
+            # Try to find the tool in all available servers
+            suggested_servers = await self._find_servers_with_tool(tool_name)
+
+            error_message = f"Server '{server_name}' not found."
+
+            if len(suggested_servers) == 1:
+                # Auto-correction: only one server contains the tool, try to auto-correct and execute
+                correct_server = suggested_servers[0]
+                logger.info(
+                    f"Auto-correction: Server '{server_name}' not found, but found tool '{tool_name}' in '{correct_server}', trying to auto-correct and execute"
+                )
+
+                try:
+                    # Recursive call, using the correct server name
+                    corrected_result = await self.execute_tool_call(
+                        correct_server, tool_name, arguments
+                    )
+
+                    # If auto-correction is successful, add a note in the result
+                    if "result" in corrected_result:
+                        # Add auto-correction note in the result, including the reason for the correction
+                        correction_note = f"[Auto-corrected: Server '{server_name}' not found, but tool '{tool_name}' was found only in server '{correct_server}', so automatically used '{correct_server}' instead] "
+                        corrected_result["result"] = correction_note + str(
+                            corrected_result["result"]
+                        )
+                        return corrected_result
+                    elif "error" in corrected_result:
+                        # If there is an error after auto-correction, add a note in the error message
+                        correction_note = f"[Auto-corrected: Server '{server_name}' not found, but tool '{tool_name}' was found only in server '{correct_server}', attempted auto-correction but still failed] "
+                        corrected_result["error"] = correction_note + str(
+                            corrected_result["error"]
+                        )
+                        return corrected_result
+
+                except Exception as auto_correct_error:
+                    logger.error(f"Auto-correction failed: {auto_correct_error}")
+                    error_message += f" Found tool '{tool_name}' in server '{correct_server}' and attempted auto-correction, but it failed: {str(auto_correct_error)}"
+
+            elif len(suggested_servers) > 1:
+                error_message += f" However, found tool '{tool_name}' in these servers: {', '.join(suggested_servers)}. You may want to use one of these servers instead."
+            else:
+                error_message += (
+                    " It is possible that the server_name and tool_name were confused or mixed up. "
+                    "You should try again and carefully check the server name and tool name provided in the system prompt."
+                )
+
             return {
                 "server_name": server_name,
                 "tool_name": tool_name,
-                "error": f"Server '{server_name}' not found.",
+                "error": error_message,
             }
 
         logger.info(
diff --git a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/audio_mcp_server.py b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/audio_mcp_server.py
index 4bfd3386..0a820102 100755
--- a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/audio_mcp_server.py
+++ b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/audio_mcp_server.py
@@ -1,3 +1,7 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
 import os
 import tempfile
 import requests
diff --git a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/python_server.py b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/python_server.py
index 41ad8ccb..cbb8a77f 100755
--- a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/python_server.py
+++ b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/python_server.py
@@ -1,3 +1,7 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
 import asyncio
 import os
 
@@ -16,7 +20,7 @@
 DEFAULT_TEMPLATE_ID = "all_pip_apt_pkg"
 
 # DEFAULT CONFS
-DEFAULT_TIMEOUT = 1200  # seconds
+DEFAULT_TIMEOUT = 1800  # seconds
 
 # Common packages to install in sandbox
 COMMON_PACKAGES = [
diff --git a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/reading_mcp_server.py b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/reading_mcp_server.py
index e788a58b..ea242eb7 100644
--- a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/reading_mcp_server.py
+++ b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/reading_mcp_server.py
@@ -1,3 +1,7 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
 import argparse
 import os
 import tempfile
diff --git a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/reasoning_mcp_server.py b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/reasoning_mcp_server.py
index 04cb3c49..f034607b 100755
--- a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/reasoning_mcp_server.py
+++ b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/reasoning_mcp_server.py
@@ -1,3 +1,7 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
 import os
 
 from anthropic import Anthropic
@@ -18,14 +22,24 @@
 
 @mcp.tool()
 async def reasoning(question: str) -> str:
-    """You can use this tool use solve hard math problem, puzzle, riddle and IQ test question that requries a lot of chain of thought efforts.
-    DO NOT use this tool for simple and obvious question.
+    """This tool is for pure text-based reasoning, analysis, and logical thinking. It integrates collected information, organizes final logic, and provides planning insights.
+
+    IMPORTANT: This tool cannot access the internet, read files, program, or process multimodal content. It only performs pure text reasoning.
+
+    Use this tool for:
+    - Integrating and synthesizing collected information
+    - Analyzing patterns and relationships in data
+    - Logical reasoning and problem-solving
+    - Planning and strategy development
+    - Complex math problems, puzzles, riddles, and IQ tests
+
+    DO NOT use this tool for simple and obvious questions.
 
     Args:
-        question: The complex question or problem requiring step-by-step reasoning. Should include all relevant information needed to solve the problem..
+        question: The complex question or problem requiring step-by-step reasoning. Should include all relevant information needed to solve the problem.
 
     Returns:
-        The answer to the question.
+        The reasoned answer to the question.
     """
 
     messages_for_llm = [
diff --git a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/searching_mcp_server.py b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/searching_mcp_server.py
index c3228d7a..fb8fdaea 100644
--- a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/searching_mcp_server.py
+++ b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/searching_mcp_server.py
@@ -1,4 +1,9 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
 import os
+import json
 import requests
 import datetime
 import calendar
@@ -13,10 +18,66 @@
 JINA_API_KEY = os.environ.get("JINA_API_KEY", "")
 GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY", "")
 
+# Google search result filtering environment variables
+REMOVE_SNIPPETS = os.environ.get("REMOVE_SNIPPETS", "").lower() in ("true", "1", "yes")
+REMOVE_KNOWLEDGE_GRAPH = os.environ.get("REMOVE_KNOWLEDGE_GRAPH", "").lower() in (
+    "true",
+    "1",
+    "yes",
+)
+REMOVE_ANSWER_BOX = os.environ.get("REMOVE_ANSWER_BOX", "").lower() in (
+    "true",
+    "1",
+    "yes",
+)
+
 # Initialize FastMCP server
 mcp = FastMCP("searching-mcp-server")
 
 
+def filter_google_search_result(result_content: str) -> str:
+    """Filter google search result content based on environment variables.
+
+    Args:
+        result_content: The JSON string result from google search
+
+    Returns:
+        Filtered JSON string result
+    """
+    try:
+        # Parse JSON
+        data = json.loads(result_content)
+
+        # Remove knowledgeGraph if requested
+        if REMOVE_KNOWLEDGE_GRAPH and "knowledgeGraph" in data:
+            del data["knowledgeGraph"]
+
+        # Remove answerBox if requested
+        if REMOVE_ANSWER_BOX and "answerBox" in data:
+            del data["answerBox"]
+
+        # Remove snippets if requested
+        if REMOVE_SNIPPETS:
+            # Remove snippets from organic results
+            if "organic" in data:
+                for item in data["organic"]:
+                    if "snippet" in item:
+                        del item["snippet"]
+
+            # Remove snippets from peopleAlsoAsk
+            if "peopleAlsoAsk" in data:
+                for item in data["peopleAlsoAsk"]:
+                    if "snippet" in item:
+                        del item["snippet"]
+
+        # Return filtered JSON
+        return json.dumps(data, ensure_ascii=False, indent=2)
+
+    except (json.JSONDecodeError, Exception):
+        # If filtering fails, return original content
+        return result_content
+
+
 @mcp.tool()
 async def google_search(
     q: str,
@@ -32,7 +93,9 @@ async def google_search(
 
     Args:
         q: Search query string.
-        location: Location for search results (e.g., 'SoHo, New York, United States', 'California, United States').
+        gl: Country context for search (e.g., 'us' for United States, 'cn' for China, 'uk' for United Kingdom). Influences regional results priority. Default is 'us'.
+        hl: Google interface language (e.g., 'en' for English, 'zh' for Chinese, 'es' for Spanish). Affects snippet language preference. Default is 'en'.
+        location: City-level location for search results (e.g., 'SoHo, New York, United States', 'California, United States').
         num: The number of results to return (default: 10).
         tbs: Time-based search filter ('qdr:h' for past hour, 'qdr:d' for past day, 'qdr:w' for past week, 'qdr:m' for past month, 'qdr:y' for past year).
         page: The page number of results to return (default: 1).
@@ -41,7 +104,9 @@ async def google_search(
         The search results.
     """
     if SERPER_API_KEY == "":
-        return "SERPER_API_KEY is not set, google_search tool is not available."
+        return (
+            "[ERROR]: SERPER_API_KEY is not set, google_search tool is not available."
+        )
     tool_name = "google_search"
     arguments = {
         "q": q,
@@ -80,11 +145,13 @@ async def google_search(
                     assert (
                         result_content is not None and result_content.strip() != ""
                     ), "Empty result from google_search tool, please try again."
-                    return result_content  # Success, exit retry loop
+                    # Apply filtering based on environment variables
+                    filtered_result = filter_google_search_result(result_content)
+                    return filtered_result  # Success, exit retry loop
         except Exception as error:
             retry_count += 1
             if retry_count >= max_retries:
-                return f"[ERROR]: Tool execution failed after {max_retries} attempts: {str(error)}"
+                return f"[ERROR]: google_search tool execution failed after {max_retries} attempts: {str(error)}"
             # Wait before retrying
             await asyncio.sleep(min(2**retry_count, 60))
 
diff --git a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/utils/smart_request.py b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/utils/smart_request.py
index aaf3309c..e3657aa4 100644
--- a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/utils/smart_request.py
+++ b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/utils/smart_request.py
@@ -1,3 +1,7 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
 import os
 import requests
 import asyncio
diff --git a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/vision_mcp_server.py b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/vision_mcp_server.py
index 82362d1b..60b09480 100755
--- a/libs/miroflow-tool/src/miroflow/tool/mcp_servers/vision_mcp_server.py
+++ b/libs/miroflow-tool/src/miroflow/tool/mcp_servers/vision_mcp_server.py
@@ -1,6 +1,9 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
 import base64
 import os
-import time
 import random
 
 from anthropic import Anthropic
@@ -84,7 +87,7 @@ async def call_claude_vision(image_path_or_url: str, question: str) -> str:
 
             messages_for_llm[0]["content"][0]["source"] = dict(type="url", url=url)
 
-        max_retries = 5
+        max_retries = 4
         for attempt in range(1, max_retries + 1):
             try:
                 client = Anthropic(
@@ -105,9 +108,9 @@ async def call_claude_vision(image_path_or_url: str, question: str) -> str:
                 break  # Success, exit retry loop
             except Exception as e:
                 if attempt == max_retries:
-                    result = f"Visual Question Answering (Claude Client) failed after {max_retries} retries: {e}\n"
+                    result = f"[ERROR]: Visual Question Answering (Claude Client) failed after {max_retries} retries: {e}\n"
                     break
-                time.sleep(4**attempt)  # Exponential backoff
+                await asyncio.sleep(4**attempt)  # Exponential backoff
 
         return result
 
@@ -280,7 +283,7 @@ async def visual_question_answering(image_path_or_url: str, question: str) -> st
 
 Return only the extracted text content, maintaining the original formatting and structure as much as possible. If there is no text in the image, respond with 'No text found'. If there are areas where text may exist but is unreadable or ambiguous, describe these as well."""
 
-    gemini_ocr_result = await call_claude_vision(image_path_or_url, ocr_prompt)
+    ocr_result = await call_claude_vision(image_path_or_url, ocr_prompt)
 
     vqa_prompt = f"""You are a highly attentive visual analysis assistant. Your task is to carefully examine the image and provide a thorough, accurate answer to the question.
 
@@ -296,7 +299,7 @@ async def visual_question_answering(image_path_or_url: str, question: str) -> st
 Remember: Your analysis will be used by someone who cannot see the image themselves. Any possible guess, uncertainty, or ambiguity should be reported in words rather than left out, so that nothing is omitted or lost.
 
 The OCR result of this image is as follows (may be incomplete or missing some text):
-{gemini_ocr_result}
+{ocr_result}
 
 Question to answer: {question}
 
@@ -304,9 +307,9 @@ async def visual_question_answering(image_path_or_url: str, question: str) -> st
 """
     # Before answering, carefully analyze both the question and the image. Identify and briefly list potential subtle or easily overlooked VQA pitfalls or ambiguities that could arise in interpreting this question or image (e.g., confusing similar objects, missing small details, misreading text, ambiguous context, etc.). For each, suggest a method or strategy to avoid or mitigate these issues. Only after this analysis, proceed to answer the question, providing a thorough and detailed observation and reasoning process.
 
-    gemini_vqa_result = await call_claude_vision(image_path_or_url, vqa_prompt)
+    vqa_result = await call_claude_vision(image_path_or_url, vqa_prompt)
 
-    return f"OCR results:\n{gemini_ocr_result}\n\nVQA result:\n{gemini_vqa_result}"
+    return f"OCR results:\n{ocr_result}\n\nVQA result:\n{vqa_result}"
 
 
 # The tool visual_audio_youtube_analyzing only support single YouTube URL as input for now, though GEMINI can support multiple URLs up to 10 per request.
@@ -404,6 +407,7 @@ async def visual_audio_youtube_analyzing(
     else:
         transcribe_content = ""
 
+    answer_content = ""
     if question != "":
         prompt = f"Answer the following question: {question}"
         retry_count = 0
@@ -465,8 +469,6 @@ async def visual_audio_youtube_analyzing(
                 else:
                     answer_content = f"[ERROR]: Failed to answer the question: {str(e)}"
                     break
-    else:
-        answer_content = ""
 
     hint = "\n\nHint: Large videos may trigger rate limits causing failures. If you need more website information rather than video visual content itself (such as video subtitles, titles, descriptions, key moments), you can also call tool `scrape_website` tool."
     return transcribe_content + answer_content + hint
diff --git a/libs/miroflow-tool/tests/test_searching_mcp_server.py b/libs/miroflow-tool/tests/test_searching_mcp_server.py
index 8885e2f4..efb72c2f 100644
--- a/libs/miroflow-tool/tests/test_searching_mcp_server.py
+++ b/libs/miroflow-tool/tests/test_searching_mcp_server.py
@@ -1,3 +1,7 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
 import asyncio
 import os
 import sys
diff --git a/libs/miroflow/pyproject.toml b/libs/miroflow/pyproject.toml
index 5f729838..e7cef763 100644
--- a/libs/miroflow/pyproject.toml
+++ b/libs/miroflow/pyproject.toml
@@ -11,7 +11,7 @@ dependencies = [
     "json5>=0.12.0",
     "miroflow-contrib>=0.1.0",
     "miroflow-tool>=0.1.0",
-    "openai>=1.98.0",
+    "openai==1.78.1",
     "pyyaml>=6.0.2",
     # conflict with google-genai
     "rich>=14.1.0",
diff --git a/libs/miroflow/src/miroflow/llm/provider_client_base.py b/libs/miroflow/src/miroflow/llm/provider_client_base.py
index 1eb49b49..f35a3459 100644
--- a/libs/miroflow/src/miroflow/llm/provider_client_base.py
+++ b/libs/miroflow/src/miroflow/llm/provider_client_base.py
@@ -61,15 +61,24 @@ def __post_init__(self):
         self.top_p: float = self.cfg.llm.top_p
         self.min_p: float = self.cfg.llm.min_p
         self.top_k: int = self.cfg.llm.top_k
+        self.repetition_penalty: float = self.cfg.llm.get("repetition_penalty", 1.0)
         self.max_tokens: int = self.cfg.llm.max_tokens
-        self.max_context_length: int = self.cfg.llm.max_context_length
+        self.max_context_length: int = self.cfg.llm.get("max_context_length", -1)
         self.oai_tool_thinking: bool = self.cfg.llm.oai_tool_thinking
         self.async_client: bool = self.cfg.llm.async_client
         self.keep_tool_result: int = self.cfg.llm.keep_tool_result
         self.anthropic_base_url: str | None = self.cfg.env.anthropic_base_url
         self.openai_base_url: str | None = self.cfg.env.openai_base_url
         self.newapi_base_url: str | None = self.cfg.env.newapi_base_url
-        self.openrouter_base_url: str | None = self.cfg.env.openrouter_base_url
+        self.openrouter_base_url: str | None = (
+            self.cfg.llm.get("openrouter_base_url") or self.cfg.env.openrouter_base_url
+        )
+        # Handle special empty value for openrouter_api_key
+        openrouter_api_key_config = self.cfg.llm.get("openrouter_api_key")
+        if openrouter_api_key_config is not None:
+            self.openrouter_api_key: Optional[str] = openrouter_api_key_config
+        else:
+            self.openrouter_api_key: Optional[str] = self.cfg.env.openrouter_api_key
         self.use_tool_calls: Optional[bool] = self.cfg.llm.get("use_tool_calls")
         self.openrouter_provider: Optional[str] = self.cfg.llm.get(
             "openrouter_provider"
@@ -86,6 +95,21 @@ def __post_init__(self):
         logger.info(
             f"openrouter_provider config value: {self.openrouter_provider} (type: {type(self.openrouter_provider)})"
         )
+        logger.info(
+            f"openrouter_base_url config value: {self.openrouter_base_url} (source: {'config file' if self.cfg.llm.get('openrouter_base_url') else 'environment variable'})"
+        )
+        api_key_masked = (
+            f"{self.openrouter_api_key[:10]}..."
+            if self.openrouter_api_key and len(self.openrouter_api_key) > 10
+            else self.openrouter_api_key
+        )
+        if openrouter_api_key_config is not None:
+            api_key_source = "config file"
+        else:
+            api_key_source = "environment variable"
+        logger.info(
+            f"openrouter_api_key config value: {api_key_masked} (source: {api_key_source})"
+        )
         logger.info(
             f"disable_cache_control config value: {disable_cache_control_val} (type: {type(disable_cache_control_val)}) -> parsed as: {self.disable_cache_control}"
         )
@@ -318,17 +342,13 @@ async def create_message(
         response = None
 
         # Unified LLM call handling
-        try:
-            response = await self._create_message(
-                system_prompt,
-                filtered_history,
-                tool_definitions,
-                keep_tool_result=keep_tool_result,
-            )
-        except Exception as e:
-            logger.exception(e)
-        finally:
-            return response
+        response = await self._create_message(
+            system_prompt,
+            filtered_history,
+            tool_definitions,
+            keep_tool_result=keep_tool_result,
+        )
+        return response
 
     @staticmethod
     async def convert_tool_definition_to_tool_call(tools_definitions):
@@ -433,13 +453,6 @@ def _format_response_for_log(self, response) -> Dict:
 
         return formatted
 
-    # required by orchestrator.py
-    # @abstractmethod
-    def ensure_summary_context(
-        self, message_history: list[dict[str, Any]], summary_prompt: str
-    ) -> bool:
-        return True
-
     @abstractmethod
     def update_message_history(
         self,
@@ -451,7 +464,10 @@ def update_message_history(
 
     @abstractmethod
     def generate_agent_system_prompt(
-        self, date: datetime.datetime, mcp_servers: list[dict]
+        self,
+        date: datetime.datetime,
+        mcp_servers: list[dict],
+        chinese_context: bool = False,
     ) -> str:
         raise NotImplementedError("must implement in subclass")
 
diff --git a/libs/miroflow/src/miroflow/llm/providers/claude_anthropic_client.py b/libs/miroflow/src/miroflow/llm/providers/claude_anthropic_client.py
index 2566e240..2c55bb6c 100644
--- a/libs/miroflow/src/miroflow/llm/providers/claude_anthropic_client.py
+++ b/libs/miroflow/src/miroflow/llm/providers/claude_anthropic_client.py
@@ -215,8 +215,8 @@ def update_message_history(
 
         return message_history
 
-    def generate_agent_system_prompt(self, date, mcp_servers) -> str:
-        return generate_mcp_system_prompt(date, mcp_servers)
+    def generate_agent_system_prompt(self, date, mcp_servers, chinese_context) -> str:
+        return generate_mcp_system_prompt(date, mcp_servers, chinese_context)
 
     def handle_max_turns_reached_summary_prompt(self, message_history, summary_prompt):
         """Handle max turns reached summary prompt"""
diff --git a/libs/miroflow/src/miroflow/llm/providers/claude_newapi_client.py b/libs/miroflow/src/miroflow/llm/providers/claude_newapi_client.py
index a276fa4f..b1042a85 100644
--- a/libs/miroflow/src/miroflow/llm/providers/claude_newapi_client.py
+++ b/libs/miroflow/src/miroflow/llm/providers/claude_newapi_client.py
@@ -282,8 +282,8 @@ def update_message_history(
 
         return message_history
 
-    def generate_agent_system_prompt(self, date, mcp_servers) -> str:
-        return generate_mcp_system_prompt(date, mcp_servers)
+    def generate_agent_system_prompt(self, date, mcp_servers, chinese_context) -> str:
+        return generate_mcp_system_prompt(date, mcp_servers, chinese_context)
 
     def parse_llm_response(self, llm_response) -> str:
         """Parse OpenAI LLM response to get text content"""
diff --git a/libs/miroflow/src/miroflow/llm/providers/claude_openrouter_client.py b/libs/miroflow/src/miroflow/llm/providers/claude_openrouter_client.py
index c9af91b9..3975cf5b 100644
--- a/libs/miroflow/src/miroflow/llm/providers/claude_openrouter_client.py
+++ b/libs/miroflow/src/miroflow/llm/providers/claude_openrouter_client.py
@@ -174,6 +174,7 @@ async def _create_message(
 
             # build extra_body if self.openrouter_provider
             provider_config = (self.openrouter_provider or "").strip().lower()
+            logger.info(f"provider_config: {provider_config}")
             if provider_config == "google":
                 extra_body = {
                     "provider": {
@@ -186,41 +187,82 @@ async def _create_message(
                 }
             elif provider_config == "anthropic":
                 extra_body = {"provider": {"only": ["anthropic"]}}
+                # extra_body["provider"]["ignore"] = ["google-vertex/us", "google-vertex/europe", "google-vertex/global"]
             elif provider_config == "amazon":
                 extra_body = {"provider": {"only": ["amazon-bedrock"]}}
+            elif provider_config != "":
+                extra_body = {"provider": {"only": [provider_config]}}
             else:
                 extra_body = {}
 
+            # Add top_k and min_p through extra_body for OpenRouter
+            if self.top_k != -1:
+                extra_body["top_k"] = self.top_k
+            if self.min_p != 0.0:
+                extra_body["min_p"] = self.min_p
+            if self.repetition_penalty != 1.0:
+                extra_body["repetition_penalty"] = self.repetition_penalty
+
             params = {
                 "model": self.model_name,
                 "temperature": temperature,
-                "top_p": self.top_p if self.top_p != 1.0 else None,
                 "max_tokens": self.max_tokens,
                 "messages": processed_messages,
                 "stream": False,
                 "extra_body": extra_body,
             }
 
+            # Add optional parameters only if they have non-default values
+            if self.top_p != 1.0:
+                params["top_p"] = self.top_p
+
             response = await self._create_completion(params, self.async_client)
 
             # Update token count
             self._update_token_usage(getattr(response, "usage", None))
-            if response.choices is None:
-                logger.debug(f"LLM call failed, response: {response}")
-            else:
+
+            if (
+                response is None
+                or response.choices is None
+                or len(response.choices) == 0
+            ):
+                logger.debug(f"LLM call failed: response = {response}")
+                raise Exception(f"LLM call failed [rare case]: response = {response}")
+
+            if response.choices and response.choices[0].finish_reason == "length":
                 logger.debug(
-                    f"LLM call status: {getattr(response.choices[0], 'finish_reason', 'N/A')}"
+                    "LLM finish_reason is 'length', triggering ContextLimitError"
+                )
+                raise ContextLimitError(
+                    "(finish_reason=length) Response truncated due to maximum context length"
                 )
+
+            if (
+                response.choices
+                and response.choices[0].finish_reason == "stop"
+                and response.choices[0].message.content.strip() == ""
+            ):
+                logger.debug(
+                    "LLM finish_reason is 'stop', but content is empty, triggering Error"
+                )
+                raise Exception("LLM finish_reason is 'stop', but content is empty")
+
+            logger.debug(
+                f"LLM call finish_reason: {getattr(response.choices[0], 'finish_reason', 'N/A')}"
+            )
             return response
         except asyncio.CancelledError:
             logger.debug("[WARNING] LLM API call was cancelled during execution")
-            raise
+            raise Exception("LLM API call was cancelled during execution")
         except Exception as e:
             error_str = str(e)
             if (
                 "Input is too long for requested model" in error_str
                 or "input length and `max_tokens` exceed context limit" in error_str
                 or "maximum context length" in error_str
+                or "prompt is too long" in error_str
+                or "exceeds the maximum length" in error_str
+                or "exceeds the maximum allowed length" in error_str
             ):
                 logger.debug(f"OpenRouter LLM Context limit exceeded: {error_str}")
                 raise ContextLimitError(f"Context limit exceeded: {error_str}")
@@ -253,7 +295,7 @@ def process_llm_response(
 
         if not llm_response or not llm_response.choices:
             error_msg = "LLM did not return a valid response."
-            logger.debug(f"Error: {error_msg}")
+            logger.error(f"Should never happen: {error_msg}")
             return "", True  # Exit loop
 
         # Extract LLM response text
@@ -263,7 +305,6 @@ def process_llm_response(
             assistant_response_text = self._clean_user_content_from_response(
                 assistant_response_text
             )
-            assistant_response_text = llm_response.choices[0].message.content or ""
             message_history.append(
                 {"role": "assistant", "content": assistant_response_text}
             )
@@ -279,9 +320,16 @@ def process_llm_response(
                 {"role": "assistant", "content": assistant_response_text}
             )
         else:
-            raise ValueError(
+            logger.error(
                 f"Unsupported finish reason: {llm_response.choices[0].finish_reason}"
             )
+            assistant_response_text = (
+                "Successful response, but unsupported finish reason: "
+                + llm_response.choices[0].finish_reason
+            )
+            message_history.append(
+                {"role": "assistant", "content": assistant_response_text}
+            )
         logger.debug(f"LLM Response: {assistant_response_text}")
 
         return assistant_response_text, False
@@ -298,9 +346,6 @@ def update_message_history(
     ):
         """Update message history with tool calls data (llm client specific)"""
 
-        merged_text = "\n".join(
-            [item[1]["text"] for item in tool_call_info if item[1]["type"] == "text"]
-        )
         # Filter tool call results with type "text"
         tool_call_info = [item for item in tool_call_info if item[1]["type"] == "text"]
 
@@ -357,8 +402,8 @@ def update_message_history(
         )
         return message_history
 
-    def generate_agent_system_prompt(self, date, mcp_servers) -> str:
-        return generate_mcp_system_prompt(date, mcp_servers)
+    def generate_agent_system_prompt(self, date, mcp_servers, chinese_context) -> str:
+        return generate_mcp_system_prompt(date, mcp_servers, chinese_context)
 
     def parse_llm_response(self, llm_response) -> str:
         """Parse OpenAI LLM response to get text content"""
@@ -382,61 +427,6 @@ def _estimate_tokens(self, text: str) -> int:
             # If encoding fails, use simple estimation: about 1 token per 4 characters
             return len(text) // 4
 
-    def ensure_summary_context(
-        self, message_history: list, summary_prompt: str
-    ) -> bool:
-        """
-        Check if current message_history + summary_prompt would exceed context
-        If it would exceed, remove last assistant-user pair and return False
-        Return True means can continue, False means messages have been rolled back
-        """
-        # Get token usage from last LLM call
-        last_prompt_tokens = self.last_call_tokens.get("prompt_tokens", 0)
-        last_completion_tokens = self.last_call_tokens.get("completion_tokens", 0)
-        buffer_factor = 1.2
-
-        # Calculate token count of summary prompt
-        summary_tokens = self._estimate_tokens(summary_prompt) * buffer_factor
-
-        # Calculate token count of last user message in message_history (if exists and not sent)
-        last_user_tokens = 0
-        if message_history[-1]["role"] == "user":
-            content = message_history[-1]["content"][0]["text"]
-            last_user_tokens = self._estimate_tokens(content) * buffer_factor
-
-        # Calculate total token count: last prompt + completion + last user message + summary + reserved response space
-        estimated_total = (
-            last_prompt_tokens
-            + last_completion_tokens
-            + last_user_tokens
-            + summary_tokens
-            + self.max_tokens
-        )
-
-        if estimated_total >= self.max_context_length:
-            logger.debug(
-                f"Current context + summary would exceed limit: {estimated_total} >= {self.max_context_length}"
-            )
-
-            # Remove last user message (tool call results)
-            if message_history[-1]["role"] == "user":
-                message_history.pop()
-
-            # Remove second-to-last assistant message (tool call request)
-            if message_history[-1]["role"] == "assistant":
-                message_history.pop()
-
-            logger.debug(
-                f"Removed last assistant-user pair, current message_history length: {len(message_history)}"
-            )
-
-            return False
-
-        logger.debug(
-            f"Context check passed: {estimated_total}/{self.max_context_length}"
-        )
-        return True
-
     def handle_max_turns_reached_summary_prompt(self, message_history, summary_prompt):
         """Handle max turns reached summary prompt"""
         if message_history[-1]["role"] == "user":
diff --git a/libs/miroflow/src/miroflow/llm/providers/deepseek_newapi_client.py b/libs/miroflow/src/miroflow/llm/providers/deepseek_newapi_client.py
index cde04740..d0493c79 100644
--- a/libs/miroflow/src/miroflow/llm/providers/deepseek_newapi_client.py
+++ b/libs/miroflow/src/miroflow/llm/providers/deepseek_newapi_client.py
@@ -278,8 +278,8 @@ def update_message_history(
 
         return message_history
 
-    def generate_agent_system_prompt(self, date, mcp_servers) -> str:
-        return generate_mcp_system_prompt(date, mcp_servers)
+    def generate_agent_system_prompt(self, date, mcp_servers, chinese_context) -> str:
+        return generate_mcp_system_prompt(date, mcp_servers, chinese_context)
 
     def parse_llm_response(self, llm_response) -> str:
         """Parse OpenAI LLM response to get text content"""
diff --git a/libs/miroflow/src/miroflow/llm/providers/gpt_openai_client.py b/libs/miroflow/src/miroflow/llm/providers/gpt_openai_client.py
index de5f52ca..c7d105e9 100644
--- a/libs/miroflow/src/miroflow/llm/providers/gpt_openai_client.py
+++ b/libs/miroflow/src/miroflow/llm/providers/gpt_openai_client.py
@@ -320,8 +320,8 @@ def update_message_history(
 
         return message_history
 
-    def generate_agent_system_prompt(self, date, mcp_servers) -> str:
-        return generate_no_mcp_system_prompt(date)
+    def generate_agent_system_prompt(self, date, mcp_servers, chinese_context) -> str:
+        return generate_no_mcp_system_prompt(date, chinese_context)
 
     def handle_max_turns_reached_summary_prompt(self, message_history, summary_prompt):
         """Handle max turns reached summary prompt"""
diff --git a/libs/miroflow/src/miroflow/llm/providers/gpt_openai_response_client.py b/libs/miroflow/src/miroflow/llm/providers/gpt_openai_response_client.py
index 5d84a2b0..5a0a9793 100644
--- a/libs/miroflow/src/miroflow/llm/providers/gpt_openai_response_client.py
+++ b/libs/miroflow/src/miroflow/llm/providers/gpt_openai_response_client.py
@@ -316,8 +316,8 @@ def update_message_history(
 
         return message_history
 
-    def generate_agent_system_prompt(self, date, mcp_servers) -> str:
-        return generate_no_mcp_system_prompt(date)
+    def generate_agent_system_prompt(self, date, mcp_servers, chinese_context) -> str:
+        return generate_no_mcp_system_prompt(date, chinese_context)
 
     def handle_max_turns_reached_summary_prompt(self, message_history, summary_prompt):
         """Handle max turns reached summary prompt"""
diff --git a/libs/miroflow/src/miroflow/llm/providers/qwen_sglang_client.py b/libs/miroflow/src/miroflow/llm/providers/qwen_sglang_client.py
index 4a2abe65..34ed9452 100644
--- a/libs/miroflow/src/miroflow/llm/providers/qwen_sglang_client.py
+++ b/libs/miroflow/src/miroflow/llm/providers/qwen_sglang_client.py
@@ -191,8 +191,8 @@ def update_message_history(
 
         return message_history
 
-    def generate_agent_system_prompt(self, date, mcp_servers) -> str:
-        return generate_mcp_system_prompt(date, mcp_servers)
+    def generate_agent_system_prompt(self, date, mcp_servers, chinese_context) -> str:
+        return generate_mcp_system_prompt(date, mcp_servers, chinese_context)
 
     def handle_max_turns_reached_summary_prompt(self, message_history, summary_prompt):
         """Handle max turns reached summary prompt"""
diff --git a/libs/miroflow/src/miroflow/logging/task_tracer.py b/libs/miroflow/src/miroflow/logging/task_tracer.py
index 40ffa351..849c802a 100644
--- a/libs/miroflow/src/miroflow/logging/task_tracer.py
+++ b/libs/miroflow/src/miroflow/logging/task_tracer.py
@@ -54,7 +54,7 @@ class TaskTracer(BaseModel):
 
     # record task result. hydrdrated AFTER task execution.
     final_boxed_answer: str = ""
-    llm_as_judge_result: str = ""
+    judge_result: str = ""
     error: str = ""
 
     # record task exection detail. hydrated DURING task_execution.
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/claude03_claude_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/claude03_claude_dual.yaml
new file mode 100644
index 00000000..99beba45
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/claude03_claude_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/claude03_claude_dual.yaml
+# Agent configuration with Claude 3.7 Sonnet (temp=0.3) for both main and sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "claude-3.7-sonnet_temp03"         # Main agent uses Claude 3.7 Sonnet (temp=0.3)
+sub_agent_llm: "claude-3.7-sonnet_temp03"    # Sub agents use Claude 3.7 Sonnet
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/claude05_claude_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/claude05_claude_dual.yaml
new file mode 100644
index 00000000..2e2656be
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/claude05_claude_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/claude05_claude_dual.yaml
+# Agent configuration with Claude 3.7 Sonnet (temp=0.5) for main agent and Claude 3.7 Sonnet (temp=0.3) for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "claude-3.7-sonnet_temp05"         # Main agent uses Claude 3.7 Sonnet (temp=0.5)
+sub_agent_llm: "claude-3.7-sonnet_temp03"    # Sub agents use Claude 3.7 Sonnet
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/claude07_claude_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/claude07_claude_dual.yaml
new file mode 100644
index 00000000..27520915
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/claude07_claude_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/claude07_claude_dual.yaml
+# Agent configuration with Claude 3.7 Sonnet (temp=0.7) for main agent and Claude 3.7 Sonnet (temp=0.3) for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "claude-3.7-sonnet_temp07"         # Main agent uses Claude 3.7 Sonnet (temp=0.7)
+sub_agent_llm: "claude-3.7-sonnet_temp03"    # Sub agents use Claude 3.7 Sonnet
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_claude_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_claude_dual.yaml
new file mode 100644
index 00000000..7e163655
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_claude_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/deepseek_claude_dual.yaml
+# Agent configuration with DeepSeek R1 for main agent and Claude 3.7 Sonnet for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "deepseek-r1-0528"         # Main agent uses DeepSeek R1
+sub_agent_llm: "claude-3.7-sonnet_temp03"    # Sub agents use Claude 3.7 Sonnet
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_deepseek_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_deepseek_dual.yaml
new file mode 100644
index 00000000..fb634c95
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_deepseek_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/deepseek_deepseek_dual.yaml
+# Agent configuration with DeepSeek R1 for main agent and DeepSeek V3 for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "deepseek-r1-0528"         # Main agent uses DeepSeek R1
+sub_agent_llm: "deepseek-v3"    # Sub agents use DeepSeek V3
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_kimi_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_kimi_dual.yaml
new file mode 100644
index 00000000..2b001ae4
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_kimi_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/deepseek_kimi_dual.yaml
+# Agent configuration with DeepSeek R1 for main agent and Kimi K2 for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "deepseek-r1-0528"         # Main agent uses DeepSeek R1
+sub_agent_llm: "kimi-k2"    # Sub agents use Kimi K2
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_nohint_claude_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_nohint_claude_dual.yaml
new file mode 100644
index 00000000..cbc0294f
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_nohint_claude_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/deepseek_nohint_claude_dual.yaml
+# Agent configuration with DeepSeek R1 for main agent and Claude 3.7 Sonnet for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "deepseek-r1-0528"         # Main agent uses DeepSeek R1
+sub_agent_llm: "claude-3.7-sonnet_temp03"    # Sub agents use Claude 3.7 Sonnet
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: false
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_nohintreason_claude_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_nohintreason_claude_dual.yaml
new file mode 100644
index 00000000..ab99062a
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_nohintreason_claude_dual.yaml
@@ -0,0 +1,34 @@
+# config/agent/deepseek_nohintreason_claude_dual.yaml
+# Agent configuration with DeepSeek R1 for main agent and Claude 3.7 Sonnet for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "deepseek-r1-0528"         # Main agent uses DeepSeek R1
+sub_agent_llm: "claude-3.7-sonnet_temp03"    # Sub agents use Claude 3.7 Sonnet
+
+main_agent:
+  tools: []
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: false
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_qwen3_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_qwen3_dual.yaml
new file mode 100644
index 00000000..fa485f41
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_qwen3_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/deepseek_qwen3_dual.yaml
+# Agent configuration with DeepSeek R1 for main agent and Qwen3 Coder for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "deepseek-r1-0528"         # Main agent uses DeepSeek R1
+sub_agent_llm: "qwen3-coder"    # Sub agents use Qwen3 Coder
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_qwen3flash_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_qwen3flash_dual.yaml
new file mode 100644
index 00000000..e960f6a1
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/deepseek_qwen3flash_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/deepseek_qwen3flash_dual.yaml
+# Agent configuration with DeepSeek R1 for main agent and Qwen3 Coder Flash for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "deepseek-r1-0528"         # Main agent uses DeepSeek R1
+sub_agent_llm: "qwen3-coder-flash"    # Sub agents use Qwen3 Coder Flash
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/_default.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/default.yaml
similarity index 63%
rename from libs/miroflow/src/miroflow/prebuilt/config/agent/_default.yaml
rename to libs/miroflow/src/miroflow/prebuilt/config/agent/default.yaml
index 8018ff54..6d7e5e04 100644
--- a/libs/miroflow/src/miroflow/prebuilt/config/agent/_default.yaml
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/default.yaml
@@ -1,26 +1,26 @@
-# conf/agent/default.yaml
-# The name of tools and sub-agents defined in: mirage/apps/reorg-modular-structure/src/mirage_agent/config/settings.py
+# config/agent/default.yaml
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
 # Each sub-agent prompt is written in: mirage/apps/reorg-modular-structure/src/mirage_agent/utils/prompt_utils.py
 main_agent:
   tools:
-    - tool-code
-    - tool-vqa
-    - tool-transcribe
     - tool-reasoning
-    - tool-markitdown
+    # - tool-code
+    # - tool-image-video
+    # - tool-audio
+    # - tool-markitdown
   # tool_blacklist:
   #   - ["name-of-tool", "name-of-method"]  # tool blacklist example
   max_turns: 20  # Maximum number of turns for main agent execution
   max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
-sub_agents:
-  agent-browsing:
-    tools:
-      - tool-serper-search
-      - tool-vqa
-      - tool-markitdown
-      - tool-code
-    max_turns: 20
-    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+# sub_agents:
+  # agent-browsing:
+  #   tools:
+  #     - tool-serper-search
+  #     - tool-image-video
+  #     - tool-markitdown
+  #     - tool-code
+  #   max_turns: 20
+  #   max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
   # agent-coding:
   #   tools:
   #     - tool-code
@@ -32,7 +32,7 @@ sub_agents:
   #   max_turns: 20
 
 tool_config:
-  tool-vqa:
+  tool-image-video:
     enable_claude_vision: "true"
     enable_openai_vision: "true"
 
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/gptoss_gptoss_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/gptoss_gptoss_dual.yaml
new file mode 100644
index 00000000..8513e830
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/gptoss_gptoss_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/gptoss_gptoss_dual.yaml
+# Agent configuration with GPT-OSS 120B for both main and sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "gpt-oss-120b"         # Main agent uses GPT-OSS 120B
+sub_agent_llm: "gpt-oss-120b"    # Sub agents use GPT-OSS 120B
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/kimi_claude_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/kimi_claude_dual.yaml
new file mode 100644
index 00000000..c36b0d68
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/kimi_claude_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/kimi_claude_dual.yaml
+# Agent configuration with Kimi K2 for main agent and Claude 3.7 Sonnet for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "kimi-k2"         # Main agent uses Kimi K2
+sub_agent_llm: "claude-3.7-sonnet_temp03"    # Sub agents use Claude 3.7 Sonnet
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/miroflow.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/miroflow.yaml
index 7ec41db6..a1753acf 100644
--- a/libs/miroflow/src/miroflow/prebuilt/config/agent/miroflow.yaml
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/miroflow.yaml
@@ -1,27 +1,26 @@
-# conf/agent/default.yaml
-# The name of tools and sub-agents defined in: mirage/apps/reorg-modular-structure/src/mirage_agent/config/settings.py
+# config/agent/default.yaml
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
 # Each sub-agent prompt is written in: mirage/apps/reorg-modular-structure/src/mirage_agent/utils/prompt_utils.py
 defaults:
-  - _default
+  - default
   - _self_
 
 main_agent:
   tools:
-    - tool-vqa
-    - tool-reading
-    - tool-code
     - tool-reasoning
-    - tool-transcribe
   max_turns: 20  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
 # Trival reproduce of OWL:
 sub_agents:
-  agent-browsing:
+  agent-worker:
     tools:
       - tool-searching
-      - tool-vqa
+      - tool-image-video
+      - tool-audio
       - tool-reading
       - tool-code
     max_turns: 20
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
 
 o3_hint: true
 o3_final_answer: true
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/qwen3_claude_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/qwen3_claude_dual.yaml
new file mode 100644
index 00000000..6e50f33d
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/qwen3_claude_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/qwen3_claude_dual.yaml
+# Agent configuration with Qwen3 235B Thinking for main agent and Claude 3.7 Sonnet for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "qwen3-235b-thinking"         # Main agent uses Qwen3 235B Thinking
+sub_agent_llm: "claude-3.7-sonnet_temp03"    # Sub agents use Claude 3.7 Sonnet
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/agent/seed_claude_dual.yaml b/libs/miroflow/src/miroflow/prebuilt/config/agent/seed_claude_dual.yaml
new file mode 100644
index 00000000..55cfb947
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/agent/seed_claude_dual.yaml
@@ -0,0 +1,35 @@
+# config/agent/seed_claude_dual.yaml
+# Agent configuration with Seed 1.6 Thinking for main agent and Claude 3.7 Sonnet for sub agents
+# The name of tools and sub-agents defined in: ./miroflow/src/miroflow/utils/prompt_utils.py and tool_utils.py
+defaults:
+  - default
+  - _self_
+
+# LLM configuration for different agent types
+main_agent_llm: "seed-1-6-thinking"         # Main agent uses Seed 1.6 Thinking
+sub_agent_llm: "claude-3.7-sonnet_temp03"    # Sub agents use Claude 3.7 Sonnet
+
+main_agent:
+  tools:
+    - tool-reasoning
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+sub_agents:
+  agent-worker:
+    tools:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+    max_turns: -1
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+o3_hint: true
+o3_final_answer: true  # Use O3 to extract final answer from summary
+
+# Message ID configuration
+add_message_id: true  # Add random message ID to all messages sent to LLM
+
+keep_tool_result: -1
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/benchmark/browsecomp-zh.yaml b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/browsecomp-zh.yaml
new file mode 100644
index 00000000..35207e43
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/browsecomp-zh.yaml
@@ -0,0 +1,14 @@
+# config/benchmark/browsecomp.yaml
+defaults:
+  - default
+  - _self_
+
+name: "browsecomp-zh"
+
+data:
+  data_dir: "${env.data_dir}/browsecomp-zh-test"
+
+
+execution:
+  max_tasks: null  # null means no limit
+  max_concurrent: 5
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/benchmark/browsecomp.yaml b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/browsecomp.yaml
new file mode 100644
index 00000000..33ec3fdf
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/browsecomp.yaml
@@ -0,0 +1,14 @@
+# config/benchmark/browsecomp.yaml
+defaults:
+  - default
+  - _self_
+
+name: "browsecomp"
+
+data:
+  data_dir: "${env.data_dir}/browsecomp-test"
+
+
+execution:
+  max_tasks: null  # null means no limit
+  max_concurrent: 5
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/benchmark/_default.yaml b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/default.yaml
similarity index 86%
rename from libs/miroflow/src/miroflow/prebuilt/config/benchmark/_default.yaml
rename to libs/miroflow/src/miroflow/prebuilt/config/benchmark/default.yaml
index 3850f5e2..c3d60953 100644
--- a/libs/miroflow/src/miroflow/prebuilt/config/benchmark/_default.yaml
+++ b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/default.yaml
@@ -1,4 +1,4 @@
-# conf/benchmark/default.yaml - Default benchmark configuration
+# config/benchmark/default.yaml - Default benchmark configuration
 # This is a base configuration for benchmarks. Specific benchmarks can override these defaults.
 name: "default"
 
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/benchmark/gaia-validation.yaml b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/gaia-validation.yaml
index 4a5a0010..235a712c 100644
--- a/libs/miroflow/src/miroflow/prebuilt/config/benchmark/gaia-validation.yaml
+++ b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/gaia-validation.yaml
@@ -1,6 +1,6 @@
-# conf/benchmark/gaia-validation.yaml
+# config/benchmark/gaia-validation.yaml
 defaults:
-  - _default
+  - default
   - _self_
 
 name: "gaia-validation"
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/benchmark/hle-text-500.yaml b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/hle-text-500.yaml
new file mode 100644
index 00000000..2c40cf6e
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/hle-text-500.yaml
@@ -0,0 +1,14 @@
+# config/benchmark/hle-text-500.yaml
+defaults:
+  - default
+  - _self_
+
+name: "hle-text-500"
+
+data:
+  data_dir: "${env.data_dir}/hle-text-500"
+
+
+execution:
+  max_tasks: null  # null means no limit
+  max_concurrent: 5
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/benchmark/hle.yaml b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/hle.yaml
new file mode 100644
index 00000000..f64d53d2
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/hle.yaml
@@ -0,0 +1,14 @@
+# config/benchmark/hle.yaml
+defaults:
+  - default
+  - _self_
+
+name: "hle"
+
+data:
+  data_dir: "${env.data_dir}/hle"
+
+
+execution:
+  max_tasks: 850  # null means no limit
+  max_concurrent: 5
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/benchmark/xbench-ds.yaml b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/xbench-ds.yaml
new file mode 100644
index 00000000..294a7f50
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/benchmark/xbench-ds.yaml
@@ -0,0 +1,14 @@
+# config/benchmark/browsecomp.yaml
+defaults:
+  - default
+  - _self_
+
+name: "xbench-ds"
+
+data:
+  data_dir: "${env.data_dir}/xbench-ds"
+
+
+execution:
+  max_tasks: null  # null means no limit
+  max_concurrent: 5
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/config.yaml b/libs/miroflow/src/miroflow/prebuilt/config/config.yaml
index 0f3c385d..ed712ec4 100644
--- a/libs/miroflow/src/miroflow/prebuilt/config/config.yaml
+++ b/libs/miroflow/src/miroflow/prebuilt/config/config.yaml
@@ -1,4 +1,4 @@
-# conf/config.yaml
+# config/config.yaml
 defaults:
   - llm: claude_openrouter
   - agent: miroflow
@@ -30,7 +30,12 @@ env:
   https_proxy: "${oc.env:HTTPS_PROXY,???}"
   # points to where data is
   data_dir: "${oc.env:DATA_DIR,???}"
-  
+  # configs for searching tool
+  remove_snippets: "${oc.env:REMOVE_SNIPPETS,false}"
+  remove_knowledge_graph: "${oc.env:REMOVE_KNOWLEDGE_GRAPH,false}"
+  remove_answer_box: "${oc.env:REMOVE_ANSWER_BOX,false}"
+  # whether using chinese context
+  chinese_context: "${oc.env:CHINESE_CONTEXT,false}"
 
 # Can define some top-level or default parameters here
 project_name: "miroflow"
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-3.7-sonnet_temp03.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-3.7-sonnet_temp03.yaml
new file mode 100644
index 00000000..9a07ba25
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-3.7-sonnet_temp03.yaml
@@ -0,0 +1,25 @@
+# config/llm/claude-3.7-sonnet.yaml - Sub Agent LLM Configuration (Claude 3.7 Sonnet)
+provider: "claude_openrouter"
+model_name: "anthropic/claude-3.7-sonnet"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.3
+top_p: 0.95
+min_p: 0.0
+top_k: -1
+max_tokens: 32000
+# max_context_length: 190000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: https://openrouter.ai/api/v1
+# openrouter_provider: "anthropic"  # Force provider
+disable_cache_control: false
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false 
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-3.7-sonnet_temp05.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-3.7-sonnet_temp05.yaml
new file mode 100644
index 00000000..f4b644b9
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-3.7-sonnet_temp05.yaml
@@ -0,0 +1,25 @@
+# config/llm/claude-3.7-sonnet.yaml - Sub Agent LLM Configuration (Claude 3.7 Sonnet)
+provider: "claude_openrouter"
+model_name: "anthropic/claude-3.7-sonnet"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.5
+top_p: 0.95
+min_p: 0.0
+top_k: -1
+max_tokens: 32000
+# max_context_length: 190000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: https://openrouter.ai/api/v1
+openrouter_provider: "anthropic"  # Force provider
+disable_cache_control: false
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false 
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-3.7-sonnet_temp07.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-3.7-sonnet_temp07.yaml
new file mode 100644
index 00000000..77b98eb2
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-3.7-sonnet_temp07.yaml
@@ -0,0 +1,25 @@
+# config/llm/claude-3.7-sonnet.yaml - Sub Agent LLM Configuration (Claude 3.7 Sonnet)
+provider: "claude_openrouter"
+model_name: "anthropic/claude-3.7-sonnet"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.7
+top_p: 0.95
+min_p: 0.0
+top_k: -1
+max_tokens: 32000
+# max_context_length: 190000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: https://openrouter.ai/api/v1
+openrouter_provider: "anthropic"  # Force provider
+disable_cache_control: false
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false 
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-4-sonnet.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-4-sonnet.yaml
new file mode 100644
index 00000000..9b1d1752
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude-4-sonnet.yaml
@@ -0,0 +1,25 @@
+# config/llm/claude-3.7-sonnet.yaml - Sub Agent LLM Configuration (Claude 3.7 Sonnet)
+provider: "claude_openrouter"
+model_name: "anthropic/claude-sonnet-4"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.3
+top_p: 0.95
+min_p: 0.0
+top_k: -1
+max_tokens: 32000
+# max_context_length: 190000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: https://openrouter.ai/api/v1
+openrouter_provider: "anthropic"  # Force provider
+disable_cache_control: false
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false 
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/claude_openrouter.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude_openrouter.yaml
index 744f9b27..37e7783c 100644
--- a/libs/miroflow/src/miroflow/prebuilt/config/llm/claude_openrouter.yaml
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/claude_openrouter.yaml
@@ -1,6 +1,6 @@
-# conf/llm/claude.yaml
+# config/llm/claude.yaml
 defaults:
-  - _default
+  - default
   - _self_
 
 provider: "claude_openrouter"
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/deepseek-r1-0528.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/deepseek-r1-0528.yaml
new file mode 100644
index 00000000..d9929d4c
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/deepseek-r1-0528.yaml
@@ -0,0 +1,25 @@
+# config/llm/deepseek-r1.yaml - Main Agent LLM Configuration (DeepSeek R1)
+provider: "claude_openrouter"
+model_name: "deepseek/deepseek-r1-0528"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.6
+top_p: 0.95
+min_p: 0.0
+top_k: -1
+max_tokens: 32000
+# max_context_length: 150000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: https://openrouter.ai/api/v1
+openrouter_provider: "google-vertex"  # Force provider
+disable_cache_control: true
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false 
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/deepseek-v3.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/deepseek-v3.yaml
new file mode 100644
index 00000000..a853a62b
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/deepseek-v3.yaml
@@ -0,0 +1,26 @@
+# config/llm/qwen3-235b-thinking.yaml - Main Agent LLM Configuration
+provider: "claude_openrouter"
+model_name: "deepseek-v3-250324"
+
+# Basic LLM parameters
+async_client: true
+temperature: 1.0
+top_p: 0.95
+min_p: 0.0
+top_k: 50
+max_tokens: 16000
+# max_context_length: 130000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: # the deployment url
+openrouter_api_key: ??? # the deployment api key
+openrouter_provider: ""  # Force provider
+disable_cache_control: true  # qwen models don't support cache control
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/_default.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/default.yaml
similarity index 92%
rename from libs/miroflow/src/miroflow/prebuilt/config/llm/_default.yaml
rename to libs/miroflow/src/miroflow/prebuilt/config/llm/default.yaml
index 6a646420..e66a1327 100644
--- a/libs/miroflow/src/miroflow/prebuilt/config/llm/_default.yaml
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/default.yaml
@@ -1,4 +1,4 @@
-# conf/llm/default.yaml - Default LLM configuration
+# config/llm/default.yaml - Default LLM configuration
 provider: "openai" # openai, qwen, anthropic, gemini, deepseek
 model_name: "gpt-4.1" # gpt-4.1, qwen3-14b, claude-sonnet-4-20250514, claude-3-7-sonnet-20250219, gemini-1.5-pro, deepseek-chat
 async_client: true
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/gemini-2-5-pro.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/gemini-2-5-pro.yaml
new file mode 100644
index 00000000..44d6db59
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/gemini-2-5-pro.yaml
@@ -0,0 +1,25 @@
+# config/llm/qwen3-coder.yaml - Sub Agent LLM Configuration
+provider: "claude_openrouter"
+model_name: "google/gemini-2.5-pro"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.7
+top_p: 0.95
+min_p: 0.0
+top_k: -1
+max_tokens: 32000
+# max_context_length: 260000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: https://openrouter.ai/api/v1
+openrouter_provider: "google-vertex/us"  # Force provider
+disable_cache_control: true
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false 
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/gpt-oss-120b.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/gpt-oss-120b.yaml
new file mode 100644
index 00000000..5826d696
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/gpt-oss-120b.yaml
@@ -0,0 +1,25 @@
+# config/llm/claude-3.7-sonnet.yaml - Sub Agent LLM Configuration (Claude 3.7 Sonnet)
+provider: "claude_openrouter"
+model_name: "openai/gpt-oss-120b"
+
+# Basic LLM parameters
+async_client: true
+temperature: 1.0
+top_p: 0.95
+min_p: 0.0
+top_k: 100
+max_tokens: 32000
+# max_context_length: 130000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: https://openrouter.ai/api/v1
+openrouter_provider: "groq"  # Force provider
+disable_cache_control: true  # OpenAI don't need manual cache control
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false 
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/kimi-k2.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/kimi-k2.yaml
new file mode 100644
index 00000000..892ab519
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/kimi-k2.yaml
@@ -0,0 +1,26 @@
+# config/llm/qwen3-235b-thinking.yaml - Main Agent LLM Configuration
+provider: "claude_openrouter"
+model_name: "kimi-k2-250711"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.6
+top_p: 0.95
+min_p: 0.0
+top_k: 20
+max_tokens: 32000
+# max_context_length: 130000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: # the deployment url
+openrouter_api_key: ??? # the deployment api key
+openrouter_provider: ""  # Force provider
+disable_cache_control: true  # qwen models don't support cache control
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/qwen3-235b-thinking.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/qwen3-235b-thinking.yaml
new file mode 100644
index 00000000..e72f4277
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/qwen3-235b-thinking.yaml
@@ -0,0 +1,26 @@
+# config/llm/qwen3-235b-thinking.yaml - Main Agent LLM Configuration
+provider: "claude_openrouter"
+model_name: "Qwen3-235B-A22B-Thinking-2507"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.6
+top_p: 0.95
+min_p: 0.0
+top_k: -1
+max_tokens: 32000
+# max_context_length: 130000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: # the deployment url
+openrouter_api_key: ??? # the deployment api key
+openrouter_provider: ""  # Force provider
+disable_cache_control: true  # qwen models don't support cache control
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/qwen3-coder-flash.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/qwen3-coder-flash.yaml
new file mode 100644
index 00000000..f24cca58
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/qwen3-coder-flash.yaml
@@ -0,0 +1,27 @@
+# config/llm/qwen3-coder.yaml - Main Agent LLM Configuration
+provider: "claude_openrouter"
+model_name: "Qwen3-Coder-30B-A3B-Instruct"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.7
+top_p: 0.8
+min_p: 0.0
+top_k: 20
+repetition_penalty: 1.05  # Only added to request if not equal to 1.0
+max_tokens: 32000
+# max_context_length: 130000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: # the deployment url
+openrouter_api_key: ??? # the deployment api key
+openrouter_provider: ""  # Force provider
+disable_cache_control: true  # qwen models don't support cache control
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/qwen3-coder.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/qwen3-coder.yaml
new file mode 100644
index 00000000..29d86bb7
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/qwen3-coder.yaml
@@ -0,0 +1,26 @@
+# config/llm/qwen3-coder.yaml - Main Agent LLM Configuration
+provider: "claude_openrouter"
+model_name: "Qwen3-Coder-480B-A35B-Instruct"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.7
+top_p: 0.8
+min_p: 0.0
+top_k: 20
+max_tokens: 32000
+# max_context_length: 130000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: # the deployment url
+openrouter_api_key: ??? # the deployment api key
+openrouter_provider: ""  # Force provider
+disable_cache_control: true  # qwen models don't support cache control
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/config/llm/seed-1-6-thinking.yaml b/libs/miroflow/src/miroflow/prebuilt/config/llm/seed-1-6-thinking.yaml
new file mode 100644
index 00000000..fb4736b4
--- /dev/null
+++ b/libs/miroflow/src/miroflow/prebuilt/config/llm/seed-1-6-thinking.yaml
@@ -0,0 +1,26 @@
+# conffig/llm/qwen3-235b-thinking.yaml - Main Agent LLM Configuration
+provider: "claude_openrouter"
+model_name: "doubao-seed-1-6-thinking-250715"
+
+# Basic LLM parameters
+async_client: true
+temperature: 0.6
+top_p: 0.95
+min_p: 0.0
+top_k: -1
+max_tokens: 32000
+# max_context_length: 130000  # deprecated, no longer used
+
+# Provider specific settings
+openrouter_base_url: # the deployment url
+openrouter_api_key: ??? # the deployment api key
+openrouter_provider: ""  # Force provider
+disable_cache_control: true  # qwen models don't support cache control
+
+# Base URLs
+openai_base_url: null
+anthropic_base_url: https://api.anthropic.com
+
+# Other settings
+keep_tool_result: -1
+oai_tool_thinking: false
\ No newline at end of file
diff --git a/libs/miroflow/src/miroflow/prebuilt/orchestrator.py b/libs/miroflow/src/miroflow/prebuilt/orchestrator.py
index 0e8a90b7..066e86ff 100644
--- a/libs/miroflow/src/miroflow/prebuilt/orchestrator.py
+++ b/libs/miroflow/src/miroflow/prebuilt/orchestrator.py
@@ -7,12 +7,9 @@
 import sys
 import time
 import uuid
-import re
-from typing import Any
+from typing import Any, Optional
 
 from omegaconf import DictConfig
-from openai import AsyncOpenAI
-from tenacity import retry, stop_after_attempt, wait_exponential
 
 from miroflow.contrib.tracing import function_span, generation_span
 from miroflow.llm.provider_client_base import LLMProviderClientBase
@@ -26,6 +23,11 @@
     generate_agent_summarize_prompt,
 )
 from miroflow.utils.tool_utils import expose_sub_agents_as_tools
+from miroflow.utils.summary_utils import (
+    o3_extract_hints,
+    o3_extract_gaia_final_answer,
+    o3_extract_browsecomp_zh_final_answer,
+)
 
 logger = bootstrap_logger()
 
@@ -63,16 +65,22 @@ def __init__(
         output_formatter: OutputFormatter,
         cfg: DictConfig,
         task_log: TaskTracer,
+        sub_agent_llm_client: Optional[LLMProviderClientBase] = None,
     ):
         self.main_agent_tool_manager = main_agent_tool_manager
         self.sub_agent_tool_managers = sub_agent_tool_managers
         self.llm_client = llm_client
+        self.sub_agent_llm_client = (
+            sub_agent_llm_client or llm_client
+        )  # Use client from main agent if not provided
         self.output_formatter = output_formatter
         self.cfg = cfg
         self.task_log = task_log
         # call this once, then use cache value
         self._list_sub_agent_tools = _list_tools(sub_agent_tool_managers)
 
+        self.chinese_context = self.cfg.env.chinese_context.lower().strip() == "true"
+
         # Handle add_message_id configuration, support string to bool conversion
         add_message_id_val = self.cfg.agent.get("add_message_id", False)
         if isinstance(add_message_id_val, str):
@@ -86,14 +94,12 @@ def __init__(
         # Pass task_log to llm_client
         if self.llm_client and task_log:
             self.llm_client.task_log = task_log
-
-    # Could be removed, use task_log.log_step instead, will be removed in the future
-    # def _log_step(
-    #     self, step_name: str, message: str, status: str = "info", level: str = "info"
-    # ):
-    #     """Log step information"""
-    #     # Use TaskLog's log_step method to record to structured log
-    #     self.task_log.log_step(step_name, message, status)
+        if (
+            self.sub_agent_llm_client
+            and task_log
+            and self.sub_agent_llm_client != self.llm_client
+        ):
+            self.sub_agent_llm_client.task_log = task_log
 
     async def _handle_llm_call_with_logging(
         self,
@@ -110,6 +116,11 @@ async def _handle_llm_call_with_logging(
             tuple[Optional[str], bool, Optional[object]]: (response_text, should_break, tool_calls_info)
         """
 
+        # Select correct LLM client based on agent_type
+        current_llm_client = (
+            self.llm_client if agent_type == "main" else self.sub_agent_llm_client
+        )
+
         # Add message ID to user messages (if configured and message doesn't have ID yet)
         if self.add_message_id:
             for message in message_history:
@@ -145,9 +156,9 @@ async def _handle_llm_call_with_logging(
 
         try:
             with generation_span(
-                input=message_history, model=self.llm_client.model_name
+                input=message_history, model=current_llm_client.model_name
             ) as span:
-                response = await self.llm_client.create_message(
+                response = await current_llm_client.create_message(
                     system_prompt=system_prompt,
                     message_history=message_history,
                     tool_definitions=tool_definitions,
@@ -164,7 +175,7 @@ async def _handle_llm_call_with_logging(
             if response:
                 # Use client's response processing method
                 assistant_response_text, should_break = (
-                    self.llm_client.process_llm_response(
+                    current_llm_client.process_llm_response(
                         response, message_history, agent_type
                     )
                 )
@@ -186,7 +197,7 @@ async def _handle_llm_call_with_logging(
                     self.task_log.save()
 
                 # Use client's tool call information extraction method
-                tool_calls_info = self.llm_client.extract_tool_calls_info(
+                tool_calls_info = current_llm_client.extract_tool_calls_info(
                     response, assistant_response_text
                 )
 
@@ -269,10 +280,14 @@ async def _handle_summary_with_context_limit_retry(
                 task_description + task_guidence,
                 task_failed=task_failed,
                 agent_type=agent_type,
+                chinese_context=self.chinese_context,
             )
 
             # Handle merging of message history and summary prompt
-            summary_prompt = self.llm_client.handle_max_turns_reached_summary_prompt(
+            current_llm_client = (
+                self.llm_client if agent_type == "main" else self.sub_agent_llm_client
+            )
+            summary_prompt = current_llm_client.handle_max_turns_reached_summary_prompt(
                 message_history, summary_prompt
             )
 
@@ -281,14 +296,31 @@ async def _handle_summary_with_context_limit_retry(
                 {"role": "user", "content": [{"type": "text", "text": summary_prompt}]}
             )
 
-            response_text, _, tool_calls = await self._handle_llm_call_with_logging(
-                system_prompt,
-                message_history,
-                tool_definitions,
-                999,
-                purpose,
-                agent_type=agent_type,
-            )
+            for network_retry_count in range(5):
+                (
+                    response_text,
+                    _,
+                    tool_calls_info,
+                ) = await self._handle_llm_call_with_logging(
+                    system_prompt,
+                    message_history,
+                    tool_definitions,
+                    999,
+                    purpose,
+                    agent_type=agent_type,
+                )
+                if response_text or tool_calls_info == "context_limit":
+                    break
+                else:
+                    logger.error(
+                        f"LLM summary process call failed, attempt {network_retry_count+1}/5, retrying after 60 seconds..."
+                    )
+                    self.task_log.log_step(
+                        f"{agent_type}_summary_retry",
+                        f"LLM summary process call failed, attempt {network_retry_count+1}/5, retrying after 60 seconds...",
+                        "warning",
+                    )
+                    await asyncio.sleep(60)
 
             if response_text:
                 # Call successful: return generated summary text
@@ -297,27 +329,22 @@ async def _handle_summary_with_context_limit_retry(
             # Context limit exceeded or network issues: try removing messages and retry
             retry_count += 1
             logger.debug(
-                f"LLM call failed, attempt {retry_count} retry, removing recent assistant-user dialogue"
+                f"LLM call failed (context_limit), attempt {retry_count} retry, removing recent assistant-user dialogue"
             )
-
             # First remove the just-added summary prompt
             if message_history and message_history[-1]["role"] == "user":
                 message_history.pop()
-
             # Remove the most recent assistant message (tool call request)
             if message_history and message_history[-1]["role"] == "assistant":
                 message_history.pop()
-
             # Once assistant-user dialogue needs to be removed, task fails (information is lost)
             task_failed = True
-
             # If there are no more dialogues to remove
             if len(message_history) <= 2:  # Only initial system-user messages remain
                 logger.warning(
                     "Removed all removable dialogues, but still unable to generate summary"
                 )
                 break
-
             self.task_log.log_step(
                 f"{agent_type}_summary_context_retry",
                 f"Removed assistant-user pair, retry {retry_count}, task marked as failed",
@@ -325,8 +352,15 @@ async def _handle_summary_with_context_limit_retry(
             )
 
         # If still fails after removing all dialogues
-        logger.error("Summary failed after removing all possible messages")
-        return "Unable to generate final summary due to persistent network issues. You should try again."
+        logger.error(
+            "Summary failed after several attempts (removing all possible messages)"
+        )
+        self.task_log.log_step(
+            f"{agent_type}_summary_failed",
+            "Summary failed after several attempts (removing all possible messages)",
+            "failed",
+        )
+        return "[ERROR] Unable to generate final summary due to context limit or network issues. You should try again."
 
     async def run_sub_agent(
         self, sub_agent_name, task_description, keep_tool_result: int = -1
@@ -363,10 +397,13 @@ async def run_sub_agent(
             )
 
         # Generate sub-agent system prompt
-        system_prompt = self.llm_client.generate_agent_system_prompt(
+        system_prompt = self.sub_agent_llm_client.generate_agent_system_prompt(
             date=datetime.datetime.today(),
             mcp_servers=tool_definitions,
-        ) + generate_agent_specific_system_prompt(agent_type=sub_agent_name)
+            chinese_context=self.chinese_context,
+        ) + generate_agent_specific_system_prompt(
+            agent_type=sub_agent_name, chinese_context=self.chinese_context
+        )
 
         # Limit sub-agent turns
         max_turns = self.cfg.agent.sub_agents[sub_agent_name].max_turns
@@ -497,7 +534,7 @@ async def run_sub_agent(
 
                     # Handle empty error messages, especially for TimeoutError
                     error_msg = str(e) or (
-                        "Tool execution timeout"
+                        "[ERROR]: Tool execution timeout"
                         if isinstance(e, TimeoutError)
                         else f"Tool execution failed: {type(e).__name__}"
                     )
@@ -546,30 +583,10 @@ async def run_sub_agent(
                 )
                 all_tool_results_content_with_id.append(("FAILED", tool_result_for_llm))
 
-            message_history = self.llm_client.update_message_history(
+            message_history = self.sub_agent_llm_client.update_message_history(
                 message_history, all_tool_results_content_with_id, tool_calls_exceeded
             )
 
-            # Generate summary_prompt to check token limit
-            temp_summary_prompt = generate_agent_summarize_prompt(
-                task_description,
-                task_failed=True,  # Set to True here to simulate potential task failure for context checking
-                agent_type=sub_agent_name,
-            )
-
-            # Check if current context would exceed limit, auto rollback messages and trigger summary if exceeded
-            if not self.llm_client.ensure_summary_context(
-                message_history, temp_summary_prompt
-            ):
-                # Context estimated to exceed limit, jump to summary stage
-                task_failed = True  # Mark task as failed
-                self.task_log.log_step(
-                    f"{sub_agent_name}_context_limit_reached",
-                    "Context limit reached, triggering summary",
-                    "warning",
-                )
-                break
-
         # Continue execution
         logger.debug(
             f"\n=== Sub Agent {sub_agent_name} Completed ({turn_count} turns) ==="
@@ -641,342 +658,6 @@ async def run_sub_agent(
         # Return final answer instead of dialogue log, so main agent can use directly
         return final_answer_text
 
-    @retry(wait=wait_exponential(multiplier=15), stop=stop_after_attempt(5))
-    async def _o3_extract_hints(self, question: str) -> str:
-        """Use O3 model to extract task hints"""
-        client = AsyncOpenAI(api_key=self.cfg.env.openai_api_key, timeout=600)
-
-        instruction = """Carefully analyze the given task description (question) without attempting to solve it directly. Your role is to identify potential challenges and areas that require special attention during the solving process, and provide practical guidance for someone who will solve this task by actively gathering and analyzing information from the web.
-
-Identify and concisely list key points in the question that could potentially impact subsequent information collection or the accuracy and completeness of the problem solution, especially those likely to cause mistakes, carelessness, or confusion during problem-solving.
-
-The question author does not intend to set traps or intentionally create confusion. Interpret the question in the most common, reasonable, and straightforward manner, without speculating about hidden meanings or unlikely scenarios. However, be aware that mistakes, imprecise wording, or inconsistencies may exist due to carelessness or limited subject expertise, rather than intentional ambiguity.
-
-Additionally, when considering potential answers or interpretations, note that question authors typically favor more common and familiar expressions over overly technical, formal, or obscure terminology. They generally prefer straightforward and common-sense interpretations rather than being excessively cautious or academically rigorous in their wording choices.
-
-Also, consider additional flagging issues such as:
-- Potential mistakes or oversights introduced unintentionally by the question author due to his misunderstanding, carelessness, or lack of attention to detail.
-- Terms or instructions that might have multiple valid interpretations due to ambiguity, imprecision, outdated terminology, or subtle wording nuances.
-- Numeric precision, rounding requirements, formatting, or units that might be unclear, erroneous, or inconsistent with standard practices or provided examples.
-- Contradictions or inconsistencies between explicit textual instructions and examples or contextual clues provided within the question itself.
-
-Do NOT attempt to guess or infer correct answers, as complete factual information is not yet available. Your responsibility is purely analytical, proactively flagging points that deserve special attention or clarification during subsequent information collection and task solving. Avoid overanalyzing or listing trivial details that would not materially affect the task outcome.
-
-Here is the question:
-
-"""
-
-        # Add message ID for O3 messages (if configured)
-        content = instruction + question
-        if self.add_message_id:
-            message_id = _generate_message_id()
-            content = f"[{message_id}] {content}"
-
-        response = await client.chat.completions.create(
-            model="o3",
-            messages=[{"role": "user", "content": content}],
-            reasoning_effort="high",
-        )
-        result = response.choices[0].message.content
-
-        # Check if result is empty, raise exception to trigger retry if empty
-        if not result or not result.strip():
-            raise ValueError("O3 hints extraction returned empty result")
-
-        return result
-
-    @retry(wait=wait_exponential(multiplier=15), stop=stop_after_attempt(5))
-    async def _get_gaia_answer_type(self, task_description: str) -> str:
-        client = AsyncOpenAI(api_key=self.cfg.env.openai_api_key, timeout=600)
-        instruction = f"""Input:
-`{task_description}`
-
-Question:
-Determine the expected data type of the answer. For questions asking to "identify" something, focus on the final answer type, not the identification process. Format requirements in the question often hint at the expected data type. If the question asks you to write a specific word, return string. Choose only one of the four types below:
-- number — a pure number (may include decimals or signs), e.g., price, distance, length
-- date   — a specific calendar date (e.g., 2025-08-05 or August 5, 2025)
-- time   — a specific time of day or formated time cost (e.g., 14:30 or 1:30.12)
-- string — any other textual answer
-
-Output:
-Return exactly one of the [number, date, time, string], nothing else.
-"""
-        print(f"Answer type instruction: {instruction}")
-
-        message_id = _generate_message_id()
-        response = await client.chat.completions.create(
-            model="gpt-4.1",
-            messages=[{"role": "user", "content": f"[{message_id}] {instruction}"}],
-        )
-        answer_type = response.choices[0].message.content
-        # Check if result is empty, raise exception to trigger retry if empty
-        if not answer_type or not answer_type.strip():
-            raise ValueError("answer type returned empty result")
-
-        print(f"Answer type: {answer_type}")
-
-        return answer_type.strip()
-
-    @retry(wait=wait_exponential(multiplier=15), stop=stop_after_attempt(5))
-    async def _o3_extract_gaia_final_answer(
-        self, answer_type: str, task_description_detail: str, summary: str
-    ) -> str:
-        """Use O3 model to extract final answer from summary"""
-        client = AsyncOpenAI(api_key=self.cfg.env.openai_api_key, timeout=600)
-
-        full_prompts = {
-            "time": f"""# Inputs
-
-* **Original Question**: `{task_description_detail}`
-* **Agent Summary**: `{summary}`
-
----
-
-# Task
-
-1. **Independently derive** the best possible answer, step by step, based solely on evidence and reasoning from the Agent Summary. **Ignore the summary's "Final Answer" field** at this stage.
-2. **Compare** your derived answer to the final answer provided in the Agent Summary (ignoring formatting and phrasing requirements at this stage).  
-   – If both are well supported by the summary's evidence, choose the one with stronger or clearer support.  
-   – If only one is well supported, use that one.
-3. **Revise** your chosen answer to fully satisfy all formatting and phrasing requirements listed below (**Formatting rules**, **Additional constraints**, **Common pitfalls to avoid**, and **Quick reference examples**). These requirements override those in the original question if there is any conflict.
-
-If no answer is clearly supported by the evidence, provide a well-justified educated guess. **Always wrap your final answer in a non-empty \\boxed{{...}}.**
-
----
-
-# Output Guidelines
-
-1. **Box the answer**
-   Wrap the answer in `\\boxed{{}}`.
-
-2. **Answer type**
-   The boxed content must be a time.
-
-3. **Formatting rules**
-   * Follow every formatting instruction in the original question (units, rounding, decimal places, etc.).
-   * Do **not** add any units (e.g., "s", "second", "seconds"), unless required.
-   * Ensure the correct unit (e.g., hours versus thousand hours); if the question specifies "thousand hours" or "1000 hours", treat it as the required unit — output a number like 13 (thousand hours) instead of 13000 (hours).
-   * If the question's written instructions for precision or rounding differ from the examples, treat the examples as authoritative — match their number of decimal places and rounding style.
-   
-4. **Additional constraints**
-   * If the **Agent Summary** is incomplete or unclear, provide the best possible answer (educated guess).
-
-5. **Common pitfalls to avoid**
-   * Minor mismatches in the required format.
-   * Unit-conversion errors, especially with uncommon units.
-   * Incorrect precision, rounding or scale (e.g., 0.01 vs 0.001), **double-check the required level**.
-   * Conflicts between textual instructions and example formatting, just follow the example: if the question says to "retain the percentile" but the example shows 0.001, use 0.001 rather than 0.01.
-   
----
-
-# Quick reference examples
-
-* If the question says to "rounding the seconds to the nearest hundredth", but the example shows "0.001", 1:23.4567 → 1:23.457
-* If the question says to "rounding the seconds to the nearest hundredth", but the example shows "0.001", 10:08.47445 → 10:08.474
-* If the question says to "round to one decimal place", but the example shows "0.01", 2:17.456 → 2:17.46
-* If the question says to "round to the nearest minute", but the example keeps seconds ("0:45"), 3:44.8 → 3:45
-* If the question says "keep three decimal places", but the example shows "0.1", 1:03.987 → 1:03.1
-* If the question asks for "thousand hours", 13000 -> 13 
-
----
-
-# Output
-
-Return the step-by-step process and your final answer wrapped in \\boxed{{...}}, check the **Formatting rules**, **Additional constraints**, **Common pitfalls to avoid** and **Quick reference examples** step by step, and ensure the answer meet the requirements.
-""",
-            "number": f"""# Inputs
-
-* **Original Question**: `{task_description_detail}`
-* **Agent Summary**: `{summary}`
-
----
-
-# Task
-
-1. **Independently derive** the best possible answer, step by step, based solely on evidence and reasoning from the Agent Summary. **Ignore the summary's "Final Answer" field** at this stage.
-2. **Compare** your derived answer to the final answer provided in the Agent Summary (ignoring formatting and phrasing requirements at this stage).  
-   – If both are well supported by the summary's evidence, choose the one with stronger or clearer support.  
-   – If only one is well supported, use that one.
-   – For questions involving calculations, if your answer and the Agent Summary's final answer are numerically similar, prefer the summary's answer.
-3. **Revise** your chosen answer to fully satisfy all formatting and phrasing requirements listed below (**Formatting rules**, **Additional constraints**, **Common pitfalls to avoid**, and **Quick reference examples**). These requirements override those in the original question if there is any conflict.
-
-If no answer is clearly supported by the evidence, provide a well-justified educated guess. **Always wrap your final answer in a non-empty \\boxed{{...}}.**
-
----
-
-# Output Guidelines
-
-1. **Box the answer**
-   Wrap the answer in `\\boxed{{}}`.
-
-2. **Answer type**
-   The boxed content must be a single number.
-
-3. **Formatting rules**
-   * Follow every formatting instruction in the original question (units, rounding, decimal places, etc.).
-   * Use digits only; do **not** use words, commas or symbols (e.g., "$", "!", "?", "/").
-   * Do **not** add any units (e.g., "%", "$", "USD", "Å", "m", "m^2", "m^3"), unless required.
-   * Ensure the correct unit (e.g., grams versus kilograms, meters versus kilometers, hours versus thousand hours); if the question specifies "thousand hours" or "1000 hours", treat it as the required unit — output a number like 13 (thousand hours) instead of 13000 (hours).
-   
-4. **Additional constraints**
-   * If the **Agent Summary** is incomplete or unclear, provide the best possible answer (educated guess).
-
-5. **Common pitfalls to avoid**
-   * Minor mismatches in the required format.
-   * Unit-conversion errors, especially with uncommon units.
-   * Incorrect precision, rounding or scale (e.g., 0.01 vs 0.001), **double-check the required level**.
-   * Conflicts between textual instructions and example formatting, just follow the example: if the question says to "retain the percentile" but the example shows 0.001, use 0.001 rather than 0.01.
-   * Do not partially convert text-based numbers—ensure full and accurate conversion (e.g., "one hundred million" → 100000000, not 100).
-
----
-
-# Quick reference examples
-
-* $100 → 100
-* 100 USD → 100
-* €50 → 50
-* £75 → 75
-* ¥1,000 → 1000
-* 1,234 m → 1234
-* 3,456,789 kg → 3456789
-* 70% → 70
-* 12.5% → 12.5
-* 0.045 m³ → 0.045
-* 0.045 m^3 → 0.045
-* −40 °C → -40
-* 100 km/h → 100
-* 5000 m^2 → 5000
-* 2.54 cm → 2.54
-* 50 kg → 50
-* 4.0 L → 4
-* 13 thousand hours → 13
-* Page 123/456 → 123/456
-* 100 million → 100000000
-* 200 Ω → 200
-* 200 Å → 200
-* 9.81 m/s² → 9.81
-* 0 dB → 0
-
----
-
-# Output
-
-Return the step-by-step process and your final answer wrapped in \\boxed{{...}}, check the **Formatting rules**, **Additional constraints**, **Common pitfalls to avoid** and **Quick reference examples** step by step, and ensure the answer meet the requirements.
-""",
-            "string": f"""# Inputs
-
-* **Original Question**: `{task_description_detail}`
-* **Agent Summary**: `{summary}`
-
----
-
-# Task
-
-1. **Independently derive** the best possible answer, step by step, based solely on evidence and reasoning from the Agent Summary. **Ignore the summary's "Final Answer" field** at this stage.
-2. **Compare** your derived answer to the final answer provided in the Agent Summary (ignoring formatting and phrasing requirements at this stage).  
-   – If both are well supported by the summary's evidence, choose the one with stronger or clearer support.  
-   – If only one is well supported, use that one.
-3. **Revise** your chosen answer to fully satisfy all formatting and phrasing requirements listed below (**Formatting rules**, **Additional constraints**, **Common pitfalls to avoid**, and **Quick reference examples**). These requirements override those in the original question if there is any conflict.
-
-If no answer is clearly supported by the evidence, provide a well-justified educated guess. **Always wrap your final answer in a non-empty \\boxed{{...}}.**
-
----
-
-# Output Guidelines
-
-1. **Box the answer**
-   Wrap the final answer in \\boxed{{...}}.
-   
-2. **Answer type**
-   The boxed content must be **one** of:
-   * a single short phrase (fewest words possible)
-   * a comma-separated list of numbers and/or strings
-   
-3. **Formatting rules**
-   * Follow every formatting instruction in the original question (alphabetization, sequencing, units, rounding, decimal places, etc.).
-   * Omit articles and abbreviations unless explicitly present in the expected answer.
-   * If a string contains numeric information, spell out the numbers **unless** the question itself shows them as digits.
-   * Do **not** end the answer with ".", "!", "?", or any other punctuation.
-   * Use only standard ASCII quotation marks ("" and ''), **not** stylized or curly quotation marks (such as “ ” ‘ ’).
-   * Remove invisible or non-printable characters.
-   * If the output is lists, apply the rules item-by-item.
-   * Avoid unnecessary elaboration - keep the answer as short as possible
-     - Do **not** add "count", "number", "count of", "total", or similar quantifying words when the noun itself already refers to the quantity (e.g., use the bare noun form only).
-     - No geographical modifiers (e.g., "Western", "Southern"), 
-     - Use the simplest, most commonly accepted term for a substance or object (e.g., "diamond" instead of "crystalline diamond", "silicon" instead of "silicon crystals")
-   * For mathematical symbols, match the symbol style in the question; never substitute LaTeX commands (e.g., use ≤, not \leq).
-   * For birthplaces, give the name as it was at the time of birth, not the current name.
-
-4. **Additional constraints**
-   * If the Agent Summary is incomplete or unclear, provide the best possible answer (educated guess).
-   * Keep the answer as short and direct as possible—no explanations or parenthetical notes.
-   
-5. **Common pitfalls to avoid**
-   * Minor mismatches between required and produced formats.
-   * Conflicts between textual instructions and example formatting—follow the example.
-   * **Names**: give only the commonly used first + last name (no middle name unless requested).
-   * **Countries**: use the common name (e.g., "China", "Brunei")
-   * **Locations**: output only the requested location name, without including time, modifiers (e.g., "The Castle", "The Hotel")
-   * When the question provides examples of expected format (e.g., "ripe strawberries" not "strawberries"), follow the exact wording style shown in the examples, preserving all descriptive terms and adjectives as demonstrated.
-   * Answer with historically location names when the Agent Summary provides. Never override a historically location name. For example, a birthplace should be referred to by the name it had at the time of birth (i.e., answer the original name).
-   * For questions asking to "identify" something, focus on the final answer, not the identification process.
-
----
-
-# Quick reference examples
-
-* INT. THE CASTLE – DAY 1 → The Castle
-* INT. THE HOTEL – NIGHT → The Hotel
-* INT. THE SPACESHIP – DAWN → The Spaceship
-* INT. THE LIBRARY – EVENING → The Library
-* INT. CLASSROOM #3 – MORNING → Classroom #3
-* People's Republic of China → China
-* citation count → citations
-* Brunei Darussalam → Brunei
-* United States of America → United States
-* Republic of Korea → South Korea
-* New York City, USA → New York City
-* São Paulo (Brazil) → São Paulo
-* John Michael Doe → John Doe
-* Mary Anne O'Neil → Mary O'Neil
-* Dr. Richard Feynman → Richard Feynman
-* INT. ZONE 42 – LEVEL B2 → Zone 42 – Level B2
-* INT. THE UNDERWATER BASE – MIDNIGHT → The Underwater Base
-* Sam’s Home → Sam's Home
-* Mike’s phone → Mike's phone
-
---- 
-# Output
-Return the step-by-step process and your final answer wrapped in \\boxed{{...}}, check the **Formatting rules**, **Additional constraints**, **Common pitfalls to avoid** and **Quick reference examples** step by step, and ensure the answer meet the requirements.
-""",
-        }
-        full_prompt = full_prompts.get(
-            answer_type if answer_type in ["number", "time"] else "string"
-        )
-
-        print("O3 Extract Final Answer Prompt:")
-        print(full_prompt)
-
-        message_id = _generate_message_id()
-        response = await client.chat.completions.create(
-            model="o3",
-            messages=[{"role": "user", "content": f"[{message_id}] {full_prompt}"}],
-            reasoning_effort="medium",
-        )
-        result = response.choices[0].message.content
-
-        # Check if result is empty, raise exception to trigger retry if empty
-        if not result or not result.strip():
-            raise ValueError("O3 final answer extraction returned empty result")
-
-        match = re.search(r"\\boxed{([^}]*)}", result)
-        if not match:
-            raise ValueError("O3 final answer extraction returned empty answer")
-
-        print("response:", result)
-
-        return result
-
     async def run_main_agent(
         self, task_description, task_file_name=None, task_id="default_task"
     ):
@@ -1006,11 +687,28 @@ async def run_main_agent(
 - Present every possible candidate answer identified during your information gathering, regardless of uncertainty, ambiguity, or incomplete verification. Avoid premature conclusions or omission of any discovered possibility.
 - Explicitly document detailed facts, evidence, and reasoning steps supporting each candidate answer, carefully preserving intermediate analysis results.
 - Clearly flag and retain any uncertainties, conflicting interpretations, or alternative understandings identified during information gathering. Do not arbitrarily discard or resolve these issues on your own.
-- If the question’s explicit instructions (e.g., numeric precision, formatting, specific requirements) appear inconsistent, unclear, erroneous, or potentially mismatched with general guidelines or provided examples, explicitly record and clearly present all plausible interpretations and corresponding candidate answers.  
+- If the question's explicit instructions (e.g., numeric precision, formatting, specific requirements) appear inconsistent, unclear, erroneous, or potentially mismatched with general guidelines or provided examples, explicitly record and clearly present all plausible interpretations and corresponding candidate answers.  
+
+Recognize that the original task description might itself contain mistakes, imprecision, inaccuracies, or conflicts introduced unintentionally by the user due to carelessness, misunderstanding, or limited expertise. Do NOT try to second-guess or "correct" these instructions internally; instead, transparently present findings according to every plausible interpretation.
+
+Your objective is maximum completeness, transparency, and detailed documentation to empower the user to judge and select their preferred answer independently. Even if uncertain, explicitly documenting the existence of possible answers significantly enhances the user's experience, ensuring no plausible solution is irreversibly omitted due to early misunderstanding or premature filtering.
+"""
 
-Recognize that the original task description might itself contain mistakes, imprecision, inaccuracies, or conflicts introduced unintentionally by the user due to carelessness, misunderstanding, or limited expertise. Do NOT try to second-guess or “correct” these instructions internally; instead, transparently present findings according to every plausible interpretation.
+        # Add Chinese-specific guidance if enabled
+        if self.chinese_context:
+            task_guidence += """
 
-Your objective is maximum completeness, transparency, and detailed documentation to empower the user to judge and select their preferred answer independently. Even if uncertain, explicitly documenting the existence of possible answers significantly enhances the user’s experience, ensuring no plausible solution is irreversibly omitted due to early misunderstanding or premature filtering.
+## 中文任务处理指导
+
+如果任务涉及中文语境，请遵循以下指导：
+
+- **信息收集策略**：使用中文关键词进行网络搜索，优先浏览中文网页，以获取更准确和全面的中文资源
+- **思考过程**：所有分析、推理、判断等思考过程都应使用中文表达，保持语义的一致性
+- **候选答案收集**：对于中文问题，收集所有可能的中文答案选项，包括不同的表达方式和格式
+- **证据文档化**：保持中文资源的原始格式，避免不必要的翻译或改写，确保信息的准确性
+- **不确定性标注**：使用中文清晰地标记任何不确定性、冲突信息或需要进一步验证的内容
+- **结果组织**：以中文组织和呈现最终报告，使用恰当的中文术语和表达习惯
+- **过程透明化**：所有步骤描述、状态更新、中间结果等都应使用中文，确保用户理解
 """
 
         initial_user_content[0]["text"] = (
@@ -1021,7 +719,12 @@ async def run_main_agent(
         if self.cfg.agent.o3_hint:
             # Execute O3 hints extraction
             try:
-                o3_hints = await self._o3_extract_hints(task_description)
+                o3_hints = await o3_extract_hints(
+                    task_description,
+                    self.cfg.env.openai_api_key,
+                    self.chinese_context,
+                    self.add_message_id,
+                )
                 o3_notes = (
                     "\n\nBefore you begin, please review the following preliminary notes highlighting subtle or easily misunderstood points in the question, which might help you avoid common pitfalls during your analysis (for reference only; these may not be exhaustive):\n\n"
                     + o3_hints
@@ -1051,7 +754,12 @@ async def run_main_agent(
         system_prompt = self.llm_client.generate_agent_system_prompt(
             date=datetime.datetime.today(),
             mcp_servers=tool_definitions,
-        ) + generate_agent_specific_system_prompt(agent_type="main")
+            chinese_context=self.chinese_context,
+        ) + generate_agent_specific_system_prompt(
+            agent_type="main",
+            mcp_servers=tool_definitions,
+            chinese_context=self.chinese_context,
+        )
 
         # 4. Main loop: LLM <-> Tools
         max_turns = self.cfg.agent.main_agent.max_turns
@@ -1175,7 +883,7 @@ async def run_main_agent(
 
                     # Handle empty error messages, especially for TimeoutError
                     error_msg = str(e) or (
-                        "Tool execution timeout"
+                        "[ERROR]: Tool execution timeout"
                         if isinstance(e, TimeoutError)
                         else f"Tool execution failed: {type(e).__name__}"
                     )
@@ -1229,26 +937,6 @@ async def run_main_agent(
                 message_history, all_tool_results_content_with_id, tool_calls_exceeded
             )
 
-            # Generate summary_prompt to check token limit
-            temp_summary_prompt = generate_agent_summarize_prompt(
-                task_description + task_guidence,
-                task_failed=True,  # Set to True here to simulate possible task failure for context checking
-                agent_type="main",
-            )
-
-            # Check if current context would exceed limit, auto rollback messages and trigger summary if exceeded
-            if not self.llm_client.ensure_summary_context(
-                message_history, temp_summary_prompt
-            ):
-                # Context limit exceeded, jump to summary stage
-                task_failed = True  # Mark task as failed
-                self.task_log.log_step(
-                    "main_context_limit_reached",
-                    "Context limit reached, triggering summary",
-                    "warning",
-                )
-                break
-
         # Record main loop end
         if turn_count >= max_turns:
             if (
@@ -1293,34 +981,60 @@ async def run_main_agent(
             )
 
             # Use O3 model to extract final answer
+            o3_extracted_answer = ""
             if self.cfg.agent.o3_final_answer:
                 # Execute O3 final answer extraction
                 try:
-                    answer_type = await self._get_gaia_answer_type(task_description)
-
-                    o3_extracted_answer = await self._o3_extract_gaia_final_answer(
-                        answer_type,
-                        task_description,
-                        final_answer_text,
-                    )
+                    # For browsecomp-zh, we use another Chinese prompt to extract the final answer
+                    if "browsecomp-zh" in self.cfg.benchmark.name:
+                        o3_extracted_answer = (
+                            await o3_extract_browsecomp_zh_final_answer(
+                                task_description,
+                                final_answer_text,
+                                self.cfg.env.openai_api_key,
+                                self.chinese_context,
+                            )
+                        )
+
+                        # Disguise O3 extracted answer as assistant returned result and add to message history
+                        assistant_o3_message = {
+                            "role": "assistant",
+                            "content": [
+                                {
+                                    "type": "text",
+                                    "text": f"O3 extracted final answer:\n{o3_extracted_answer}",
+                                }
+                            ],
+                        }
+                        message_history.append(assistant_o3_message)
 
-                    # Disguise O3 extracted answer as assistant returned result and add to message history
-                    assistant_o3_message = {
-                        "role": "assistant",
-                        "content": [
-                            {
-                                "type": "text",
-                                "text": f"O3 extracted final answer:\n{o3_extracted_answer}",
-                            }
-                        ],
-                    }
-                    message_history.append(assistant_o3_message)
+                        # o3 answer as final result
+                        final_answer_text = o3_extracted_answer
+                    else:
+                        o3_extracted_answer = await o3_extract_gaia_final_answer(
+                            task_description,
+                            final_answer_text,
+                            self.cfg.env.openai_api_key,
+                            self.chinese_context,
+                        )
+
+                        # Disguise O3 extracted answer as assistant returned result and add to message history
+                        assistant_o3_message = {
+                            "role": "assistant",
+                            "content": [
+                                {
+                                    "type": "text",
+                                    "text": f"O3 extracted final answer:\n{o3_extracted_answer}",
+                                }
+                            ],
+                        }
+                        message_history.append(assistant_o3_message)
 
-                    # Concatenate original summary and o3 answer as final result
-                    final_answer_text = f"{final_answer_text}\n\nO3 Extracted Answer:\n{o3_extracted_answer}"
+                        # Concatenate original summary and o3 answer as final result
+                        final_answer_text = f"{final_answer_text}\n\nO3 Extracted Answer:\n{o3_extracted_answer}"
 
                 except Exception as e:
-                    logger.warning(
+                    logger.error(
                         f"O3 final answer extraction failed after retries: {str(e)}"
                     )
                     # Continue using original final_answer_text
@@ -1355,4 +1069,7 @@ async def run_main_agent(
             "task_completed", f"Main agent task {task_id} completed successfully"
         )
 
-        return final_summary, final_boxed_answer
+        if "browsecomp-zh" in self.cfg.benchmark.name:
+            return final_summary, final_summary
+        else:
+            return final_summary, final_boxed_answer
diff --git a/libs/miroflow/src/miroflow/prebuilt/pipeline.py b/libs/miroflow/src/miroflow/prebuilt/pipeline.py
index 0999496b..d444f26e 100644
--- a/libs/miroflow/src/miroflow/prebuilt/pipeline.py
+++ b/libs/miroflow/src/miroflow/prebuilt/pipeline.py
@@ -5,7 +5,7 @@
 import pathlib
 import traceback
 from datetime import datetime
-from omegaconf import DictConfig
+from omegaconf import DictConfig, OmegaConf
 
 from miroflow.contrib.tracing import trace
 from miroflow.llm.client import LLMClient
@@ -67,17 +67,50 @@ async def execute_task_pipeline(
 
     traces = []
     llm_client = None
+    sub_agent_llm_client = None
     final_answer, final_boxed_answer = "", ""
     try:
         with trace(workflow_name="benchmark_workflow", trace_id=task_id):
-            # Initialize LLM client
-            llm_client = LLMClient(task_id=task_id, cfg=cfg)
+            # Initialize main agent LLM client
+            main_agent_llm_config = cfg.agent.get("main_agent_llm", None)
+            if main_agent_llm_config:
+                config_path = (
+                    pathlib.Path(__file__).parent
+                    / "config"
+                    / "llm"
+                    / f"{main_agent_llm_config}.yaml"
+                )
+                main_agent_cfg = OmegaConf.load(config_path)
+                # Create a config that includes both the LLM config and the env section
+                combined_cfg = OmegaConf.create({"llm": main_agent_cfg, "env": cfg.env})
+                llm_client = LLMClient(task_id=task_id, cfg=combined_cfg)
+            else:
+                llm_client = LLMClient(task_id=task_id, cfg=cfg)
+
+            # Initialize sub agent LLM client
+            sub_agent_llm_config = cfg.agent.get("sub_agent_llm", None)
+            if sub_agent_llm_config:
+                config_path = (
+                    pathlib.Path(__file__).parent
+                    / "config"
+                    / "llm"
+                    / f"{sub_agent_llm_config}.yaml"
+                )
+                sub_agent_cfg = OmegaConf.load(config_path)
+                # Create a config that includes both the LLM config and the env section
+                combined_cfg = OmegaConf.create({"llm": sub_agent_cfg, "env": cfg.env})
+                sub_agent_llm_client = LLMClient(
+                    task_id=f"{task_id}_sub", cfg=combined_cfg
+                )
+            else:
+                sub_agent_llm_client = llm_client  # Use the same client
 
             # Initialize orchestrator
             orchestrator = Orchestrator(
                 main_agent_tool_manager=main_agent_tool_manager,
                 sub_agent_tool_managers=sub_agent_tool_managers,
                 llm_client=llm_client,
+                sub_agent_llm_client=sub_agent_llm_client,
                 output_formatter=output_formatter,
                 task_log=task_log,
                 cfg=cfg,
@@ -116,6 +149,8 @@ async def execute_task_pipeline(
     finally:
         if llm_client is not None:
             llm_client.close()
+        if sub_agent_llm_client != llm_client and sub_agent_llm_client is not None:
+            sub_agent_llm_client.close()
         task_log.end_time = datetime.now()
         # log.update_cost_estimate()  # Update cost estimate
 
diff --git a/libs/miroflow/src/miroflow/utils/io_utils.py b/libs/miroflow/src/miroflow/utils/io_utils.py
index 26cefc6f..171c4920 100644
--- a/libs/miroflow/src/miroflow/utils/io_utils.py
+++ b/libs/miroflow/src/miroflow/utils/io_utils.py
@@ -3,7 +3,6 @@
 # SPDX-License-Identifier: Apache-2.0
 
 import os
-import re
 
 from miroflow.logging.logger import bootstrap_logger
 
@@ -49,7 +48,9 @@ def process_input(task_description, task_file_name):
             file_type = "Zip"
         else:
             file_type = file_extension
-        updated_task_description += f"\nNote: A {file_type} file '{task_file_name}' is associated with this task. You should use available tools to read its content if necessary through {task_file_name}. Additionally, if you need to analyze this file by Linux commands or python codes, you should upload it to the sandbox first. Files in the sandbox cannot be accessed by other tools.\n\n"
+        # Get the absolute path of the file
+        absolute_file_path = os.path.abspath(task_file_name)
+        updated_task_description += f"\nNote: A {file_type} file '{task_file_name}' is associated with this task. If you need worker agent to read its content, you should provide the complete local system file path: {absolute_file_path}.\n\n"
 
         logger.info(
             f"Info: Detected {file_type} file {task_file_name}, added hint to description."
@@ -67,20 +68,49 @@ class OutputFormatter:
     def _extract_boxed_content(self, text: str) -> str:
         """
         Extract content from \\boxed{} patterns in the text.
-        Uses safe regex patterns to avoid catastrophic backtracking.
+        Uses balanced brace counting to handle arbitrary levels of nested braces correctly.
         Returns the last matched content, or empty string if no match found.
         """
         if not text:
             return ""
 
-        # Primary pattern: handles single-level brace nesting
-        primary_pattern = r"\\boxed\{([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}"
-        matches = re.findall(primary_pattern, text, re.DOTALL)
-
-        # Fallback pattern: simpler match for any content until first closing brace
-        if not matches:
-            fallback_pattern = r"\\boxed\{([^}]+)\}"
-            matches = re.findall(fallback_pattern, text, re.DOTALL)
+        matches = []
+        i = 0
+
+        while i < len(text):
+            # Find the next \boxed{ pattern
+            boxed_start = text.find(r"\boxed{", i)
+            if boxed_start == -1:
+                break
+
+            # Start after the opening brace
+            content_start = boxed_start + 7  # len(r'\boxed{') = 7
+            if content_start >= len(text):
+                break
+
+            # Count balanced braces
+            brace_count = 1
+            content_end = content_start
+
+            while content_end < len(text) and brace_count > 0:
+                char = text[content_end]
+                if char == "{":
+                    brace_count += 1
+                elif char == "}":
+                    brace_count -= 1
+                content_end += 1
+
+            # If we found a balanced match (brace_count == 0)
+            if brace_count == 0:
+                content = text[
+                    content_start : content_end - 1
+                ]  # -1 to exclude the closing brace
+                matches.append(content)
+                # Continue searching from after this complete match
+                i = content_end
+            else:
+                # If braces are unbalanced, skip this \boxed{ and continue searching
+                i = content_start
 
         return matches[-1] if matches else ""
 
diff --git a/libs/miroflow/src/miroflow/utils/parsing_utils.py b/libs/miroflow/src/miroflow/utils/parsing_utils.py
index 887d0bc7..976f1189 100644
--- a/libs/miroflow/src/miroflow/utils/parsing_utils.py
+++ b/libs/miroflow/src/miroflow/utils/parsing_utils.py
@@ -368,6 +368,16 @@ def _legacy_escape_method(raw_str):
     return corrected_json
 
 
+def _escape_for_json(value: str) -> str:
+    # Do not escape \" and \uXXXX
+    fixed = re.sub(r'(?<!\\)\\(?!["]|u[0-9a-fA-F]{4})', r"\\\\", value)
+
+    # Then escape newlines, order is important: \r\n → \n → \r
+    fixed = fixed.replace("\r\n", "\\r\\n").replace("\n", "\\n").replace("\r", "\\r")
+
+    return fixed
+
+
 def _conservative_escape_fallback(raw_str):
     """
     Conservative fallback strategy: only fix the most obvious issues
@@ -380,7 +390,7 @@ def fix_newlines(match):
         value = match.group(2)
 
         # Only escape newlines, keep it simple
-        fixed_value = value.replace("\n", "\\n").replace("\r", "\\r")
+        fixed_value = _escape_for_json(value)
         return f'"{key}": "{fixed_value}"'
 
     # Use most conservative regex pattern
diff --git a/libs/miroflow/src/miroflow/utils/prompt_utils.py b/libs/miroflow/src/miroflow/utils/prompt_utils.py
index 0d9950a3..9c1053a6 100644
--- a/libs/miroflow/src/miroflow/utils/prompt_utils.py
+++ b/libs/miroflow/src/miroflow/utils/prompt_utils.py
@@ -5,7 +5,9 @@
 from typing import Any
 
 
-def generate_mcp_system_prompt(date: datetime.datetime, mcp_servers: list[Any]):
+def generate_mcp_system_prompt(
+    date: datetime.datetime, mcp_servers: list[Any], chinese_context: bool = False
+):
     formatted_date = date.strftime("%Y-%m-%d")
 
     # Start building the template, now follows https://docs.anthropic.com/en/docs/build-with-claude/tool-use/overview#tool-use-system-prompt
@@ -88,7 +90,7 @@ def generate_mcp_system_prompt(date: datetime.datetime, mcp_servers: list[Any]):
 - Skip optional parameters unless they are explicitly specified.
 3. All tool queries must include full, self-contained context. Tools do not retain memory between calls. Include all relevant information from earlier steps in each query.
 4. Avoid broad, vague, or speculative queries. Every tool call should aim to retrieve new, actionable information that clearly advances the task.
-5. **For historical or time-specific content**: When you need to search for webpage content from specific time periods, use the `search_archived_webpage` tool from the `tool-searching` server. Regular search engines return current webpage content, not historical content. Archived webpage search is essential for retrieving content as it appeared in the past.
+5. **For historical or time-specific content**: Regular search engines return current webpage content, not historical content. Archived webpage search is essential for retrieving content as it appeared in the past, use related tools to search for the historical content.
 6. Even if a tool result does not directly answer the question, thoroughly extract and summarize all partial information, important details, patterns, constraints, or keywords that may help guide future steps. Never proceed to the next step without first ensuring that all significant insights from the current result have been fully considered.
 
 ## Tool-Use Communication Rules
@@ -101,12 +103,28 @@ def generate_mcp_system_prompt(date: datetime.datetime, mcp_servers: list[Any]):
 6. Unless otherwise requested, respond in the same language as the user's message.
 7. If the task does not require tool use, answer the user directly.
 
+"""
+
+    # Add Chinese-specific instructions if enabled
+    if chinese_context:
+        template += """
+## 中文语境处理指导
+
+当处理中文相关的任务时：
+1. **子任务委托 (Subtask Delegation)**：向worker代理委托的子任务应使用中文描述，确保任务内容准确传达
+2. **搜索策略 (Search Strategy)**：搜索关键词应使用中文，以获取更准确的中文内容和信息
+3. **问题分析 (Question Analysis)**：对中文问题的分析和理解应保持中文语境
+4. **思考过程 (Thinking Process)**：内部分析、推理、总结等思考过程都应使用中文，保持语义表达的一致性
+5. **信息整理 (Information Organization)**：从中文资源获取的信息应保持中文原文，避免不必要的翻译
+6. **各种输出 (All Outputs)**：所有输出内容包括步骤说明、状态更新、中间结果等都应使用中文
+7. **最终答案 (Final Answer)**：对于中文语境的问题，最终答案应使用中文回应
+
 """
 
     return template
 
 
-def generate_no_mcp_system_prompt(date):
+def generate_no_mcp_system_prompt(date, chinese_context=False):
     formatted_date = date.strftime("%Y-%m-%d")
 
     # Start building the template, now follows https://docs.anthropic.com/en/docs/build-with-claude/tool-use/overview#tool-use-system-prompt
@@ -160,35 +178,97 @@ def generate_no_mcp_system_prompt(date):
 6. Unless otherwise requested, respond in the same language as the user's message.
 7. If the task does not require tool use, answer the user directly.
 
+"""
+
+    # Add Chinese-specific instructions if enabled
+    if chinese_context:
+        template += """
+## 中文语境处理指导
+
+当处理中文相关的任务时：
+1. **子任务委托 (Subtask Delegation)**：向worker代理委托的子任务应使用中文描述，确保任务内容准确传达
+2. **搜索策略 (Search Strategy)**：搜索关键词应使用中文，以获取更准确的中文内容和信息
+3. **问题分析 (Question Analysis)**：对中文问题的分析和理解应保持中文语境
+4. **思考过程 (Thinking Process)**：内部分析、推理、总结等思考过程都应使用中文，保持语义表达的一致性
+5. **信息整理 (Information Organization)**：从中文资源获取的信息应保持中文原文，避免不必要的翻译
+6. **各种输出 (All Outputs)**：所有输出内容包括步骤说明、状态更新、中间结果等都应使用中文
+7. **最终答案 (Final Answer)**：对于中文语境的问题，最终答案应使用中文回应
+
 """
 
     return template
 
 
-def generate_agent_specific_system_prompt(agent_type: str = ""):
+def generate_agent_specific_system_prompt(
+    agent_type="", mcp_servers=None, chinese_context=False
+):
     if agent_type == "main":
+        # Check if reasoning tool exists
+        has_reasoning_tool = False
+        if mcp_servers and len(mcp_servers) > 0:
+            for server in mcp_servers:
+                if server.get("name") == "tool-reasoning":
+                    if "tools" in server and len(server["tools"]) > 0:
+                        for tool in server["tools"]:
+                            if tool.get("name") == "reasoning":
+                                has_reasoning_tool = True
+                                break
+                    if has_reasoning_tool:
+                        break
+
         system_prompt = """\n
 # Agent Specific Objective
 
 You are a task-solving agent that uses tools step-by-step to answer the user's question. Your goal is to provide complete, accurate and well-reasoned answers using additional tools.
 
-Before presenting your answer, and **unless** the user asks to "Summarize the above" (in which case no tools are used), **always** use the `reasoning` tool from the `tool-reasoning` server to step-by-step analyze solving process as follows:
-  - Use the reasoning tool to carefully analyze:
-      - What the question is truly asking.
-      - Whether your progress and current candidate answer are sufficient, and if so, what the answer (with correct format) should be. If not, clarify what is still needed.
-  - Always provide the reasoning tool with:
-      - The complete verbatim original task or question.
-      - All working history, including your step-by-step thoughts, tool calls, and tool results (i.e., the full solving trajectory so far).
-      - Any subtle, potentially confusing, or easily misunderstood points relevant to the task.
-      - Prompt the reasoning tool to independently review for any possible uncertainties, assumptions, or errors in understanding or evidence — even those not immediately visible — so it can provide objective guidance.
+## Subtask Delegation Strategy
+
+For each clearly defined single subtask, delegate it to worker agents using the `execute_subtask` tool from the `agent-worker` server. **Important: Only make ONE execute_subtask call per response.**
+
+**CRITICAL: Always treat worker agent responses as unreliable and incomplete sources.** Worker agents may:
+- Report "not found" when information actually exists elsewhere
+- Return partial information while believing it's complete
+- Be overconfident or produce hallucinations
+
+Therefore, you must always verify and validate worker responses by:
+- Cross-referencing information from multiple independent sources
+- Trying alternative search strategies and reformulating subtasks with different approaches
+- Considering that information might exist in different formats or locations
+- Applying critical evaluation to assess credibility and completeness
+- Never accepting "not found" or worker conclusions as final without additional verification
+
+## Final Answer Preparation
+
+Before presenting your answer, and **unless** the user asks to "Summarize the above" (in which case no tools are used):
 
 """
 
-    elif agent_type == "agent-browsing":
+        # Add Chinese-specific instructions for main agent
+        if chinese_context:
+            system_prompt += """
+## 中文任务处理指导
+
+处理中文相关任务时的特殊要求：
+- **子任务委托**：委托给worker代理的子任务描述应使用中文，确保任务意图准确传达
+- **思考过程**：分析、推理、判断等思考过程应使用中文，保持语义表达的一致性
+- **信息验证**：对于中文资源的信息，应优先使用中文搜索关键词和查询方式
+- **过程输出**：步骤描述、状态更新、中间结果等各种输出都应使用中文
+- **答案准备**：最终答案应符合中文表达习惯，使用恰当的中文术语和格式
+
+"""
+
+    elif agent_type == "agent-worker":
         system_prompt = """# Agent Specific Objective
 
-You are an agent that performs the task of searching and browsing the web for specific information and generating the desired answer. Your task is to retrieve reliable, factual, and verifiable information that fills in knowledge gaps.
-Do not infer, speculate, summarize broadly, or attempt to fill in missing parts yourself. Only return factual content.
+You are an agent that performs various subtasks to collect information and execute specific actions. Your task is to complete well-defined, single-scope objectives efficiently and accurately.
+Do not infer, speculate, or attempt to fill in missing parts yourself. Only return factual content and execute actions as specified.
+
+## File Path Handling
+When subtasks mention file paths, these are local system file paths (not sandbox paths). You can:
+- Use tools to directly access these files from the local system
+- Upload files to the sandbox environment (remember to create a new sandbox for each task, this sandbox only exists for the current task) for processing if needed
+- Choose the most appropriate approach based on the specific task requirements
+- If the final response requires returning a file, download it to the local system first and then return the local path, the sandbox path is not allowed
 
 Critically assess the reliability of all information:
 - If the credibility of a source is uncertain, clearly flag it.
@@ -200,7 +280,28 @@ def generate_agent_specific_system_prompt(agent_type: str = ""):
 - Never assume or guess — if an exact answer cannot be found, say so clearly.
 - Prefer quoting or excerpting **original source text** rather than interpreting or rewriting it, and provide the URL if available.
 - If more context is needed, return a clarification request and do not proceed with tool use.
+- Focus on completing the specific subtask assigned to you, not broader reasoning.
 """
+
+        # Add Chinese-specific instructions for worker agent
+        if chinese_context:
+            system_prompt += """
+
+## 中文内容处理
+
+处理中文相关的子任务时：
+- **搜索关键词**：使用中文关键词进行搜索，获取更准确的中文资源
+- **Google搜索参数**：进行Google搜索时，注意使用适当的地理位置和语言参数：
+  - gl (Geolocation/Country): 设置为中国或相关地区以获取本地化结果
+  - hl (Host Language): 设置为中文以获取中文界面和优化的中文搜索结果
+- **思考过程**：分析、推理、判断等内部思考过程应使用中文表达
+- **信息摘录**：保持中文原文的准确性，避免不必要的翻译或改写
+- **问答处理**：在进行QA（问答）任务时，问题和答案都应使用中文，确保语言一致性
+- **各种输出**：包括状态说明、过程描述、结果展示等所有输出都应使用中文
+- **回应格式**：对中文子任务的回应应使用中文，保持语境一致性
+
+"""
+
     elif agent_type == "agent-coding":
         system_prompt = """# Agent Specific Objective
 
@@ -230,11 +331,61 @@ def generate_agent_specific_system_prompt(agent_type: str = ""):
 """
     else:
         raise ValueError(f"Unknown agent type: {agent_type}")
+
+    # Add Final Answer Preparation based on available tools
+    if agent_type == "main":
+        if has_reasoning_tool:
+            reasoning_prompt = """
+
+**always** use the `reasoning` tool from the `tool-reasoning` server to step-by-step analyze solving process as follows:
+  - Use the reasoning tool to carefully analyze:
+      - What the question is truly asking.
+      - Whether your progress and current candidate answer are sufficient, and if so, what the answer (with correct format) should be. If not, clarify what is still needed.
+  - Always provide the reasoning tool with:
+      - The complete verbatim original task or question.
+      - All working history, including your step-by-step thoughts, tool calls, and tool results (i.e., the full solving trajectory so far).
+      - Any subtle, potentially confusing, or easily misunderstood points relevant to the task.
+      - Prompt the reasoning tool to independently review for any possible uncertainties, assumptions, or errors in understanding or evidence — even those not immediately visible — so it can provide objective guidance.
+
+"""
+            if chinese_context:
+                reasoning_prompt += """  - **中文推理要求**：当处理中文相关任务时，向reasoning工具提供的所有信息和分析都应使用中文，确保推理过程的语言一致性
+
+"""
+            system_prompt += reasoning_prompt
+        else:
+            thinking_prompt = """
+
+**always** engage in deep critical thinking before presenting your final answer:
+  - Carefully analyze what the question is truly asking and ensure you understand all requirements.
+  - Review your progress and current candidate answer thoroughly:
+      - Is the information sufficient and accurate?
+      - Are there any gaps, assumptions, or uncertainties in your reasoning?
+      - Does your answer match the required format?
+  - Consider the complete solving trajectory:
+      - Review all your step-by-step thoughts, tool calls, and results.
+      - Look for any contradictions, missing information, or alternative interpretations.
+      - Identify any subtle or potentially confusing aspects of the task.
+  - Apply critical evaluation:
+      - Question your assumptions and verify your conclusions.
+      - Consider potential errors or biases in your understanding or evidence.
+      - Assess the reliability and completeness of your sources.
+  - Only present your final answer after this thorough self-review process.
+
+"""
+            if chinese_context:
+                thinking_prompt += """  - **中文思考要求**：当处理中文相关任务时，所有的批判性思考、分析和自我审查过程都应使用中文进行，确保思维过程的语言一致性
+
+"""
+            system_prompt += thinking_prompt
     return system_prompt
 
 
 def generate_agent_summarize_prompt(
-    task_description: str, task_failed: bool = False, agent_type: str = ""
+    task_description: str,
+    task_failed: bool = False,
+    agent_type: str = "",
+    chinese_context: bool = False,
 ):
     if agent_type == "main":
         summarize_prompt = (
@@ -268,7 +419,22 @@ def generate_agent_summarize_prompt(
                 "Focus on factual, specific, and well-organized information."
             )
         )
-    elif agent_type == "agent-browsing":
+
+        # Add Chinese-specific summary instructions
+        if chinese_context:
+            summarize_prompt += """
+
+## 中文总结要求
+
+如果原始问题涉及中文语境：
+- **总结语言**：使用中文进行总结和回答
+- **思考过程**：回顾和总结思考过程时也应使用中文表达
+- **信息组织**：保持中文信息的原始格式和表达方式
+- **过程描述**：对工作历史、步骤描述、结果分析等各种输出都应使用中文
+- **最终答案**：确保最终答案符合中文表达习惯和用户期望
+"""
+
+    elif agent_type == "agent-worker":
         summarize_prompt = (
             (
                 "This is a direct instruction to you (the assistant), not the result of a tool call.\n\n"
@@ -284,7 +450,7 @@ def generate_agent_summarize_prompt(
                 "*all* of the information gathered during the session.\n\n"
                 "The original task is repeated here for reference:\n\n"
                 f"---\n{task_description}\n---\n\n"
-                "Summarize the above search and browsing history. Output the FINAL RESPONSE and detailed supporting information of the task given to you.\n\n"
+                "Summarize the above subtask execution history. Output the FINAL RESPONSE and detailed supporting information of the task given to you.\n\n"
                 "If you found any useful facts, data, quotes, or answers directly relevant to the original task, include them clearly and completely.\n"
                 "If you reached a conclusion or answer, include it as part of the response.\n"
                 "If the task could not be fully answered, do NOT make up any content. Instead, return all partially relevant findings, "
@@ -296,6 +462,14 @@ def generate_agent_summarize_prompt(
                 "Focus on factual, specific, and well-organized information."
             )
         )
+
+        # Add Chinese-specific instructions for worker summary
+        if chinese_context:
+            summarize_prompt += """
+
+如果子任务涉及中文内容，请使用中文进行总结和回应，包括执行过程的思考、分析和各种输出，保持信息的准确性和语境一致性。
+"""
+
     elif agent_type == "agent-coding":
         summarize_prompt = (
             (
diff --git a/libs/miroflow/src/miroflow/utils/summary_utils.py b/libs/miroflow/src/miroflow/utils/summary_utils.py
new file mode 100644
index 00000000..8c7514a9
--- /dev/null
+++ b/libs/miroflow/src/miroflow/utils/summary_utils.py
@@ -0,0 +1,594 @@
+# SPDX-FileCopyrightText: 2025 MiromindAI
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import re
+from openai import AsyncOpenAI
+from tenacity import retry, stop_after_attempt, wait_exponential
+import uuid
+
+
+def _generate_message_id() -> str:
+    """Generate random message ID using common LLM format"""
+    # Use 8-character random hex string, similar to OpenAI API format, avoid cross-conversation cache hits
+    return f"msg_{uuid.uuid4().hex[:8]}"
+
+
+@retry(wait=wait_exponential(multiplier=15), stop=stop_after_attempt(5))
+async def o3_extract_hints(
+    question: str, api_key: str, chinese_context: bool, add_message_id: bool
+) -> str:
+    """Use O3 model to extract task hints"""
+    client = AsyncOpenAI(api_key=api_key, timeout=600)
+
+    instruction = """Carefully analyze the given task description (question) without attempting to solve it directly. Your role is to identify potential challenges and areas that require special attention during the solving process, and provide practical guidance for someone who will solve this task by actively gathering and analyzing information from the web.
+
+Identify and concisely list key points in the question that could potentially impact subsequent information collection or the accuracy and completeness of the problem solution, especially those likely to cause mistakes, carelessness, or confusion during problem-solving.
+
+The question author does not intend to set traps or intentionally create confusion. Interpret the question in the most common, reasonable, and straightforward manner, without speculating about hidden meanings or unlikely scenarios. However, be aware that mistakes, imprecise wording, or inconsistencies may exist due to carelessness or limited subject expertise, rather than intentional ambiguity.
+
+Additionally, when considering potential answers or interpretations, note that question authors typically favor more common and familiar expressions over overly technical, formal, or obscure terminology. They generally prefer straightforward and common-sense interpretations rather than being excessively cautious or academically rigorous in their wording choices.
+
+Also, consider additional flagging issues such as:
+- Potential mistakes or oversights introduced unintentionally by the question author due to his misunderstanding, carelessness, or lack of attention to detail.
+- Terms or instructions that might have multiple valid interpretations due to ambiguity, imprecision, outdated terminology, or subtle wording nuances.
+- Numeric precision, rounding requirements, formatting, or units that might be unclear, erroneous, or inconsistent with standard practices or provided examples.
+- Contradictions or inconsistencies between explicit textual instructions and examples or contextual clues provided within the question itself.
+
+Do NOT attempt to guess or infer correct answers, as complete factual information is not yet available. Your responsibility is purely analytical, proactively flagging points that deserve special attention or clarification during subsequent information collection and task solving. Avoid overanalyzing or listing trivial details that would not materially affect the task outcome.
+
+Here is the question:
+
+"""
+
+    # Add Chinese-specific instructions if enabled
+    if chinese_context:
+        instruction += """
+
+## 中文分析指导
+
+如果问题涉及中文语境，请特别注意：
+
+- **语言理解**：识别可能存在的中文表达歧义、方言差异或特定语境下的含义
+- **文化背景**：考虑可能需要中文文化背景知识才能正确理解的术语或概念
+- **信息获取**：标注需要使用中文搜索关键词才能获得准确信息的方面
+- **格式要求**：识别中文特有的格式要求、表达习惯或答案形式
+- **翻译风险**：标记直接翻译可能导致误解或信息丢失的关键术语
+- **时效性**：注意中文信息源的时效性和地域性特征
+- **分析输出**：使用中文进行分析和提示，确保语言一致性
+
+"""
+
+    # Add message ID for O3 messages (if configured)
+    content = instruction + question
+    if add_message_id:
+        message_id = _generate_message_id()
+        content = f"[{message_id}] {content}"
+
+    response = await client.chat.completions.create(
+        model="o3",
+        messages=[{"role": "user", "content": content}],
+        reasoning_effort="high",
+    )
+    result = response.choices[0].message.content
+
+    # Check if result is empty, raise exception to trigger retry if empty
+    if not result or not result.strip():
+        raise ValueError("O3 hints extraction returned empty result")
+
+    return result
+
+
+@retry(wait=wait_exponential(multiplier=15), stop=stop_after_attempt(5))
+async def get_gaia_answer_type(task_description: str, api_key: str) -> str:
+    client = AsyncOpenAI(api_key=api_key, timeout=600)
+    instruction = f"""Input:
+`{task_description}`
+
+Question:
+Determine the expected data type of the answer. For questions asking to "identify" something, focus on the final answer type, not the identification process. Format requirements in the question often hint at the expected data type. If the question asks you to write a specific word, return string. Choose only one of the four types below:
+- number — a pure number (may include decimals or signs), e.g., price, distance, length
+- date   — a specific calendar date (e.g., 2025-08-05 or August 5, 2025)
+- time   — a specific time of day or formated time cost (e.g., 14:30 or 1:30.12)
+- string — any other textual answer
+
+Output:
+Return exactly one of the [number, date, time, string], nothing else.
+"""
+    print(f"Answer type instruction: {instruction}")
+
+    message_id = _generate_message_id()
+    response = await client.chat.completions.create(
+        model="gpt-4.1",
+        messages=[{"role": "user", "content": f"[{message_id}] {instruction}"}],
+    )
+    answer_type = response.choices[0].message.content
+    # Check if result is empty, raise exception to trigger retry if empty
+    if not answer_type or not answer_type.strip():
+        raise ValueError("answer type returned empty result")
+
+    print(f"Answer type: {answer_type}")
+
+    return answer_type.strip()
+
+
+@retry(wait=wait_exponential(multiplier=15), stop=stop_after_attempt(5))
+async def o3_extract_gaia_final_answer(
+    task_description_detail: str, summary: str, api_key: str, chinese_context: bool
+) -> str:
+    """Use O3 model to extract final answer from summary"""
+    answer_type = await get_gaia_answer_type(task_description_detail, api_key)
+
+    client = AsyncOpenAI(api_key=api_key, timeout=600)
+
+    # Add Chinese-specific instructions and output format if enabled
+    chinese_supplement = ""
+    output_format_section = """
+# Output Format
+
+Return your analysis in this exact format:
+
+**Step-by-step Analysis:**
+[Your detailed reasoning process]
+
+**Final Answer:** \\boxed{...}
+
+**Confidence:** [0-100 integer]
+
+**Supporting Evidence:** [Brief summary of evidence that supports this answer]
+
+**Potential Weaknesses:** [Any limitations, uncertainties, or factors that might make this answer incorrect - be objective and thorough]
+"""
+
+    if chinese_context:
+        chinese_supplement = """
+
+## 中文答案抽取指导
+
+如果原始问题或代理总结涉及中文内容，请遵循以下指导：
+
+- **语境理解**：在分析代理总结和原始问题时，保持对中文语境的敏感性，理解可能的文化背景和表达习惯
+- **答案识别**：在识别最佳答案时，优先考虑符合中文表达习惯的答案形式
+- **格式处理**：对于中文特有的格式要求（如中文日期格式、中文数字表达等），确保答案符合中文用户的期望
+- **术语准确性**：保持中文术语的准确性，避免因直译造成的表达不当
+- **分析过程**：整个分析和推理过程应使用中文进行，确保语言一致性
+- **最终答案**：确保最终答案符合中文语境下的表达方式和格式要求
+
+---
+
+"""
+        output_format_section = """
+# 输出格式
+
+请严格按照以下格式返回你的分析：
+
+**逐步分析：**
+[你的详细推理过程]
+
+**最终答案：** \\boxed{...}
+
+**置信度：** [0-100整数]
+
+**支持证据：** [支持此答案的证据简要总结]
+
+**潜在不足：** [任何限制、不确定性或可能使此答案错误的因素 - 要客观且全面]
+"""
+
+    # Common confidence assessment section (unified for all languages)
+    common_confidence_section = (
+        """
+# Confidence Assessment
+
+Provide a confidence score (0-100) based on objective criteria for how likely this answer is to be judged correct by an automated verifier:
+
+**Calibration Guidelines (use these as objective anchors):**
+- **85-100**: Direct factual evidence found, no contradictions, formatting requirements clearly satisfied
+- **70-84**: Strong supporting evidence with minor gaps or slight formatting uncertainty  
+- **55-69**: Moderate evidence but requires interpretation, or some conflicting information exists
+- **40-54**: Limited evidence, significant uncertainty, multiple plausible answers possible
+- **25-39**: Weak evidence, mostly reasoning-based, likely incomplete information
+- **0-24**: No supporting evidence found, pure speculation, or contradicts known facts
+
+**Objective Calibration Checks:**
+1. **Evidence Verifiability**: Can the key facts be directly verified from the agent summary?
+2. **Contradiction Test**: Does anything in the summary contradict this answer?  
+3. **Completeness Test**: Does the summary contain sufficient information to answer confidently?
+4. **Formatting Clarity**: Are the format requirements unambiguous and correctly followed?
+
+Rate conservatively - if unsure between two ranges, choose the lower one.
+
+---
+"""
+        + chinese_supplement
+        + output_format_section
+    )
+
+    full_prompts = {
+        "time": f"""# Inputs
+
+* **Original Question**: `{task_description_detail}`
+* **Agent Summary**: `{summary}`
+
+---
+
+# Task
+
+1. **Independently derive** the best possible answer, step by step, based solely on evidence and reasoning from the Agent Summary. **Ignore the summary's "Final Answer" field** at this stage.
+2. **Compare** your derived answer to the final answer provided in the Agent Summary (ignoring formatting and phrasing requirements at this stage).  
+– If both are well supported by the summary's evidence, choose the one with stronger or clearer support.  
+– If only one is well supported, use that one.
+3. **Revise** your chosen answer to fully satisfy all formatting and phrasing requirements listed below (**Formatting rules**, **Additional constraints**, **Common pitfalls to avoid**, and **Quick reference examples**). These requirements override those in the original question if there is any conflict.
+
+If no answer is clearly supported by the evidence, provide a well-justified educated guess. **Always wrap your final answer in a non-empty \\boxed{{...}}.**
+
+---
+
+# Output Guidelines
+
+1. **Box the answer**
+Wrap the answer in `\\boxed{{}}`.
+
+2. **Answer type**
+The boxed content must be a time.
+
+3. **Formatting rules**
+* Follow every formatting instruction in the original question (units, rounding, decimal places, etc.).
+* Do **not** add any units (e.g., "s", "second", "seconds"), unless required.
+* Ensure the correct unit (e.g., hours versus thousand hours); if the question specifies "thousand hours" or "1000 hours", treat it as the required unit — output a number like 13 (thousand hours) instead of 13000 (hours).
+* If the question's written instructions for precision or rounding differ from the examples, treat the examples as authoritative — match their number of decimal places and rounding style.
+
+4. **Additional constraints**
+* If the **Agent Summary** is incomplete or unclear, provide the best possible answer (educated guess).
+
+5. **Common pitfalls to avoid**
+* Minor mismatches in the required format.
+* Unit-conversion errors, especially with uncommon units.
+* Incorrect precision, rounding or scale (e.g., 0.01 vs 0.001), **double-check the required level**.
+* Conflicts between textual instructions and example formatting, just follow the example: if the question says to "retain the percentile" but the example shows 0.001, use 0.001 rather than 0.01.
+
+---
+
+# Quick reference examples
+
+* If the question says to "rounding the seconds to the nearest hundredth", but the example shows "0.001", 1:23.4567 → 1:23.457
+* If the question says to "rounding the seconds to the nearest hundredth", but the example shows "0.001", 10:08.47445 → 10:08.474
+* If the question says to "round to one decimal place", but the example shows "0.01", 2:17.456 → 2:17.46
+* If the question says to "round to the nearest minute", but the example keeps seconds ("0:45"), 3:44.8 → 3:45
+* If the question says "keep three decimal places", but the example shows "0.1", 1:03.987 → 1:03.1
+* If the question asks for "thousand hours", 13000 -> 13 
+
+---
+"""
+        + common_confidence_section,
+        "number": f"""# Inputs
+
+* **Original Question**: `{task_description_detail}`
+* **Agent Summary**: `{summary}`
+
+---
+
+# Task
+
+1. **Independently derive** the best possible answer, step by step, based solely on evidence and reasoning from the Agent Summary. **Ignore the summary's "Final Answer" field** at this stage.
+2. **Compare** your derived answer to the final answer provided in the Agent Summary (ignoring formatting and phrasing requirements at this stage).  
+– If both are well supported by the summary's evidence, choose the one with stronger or clearer support.  
+– If only one is well supported, use that one.
+– For questions involving calculations, if your answer and the Agent Summary's final answer are numerically similar, prefer the summary's answer.
+3. **Revise** your chosen answer to fully satisfy all formatting and phrasing requirements listed below (**Formatting rules**, **Additional constraints**, **Common pitfalls to avoid**, and **Quick reference examples**). These requirements override those in the original question if there is any conflict.
+
+If no answer is clearly supported by the evidence, provide a well-justified educated guess. **Always wrap your final answer in a non-empty \\boxed{{...}}.**
+
+---
+
+# Output Guidelines
+
+1. **Box the answer**
+Wrap the answer in `\\boxed{{}}`.
+
+2. **Answer type**
+The boxed content must be a single number.
+
+3. **Formatting rules**
+* Follow every formatting instruction in the original question (units, rounding, decimal places, etc.).
+* Use digits only; do **not** use words, commas or symbols (e.g., "$", "!", "?", "/").
+* Do **not** add any units (e.g., "%", "$", "USD", "Å", "m", "m^2", "m^3"), unless required.
+* Ensure the correct unit (e.g., grams versus kilograms, meters versus kilometers, hours versus thousand hours); if the question specifies "thousand hours" or "1000 hours", treat it as the required unit — output a number like 13 (thousand hours) instead of 13000 (hours).
+
+4. **Additional constraints**
+* If the **Agent Summary** is incomplete or unclear, provide the best possible answer (educated guess).
+
+5. **Common pitfalls to avoid**
+* Minor mismatches in the required format.
+* Unit-conversion errors, especially with uncommon units.
+* Incorrect precision, rounding or scale (e.g., 0.01 vs 0.001), **double-check the required level**.
+* Conflicts between textual instructions and example formatting, just follow the example: if the question says to "retain the percentile" but the example shows 0.001, use 0.001 rather than 0.01.
+* Do not partially convert text-based numbers—ensure full and accurate conversion (e.g., "one hundred million" → 100000000, not 100).
+
+---
+
+# Quick reference examples
+
+* $100 → 100
+* 100 USD → 100
+* €50 → 50
+* £75 → 75
+* ¥1,000 → 1000
+* 1,234 m → 1234
+* 3,456,789 kg → 3456789
+* 70% → 70
+* 12.5% → 12.5
+* 0.045 m³ → 0.045
+* 0.045 m^3 → 0.045
+* −40 °C → -40
+* 100 km/h → 100
+* 5000 m^2 → 5000
+* 2.54 cm → 2.54
+* 50 kg → 50
+* 4.0 L → 4
+* 13 thousand hours → 13
+* Page 123/456 → 123/456
+* 100 million → 100000000
+* 200 Ω → 200
+* 200 Å → 200
+* 9.81 m/s² → 9.81
+* 0 dB → 0
+
+---
+"""
+        + common_confidence_section,
+        "string": f"""# Inputs
+
+* **Original Question**: `{task_description_detail}`
+* **Agent Summary**: `{summary}`
+
+---
+
+# Task
+
+1. **Independently derive** the best possible answer, step by step, based solely on evidence and reasoning from the Agent Summary. **Ignore the summary's "Final Answer" field** at this stage.
+2. **Compare** your derived answer to the final answer provided in the Agent Summary (ignoring formatting and phrasing requirements at this stage).  
+– If both are well supported by the summary's evidence, choose the one with stronger or clearer support.  
+– If only one is well supported, use that one.
+3. **Revise** your chosen answer to fully satisfy all formatting and phrasing requirements listed below (**Formatting rules**, **Additional constraints**, **Common pitfalls to avoid**, and **Quick reference examples**). These requirements override those in the original question if there is any conflict.
+
+If no answer is clearly supported by the evidence, provide a well-justified educated guess. **Always wrap your final answer in a non-empty \\boxed{{...}}.**
+
+---
+
+# Output Guidelines
+
+1. **Box the answer**
+Wrap the final answer in \\boxed{{...}}.
+
+2. **Answer type**
+The boxed content must be **one** of:
+* a single short phrase (fewest words possible)
+* a comma-separated list of numbers and/or strings
+
+3. **Formatting rules**
+* Follow every formatting instruction in the original question (alphabetization, sequencing, units, rounding, decimal places, etc.).
+* Omit articles and abbreviations unless explicitly present in the expected answer.
+* If a string contains numeric information, spell out the numbers **unless** the question itself shows them as digits.
+* Do **not** end the answer with ".", "!", "?", or any other punctuation.
+* Use only standard ASCII quotation marks ("" and ''), **not** stylized or curly quotation marks (such as “ ” ‘ ’).
+* Remove invisible or non-printable characters.
+* If the output is lists, apply the rules item-by-item.
+* Avoid unnecessary elaboration - keep the answer as short as possible
+    - Do **not** add "count", "number", "count of", "total", or similar quantifying words when the noun itself already refers to the quantity (e.g., use the bare noun form only).
+    - No geographical modifiers (e.g., "Western", "Southern"), 
+    - Use the simplest, most commonly accepted term for a substance or object (e.g., "diamond" instead of "crystalline diamond", "silicon" instead of "silicon crystals")
+* For mathematical symbols, match the symbol style in the question; never substitute LaTeX commands (e.g., use ≤, not \\leq, use pure text, not \\text{{}}, use ↔, not \\leftrightarrow).
+* For birthplaces, give the name as it was at the time of birth, not the current name.
+
+4. **Additional constraints**
+* If the Agent Summary is incomplete or unclear, provide the best possible answer (educated guess).
+* Keep the answer as short and direct as possible—no explanations or parenthetical notes.
+
+5. **Common pitfalls to avoid**
+* Minor mismatches between required and produced formats.
+* Conflicts between textual instructions and example formatting—follow the example.
+* **Names**: give only the commonly used first + last name (no middle name unless requested).
+* **Countries**: use the common name (e.g., "China", "Brunei")
+* **Locations**: output only the requested location name, without including time, modifiers (e.g., "The Castle", "The Hotel")
+* When the question provides examples of expected format (e.g., "ripe strawberries" not "strawberries"), follow the exact wording style shown in the examples, preserving all descriptive terms and adjectives as demonstrated.
+* Answer with historically location names when the Agent Summary provides. Never override a historically location name. For example, a birthplace should be referred to by the name it had at the time of birth (i.e., answer the original name).
+* For questions asking to "identify" something, focus on the final answer, not the identification process.
+
+---
+
+# Quick reference examples
+
+* INT. THE CASTLE – DAY 1 → The Castle
+* INT. THE HOTEL – NIGHT → The Hotel
+* INT. THE SPACESHIP – DAWN → The Spaceship
+* INT. THE LIBRARY – EVENING → The Library
+* INT. CLASSROOM #3 – MORNING → Classroom #3
+* People's Republic of China → China
+* citation count → citations
+* Brunei Darussalam → Brunei
+* United States of America → United States
+* Republic of Korea → South Korea
+* New York City, USA → New York City
+* São Paulo (Brazil) → São Paulo
+* John Michael Doe → John Doe
+* Mary Anne O'Neil → Mary O'Neil
+* Dr. Richard Feynman → Richard Feynman
+* INT. ZONE 42 – LEVEL B2 → Zone 42 – Level B2
+* INT. THE UNDERWATER BASE – MIDNIGHT → The Underwater Base
+* Sam’s Home → Sam's Home
+* Mike’s phone → Mike's phone
+
+---
+"""
+        + common_confidence_section,
+    }
+
+    full_prompt = full_prompts.get(
+        answer_type if answer_type in ["number", "time"] else "string"
+    )
+
+    print("O3 Extract Final Answer Prompt:")
+    print(full_prompt)
+
+    message_id = _generate_message_id()
+    response = await client.chat.completions.create(
+        model="o3",
+        messages=[{"role": "user", "content": f"[{message_id}] {full_prompt}"}],
+        reasoning_effort="medium",
+    )
+    result = response.choices[0].message.content
+
+    # Check if result is empty, raise exception to trigger retry if empty
+    if not result or not result.strip():
+        raise ValueError("O3 final answer extraction returned empty result")
+
+    # Verify boxed answer exists
+    boxed_match = re.search(r"\\boxed{([^}]*)}", result)
+    if not boxed_match:
+        raise ValueError("O3 final answer extraction returned empty answer")
+
+    print("response:", result)
+
+    # Return the full response directly for downstream LLM processing
+    # This contains all structured information: analysis, boxed answer, confidence, evidence, and weaknesses
+    return result
+
+
+@retry(wait=wait_exponential(multiplier=15), stop=stop_after_attempt(5))
+async def o3_extract_browsecomp_zh_final_answer(
+    task_description_detail: str, summary: str, api_key: str
+) -> str:
+    """Use O3 model to extract final answer from summary"""
+    client = AsyncOpenAI(api_key=api_key, timeout=600)
+
+    chinese_supplement = """
+
+## 中文答案抽取指导
+
+如果原始问题或代理总结涉及中文内容，请遵循以下指导：
+
+- **语境理解**：在分析代理总结和原始问题时，保持对中文语境的敏感性，理解可能的文化背景和表达习惯
+- **答案识别**：在识别最佳答案时，优先考虑符合中文表达习惯的答案形式
+- **格式处理**：对于中文特有的格式要求（如中文日期格式、中文数字表达等），确保答案符合中文用户的期望
+- **术语准确性**：保持中文术语的准确性，避免因直译造成的表达不当
+- **分析过程**：整个分析和推理过程应使用中文进行，确保语言一致性
+- **最终答案**：确保最终答案符合中文语境下的表达方式和格式要求
+- **等价名称**：如果最终答案有多种等价名称，请在响应中明确提及 **所有** 等价的中英文名称
+
+---
+
+"""
+    output_format_section = """
+# 输出格式
+
+请严格按照以下格式返回你的分析：
+
+**逐步分析：**
+[你的详细推理过程]
+
+**最终答案：** \\boxed{...}
+
+**置信度：** [0-100整数]
+
+**支持证据：** [支持此答案的证据总结]
+"""
+
+    # Common confidence assessment section (unified for all languages)
+    common_confidence_section = (
+        """
+# 置信度评估
+
+请根据客观标准，对该答案被自动验证器判定为正确的可能性进行打分（0-100分）：
+
+**校准指南（请以此为客观参考）：**
+- **85-100**：有直接的事实证据支持，无矛盾，格式要求完全满足
+- **70-84**：有强有力的支持证据，但存在小的缺口或格式略有不确定
+- **55-69**：有一定证据，但需要解释，或存在部分相互矛盾的信息
+- **40-54**：证据有限，不确定性较大，可能存在多个合理答案
+- **25-39**：证据薄弱，主要依赖推理，信息可能不完整
+- **0-24**：没有支持证据，纯属猜测，或与已知事实相矛盾
+
+**客观校准检查：**
+1. **证据可验证性**：关键事实能否直接从代理总结中验证？
+2. **矛盾检测**：总结中是否有内容与该答案相矛盾？
+3. **完整性检查**：总结中是否包含足够信息以有信心地作答？
+4. **格式清晰度**：格式要求是否明确且被正确遵循？
+
+请保守打分——如果在两个区间之间犹豫，请选择较低的分数。
+
+---
+"""
+        + chinese_supplement
+        + output_format_section
+    )
+
+    full_prompt = (
+        f"""# 输入
+
+* **原始问题**：`{task_description_detail}`
+* **Agent总结**：`{summary}`
+
+---
+
+# 任务
+
+1. **独立推导**：仅根据Agent总结中的证据和推理，逐步独立推导出最优答案。**此阶段请忽略总结中的“最终答案”字段。**
+2. **对比**：将你推导出的答案与Agent总结中给出的最终答案进行对比（此阶段忽略格式和表述要求）。  
+   – 如果两者都得到了总结证据的有力支持，选择支持更充分或更清晰的那个。  
+   – 如果只有一个答案有充分证据支持，则采用该答案。
+3. **修订**：将你选定的答案修订为完全符合下方所有格式和表述要求（**格式规则**、**附加约束**、**常见错误**）。
+4. **输出**：你需要在输出中展现你的分析过程，并给出最终答案。
+
+如果没有答案能被证据明确支持，请给出有充分理由的最佳猜测。**最终答案必须用非空的 \\boxed{{...}} 包裹。**
+
+---
+
+# 输出指南
+
+1. **答案加框**
+用 `\\boxed{{}}` 包裹最终答案。
+
+2. **格式规则**
+* 严格遵循原始问题中的所有格式说明（单位、四舍五入、保留小数位等）。
+* 确保使用正确的单位（如小时与千小时）；如果题目要求“千小时”或“1000小时”，则以此为准——输出如 13（千小时），而不是 13000（小时）。
+* 如果题目的文字说明与示例在精度或四舍五入上有出入，以示例为准——匹配其小数位数和四舍五入方式。
+* 如题目答案是地名、人名、组织名、国家名等，请给出标准全称，并用括号注释常用说法或等价说法（如有）。
+* 如题目答案有多种称呼方式、翻译方式，请给出所有中英文等价表达，并用明确标注“等价表达不唯一”。
+
+3. **附加约束**
+* 如果**Agent总结**内容不完整或不清晰，请给出最佳答案（合理猜测）。
+* 如果一个答案实体有多个名称、说法、叫法，请在最终答案用括号注释**所有等价的名称**，包括官方中英文对照（如有）。
+
+4. **常见错误**
+* 拥有官方中文名称的英文实体没有给出中文名称。
+* 拥有多个等价表达的答案、只给出了一种说法。
+
+"""
+        + common_confidence_section
+    )
+
+    print("O3 Extract Final Answer Prompt:")
+    print(full_prompt)
+
+    message_id = _generate_message_id()
+    response = await client.chat.completions.create(
+        model="o3",
+        messages=[{"role": "user", "content": f"[{message_id}] {full_prompt}"}],
+        reasoning_effort="medium",
+    )
+    result = response.choices[0].message.content
+
+    # Check if result is empty, raise exception to trigger retry if empty
+    if not result or not result.strip():
+        raise ValueError("O3 final answer extraction returned empty result")
+
+    # Verify boxed answer exists
+    boxed_match = re.search(r"\\boxed{([^}]*)}", result)
+    if not boxed_match:
+        raise ValueError("O3 final answer extraction returned empty answer")
+
+    print("response:", result)
+
+    # Return the full response directly for downstream LLM processing
+    # This contains all structured information: analysis, boxed answer, confidence, evidence, and weaknesses
+    return result
diff --git a/libs/miroflow/src/miroflow/utils/tool_utils.py b/libs/miroflow/src/miroflow/utils/tool_utils.py
index b2d2c405..6009d08d 100644
--- a/libs/miroflow/src/miroflow/utils/tool_utils.py
+++ b/libs/miroflow/src/miroflow/utils/tool_utils.py
@@ -17,8 +17,12 @@ def create_mcp_server_parameters(
     cfg: DictConfig, agent_cfg: DictConfig, logs_dir: str | None = None
 ):
     """Define and return MCP server configuration list"""
-    ENABLE_CLAUDE_VISION = cfg.agent.tool_config["tool-vqa"]["enable_claude_vision"]
-    ENABLE_OPENAI_VISION = cfg.agent.tool_config["tool-vqa"]["enable_openai_vision"]
+    ENABLE_CLAUDE_VISION = cfg.agent.tool_config["tool-image-video"][
+        "enable_claude_vision"
+    ]
+    ENABLE_OPENAI_VISION = cfg.agent.tool_config["tool-image-video"][
+        "enable_openai_vision"
+    ]
 
     configs = []
     if agent_cfg.get("tools", None) is not None and "tool-code" in agent_cfg["tools"]:
@@ -40,10 +44,13 @@ def create_mcp_server_parameters(
             }
         )
 
-    if agent_cfg.get("tools", None) is not None and "tool-vqa" in agent_cfg["tools"]:
+    if (
+        agent_cfg.get("tools", None) is not None
+        and "tool-image-video" in agent_cfg["tools"]
+    ):
         configs.append(
             {
-                "name": "tool-vqa",
+                "name": "tool-image-video",
                 "params": StdioServerParameters(
                     command=sys.executable,
                     args=["-m", "miroflow.tool.mcp_servers.vision_mcp_server"],
@@ -60,13 +67,10 @@ def create_mcp_server_parameters(
             }
         )
 
-    if (
-        agent_cfg.get("tools", None) is not None
-        and "tool-transcribe" in agent_cfg["tools"]
-    ):
+    if agent_cfg.get("tools", None) is not None and "tool-audio" in agent_cfg["tools"]:
         configs.append(
             {
-                "name": "tool-transcribe",
+                "name": "tool-audio",
                 "params": StdioServerParameters(
                     command=sys.executable,
                     args=["-m", "miroflow.tool.mcp_servers.audio_mcp_server"],
@@ -185,6 +189,9 @@ def create_mcp_server_parameters(
                             "SERPER_API_KEY": cfg.env.serper_api_key,
                             "JINA_API_KEY": cfg.env.jina_api_key,
                             "GEMINI_API_KEY": cfg.env.gemini_api_key,
+                            "REMOVE_SNIPPETS": cfg.env.remove_snippets,
+                            "REMOVE_KNOWLEDGE_GRAPH": cfg.env.remove_knowledge_graph,
+                            "REMOVE_ANSWER_BOX": cfg.env.remove_answer_box,
                         },
                     ),
                 }
@@ -204,21 +211,21 @@ def expose_sub_agents_as_tools(sub_agents_cfg: DictConfig):
     """Expose sub-agents as tools"""
     sub_agents_server_params = []
     for sub_agent in sub_agents_cfg.keys():
-        if "agent-browsing" in sub_agent:  # type: ignore
+        if "agent-worker" in sub_agent:  # type: ignore
             sub_agents_server_params.append(
                 dict(
-                    name="agent-browsing",
+                    name="agent-worker",
                     tools=[
                         dict(
-                            name="search_and_browse",
-                            description="This tool is an agent that performs the subtask of searching and browsing the web for specific missing information and generating the desired answer. The subtask should be clearly defined, include relevant background, and focus on factual gaps. It does not perform vague or speculative subtasks. \nArgs: \n\tsubtask: the subtask to be performed. \nReturns: \n\tthe result of the subtask. ",
+                            name="execute_subtask",
+                            description="This tool is an agent that performs various subtasks to collect information and execute specific actions. It can access the internet, read files, program, and process multimodal content, but is not specialized in complex reasoning or logical thinking. The tool returns processed summary reports rather than raw information - it analyzes, synthesizes, and presents findings in a structured format. The subtask should be clearly defined, include relevant background, and focus on a single, well-scoped objective. It does not perform vague or speculative subtasks. \nArgs: \n\tsubtask: the subtask to be performed. \nReturns: \n\tthe processed summary report of the subtask. ",
                             schema={
                                 "type": "object",
                                 "properties": {
                                     "subtask": {"title": "Subtask", "type": "string"}
                                 },
                                 "required": ["subtask"],
-                                "title": "search_and_browseArguments",
+                                "title": "execute_subtaskArguments",
                             },
                         )
                     ],