Improve nested handoff conversation history

jhills20 · jhills20 · commit 98d154c4724e · 2025-10-27T10:12:00.000-04:00
diff --git a/docs/handoffs.md b/docs/handoffs.md
@@ -82,7 +82,7 @@ handoff_obj = handoff(
 
 When a handoff occurs, it's as though the new agent takes over the conversation, and gets to see the entire previous conversation history. If you want to change this, you can set an [`input_filter`][agents.handoffs.Handoff.input_filter]. An input filter is a function that receives the existing input via a [`HandoffInputData`][agents.handoffs.HandoffInputData], and must return a new `HandoffInputData`.
 
-By default the runner now wraps the prior transcript inside a developer-role summary message (see [`RunConfig.nest_handoff_history`][agents.run.RunConfig.nest_handoff_history]). That default only applies when neither the handoff nor the run supplies an explicit `input_filter`, so existing code that already customizes the payload (including the examples in this repository) keeps its current behavior without changes.
+By default the runner now wraps the prior transcript inside a developer-role summary message (see [`RunConfig.nest_handoff_history`][agents.run.RunConfig.nest_handoff_history]). The summary appears inside a `<CONVERSATION HISTORY>` block that keeps appending new turns when multiple handoffs happen during the same run. That default only applies when neither the handoff nor the run supplies an explicit `input_filter`, so existing code that already customizes the payload (including the examples in this repository) keeps its current behavior without changes.
 
 There are some common patterns (for example removing all tool calls from the history), which are implemented for you in [`agents.extensions.handoff_filters`][]
 
@@ -127,7 +127,7 @@ router = Agent(
 )
 ```
 
-The new [examples/handoffs/log_handoff_history.py](https://github.com/openai/openai-agents-python/tree/main/examples/handoffs/log_handoff_history.py) script contains a complete runnable sample that prints the nested transcript every time a handoff occurs.
+The new [examples/handoffs/log_handoff_history.py](https://github.com/openai/openai-agents-python/tree/main/examples/handoffs/log_handoff_history.py) script contains a complete runnable sample that prints the nested transcript every time a handoff occurs so you can see the `<CONVERSATION HISTORY>` block that will be passed to the next agent.
 
 ## Recommended prompts
 
diff --git a/docs/running_agents.md b/docs/running_agents.md
@@ -51,7 +51,7 @@ The `run_config` parameter lets you configure some global settings for the agent
 -   [`model_settings`][agents.run.RunConfig.model_settings]: Overrides agent-specific settings. For example, you can set a global `temperature` or `top_p`.
 -   [`input_guardrails`][agents.run.RunConfig.input_guardrails], [`output_guardrails`][agents.run.RunConfig.output_guardrails]: A list of input or output guardrails to include on all runs.
 -   [`handoff_input_filter`][agents.run.RunConfig.handoff_input_filter]: A global input filter to apply to all handoffs, if the handoff doesn't already have one. The input filter allows you to edit the inputs that are sent to the new agent. See the documentation in [`Handoff.input_filter`][agents.handoffs.Handoff.input_filter] for more details.
--   [`nest_handoff_history`][agents.run.RunConfig.nest_handoff_history]: When `True` (the default) the runner wraps the prior transcript in a developer-role summary message and keeps the latest user turn separate before invoking the next agent. Set this to `False` or provide a custom handoff filter if you prefer to pass through the raw transcript. You can also call [`nest_handoff_history`](agents.extensions.handoff_filters.nest_handoff_history) from your own filters to reuse the default behavior. All [`Runner` methods](agents.run.Runner) automatically create a `RunConfig` when you do not pass one, so the quickstarts and examples pick up this default automatically, and any explicit [`Handoff.input_filter`][agents.handoffs.Handoff.input_filter] callbacks continue to override it.
+-   [`nest_handoff_history`][agents.run.RunConfig.nest_handoff_history]: When `True` (the default) the runner wraps the prior transcript in a developer-role summary message, placing the content inside a `<CONVERSATION HISTORY>` block while keeping the latest user turn separate before invoking the next agent. The block automatically appends new turns as subsequent handoffs occur. Set this to `False` or provide a custom handoff filter if you prefer to pass through the raw transcript. You can also call [`nest_handoff_history`](agents.extensions.handoff_filters.nest_handoff_history) from your own filters to reuse the default behavior. All [`Runner` methods](agents.run.Runner) automatically create a `RunConfig` when you do not pass one, so the quickstarts and examples pick up this default automatically, and any explicit [`Handoff.input_filter`][agents.handoffs.Handoff.input_filter] callbacks continue to override it.
 -   [`tracing_disabled`][agents.run.RunConfig.tracing_disabled]: Allows you to disable [tracing](tracing.md) for the entire run.
 -   [`trace_include_sensitive_data`][agents.run.RunConfig.trace_include_sensitive_data]: Configures whether traces will include potentially sensitive data, such as LLM and tool call inputs/outputs.
 -   [`workflow_name`][agents.run.RunConfig.workflow_name], [`trace_id`][agents.run.RunConfig.trace_id], [`group_id`][agents.run.RunConfig.group_id]: Sets the tracing workflow name, trace ID and trace group ID for the run. We recommend at least setting `workflow_name`. The group ID is an optional field that lets you link traces across multiple runs.
diff --git a/src/agents/extensions/handoff_filters.py b/src/agents/extensions/handoff_filters.py
@@ -39,15 +39,22 @@ def remove_all_tools(handoff_input_data: HandoffInputData) -> HandoffInputData:
     )
 
 
+_CONVERSATION_HISTORY_START = "<CONVERSATION HISTORY>"
+_CONVERSATION_HISTORY_END = "</CONVERSATION HISTORY>"
+_NEST_HISTORY_METADATA_KEY = "nest_handoff_history"
+_NEST_HISTORY_TRANSCRIPT_KEY = "transcript"
+
+
 def nest_handoff_history(handoff_input_data: HandoffInputData) -> HandoffInputData:
     """Summarizes the previous transcript into a developer message for the next agent."""
 
     normalized_history = _normalize_input_history(handoff_input_data.input_history)
+    flattened_history = _flatten_nested_history_messages(normalized_history)
     pre_items_as_inputs = [
         _run_item_to_plain_input(item) for item in handoff_input_data.pre_handoff_items
     ]
     new_items_as_inputs = [_run_item_to_plain_input(item) for item in handoff_input_data.new_items]
-    transcript = normalized_history + pre_items_as_inputs + new_items_as_inputs
+    transcript = flattened_history + pre_items_as_inputs + new_items_as_inputs
 
     developer_message = _build_developer_message(transcript)
     latest_user = _find_latest_user_turn(transcript)
@@ -80,15 +87,23 @@ def _run_item_to_plain_input(run_item: RunItem) -> TResponseInputItem:
 
 
 def _build_developer_message(transcript: list[TResponseInputItem]) -> TResponseInputItem:
-    if transcript:
+    transcript_copy = [deepcopy(item) for item in transcript]
+    if transcript_copy:
         summary_lines = [
-            f"{idx + 1}. {_format_transcript_item(item)}" for idx, item in enumerate(transcript)
+            f"{idx + 1}. {_format_transcript_item(item)}" for idx, item in enumerate(transcript_copy)
         ]
     else:
         summary_lines = ["(no previous turns recorded)"]
 
-    content = "Previous conversation before this handoff:\n" + "\n".join(summary_lines)
-    return {"role": "developer", "content": content}
+    content_lines = [_CONVERSATION_HISTORY_START, *summary_lines, _CONVERSATION_HISTORY_END]
+    content = "\n".join(content_lines)
+    return {
+        "role": "developer",
+        "content": content,
+        "metadata": {
+            _NEST_HISTORY_METADATA_KEY: {_NEST_HISTORY_TRANSCRIPT_KEY: transcript_copy}
+        },
+    }
 
 
 def _format_transcript_item(item: TResponseInputItem) -> str:
@@ -130,6 +145,40 @@ def _find_latest_user_turn(
     return None
 
 
+def _flatten_nested_history_messages(
+    items: list[TResponseInputItem],
+) -> list[TResponseInputItem]:
+    flattened: list[TResponseInputItem] = []
+    for item in items:
+        nested_transcript = _extract_nested_history_transcript(item)
+        if nested_transcript is not None:
+            flattened.extend(nested_transcript)
+            continue
+        flattened.append(deepcopy(item))
+    return flattened
+
+
+def _extract_nested_history_transcript(
+    item: TResponseInputItem,
+) -> list[TResponseInputItem] | None:
+    if item.get("role") != "developer":
+        return None
+    metadata = item.get("metadata")
+    if not isinstance(metadata, dict):
+        return None
+    payload = metadata.get(_NEST_HISTORY_METADATA_KEY)
+    if not isinstance(payload, dict):
+        return None
+    transcript = payload.get(_NEST_HISTORY_TRANSCRIPT_KEY)
+    if not isinstance(transcript, list):
+        return None
+    normalized: list[TResponseInputItem] = []
+    for entry in transcript:
+        if isinstance(entry, dict):
+            normalized.append(deepcopy(entry))
+    return normalized if normalized else []
+
+
 def _get_run_item_role(run_item: RunItem) -> str | None:
     role_candidate = run_item.to_input_item().get("role")
     return role_candidate if isinstance(role_candidate, str) else None
diff --git a/tests/test_agent_runner.py b/tests/test_agent_runner.py
@@ -294,7 +294,7 @@ async def test_default_handoff_history_nested_and_filters_respected():
 
     assert isinstance(result.input, list)
     assert result.input[0]["role"] == "developer"
-    assert "Previous conversation" in result.input[0]["content"]
+    assert "<CONVERSATION HISTORY>" in result.input[0]["content"]
     assert "triage summary" in result.input[0]["content"]
     assert result.input[1]["role"] == "user"
     assert result.input[1]["content"] == "user_message"
@@ -324,6 +324,39 @@ def passthrough_filter(data: HandoffInputData) -> HandoffInputData:
     assert filtered_result.input == "user_message"
 
 
+@pytest.mark.asyncio
+async def test_default_handoff_history_accumulates_across_multiple_handoffs():
+    triage_model = FakeModel()
+    delegate_model = FakeModel()
+    closer_model = FakeModel()
+
+    closer = Agent(name="closer", model=closer_model)
+    delegate = Agent(name="delegate", model=delegate_model, handoffs=[closer])
+    triage = Agent(name="triage", model=triage_model, handoffs=[delegate])
+
+    triage_model.add_multiple_turn_outputs(
+        [[get_text_message("triage summary"), get_handoff_tool_call(delegate)]]
+    )
+    delegate_model.add_multiple_turn_outputs(
+        [[get_text_message("delegate update"), get_handoff_tool_call(closer)]]
+    )
+    closer_model.add_multiple_turn_outputs([[get_text_message("resolution")]])
+
+    result = await Runner.run(triage, input="user_question")
+
+    assert result.final_output == "resolution"
+    assert closer_model.first_turn_args is not None
+    closer_input = closer_model.first_turn_args["input"]
+    assert isinstance(closer_input, list)
+    assert closer_input[0]["role"] == "developer"
+    developer_content = closer_input[0]["content"]
+    assert developer_content.count("<CONVERSATION HISTORY>") == 1
+    assert "triage summary" in developer_content
+    assert "delegate update" in developer_content
+    assert closer_input[1]["role"] == "user"
+    assert closer_input[1]["content"] == "user_question"
+
+
 @pytest.mark.asyncio
 async def test_async_input_filter_supported():
     # DO NOT rename this without updating pyproject.toml
diff --git a/tests/test_extension_filters.py b/tests/test_extension_filters.py
@@ -243,7 +243,19 @@ def test_nest_handoff_history_wraps_transcript() -> None:
 
     assert isinstance(nested.input_history, tuple)
     assert nested.input_history[0]["role"] == "developer"
-    assert "Assist reply" in nested.input_history[0]["content"]
+    developer_content = nested.input_history[0]["content"]
+    assert "<CONVERSATION HISTORY>" in developer_content
+    assert "</CONVERSATION HISTORY>" in developer_content
+    assert "Assist reply" in developer_content
+    metadata = nested.input_history[0].get("metadata")
+    assert isinstance(metadata, dict)
+    history_payload = metadata.get("nest_handoff_history")
+    assert isinstance(history_payload, dict)
+    transcript = history_payload.get("transcript")
+    assert isinstance(transcript, list)
+    assert len(transcript) == 4
+    assert transcript[0]["role"] == "user"
+    assert transcript[1]["role"] == "assistant"
     assert nested.input_history[1]["role"] == "user"
     assert nested.input_history[1]["content"] == "Hello"
     assert len(nested.pre_handoff_items) == 0
@@ -264,3 +276,39 @@ def test_nest_handoff_history_handles_missing_user() -> None:
     assert len(nested.input_history) == 1
     assert nested.input_history[0]["role"] == "developer"
     assert "reasoning" in nested.input_history[0]["content"].lower()
+
+
+def test_nest_handoff_history_appends_existing_history() -> None:
+    first = HandoffInputData(
+        input_history=(_get_user_input_item("Hello"),),
+        pre_handoff_items=(_get_message_output_run_item("First reply"),),
+        new_items=(),
+        run_context=RunContextWrapper(context=()),
+    )
+
+    first_nested = nest_handoff_history(first)
+    developer_message = first_nested.input_history[0]
+
+    follow_up_history = (
+        developer_message,
+        _get_user_input_item("Another question"),
+    )
+
+    second = HandoffInputData(
+        input_history=follow_up_history,
+        pre_handoff_items=(_get_message_output_run_item("Second reply"),),
+        new_items=(_get_handoff_output_run_item("transfer"),),
+        run_context=RunContextWrapper(context=()),
+    )
+
+    second_nested = nest_handoff_history(second)
+
+    assert isinstance(second_nested.input_history, tuple)
+    developer = second_nested.input_history[0]
+    assert developer["role"] == "developer"
+    content = developer["content"]
+    assert content.count("<CONVERSATION HISTORY>") == 1
+    assert content.count("</CONVERSATION HISTORY>") == 1
+    assert "First reply" in content
+    assert "Second reply" in content
+    assert "Another question" in content