|
| 1 | +<!-- |
| 2 | +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +# SPDX-License-Identifier: Apache-2.0 |
| 4 | +--> |
| 5 | +# Conversation Context Mode |
| 6 | + |
| 7 | +Conversation context mode controls how prior turns are accumulated when building multi-turn chat requests. Different dataset formats imply different accumulation strategies, and AIPerf automatically selects the right one based on your data. |
| 8 | + |
| 9 | +## Modes |
| 10 | + |
| 11 | +### `accumulate_all` |
| 12 | + |
| 13 | +Standard multi-turn chat. The live inference response is stored and included in subsequent requests. |
| 14 | + |
| 15 | +**Dataset:** |
| 16 | +``` |
| 17 | +Turn 1: {"role": "user", "content": "What is ML?"} |
| 18 | +Turn 2: {"role": "user", "content": "Give an example"} |
| 19 | +Turn 3: {"role": "user", "content": "How does it differ from traditional programming?"} |
| 20 | +``` |
| 21 | + |
| 22 | +**Replay:** |
| 23 | +``` |
| 24 | +Request 1: [User "What is ML?"] |
| 25 | + → Server responds with A1 |
| 26 | +
|
| 27 | +Request 2: [User "What is ML?", Assistant A1, User "Give an example"] |
| 28 | + → Server responds with A2 |
| 29 | +
|
| 30 | +Request 3: [User "What is ML?", Assistant A1, User "Give an example", Assistant A2, User "How does it differ..."] |
| 31 | + → Server responds with A3 |
| 32 | +``` |
| 33 | + |
| 34 | +Default for: |
| 35 | +- Synthetic datasets |
| 36 | +- Multi-turn JSONL |
| 37 | +- ShareGPT |
| 38 | +- Mooncake traces with `hash_ids` |
| 39 | + |
| 40 | +### `drop_responses` |
| 41 | + |
| 42 | +Delta-compressed prompts. Each dataset turn only contains the *new* messages since the previous turn. AIPerf accumulates these deltas to reconstruct the full conversation. The live inference response is only used for measurement and discarded -- the pre-canned assistant responses in the dataset are used instead. |
| 43 | + |
| 44 | +**Dataset (each turn is a delta):** |
| 45 | +``` |
| 46 | +Turn 1: [{"role": "user", "content": "What is ML?"}] |
| 47 | +Turn 2: [{"role": "assistant", "content": "ML is..."}, {"role": "user", "content": "Give an example"}] |
| 48 | +Turn 3: [{"role": "assistant", "content": "Sure..."}, {"role": "user", "content": "How does it differ..."}] |
| 49 | +``` |
| 50 | + |
| 51 | +**Replay (deltas accumulated):** |
| 52 | +``` |
| 53 | +Request 1: [User "What is ML?"] |
| 54 | + → Live response discarded |
| 55 | +
|
| 56 | +Request 2: [User "What is ML?"] + [Assistant "ML is...", User "Give an example"] |
| 57 | + → Live response discarded |
| 58 | +
|
| 59 | +Request 3: [User "What is ML?"] + [Assistant "ML is...", User "Give an example"] + [Assistant "Sure...", User "How does it differ..."] |
| 60 | + → Live response discarded |
| 61 | +``` |
| 62 | + |
| 63 | +Default for: |
| 64 | +- N/A (no built-in loader defaults to this mode yet) |
| 65 | + |
| 66 | +### `standalone` |
| 67 | + |
| 68 | +Self-contained prompts. Each turn already contains its full context. No session accumulation. |
| 69 | + |
| 70 | +**Dataset:** |
| 71 | +``` |
| 72 | +Turn 1: [{"role": "user", "content": "What is ML?"}] |
| 73 | +Turn 2: [{"role": "user", "content": "What is ML?"}, {"role": "assistant", "content": "ML is..."}, {"role": "user", "content": "Give an example"}] |
| 74 | +Turn 3: [{"role": "user", "content": "What is ML?"}, {"role": "assistant", "content": "ML is..."}, {"role": "user", "content": "Give an example"}, {"role": "assistant", "content": "Sure..."}, {"role": "user", "content": "How does it differ..."}] |
| 75 | +``` |
| 76 | + |
| 77 | +**Replay:** |
| 78 | +``` |
| 79 | +Request 1: sends Turn 1 as-is |
| 80 | +Request 2: sends Turn 2 as-is |
| 81 | +Request 3: sends Turn 3 as-is |
| 82 | +``` |
| 83 | + |
| 84 | +Each turn is sent exactly as it appears in the dataset. |
| 85 | + |
| 86 | +Default for: |
| 87 | +- Mooncake traces with pre-built `messages` arrays |
| 88 | + |
| 89 | +## How It Works |
| 90 | + |
| 91 | +Context mode is resolved through a priority chain: |
| 92 | + |
| 93 | +1. **Per-conversation override** -- A conversation in the dataset can specify its own `context_mode` |
| 94 | +2. **Loader default** -- The dataset loader can declare a default based on dataset format semantics |
| 95 | +3. **Global fallback** -- `accumulate_all` |
| 96 | + |
| 97 | +This means most users never need to think about context mode. The loader picks the right default, and individual conversations can override it when needed. |
0 commit comments