You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
- Refactored io_manager into five distinct components:
- DecoderRunner: Module wrapper class.
- PromptProcessor: Handles prompt processing using the decoder and key-value manager.
- TokenGenerator: Generates tokens using the decoder and key-value manager.
- KVManager: Manages key-value cache with kv_updater, including data buffer allocation, cache updates, and buffer updates in TensorImpl.
- IBufferAlloc: Allocates data buffers from RPC memory or client buffer.
- Support multi-turn use case. Validate on story llama
- To simulate the scenario, I forced decode mode to generate 5 tokens each time. Tokens with random length are inserted after one round of prefill->decode finished.
help="User prompts for Llama. When multiple prompts are entered, a multi-turn conversation will be initiated. Note that this feature is currently for testing purposes only.",
1008
1007
required=True,
1009
1008
type=str,
1009
+
nargs="+",
1010
1010
)
1011
1011
1012
1012
parser.add_argument(
@@ -1090,7 +1090,7 @@ def _build_parser():
1090
1090
1091
1091
defexport_llama(args) ->None:
1092
1092
ifargs.compile_onlyandargs.pre_gen_pte:
1093
-
exit("Cannot set both compile_only and pre_gen_pte as true")
1093
+
raiseRuntimeError("Cannot set both compile_only and pre_gen_pte as true")
DEFINE_string(prompt, "The answer to the ultimate question is", "Prompt.");
37
+
DEFINE_string(
38
+
prompt,
39
+
"The answer to the ultimate question is",
40
+
"User prompts for Llama. When multiple prompts are entered, a multi-turn conversation will be initiated. Note that this feature is currently for testing purposes only.");
38
41
DEFINE_string(
39
42
system_prompt,
40
43
"",
@@ -49,10 +52,8 @@ DEFINE_int32(
49
52
"Total number of tokens to generate (prompt + output).");
0 commit comments