You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"""Generate with budget forcing using the completions APIs. This relies on raw autocompletion and assumes the model's output is structued in the following form: '<think> ... </think> summary answer'
486
+
"""Generate with budget forcing using the completions APIs. This relies on raw autocompletion and assumes the model's output is structured in the following form: '<think> ... </think> summary answer'
486
487
The budget forcing method is proposed in the paper: https://arxiv.org/abs/2501.19393
487
488
This implementation tries to follow the key outlines in the paper while ensuring stable and fail-safe operation.
488
-
This is performed via multi-step generation. The model will be called multiple times until requirements are met, in other words, the response will be assembeled conditionally.
489
+
This is performed via multi-step generation. The model will be called multiple times until requirements are met, in other words, the response will be assembled conditionally.
489
490
490
491
Args:
491
492
think_max_tokens: Budget in number of tokens allocated for the think block
end_response_token: Used by certain models, string indicating end of response block, e.g. "</response>", default None
497
498
think_wait_suffix: String to append to force continued thinking, e.g. "\nWait" if set to None we will not force additional thinking. Use None for upper-bound budget case
498
499
answer_suffix: String to append to force a final answer
499
-
answer_token: Token that indicates an answer is generated
500
+
answer_regex: Answer regex which indicates an answer is generated
500
501
501
502
Assumptions:
502
503
- The chat template is applied on prompt, with think mode enabled
0 commit comments