-
Notifications
You must be signed in to change notification settings - Fork 58
Description
Modern large language models like GLM-4.6 rely on multiple distinct EOS tokens to signal different types of generation boundaries, but llguidance currently only supports a single EOS token — creating critical compatibility issues in structured output workflows.
Current Behavior:
Currently, llguidance only supports a single end-of-sequence (EOS) token to determine when text generation should terminate. This is hardcoded or configured via a single token ID, which is insufficient for models like GLM-4.6 that use multiple distinct EOS tokens — specifically, 151329, 151336, and 151338 — to signal different types of termination (e.g., final output, tool call completion, or structured format end).
Problem:
When llguidance is used as a structured output backend in systems like SGLang or vLLM, generation often terminates prematurely or never terminates because only the first EOS token (typically 151329) is checked. However, in real-world usage, GLM-4.6 frequently emits one of the other two EOS tokens (151336 or 151338) to mark the end of:
- Tool call responses,
- Nested structured outputs,
- Or other cases.
This mismatch causes the generator to enter infinite loops or produce malformed outputs, since llguidance’s regex-based or grammar-guided generation cannot recognize these alternative EOS tokens as valid stop conditions.
Requested Feature:
Extend llguidance to support a configurable list (or set) of EOS tokens, rather than a single token. This should be integrated into the grammar engine and the generation termination logic so that generation stops as soon as ANY of the configured EOS tokens is generated.
Implementation suggestion:
- Add a new parameter
eos_tokens: List[int]to the generator or grammar configuration. - Update the token-wise stopping condition to check against the full set.
- Maintain backward compatibility by allowing
eos_token: intto be deprecated in favor ofeos_tokens = [eos_token]when only one is provided.
Use Case:
In a tool-use scenario with GLM-4.6:
- User asks: “What’s the weather in Tokyo? Use the weather tool.”
- Model generates:
{"action": "weather", "action_input": "Tokyo"}+ EOS token 151336. - System calls tool, receives response, and feeds back: “The weather in Tokyo is sunny.”
- Model generates final response + EOS token 151338.
If llguidance only checks for 151329, it will not stop at either 151336 or 151338, leading to:
- The model continuing generation past the valid endpoint,
- Potential corruption of structured output (e.g., appending garbage after tool response),
- Or hanging due to non-termination.
Benefits:
- Enables seamless integration of llguidance with GLM-4.6 and other multi-EOS models.
- Eliminates infinite loops and malformed outputs in tool calling and structured JSON generation.
- Brings llguidance parity with xgrammar and other modern structured output libraries.
- Improves reliability and accuracy in production workflows using LLMs with complex termination semantics.
- Increases adoption in enterprise and API-driven contexts where precise generation control is critical.
This feature is essential for production-grade structured output systems, and supporting multiple EOS tokens is a minimal but critical change to match the reality of modern model architectures.