Description
When speculative decoding and guided decoding (JSON schema / regex / grammar) are both enabled, guided constraints are silently ignored. The generated output may violate the specified schema, and there is no warning to the user.
There is an explicit TODO in the code: spec_agent.py:332 — # TODO: guided decoding not supported yet for spec decoding.
Root Cause
The GuidedDecodingManager (which wraps xgrammar for constraint enforcement) is created in model_agent/agent.py and passed to FusedLogitsProcessor for the main decode path, but it is never propagated into the speculative decoding path. Specifically:
SpecModelAgent has no access to GuidedDecodingManager — build_spec_agent does not accept or forward it.
FusedLogitsProcessor in _rejection_sampling (spec_agent.py:327-333) is instantiated without guided_decoding_manager, so target logits verification applies no constraints.
- Draft model proposers (e.g.,
DeepseekMTP, Eagle) generate tokens via raw logits.argmax() without any logits masking, so draft tokens can freely violate constraints.
- Grammar state is not incrementally updated during the draft verification loop —
accept_token is only called for the final bonus token, not for each accepted draft token.
Impact
- Users who set
response_format (JSON schema) or regex/grammar constraints with speculative decoding enabled will get unconstrained output.
- This is a silent correctness issue — no error, no warning, just wrong output.
Reproduction
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
pipe = pipeline('your-model', backend_config=TurbomindEngineConfig(speculative_decoding=True, ...))
# JSON schema constraint + spec decode → constraint is ignored
result = pipe(
messages='Generate a person',
gen_config=GenerationConfig(
response_format={"type": "json_schema", "json_schema": {"name": "person", "schema": {"type": "object", "properties": {"name": {"type": "string"}}}}}
)
)
# Output may not conform to the schema
Proposed Fix
- Pass
guided_decoding_manager from ModelAgent → build_spec_agent → SpecModelAgent
- Pass it to the
FusedLogitsProcessor instantiated inside _rejection_sampling
- Apply
FusedLogitsProcessor (with constraints) to draft model logits before argmax in proposers, or at minimum filter draft tokens against the grammar mask
- Call
accept_token incrementally for each accepted draft token during verification to keep the GrammarMatcher state consistent
Key Files
lmdeploy/pytorch/spec_decode/spec_agent.py — TODO at line 332, _rejection_sampling, SpecModelAgent
lmdeploy/pytorch/spec_decode/proposers/deepseek_mtp.py — draft generation via raw argmax
lmdeploy/pytorch/engine/model_agent/agent.py — GuidedDecodingManager creation, build_spec_agent call
lmdeploy/pytorch/engine/guided_process.py — GuidedDecodingManager implementation
lmdeploy/pytorch/engine/logits_process.py — FusedLogitsProcessor with bitmask logic
Description
When speculative decoding and guided decoding (JSON schema / regex / grammar) are both enabled, guided constraints are silently ignored. The generated output may violate the specified schema, and there is no warning to the user.
There is an explicit TODO in the code:
spec_agent.py:332—# TODO: guided decoding not supported yet for spec decoding.Root Cause
The
GuidedDecodingManager(which wraps xgrammar for constraint enforcement) is created inmodel_agent/agent.pyand passed toFusedLogitsProcessorfor the main decode path, but it is never propagated into the speculative decoding path. Specifically:SpecModelAgenthas no access toGuidedDecodingManager—build_spec_agentdoes not accept or forward it.FusedLogitsProcessorin_rejection_sampling(spec_agent.py:327-333) is instantiated withoutguided_decoding_manager, so target logits verification applies no constraints.DeepseekMTP,Eagle) generate tokens via rawlogits.argmax()without any logits masking, so draft tokens can freely violate constraints.accept_tokenis only called for the final bonus token, not for each accepted draft token.Impact
response_format(JSON schema) orregex/grammarconstraints with speculative decoding enabled will get unconstrained output.Reproduction
Proposed Fix
guided_decoding_managerfromModelAgent→build_spec_agent→SpecModelAgentFusedLogitsProcessorinstantiated inside_rejection_samplingFusedLogitsProcessor(with constraints) to draft model logits beforeargmaxin proposers, or at minimum filter draft tokens against the grammar maskaccept_tokenincrementally for each accepted draft token during verification to keep theGrammarMatcherstate consistentKey Files
lmdeploy/pytorch/spec_decode/spec_agent.py— TODO at line 332,_rejection_sampling,SpecModelAgentlmdeploy/pytorch/spec_decode/proposers/deepseek_mtp.py— draft generation via raw argmaxlmdeploy/pytorch/engine/model_agent/agent.py—GuidedDecodingManagercreation,build_spec_agentcalllmdeploy/pytorch/engine/guided_process.py—GuidedDecodingManagerimplementationlmdeploy/pytorch/engine/logits_process.py—FusedLogitsProcessorwith bitmask logic