Skip to content

[Bug] Speculative decoding does not support guided decoding #4551

@ZhijunLStudio

Description

@ZhijunLStudio

Description

When speculative decoding and guided decoding (JSON schema / regex / grammar) are both enabled, guided constraints are silently ignored. The generated output may violate the specified schema, and there is no warning to the user.

There is an explicit TODO in the code: spec_agent.py:332# TODO: guided decoding not supported yet for spec decoding.

Root Cause

The GuidedDecodingManager (which wraps xgrammar for constraint enforcement) is created in model_agent/agent.py and passed to FusedLogitsProcessor for the main decode path, but it is never propagated into the speculative decoding path. Specifically:

  1. SpecModelAgent has no access to GuidedDecodingManagerbuild_spec_agent does not accept or forward it.
  2. FusedLogitsProcessor in _rejection_sampling (spec_agent.py:327-333) is instantiated without guided_decoding_manager, so target logits verification applies no constraints.
  3. Draft model proposers (e.g., DeepseekMTP, Eagle) generate tokens via raw logits.argmax() without any logits masking, so draft tokens can freely violate constraints.
  4. Grammar state is not incrementally updated during the draft verification loop — accept_token is only called for the final bonus token, not for each accepted draft token.

Impact

  • Users who set response_format (JSON schema) or regex/grammar constraints with speculative decoding enabled will get unconstrained output.
  • This is a silent correctness issue — no error, no warning, just wrong output.

Reproduction

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

pipe = pipeline('your-model', backend_config=TurbomindEngineConfig(speculative_decoding=True, ...))

# JSON schema constraint + spec decode → constraint is ignored
result = pipe(
    messages='Generate a person',
    gen_config=GenerationConfig(
        response_format={"type": "json_schema", "json_schema": {"name": "person", "schema": {"type": "object", "properties": {"name": {"type": "string"}}}}}
    )
)
# Output may not conform to the schema

Proposed Fix

  1. Pass guided_decoding_manager from ModelAgentbuild_spec_agentSpecModelAgent
  2. Pass it to the FusedLogitsProcessor instantiated inside _rejection_sampling
  3. Apply FusedLogitsProcessor (with constraints) to draft model logits before argmax in proposers, or at minimum filter draft tokens against the grammar mask
  4. Call accept_token incrementally for each accepted draft token during verification to keep the GrammarMatcher state consistent

Key Files

  • lmdeploy/pytorch/spec_decode/spec_agent.py — TODO at line 332, _rejection_sampling, SpecModelAgent
  • lmdeploy/pytorch/spec_decode/proposers/deepseek_mtp.py — draft generation via raw argmax
  • lmdeploy/pytorch/engine/model_agent/agent.pyGuidedDecodingManager creation, build_spec_agent call
  • lmdeploy/pytorch/engine/guided_process.pyGuidedDecodingManager implementation
  • lmdeploy/pytorch/engine/logits_process.pyFusedLogitsProcessor with bitmask logic

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions