[Bug] Speculative decoding does not support guided decoding

## Description

When speculative decoding and guided decoding (JSON schema / regex / grammar) are both enabled, guided constraints are silently ignored. The generated output may violate the specified schema, and there is no warning to the user.

There is an explicit TODO in the code: `spec_agent.py:332` — `# TODO: guided decoding not supported yet for spec decoding`.

## Root Cause

The `GuidedDecodingManager` (which wraps xgrammar for constraint enforcement) is created in `model_agent/agent.py` and passed to `FusedLogitsProcessor` for the main decode path, but it is **never propagated** into the speculative decoding path. Specifically:

1. **`SpecModelAgent` has no access to `GuidedDecodingManager`** — `build_spec_agent` does not accept or forward it.
2. **`FusedLogitsProcessor` in `_rejection_sampling`** (spec_agent.py:327-333) is instantiated without `guided_decoding_manager`, so target logits verification applies no constraints.
3. **Draft model proposers** (e.g., `DeepseekMTP`, `Eagle`) generate tokens via raw `logits.argmax()` without any logits masking, so draft tokens can freely violate constraints.
4. **Grammar state is not incrementally updated** during the draft verification loop — `accept_token` is only called for the final bonus token, not for each accepted draft token.

## Impact

- Users who set `response_format` (JSON schema) or `regex`/`grammar` constraints with speculative decoding enabled will get unconstrained output.
- This is a silent correctness issue — no error, no warning, just wrong output.

## Reproduction

```python
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

pipe = pipeline('your-model', backend_config=TurbomindEngineConfig(speculative_decoding=True, ...))

# JSON schema constraint + spec decode → constraint is ignored
result = pipe(
    messages='Generate a person',
    gen_config=GenerationConfig(
        response_format={"type": "json_schema", "json_schema": {"name": "person", "schema": {"type": "object", "properties": {"name": {"type": "string"}}}}}
    )
)
# Output may not conform to the schema
```

## Proposed Fix

1. Pass `guided_decoding_manager` from `ModelAgent` → `build_spec_agent` → `SpecModelAgent`
2. Pass it to the `FusedLogitsProcessor` instantiated inside `_rejection_sampling`
3. Apply `FusedLogitsProcessor` (with constraints) to draft model logits before `argmax` in proposers, or at minimum filter draft tokens against the grammar mask
4. Call `accept_token` incrementally for each accepted draft token during verification to keep the `GrammarMatcher` state consistent

## Key Files

- `lmdeploy/pytorch/spec_decode/spec_agent.py` — TODO at line 332, `_rejection_sampling`, `SpecModelAgent`
- `lmdeploy/pytorch/spec_decode/proposers/deepseek_mtp.py` — draft generation via raw argmax
- `lmdeploy/pytorch/engine/model_agent/agent.py` — `GuidedDecodingManager` creation, `build_spec_agent` call
- `lmdeploy/pytorch/engine/guided_process.py` — `GuidedDecodingManager` implementation
- `lmdeploy/pytorch/engine/logits_process.py` — `FusedLogitsProcessor` with bitmask logic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Speculative decoding does not support guided decoding #4551

Description

Root Cause

Impact

Reproduction

Proposed Fix

Key Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Speculative decoding does not support guided decoding #4551

Description

Description

Root Cause

Impact

Reproduction

Proposed Fix

Key Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions