Skip to content

[FEATURE] LLM Token-Level Generation Supervision #370

@iwr-redmond

Description

@iwr-redmond

Feature Description

Rescued from #368:

You may wish to consider implementing one of the token-level supervision options for LlamaCPP to deliver superior adherence during structured generation. It's the difference between asking "pretty please" and guaranteeing a correctly structured response.

As currently implemented by @xsxszab in nexa_inference_text.py, generation will fail if the model does not return a valid JSON response or doesn't follow the requested schema.

Options

LM Format Enforcer (Python)

LM Format Enforcer's llama-cpp-python integration code should be easy to adapt. This package is already being used in RedHat/IBM's enterprise-focused VLLM project (reference).

A demonstration workbook is available here. You may be able to run this workbook as-is by merely changing the imports. e.g.:

-from llama_cpp import LogitsProcessorList
+from nexa.gguf.llama import LogitsProcessorList

LLGuidance (upstream)

The LLGuidance Rust crate has recently been added to upstream llama.cpp.

Enabling this feature during compilation requires some fiddling with Rust, and there are still some bug fixes that need to be finalized (pull 11644). However, these are transitional problems and adopting this approach would probably make it easier for end-users to utilize structured generation using the SDK.

Metadata

Metadata

Assignees

No one assigned

    Labels

    deprecatedIssues for nexaSDK v1, the version before July 23, 2025

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions