Skip to content

Conversation

@Ki-Seki
Copy link
Contributor

@Ki-Seki Ki-Seki commented Dec 2, 2025

Fixes #1784

@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Dec 4, 2025

@RobinPicard Since this PR turned out a bit larger, I’ve put together a detailed change log to help streamline your review. 🤗

1. Best-Effort Use of Chat Completion

For the four local model backends (llamacpp, mlxlm, transformers, and vllm_offline), this PR implements the following logic proposed in issue #1784 :

If a local model provides a chat template, we assume it expects us to use it — so we do. If not, we fall back to plain completion mode. If a backend does not support chat mode at all, we also fall back to plain completion mode.

Key changes:

  • Added a helper function _check_hf_chat_template in outlines/models/tokenizer.py to centralize Hugging Face chat template checks, since the logic is shared.
  • Introduced a new property has_chat_template in the TypeAdapter of the local models, typically determined by _check_hf_chat_template.
  • When has_chat_template is True, string-based model_input is converted to chat format whenever possible, usually via format_chat_input.
  • Updated the common/stream/batch generate functions across all four local models accordingly.

2. Special Case: vLLM Offline

  • Since vLLM internally uses TokenizerBase (a non-Hugging Face tokenizer class), additional checks were added for compatibility.
  • Previously, generate_batch was assumed not to support chat mode, and the following code was present:
if any(isinstance(item, Chat) for item in model_input):
    raise TypeError(
        "Batch generation is not available for the `Chat` input type."
    )
  • After re-verification, chat mode is supported (see vLLM documentation), so this restriction has been removed.

3. Special Case: LlamaCPP

  • LlamaCPP does not automatically check for chat templates; instead, it relies on user-provided parameter.
  • This is because llama-cpp-python provides a default fallback chat template even when the user has not explicitly configured one (source).
  • To align with this behavior, users are now expected to explicitly pass the chat_mode parameter (default: True).

Note:
This is the only interface-level change in the PR. Compared to the previous implementation, where string inputs defaulted to plain text completion, they now default to chat completion whenever possible.
I believe this is reasonable: since llama-cpp-python itself encourages a fallback chat template, this change should feel natural to users familiar with LlamaCPP.
For strict backward compatibility, however, we could consider setting the default to chat_mode=False.

4. Other Changes

  • Updated documentation to reflect the LlamaCPP interface change.
  • Added comprehensive unit tests for all new functionality, all of which pass and maintain 100% coverage.
  • Ensured consistent code style throughout.

@Ki-Seki Ki-Seki marked this pull request as ready for review December 4, 2025 07:55
Copilot AI review requested due to automatic review settings December 4, 2025 07:55
Copilot finished reviewing on behalf of Ki-Seki December 4, 2025 08:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements best-effort chat completion support across multiple model adapters (vLLM, Transformers, MLX-LM, and LlamaCpp) by automatically detecting whether a model's tokenizer has a chat template and conditionally formatting string inputs as chat messages.

Key changes:

  • Added automatic chat template detection that wraps plain string inputs as user messages when a chat template is available
  • Introduced a chat_mode parameter for LlamaCpp to allow users to explicitly disable chat-style formatting
  • Implemented _check_hf_chat_template() helper function to check for HuggingFace chat template availability

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
outlines/models/tokenizer.py Added _check_hf_chat_template() helper function to detect chat template availability
outlines/models/vllm_offline.py Updated VLLMOfflineTypeAdapter to conditionally format string inputs as chat messages based on template availability
outlines/models/transformers.py Modified TransformersTypeAdapter to support chat template detection and conditional formatting
outlines/models/mlxlm.py Updated MLXLMTypeAdapter with chat template support and conditional string input formatting
outlines/models/llamacpp.py Added chat_mode parameter to LlamaCpp model to allow explicit control over chat-style input formatting
docs/features/models/llamacpp.md Updated documentation to describe the new chat_mode parameter and its usage
tests/models/test_tokenizer.py Added tests for the new chat template detection function
tests/models/test_vllm_offline_type_adapter.py Added tests for string input formatting with and without chat templates
tests/models/test_transformers_type_adapter.py Updated tests to cover chat template conditional behavior
tests/models/test_mlxlm_type_adapter.py Added tests for chat template support using mocks
tests/models/test_llamacpp_type_adapter.py Added tests for chat template conditional formatting
tests/models/test_llamacpp.py Added test fixture and tests for non-chat mode, updated streaming tests to handle empty tokens

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@RobinPicard
Copy link
Contributor

Thanks a lot for the great description! I'll review it in the coming days

@Ki-Seki
Copy link
Contributor Author

Ki-Seki commented Dec 4, 2025

Thanks a lot for the great description! I'll review it in the coming days

No worries at all, Robin — no rush! Really happy to be working with you. 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Inconsistent handling of str vs. Chat inputs across model backends; And we may need to unify them.

2 participants