Skip to content

feat: Enhance sglang adapter and OpenAI api compatibility#158

Open
attafosu wants to merge 28 commits intomainfrom
feat/attafosu/sglang-openai-api-compatibility
Open

feat: Enhance sglang adapter and OpenAI api compatibility#158
attafosu wants to merge 28 commits intomainfrom
feat/attafosu/sglang-openai-api-compatibility

Conversation

@attafosu
Copy link
Copy Markdown
Collaborator

@attafosu attafosu commented Mar 9, 2026

What does this PR do?

  • Adds unified dataset preset featuring openai-compatible and native sglang api for cnn dailymail.
  • Also refactors "Harmonizer" in sglang adapter so to skip if "input_tokens" are pregenerated by dataset preset.
  • Adds unit tests for preset datasets and the Harmonizer transform

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

attafosu added 2 commits March 4, 2026 11:35
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>

Committer: attafosu <thomas.atta-fosu@intel.com>
@attafosu attafosu requested a review from a team as a code owner March 9, 2026 21:33
Copilot AI review requested due to automatic review settings March 9, 2026 21:33
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 9, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions bot requested review from arekay-nv and nvzhihanj March 9, 2026 21:34
@gemini-code-assist
Copy link
Copy Markdown

gemini-code-assist bot commented Mar 9, 2026

Summary of Changes

This pull request significantly enhances the data processing pipeline by integrating SGLang compatibility through a new dataset preset and refining the Harmonize transform for more flexible and robust tokenization. It also improves adherence to OpenAI API standards by making certain response fields optional. The changes are supported by comprehensive unit tests and updated documentation, ensuring reliability and ease of use for SGLang and OpenAI API interactions.

Highlights

  • SGLang Adapter and Dataset Preset: Introduced a new llama3_8b_sglang dataset preset for the CNN Dailymail dataset, enabling compatibility with both OpenAI-compatible and native SGLang APIs. This includes specific prompt formatting and tokenization steps tailored for SGLang.
  • Enhanced Harmonize Transform: Refactored the Harmonize data transform to include a mode parameter ('harmony' or 'plain') for flexible tokenization. It now also features a guard to prevent overwriting input_tokens if they are already pre-generated by a dataset preset, ensuring efficient processing in fused pipelines.
  • OpenAI API Compatibility Improvements: Adjusted the ChatCompletionResponseMessage and ChatCompletionResponse types to make refusal, usage, and system_fingerprint fields optional with a default value of None, aligning better with OpenAI API specifications.
  • Comprehensive Unit Testing: Added new unit tests for dataset preset transforms, covering instantiation, transform application, and output column verification. Expanded existing unit tests for the Harmonize transform to cover new mode functionality and the pre-tokenized row guard.
  • Updated Documentation: Updated the Llama3.1-8B example README to include detailed instructions for setting up and running benchmarks with SGLang endpoints, alongside existing vLLM instructions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • DATASET_PRESET_TESTING.md
    • Added new documentation detailing how to unit test dataset preset transforms, including quick start commands and scope of tests.
  • examples/05_Llama3.1-8B_Example/README.md
    • Updated dataset download instructions to clarify when it's needed for quantization.
    • Added a new section with instructions for launching an SGLang server.
    • Modified benchmark start instructions to differentiate between vLLM and SGLang endpoint targets.
  • examples/05_Llama3.1-8B_Example/offline_llama3_8b_cnn_sglang_api.yaml
    • Added a new YAML configuration file for running an offline throughput benchmark using the SGLang API.
  • src/inference_endpoint/dataset_manager/predefined/cnndailymail/presets.py
    • Imported the Harmonize transform.
    • Implemented a new llama3_8b_sglang preset function that formats prompts and tokenizes them using Harmonize in 'plain' mode for SGLang compatibility.
  • src/inference_endpoint/dataset_manager/transforms.py
    • Added a mode parameter to the Harmonize transform's constructor, defaulting to 'harmony'.
    • Implemented validation for the mode parameter to ensure it is either 'harmony' or 'plain'.
    • Modified the process_row method in Harmonize to conditionally tokenize based on the mode and to skip processing if input_tokens are already present in the row.
  • src/inference_endpoint/openai/types.py
    • Made the refusal field in ChatCompletionResponseMessage optional with a default value of None.
    • Made the usage and system_fingerprint fields in ChatCompletionResponse optional with a default value of None.
  • tests/unit/dataset_manager/test_dataset_presets.py
    • Added a new file containing unit tests for various dataset presets, including CNNDailyMail, AIME25, GPQA, LiveCodeBench, and OpenOrca.
    • Included specific tests for the llama3_8b_sglang preset to verify instantiation, transform application, and correct output of tokenized inputs.
  • tests/unit/dataset_manager/test_transforms.py
    • Updated the module docstring to reflect that Harmonize transform is now also tested.
    • Added unit tests for the Harmonize transform, covering invalid mode handling, skipping existing input_tokens, and behavior in 'plain' vs 'harmony' modes.
Activity
  • The MLCommons CLA bot confirmed that all contributors have signed the CLA.
  • Gemini Code Assist provided an initial summary of the changes.
  • attafosu requested reviews from @arekay-nv and @nvzhihanj.
  • arekay-nv requested changes, specifically asking for the llama3-8b example to be updated and for tests to be added.
  • attafosu requested a summary from Gemini Code Assist.
  • Copilot suggested that Harmonize should still compute harmonized_column even when input_tokens are pre-generated, or clarify this behavior in the docstring.
  • Copilot noted that new Harmonize behavior (mode parameter, validation, row-level guard) was not covered by unit tests and requested targeted tests.
  • Copilot pointed out an issue with the SGLang launch command in the README, specifically regarding bash -lc syntax and inline comments breaking line continuations.
  • Copilot reiterated the need for unit tests for Harmonize's new mode behavior and skip logic, suggesting mocking Harmonizer to avoid downloads.
  • Copilot again highlighted the incorrect docker exec command for SGLang, advising to quote the full command string passed to -lc.
  • Copilot identified an inline comment breaking a line continuation in the SGLang launch command example and suggested moving or reformatting it.
  • Copilot suggested tightening the SGLang preset test to assert the existence and content of input_tokens and the absence of harmonized_prompt.
  • Copilot recommended tightening an assertion in TestGPQAPresets to check for specific instruction text rather than just the letter 'A'.

@attafosu attafosu changed the title Feat/attafosu/sglang OpenAI api compatibility feat: Enhance sglang adapter and OpenAI api compatibility Mar 9, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new dataset preset for SGLang compatibility, llama3_8b_sglang, which leverages an enhanced Harmonize transform. The Harmonize transform now supports a 'plain' mode for direct tokenization and includes a robust guard to prevent overwriting pre-tokenized input. Additionally, optional fields in OpenAI chat completion types (refusal, usage, system_fingerprint) have been updated to include default None values, improving constructor flexibility. The changes are well-implemented and align with the stated objective of improving OpenAI API compatibility and refactoring the Harmonizer.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds dataset preset and transform behavior needed to support SGLang usage alongside OpenAI-compatible flows, particularly for CNN/DailyMail, while improving OpenAI response type compatibility.

Changes:

  • Make certain OpenAI response fields optional by providing None defaults in msgspec structs.
  • Extend Harmonize transform with a mode option (harmony vs plain) and add a row-level guard to avoid overwriting pre-generated input_tokens when row processors are fused.
  • Add a CNN/DailyMail preset (llama3_8b_sglang) that formats a prompt and pre-tokenizes it for SGLang.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/inference_endpoint/openai/types.py Adds defaults for optional OpenAI response fields to improve (de)serialization compatibility.
src/inference_endpoint/dataset_manager/transforms.py Adds mode to Harmonize and prevents overwriting pre-generated input_tokens in fused row-processing.
src/inference_endpoint/dataset_manager/predefined/cnndailymail/presets.py Introduces llama3_8b_sglang preset to generate input_tokens directly (plain tokenization).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +185 to +186
if self.tokenized_column in row and row[self.tokenized_column] is not None:
return row
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning early when input_tokens is present skips populating harmonized_column even when it’s configured (non-None). If callers rely on the text harmonized prompt for debugging/logging, consider still computing harmonized_column from the existing tokens (without overwriting tokens), or update the docstring/behavior to make it explicit that the column may not be produced when tokens are pre-generated.

Copilot uses AI. Check for mistakes.
@attafosu
Copy link
Copy Markdown
Collaborator Author

attafosu commented Mar 9, 2026

@arekay-nv @nvzhihanj please take a look when you get the chance

Copilot AI review requested due to automatic review settings March 11, 2026 00:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

src/inference_endpoint/dataset_manager/transforms.py:162

  • In mode="plain", Harmonize still instantiates Harmonizer, which unconditionally loads an OpenAI Harmony encoding (harmony.load_harmony_encoding(...)) and builds a Harmony system message. Since plain mode only needs tokenizer.encode(), consider avoiding/loading the Harmony encoding lazily (e.g., only create Harmonizer when mode=="harmony", and for plain mode just load/cache the HF tokenizer and encode directly) to prevent unnecessary startup cost and dependency on Harmony assets.
        self.mode = mode
        if self.mode not in {"harmony", "plain"}:
            raise ValueError(f"Invalid harmonize mode: {self.mode}")
        self.harmonizer = Harmonizer(
            tokenizer_name=tokenizer_name,
            encoding_name=encoding_name,
            reasoning_effort=reasoning_effort,
            conversation_start_date=conversation_start_date,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@arekay-nv
Copy link
Copy Markdown
Collaborator

@attafosu does this need the llama3-8b example to change? Can you also push those changes so the PR can be functionally verified. Also, would be nice to add tests for this.

Copilot AI review requested due to automatic review settings March 16, 2026 21:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 136 to +156
@@ -145,10 +146,14 @@ def __init__(
tokenized_column: The name of the column containing the tokenized prompt.
harmonized_column: The name of the column containing the harmonized prompt. If None,
the harmonized prompt will not be stored as text.
mode: "harmony" to render a Harmony conversation; "plain" to tokenize the raw prompt.
"""
self.prompt_column = prompt_column
self.tokenized_column = tokenized_column
self.harmonized_column = harmonized_column
self.mode = mode
if self.mode not in {"harmony", "plain"}:
raise ValueError(f"Invalid harmonize mode: {self.mode}")
Comment on lines +56 to +60
# Start sglang endpoint
docker exec -u root -w /workspace sglang-cpu-server /bin/bash -lc python3 -m sglang.launch_server \
--model-path $MODEL_NAME \
--served-model-name meta-llama/Llama-3.1-8B-Instruct \ # Needed if `model-path` is here is different from `model` in the client config
--dtype bfloat16 \
viraatc and others added 10 commits March 16, 2026 15:45
* Handle case with string response

Handles the case where the response is a single string, not a list - needed to handle AMD submission which wasn't calculating TPOT without the fix.
---------

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
* optimize msgspec implementation

* precommit

* drop old perf test

* add sglang tests

* updates
address pending comment in #162

Refactored Request Building: Refactored the build_request method to leverage these new pre-calculated prefixes, simplifying the logic and reducing string concatenations for common request paths.

Prefix Rebuilding Logic: Introduced a new private method _rebuild_prefixes to manage the construction and update of these prefixes, ensuring they are correctly updated when cached headers change.
* docs: add AGENTS.md with AI coding guidelines, restructure CLAUDE.md

Move repo guidelines from CLAUDE.md into AGENTS.md so they are
tool-agnostic and usable by any AI coding agent. CLAUDE.md now
contains only an @AGENTS.md include directive.

AGENTS.md covers architecture, code organization, development
standards, and adds two new sections: a policy requiring AGENTS.md
updates alongside significant refactors, and a catalog of common
AI coding pitfalls specific to this codebase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: fix AGENTS.md formatting for prettier compliance

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
* Add preset dataset unit tests and documentation

- Add test_dataset_presets.py with 20 test cases for 6 presets across 5 datasets
- Add comprehensive testing guide and schema reference documentation

Tests verify that transforms work correctly without end-to-end runs,
enabling fast regression detection when transform code changes.

Signed-off-by: attafosu <thomas.atta-fosu@intel.com>

* Cleanup local directory

Signed-off-by: attafosu <thomas.atta-fosu@intel.com>

* Sanitize documentation

Signed-off-by: attafosu <thomas.atta-fosu@intel.com>

* Cleanup

Signed-off-by: attafosu <thomas.atta-fosu@intel.com>

* Decorate slow tests

Signed-off-by: attafosu <thomas.atta-fosu@intel.com>

* Update DATASET_SCHEMA_REFERENCE.md

* Cleanup

Signed-off-by: attafosu <thomas.atta-fosu@intel.com>

* Remove redundant dataset schema

Signed-off-by: attafosu <thomas.atta-fosu@intel.com>

* Add fixtures to simplify unit tests

Signed-off-by: attafosu <thomas.atta-fosu@intel.com>

---------

Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Copilot AI review requested due to automatic review settings March 20, 2026 06:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +100 to +101
assert llama3_8b_sglang_transformed is not None
assert "prompt" in llama3_8b_sglang_transformed.columns
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SGLang preset test only asserts that the prompt column exists, but this preset’s key output is input_tokens (and harmonized_column=None). To prevent regressions in the new adapter/preset interaction, assert that input_tokens exists and contains a non-empty list of ints (and optionally that harmonized_prompt is absent).

Suggested change
assert llama3_8b_sglang_transformed is not None
assert "prompt" in llama3_8b_sglang_transformed.columns
assert llama3_8b_sglang_transformed is not None
# SGLang preset should still provide a prompt column
assert "prompt" in llama3_8b_sglang_transformed.columns
# Key output for SGLang preset is tokenized input
assert "input_tokens" in llama3_8b_sglang_transformed.columns
input_tokens = llama3_8b_sglang_transformed["input_tokens"].iloc[0]
assert isinstance(input_tokens, list)
assert len(input_tokens) > 0
assert all(isinstance(token, int) for token in input_tokens)
# harmonized_column is expected to be None for this preset
assert "harmonized_prompt" not in llama3_8b_sglang_transformed.columns

Copilot uses AI. Check for mistakes.
assert "(C)" in prompt
assert "(D)" in prompt
# Should instruct to express answer as option letter
assert "A" in prompt or "option" in prompt.lower()
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is effectively always true because the prompt always contains "(A)" in the choices, so the test can pass even if the instruction to answer with an option letter is removed. Tighten the check to look for the actual instruction text (e.g., "Express your final answer" / "corresponding option") rather than the letter "A".

Suggested change
assert "A" in prompt or "option" in prompt.lower()
lower_prompt = prompt.lower()
assert "express your final answer" in lower_prompt or "corresponding option" in lower_prompt

Copilot uses AI. Check for mistakes.
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
@attafosu
Copy link
Copy Markdown
Collaborator Author

/gemini summary

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

This pull request significantly enhances the data processing pipeline by integrating SGLang compatibility through a new dataset preset and refining the Harmonize transform for more flexible and robust tokenization. It also improves adherence to OpenAI API standards by making certain response fields optional. The changes are supported by comprehensive unit tests and updated documentation, ensuring reliability and ease of use for SGLang and OpenAI API interactions.

Highlights

  • SGLang Adapter and Dataset Preset: Introduced a new llama3_8b_sglang dataset preset for the CNN Dailymail dataset, enabling compatibility with both OpenAI-compatible and native SGLang APIs. This includes specific prompt formatting and tokenization steps tailored for SGLang.
  • Enhanced Harmonize Transform: Refactored the Harmonize data transform to include a mode parameter ('harmony' or 'plain') for flexible tokenization. It now also features a guard to prevent overwriting input_tokens if they are already pre-generated by a dataset preset, ensuring efficient processing in fused pipelines.
  • OpenAI API Compatibility Improvements: Adjusted the ChatCompletionResponseMessage and ChatCompletionResponse types to make refusal, usage, and system_fingerprint fields optional with a default value of None, aligning better with OpenAI API specifications.
  • Comprehensive Unit Testing: Added new unit tests for dataset preset transforms, covering instantiation, transform application, and output column verification. Expanded existing unit tests for the Harmonize transform to cover new mode functionality and the pre-tokenized row guard.
  • Updated Documentation: Updated the Llama3.1-8B example README to include detailed instructions for setting up and running benchmarks with SGLang endpoints, alongside existing vLLM instructions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • DATASET_PRESET_TESTING.md
    • Added new documentation detailing how to unit test dataset preset transforms, including quick start commands and scope of tests.
  • examples/05_Llama3.1-8B_Example/README.md
    • Updated dataset download instructions to clarify when it's needed for quantization.
    • Added a new section with instructions for launching an SGLang server.
    • Modified benchmark start instructions to differentiate between vLLM and SGLang endpoint targets.
  • examples/05_Llama3.1-8B_Example/offline_llama3_8b_cnn_sglang_api.yaml
    • Added a new YAML configuration file for running an offline throughput benchmark using the SGLang API.
  • src/inference_endpoint/dataset_manager/predefined/cnndailymail/presets.py
    • Imported the Harmonize transform.
    • Implemented a new llama3_8b_sglang preset function that formats prompts and tokenizes them using Harmonize in 'plain' mode for SGLang compatibility.
  • src/inference_endpoint/dataset_manager/transforms.py
    • Added a mode parameter to the Harmonize transform's constructor, defaulting to 'harmony'.
    • Implemented validation for the mode parameter to ensure it is either 'harmony' or 'plain'.
    • Modified the process_row method in Harmonize to conditionally tokenize based on the mode and to skip processing if input_tokens are already present in the row.
  • src/inference_endpoint/openai/types.py
    • Made the refusal field in ChatCompletionResponseMessage optional with a default value of None.
    • Made the usage and system_fingerprint fields in ChatCompletionResponse optional with a default value of None.
  • tests/unit/dataset_manager/test_dataset_presets.py
    • Added a new file containing unit tests for various dataset presets, including CNNDailyMail, AIME25, GPQA, LiveCodeBench, and OpenOrca.
    • Included specific tests for the llama3_8b_sglang preset to verify instantiation, transform application, and correct output of tokenized inputs.
  • tests/unit/dataset_manager/test_transforms.py
    • Updated the module docstring to reflect that Harmonize transform is now also tested.
    • Added unit tests for the Harmonize transform, covering invalid mode handling, skipping existing input_tokens, and behavior in 'plain' vs 'harmony' modes.
Activity
  • The MLCommons CLA bot confirmed that all contributors have signed the CLA.
  • Gemini Code Assist provided an initial summary of the changes.
  • attafosu requested reviews from @arekay-nv and @nvzhihanj.
  • arekay-nv requested changes, specifically asking for the llama3-8b example to be updated and for tests to be added.
  • attafosu requested a summary from Gemini Code Assist.
  • Copilot suggested that Harmonize should still compute harmonized_column even when input_tokens are pre-generated, or clarify this behavior in the docstring.
  • Copilot noted that new Harmonize behavior (mode parameter, validation, row-level guard) was not covered by unit tests and requested targeted tests.
  • Copilot pointed out an issue with the SGLang launch command in the README, specifically regarding bash -lc syntax and inline comments breaking line continuations.
  • Copilot reiterated the need for unit tests for Harmonize's new mode behavior and skip logic, suggesting mocking Harmonizer to avoid downloads.
  • Copilot again highlighted the incorrect docker exec command for SGLang, advising to quote the full command string passed to -lc.
  • Copilot identified an inline comment breaking a line continuation in the SGLang launch command example and suggested moving or reformatting it.
  • Copilot suggested tightening the SGLang preset test to assert the existence and content of input_tokens and the absence of harmonized_prompt.
  • Copilot recommended tightening an assertion in TestGPQAPresets to check for specific instruction text rather than just the letter 'A'.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +154 to 159
self.mode = mode
if self.mode not in {"harmony", "plain"}:
raise ValueError(f"Invalid harmonize mode: {self.mode}")
self.harmonizer = Harmonizer(
tokenizer_name=tokenizer_name,
encoding_name=encoding_name,
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In mode="plain", process_row() only needs a HuggingFace tokenizer, but __init__ still constructs a full Harmonizer, which eagerly loads the Harmony encoding and builds the system message. This adds unnecessary overhead/dependencies for the plain-tokenization path. Consider making Harmonizer lazily load the encoding/system message only when mode=="harmony", or use AutoTokenizer directly in plain mode.

Copilot uses AI. Check for mistakes.
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
@attafosu
Copy link
Copy Markdown
Collaborator Author

@attafosu does this need the llama3-8b example to change? Can you also push those changes so the PR can be functionally verified. Also, would be nice to add tests for this.

@arekay-nv Added example for this change in the llama3-8b Readme. Also added unit tests that covers preset datasets

Comment on lines +69 to +76
Harmonize(
tokenizer_name=tokenizer_name,
prompt_column="prompt",
tokenized_column="input_tokens",
harmonized_column=None,
mode="plain",
),
]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why we need harmonize with plain here for a llama model. Harmonization only works with the gpt-oss models as far as i know, so using a harmonize transform here is a bit confusing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here we just want to use the Harmonizer to generate the tokenized inputs (input_tokens needed by sglang api). The plain mode is introduced to ensure no chat templates nor processing is applied to the input prompt (as would otherwise be the case in the "harmony" mode: src/inference_endpoint/dataset_manager/transforms.py::process_row() --> src/inference_endpoint/openai/harmony.py::harmony())

I could also add a new transform say a Tokenizer transform to do just that (generating tokenized inputs), but only wanted to refactor existing implementations wherever possible. If this sounds more straightforward I can leave the Harmonizer as is and instead add the tokenizing transform.

Copilot AI review requested due to automatic review settings March 23, 2026 23:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +14 to +15
# Exclude slow tests (Harmonize transform requires transformers)
pytest tests/unit/dataset_manager/test_dataset_presets.py -m "not slow" -v
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note says slow tests are excluded because Harmonize “requires transformers”, but transformers is already a core dependency in this repo; the main reason to mark these slow is usually that they can trigger tokenizer/model downloads and be network-dependent. Consider rewording to reflect that.

Copilot uses AI. Check for mistakes.

| Dataset | Presets | Tests |
| ------------- | ------------------------------- | ----- |
| CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 6 |
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CNNDailyMail row lists 6 tests, but tests/unit/dataset_manager/test_dataset_presets.py currently defines 5 tests for CNNDailyMail (3 for llama3_8b and 2 for llama3_8b_sglang). Please update the count to match the actual test file.

Suggested change
| CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 6 |
| CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 5 |

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Copilot AI commented Mar 24, 2026

@attafosu I've opened a new pull request, #198, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI review requested due to automatic review settings March 24, 2026 22:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/inference_endpoint/dataset_manager/transforms.py:199

  • When mode="plain", harmonized_column (defaulting to harmonized_prompt) is still populated via to_text() from the raw tokens, which is not actually a Harmony-formatted prompt. To avoid semantic confusion, consider either: (a) only writing harmonized_column when mode == "harmony", or (b) renaming/documenting the column semantics for plain mode (and/or default harmonized_column=None when mode="plain").
        if self.mode == "plain":
            tokens = self.harmonizer.to_tokens(row[self.prompt_column])
            row[self.tokenized_column] = tokens
        else:
            row[self.tokenized_column] = self.harmonizer(row[self.prompt_column])
        if self.harmonized_column is not None:
            row[self.harmonized_column] = self.harmonizer.to_text(
                row[self.tokenized_column]
            )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +183 to 196
# Guard pre-tokenized rows: the SGLang adapter adds a default Harmonize
# (GPT-OSS tokenizer + harmony mode). When row processors are fused, the
# dataframe-level skip is bypassed, so without this guard, adapter
# Harmonize would overwrite input tokens. Alternative: remove Harmonize
# from the adapter transforms and require each SGLang preset to add its
# own Harmonize with the desired tokenizer/args.
if self.tokenized_column in row and row[self.tokenized_column] is not None:
return row
if self.mode == "plain":
tokens = self.harmonizer.to_tokens(row[self.prompt_column])
row[self.tokenized_column] = tokens
else:
row[self.tokenized_column] = self.harmonizer(row[self.prompt_column])
if self.harmonized_column is not None:
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In process_row, the early return when input_tokens is already present also skips populating harmonized_column (when configured). This makes it impossible to keep preset-provided tokens while still emitting harmonized_prompt text. Consider skipping only the tokenization step (avoid overwriting input_tokens), but still fill harmonized_column if it’s set and missing (or validate it matches the existing tokens).

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +26
| Dataset | Presets | Tests |
| ------------- | ------------------------------- | ----- |
| CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 6 |
| AIME25 | `gptoss` | 3 |
| GPQA | `gptoss` | 3 |
| LiveCodeBench | `gptoss` | 3 |
| OpenOrca | `llama2_70b` | 3 |
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table’s CNNDailyMail test count appears incorrect. tests/unit/dataset_manager/test_dataset_presets.py currently defines 5 tests under TestCNNDailyMailPresets (3 regular + 2 @pytest.mark.slow), not 6. Please update the count (or remove the numeric column) so the doc stays accurate.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +16
# Exclude slow tests (Harmonize transform requires transformers)
pytest tests/unit/dataset_manager/test_dataset_presets.py -m "not slow" -v
```
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note “Exclude slow tests (Harmonize transform requires transformers)” is a bit misleading since transformers is already a core dependency here; the main reason these tests are slow is typically tokenizer/model downloads and external network access. Consider rewording to reflect that the slow marker is about heavyweight downloads / network dependency.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +88
assert len(llama3_8b_transformed["prompt"][0]) > 0

def test_llama3_8b_prompt_format(self, llama3_8b_transformed, sample_cnn_data):
"""Test that llama3_8b produces properly formatted prompts."""
prompt = llama3_8b_transformed["prompt"][0]
assert "Summarize" in prompt
assert "news article" in prompt
assert "article" in sample_cnn_data.columns
# The original article should be embedded in the prompt
assert sample_cnn_data["article"][0] in prompt
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests access DataFrame cells via df["col"][0], which is label-based and can break if the index isn’t 0 (or if transforms preserve a non-default index). Prefer .iloc[0] for positional access (and apply consistently throughout this test file).

Suggested change
assert len(llama3_8b_transformed["prompt"][0]) > 0
def test_llama3_8b_prompt_format(self, llama3_8b_transformed, sample_cnn_data):
"""Test that llama3_8b produces properly formatted prompts."""
prompt = llama3_8b_transformed["prompt"][0]
assert "Summarize" in prompt
assert "news article" in prompt
assert "article" in sample_cnn_data.columns
# The original article should be embedded in the prompt
assert sample_cnn_data["article"][0] in prompt
assert len(llama3_8b_transformed["prompt"].iloc[0]) > 0
def test_llama3_8b_prompt_format(self, llama3_8b_transformed, sample_cnn_data):
"""Test that llama3_8b produces properly formatted prompts."""
prompt = llama3_8b_transformed["prompt"].iloc[0]
assert "Summarize" in prompt
assert "news article" in prompt
assert "article" in sample_cnn_data.columns
# The original article should be embedded in the prompt
assert sample_cnn_data["article"].iloc[0] in prompt

Copilot uses AI. Check for mistakes.
assert "(C)" in prompt
assert "(D)" in prompt
# Should instruct to express answer as option letter
assert "A" in prompt or "option" in prompt.lower()
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is redundant: earlier in the same test you already assert "(A)" in prompt, which implies "A" in prompt will always be true. Consider replacing this with a more specific check for the instruction text (e.g., that the prompt asks for an option letter), or drop it to avoid a non-signal assertion.

Suggested change
assert "A" in prompt or "option" in prompt.lower()
assert "option" in prompt.lower()

Copilot uses AI. Check for mistakes.
@attafosu attafosu requested a review from arekay-nv March 27, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants