feat: Enhance sglang adapter and OpenAI api compatibility#158
feat: Enhance sglang adapter and OpenAI api compatibility#158
Conversation
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com> Committer: attafosu <thomas.atta-fosu@intel.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesThis pull request significantly enhances the data processing pipeline by integrating SGLang compatibility through a new dataset preset and refining the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new dataset preset for SGLang compatibility, llama3_8b_sglang, which leverages an enhanced Harmonize transform. The Harmonize transform now supports a 'plain' mode for direct tokenization and includes a robust guard to prevent overwriting pre-tokenized input. Additionally, optional fields in OpenAI chat completion types (refusal, usage, system_fingerprint) have been updated to include default None values, improving constructor flexibility. The changes are well-implemented and align with the stated objective of improving OpenAI API compatibility and refactoring the Harmonizer.
There was a problem hiding this comment.
Pull request overview
Adds dataset preset and transform behavior needed to support SGLang usage alongside OpenAI-compatible flows, particularly for CNN/DailyMail, while improving OpenAI response type compatibility.
Changes:
- Make certain OpenAI response fields optional by providing
Nonedefaults in msgspec structs. - Extend
Harmonizetransform with amodeoption (harmonyvsplain) and add a row-level guard to avoid overwriting pre-generatedinput_tokenswhen row processors are fused. - Add a CNN/DailyMail preset (
llama3_8b_sglang) that formats a prompt and pre-tokenizes it for SGLang.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/inference_endpoint/openai/types.py |
Adds defaults for optional OpenAI response fields to improve (de)serialization compatibility. |
src/inference_endpoint/dataset_manager/transforms.py |
Adds mode to Harmonize and prevents overwriting pre-generated input_tokens in fused row-processing. |
src/inference_endpoint/dataset_manager/predefined/cnndailymail/presets.py |
Introduces llama3_8b_sglang preset to generate input_tokens directly (plain tokenization). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| if self.tokenized_column in row and row[self.tokenized_column] is not None: | ||
| return row |
There was a problem hiding this comment.
Returning early when input_tokens is present skips populating harmonized_column even when it’s configured (non-None). If callers rely on the text harmonized prompt for debugging/logging, consider still computing harmonized_column from the existing tokens (without overwriting tokens), or update the docstring/behavior to make it explicit that the column may not be produced when tokens are pre-generated.
|
@arekay-nv @nvzhihanj please take a look when you get the chance |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/inference_endpoint/dataset_manager/transforms.py:162
- In
mode="plain",Harmonizestill instantiatesHarmonizer, which unconditionally loads an OpenAI Harmony encoding (harmony.load_harmony_encoding(...)) and builds a Harmony system message. Since plain mode only needs tokenizer.encode(), consider avoiding/loading the Harmony encoding lazily (e.g., only createHarmonizerwhen mode=="harmony", and for plain mode just load/cache the HF tokenizer and encode directly) to prevent unnecessary startup cost and dependency on Harmony assets.
self.mode = mode
if self.mode not in {"harmony", "plain"}:
raise ValueError(f"Invalid harmonize mode: {self.mode}")
self.harmonizer = Harmonizer(
tokenizer_name=tokenizer_name,
encoding_name=encoding_name,
reasoning_effort=reasoning_effort,
conversation_start_date=conversation_start_date,
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
@attafosu does this need the llama3-8b example to change? Can you also push those changes so the PR can be functionally verified. Also, would be nice to add tests for this. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| @@ -145,10 +146,14 @@ def __init__( | |||
| tokenized_column: The name of the column containing the tokenized prompt. | |||
| harmonized_column: The name of the column containing the harmonized prompt. If None, | |||
| the harmonized prompt will not be stored as text. | |||
| mode: "harmony" to render a Harmony conversation; "plain" to tokenize the raw prompt. | |||
| """ | |||
| self.prompt_column = prompt_column | |||
| self.tokenized_column = tokenized_column | |||
| self.harmonized_column = harmonized_column | |||
| self.mode = mode | |||
| if self.mode not in {"harmony", "plain"}: | |||
| raise ValueError(f"Invalid harmonize mode: {self.mode}") | |||
| # Start sglang endpoint | ||
| docker exec -u root -w /workspace sglang-cpu-server /bin/bash -lc python3 -m sglang.launch_server \ | ||
| --model-path $MODEL_NAME \ | ||
| --served-model-name meta-llama/Llama-3.1-8B-Instruct \ # Needed if `model-path` is here is different from `model` in the client config | ||
| --dtype bfloat16 \ |
* Handle case with string response Handles the case where the response is a single string, not a list - needed to handle AMD submission which wasn't calculating TPOT without the fix. --------- Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
…159) * drop orjson * update * trigger CI
* updates * ci fix
* optimize msgspec implementation * precommit * drop old perf test * add sglang tests * updates
address pending comment in #162 Refactored Request Building: Refactored the build_request method to leverage these new pre-calculated prefixes, simplifying the logic and reducing string concatenations for common request paths. Prefix Rebuilding Logic: Introduced a new private method _rebuild_prefixes to manage the construction and update of these prefixes, ensuring they are correctly updated when cached headers change.
* docs: add AGENTS.md with AI coding guidelines, restructure CLAUDE.md Move repo guidelines from CLAUDE.md into AGENTS.md so they are tool-agnostic and usable by any AI coding agent. CLAUDE.md now contains only an @AGENTS.md include directive. AGENTS.md covers architecture, code organization, development standards, and adds two new sections: a policy requiring AGENTS.md updates alongside significant refactors, and a catalog of common AI coding pitfalls specific to this codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: fix AGENTS.md formatting for prettier compliance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
* Add preset dataset unit tests and documentation - Add test_dataset_presets.py with 20 test cases for 6 presets across 5 datasets - Add comprehensive testing guide and schema reference documentation Tests verify that transforms work correctly without end-to-end runs, enabling fast regression detection when transform code changes. Signed-off-by: attafosu <thomas.atta-fosu@intel.com> * Cleanup local directory Signed-off-by: attafosu <thomas.atta-fosu@intel.com> * Sanitize documentation Signed-off-by: attafosu <thomas.atta-fosu@intel.com> * Cleanup Signed-off-by: attafosu <thomas.atta-fosu@intel.com> * Decorate slow tests Signed-off-by: attafosu <thomas.atta-fosu@intel.com> * Update DATASET_SCHEMA_REFERENCE.md * Cleanup Signed-off-by: attafosu <thomas.atta-fosu@intel.com> * Remove redundant dataset schema Signed-off-by: attafosu <thomas.atta-fosu@intel.com> * Add fixtures to simplify unit tests Signed-off-by: attafosu <thomas.atta-fosu@intel.com> --------- Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert llama3_8b_sglang_transformed is not None | ||
| assert "prompt" in llama3_8b_sglang_transformed.columns |
There was a problem hiding this comment.
The SGLang preset test only asserts that the prompt column exists, but this preset’s key output is input_tokens (and harmonized_column=None). To prevent regressions in the new adapter/preset interaction, assert that input_tokens exists and contains a non-empty list of ints (and optionally that harmonized_prompt is absent).
| assert llama3_8b_sglang_transformed is not None | |
| assert "prompt" in llama3_8b_sglang_transformed.columns | |
| assert llama3_8b_sglang_transformed is not None | |
| # SGLang preset should still provide a prompt column | |
| assert "prompt" in llama3_8b_sglang_transformed.columns | |
| # Key output for SGLang preset is tokenized input | |
| assert "input_tokens" in llama3_8b_sglang_transformed.columns | |
| input_tokens = llama3_8b_sglang_transformed["input_tokens"].iloc[0] | |
| assert isinstance(input_tokens, list) | |
| assert len(input_tokens) > 0 | |
| assert all(isinstance(token, int) for token in input_tokens) | |
| # harmonized_column is expected to be None for this preset | |
| assert "harmonized_prompt" not in llama3_8b_sglang_transformed.columns |
| assert "(C)" in prompt | ||
| assert "(D)" in prompt | ||
| # Should instruct to express answer as option letter | ||
| assert "A" in prompt or "option" in prompt.lower() |
There was a problem hiding this comment.
This assertion is effectively always true because the prompt always contains "(A)" in the choices, so the test can pass even if the instruction to answer with an option letter is removed. Tighten the check to look for the actual instruction text (e.g., "Express your final answer" / "corresponding option") rather than the letter "A".
| assert "A" in prompt or "option" in prompt.lower() | |
| lower_prompt = prompt.lower() | |
| assert "express your final answer" in lower_prompt or "corresponding option" in lower_prompt |
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
Signed-off-by: attafosu <thomas.atta-fosu@intel.com>
|
/gemini summary |
Summary of ChangesThis pull request significantly enhances the data processing pipeline by integrating SGLang compatibility through a new dataset preset and refining the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.mode = mode | ||
| if self.mode not in {"harmony", "plain"}: | ||
| raise ValueError(f"Invalid harmonize mode: {self.mode}") | ||
| self.harmonizer = Harmonizer( | ||
| tokenizer_name=tokenizer_name, | ||
| encoding_name=encoding_name, |
There was a problem hiding this comment.
In mode="plain", process_row() only needs a HuggingFace tokenizer, but __init__ still constructs a full Harmonizer, which eagerly loads the Harmony encoding and builds the system message. This adds unnecessary overhead/dependencies for the plain-tokenization path. Consider making Harmonizer lazily load the encoding/system message only when mode=="harmony", or use AutoTokenizer directly in plain mode.
@arekay-nv Added example for this change in the llama3-8b Readme. Also added unit tests that covers preset datasets |
| Harmonize( | ||
| tokenizer_name=tokenizer_name, | ||
| prompt_column="prompt", | ||
| tokenized_column="input_tokens", | ||
| harmonized_column=None, | ||
| mode="plain", | ||
| ), | ||
| ] |
There was a problem hiding this comment.
Can you clarify why we need harmonize with plain here for a llama model. Harmonization only works with the gpt-oss models as far as i know, so using a harmonize transform here is a bit confusing.
There was a problem hiding this comment.
So here we just want to use the Harmonizer to generate the tokenized inputs (input_tokens needed by sglang api). The plain mode is introduced to ensure no chat templates nor processing is applied to the input prompt (as would otherwise be the case in the "harmony" mode: src/inference_endpoint/dataset_manager/transforms.py::process_row() --> src/inference_endpoint/openai/harmony.py::harmony())
I could also add a new transform say a Tokenizer transform to do just that (generating tokenized inputs), but only wanted to refactor existing implementations wherever possible. If this sounds more straightforward I can leave the Harmonizer as is and instead add the tokenizing transform.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Exclude slow tests (Harmonize transform requires transformers) | ||
| pytest tests/unit/dataset_manager/test_dataset_presets.py -m "not slow" -v |
There was a problem hiding this comment.
The note says slow tests are excluded because Harmonize “requires transformers”, but transformers is already a core dependency in this repo; the main reason to mark these slow is usually that they can trigger tokenizer/model downloads and be network-dependent. Consider rewording to reflect that.
|
|
||
| | Dataset | Presets | Tests | | ||
| | ------------- | ------------------------------- | ----- | | ||
| | CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 6 | |
There was a problem hiding this comment.
The CNNDailyMail row lists 6 tests, but tests/unit/dataset_manager/test_dataset_presets.py currently defines 5 tests for CNNDailyMail (3 for llama3_8b and 2 for llama3_8b_sglang). Please update the count to match the actual test file.
| | CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 6 | | |
| | CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 5 | |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/inference_endpoint/dataset_manager/transforms.py:199
- When
mode="plain",harmonized_column(defaulting toharmonized_prompt) is still populated viato_text()from the raw tokens, which is not actually a Harmony-formatted prompt. To avoid semantic confusion, consider either: (a) only writingharmonized_columnwhenmode == "harmony", or (b) renaming/documenting the column semantics for plain mode (and/or defaultharmonized_column=Nonewhenmode="plain").
if self.mode == "plain":
tokens = self.harmonizer.to_tokens(row[self.prompt_column])
row[self.tokenized_column] = tokens
else:
row[self.tokenized_column] = self.harmonizer(row[self.prompt_column])
if self.harmonized_column is not None:
row[self.harmonized_column] = self.harmonizer.to_text(
row[self.tokenized_column]
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Guard pre-tokenized rows: the SGLang adapter adds a default Harmonize | ||
| # (GPT-OSS tokenizer + harmony mode). When row processors are fused, the | ||
| # dataframe-level skip is bypassed, so without this guard, adapter | ||
| # Harmonize would overwrite input tokens. Alternative: remove Harmonize | ||
| # from the adapter transforms and require each SGLang preset to add its | ||
| # own Harmonize with the desired tokenizer/args. | ||
| if self.tokenized_column in row and row[self.tokenized_column] is not None: | ||
| return row | ||
| if self.mode == "plain": | ||
| tokens = self.harmonizer.to_tokens(row[self.prompt_column]) | ||
| row[self.tokenized_column] = tokens | ||
| else: | ||
| row[self.tokenized_column] = self.harmonizer(row[self.prompt_column]) | ||
| if self.harmonized_column is not None: |
There was a problem hiding this comment.
In process_row, the early return when input_tokens is already present also skips populating harmonized_column (when configured). This makes it impossible to keep preset-provided tokens while still emitting harmonized_prompt text. Consider skipping only the tokenization step (avoid overwriting input_tokens), but still fill harmonized_column if it’s set and missing (or validate it matches the existing tokens).
| | Dataset | Presets | Tests | | ||
| | ------------- | ------------------------------- | ----- | | ||
| | CNNDailyMail | `llama3_8b`, `llama3_8b_sglang` | 6 | | ||
| | AIME25 | `gptoss` | 3 | | ||
| | GPQA | `gptoss` | 3 | | ||
| | LiveCodeBench | `gptoss` | 3 | | ||
| | OpenOrca | `llama2_70b` | 3 | |
There was a problem hiding this comment.
This table’s CNNDailyMail test count appears incorrect. tests/unit/dataset_manager/test_dataset_presets.py currently defines 5 tests under TestCNNDailyMailPresets (3 regular + 2 @pytest.mark.slow), not 6. Please update the count (or remove the numeric column) so the doc stays accurate.
| # Exclude slow tests (Harmonize transform requires transformers) | ||
| pytest tests/unit/dataset_manager/test_dataset_presets.py -m "not slow" -v | ||
| ``` |
There was a problem hiding this comment.
The note “Exclude slow tests (Harmonize transform requires transformers)” is a bit misleading since transformers is already a core dependency here; the main reason these tests are slow is typically tokenizer/model downloads and external network access. Consider rewording to reflect that the slow marker is about heavyweight downloads / network dependency.
| assert len(llama3_8b_transformed["prompt"][0]) > 0 | ||
|
|
||
| def test_llama3_8b_prompt_format(self, llama3_8b_transformed, sample_cnn_data): | ||
| """Test that llama3_8b produces properly formatted prompts.""" | ||
| prompt = llama3_8b_transformed["prompt"][0] | ||
| assert "Summarize" in prompt | ||
| assert "news article" in prompt | ||
| assert "article" in sample_cnn_data.columns | ||
| # The original article should be embedded in the prompt | ||
| assert sample_cnn_data["article"][0] in prompt |
There was a problem hiding this comment.
These tests access DataFrame cells via df["col"][0], which is label-based and can break if the index isn’t 0 (or if transforms preserve a non-default index). Prefer .iloc[0] for positional access (and apply consistently throughout this test file).
| assert len(llama3_8b_transformed["prompt"][0]) > 0 | |
| def test_llama3_8b_prompt_format(self, llama3_8b_transformed, sample_cnn_data): | |
| """Test that llama3_8b produces properly formatted prompts.""" | |
| prompt = llama3_8b_transformed["prompt"][0] | |
| assert "Summarize" in prompt | |
| assert "news article" in prompt | |
| assert "article" in sample_cnn_data.columns | |
| # The original article should be embedded in the prompt | |
| assert sample_cnn_data["article"][0] in prompt | |
| assert len(llama3_8b_transformed["prompt"].iloc[0]) > 0 | |
| def test_llama3_8b_prompt_format(self, llama3_8b_transformed, sample_cnn_data): | |
| """Test that llama3_8b produces properly formatted prompts.""" | |
| prompt = llama3_8b_transformed["prompt"].iloc[0] | |
| assert "Summarize" in prompt | |
| assert "news article" in prompt | |
| assert "article" in sample_cnn_data.columns | |
| # The original article should be embedded in the prompt | |
| assert sample_cnn_data["article"].iloc[0] in prompt |
| assert "(C)" in prompt | ||
| assert "(D)" in prompt | ||
| # Should instruct to express answer as option letter | ||
| assert "A" in prompt or "option" in prompt.lower() |
There was a problem hiding this comment.
This assertion is redundant: earlier in the same test you already assert "(A)" in prompt, which implies "A" in prompt will always be true. Consider replacing this with a more specific check for the instruction text (e.g., that the prompt asks for an option letter), or drop it to avoid a non-signal assertion.
| assert "A" in prompt or "option" in prompt.lower() | |
| assert "option" in prompt.lower() |
What does this PR do?
HarmonizertransformType of change
Related issues
Testing
Checklist