Reenable and improve preprocess dataset #472

jaredoconnell · 2025-11-19T20:58:38Z

Summary

This PR re-enables, tests, and documents the preprocess dataset command.
Also changes the format that prompt and output sizes are specified, and makes the code aware of prefixes.

Details

Uses the post-refactor code to re-enable the command.
Switches over to the same format used by benchmark run's synthetic data for the data config to enable more features and make the command more cohesive with the rest of GuideLLM.
Adds options for prefixes. I added an option to include prefixes in the count, since prefixes are included in input and output tokens, and affect performance.

Test Plan

Run with a known dataset, or create one as a simple CSV.
New tests are added that should cover everything except huggingface uploads. They are all at least in part generated by AI, but I went through each one iteratively to ensure they do what they need to do.

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes AI-assisted code completion
Includes code generated by an AI application
Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Also includes some new behavior Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>

Signed-off-by: Jared O'Connell <[email protected]>

Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>

Generated-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>

Signed-off-by: Jared O'Connell <[email protected]>

sjmonson

I think the reuse of SyntheticTextDatasetConfig is not a great idea. prefix_buckets are defined differently here. For synthetic data a prefix bucket of prefix_tokens=10,prefix_count=1 means you get one identical prefix for the entire dataset. As implemented here prefix_tokens=10,prefix_count=1 will only ensure that every row has a prefix of length 10. It does not guarantee any shared prefix between rows.

Rather then reuse SyntheticTextDatasetConfig I think the best option is to create a new config format that is similar only where it makes sense. For example:

prompt_tokens:
prompt_tokens_stdenv:
prompt_tokens_min:
prompt_tokens_max:
output_tokens:
output_tokens_stdenv:
output_tokens_min:
output_tokens_max:
prefix_tokens:
prefix_tokens_stdenv:
prefix_tokens_min:
prefix_tokens_max:

Treat prefix the same as prompt and output.

src/guidellm/data/entrypoints.py

sjmonson · 2025-11-20T18:05:49Z

src/guidellm/data/entrypoints.py

+    ERROR = "error"
+
+
+def handle_ignore_strategy(


For better organization I think these handle_*_strategy methods should be defined on a static class.

src/guidellm/data/entrypoints.py

src/guidellm/__main__.py

src/guidellm/data/entrypoints.py

Use a separate config for preprocess's config, but it inherits several fields from a new shared class with the synthetic config. I did this so that the relevant fields are shared, lowering complexity. Signed-off-by: Jared O'Connell <[email protected]>

Moved short prompt strategies to a static class Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>

Signed-off-by: Jared O'Connell <[email protected]>

jaredoconnell · 2025-11-21T05:11:13Z

I moved it to its on class in a way that retains a single source of truth so that we can use the same documentation. I just simplified it to only have the option of trimming prefixes. I decided that it wouldn't make sense to use a randomized size sampling because that's not how samples work in real scenarios typically.

src/guidellm/data/schemas.py

Signed-off-by: Jared O'Connell <[email protected]>

Use separate class for preprocess config Signed-off-by: Jared O'Connell <[email protected]>

sjmonson

Need to double-check that benchmark run is unaffected but LGTM.

jaredoconnell added 7 commits November 14, 2025 18:00

Reenable preprocess dataset

5c883fb

Also includes some new behavior Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>

Move preprocess code to data/entrypoints.py

53a26eb

Signed-off-by: Jared O'Connell <[email protected]>

Added tests for dataset preprocess command

2bc1eee

Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>

Added test for prefix buckets

3885ba1

Generated-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>

Added documentation for preprocess dataset

0b260c8

Generated-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>

Improve documentation and fix lint and type errors

81fbbe8

Signed-off-by: Jared O'Connell <[email protected]>

Merge branch 'main' into features/reenable-preprocess

11cf5d2

jaredoconnell requested review from markurtz and sjmonson November 19, 2025 20:58

jaredoconnell added 4 commits November 19, 2025 19:02

Port tests from old dataset test file

6308e6a

Signed-off-by: Jared O'Connell <[email protected]>

Fix linting errors

08d9419

Signed-off-by: Jared O'Connell <[email protected]>

Format docs

f1857e0

Signed-off-by: Jared O'Connell <[email protected]>

Remove extra whitespace at end of file

c60f1f2

Signed-off-by: Jared O'Connell <[email protected]>

sjmonson requested changes Nov 20, 2025

View reviewed changes

jaredoconnell added 5 commits November 20, 2025 23:01

Address review comments

1eea713

Use a separate config for preprocess's config, but it inherits several fields from a new shared class with the synthetic config. I did this so that the relevant fields are shared, lowering complexity. Signed-off-by: Jared O'Connell <[email protected]>

Address review comments

77df1c8

Moved short prompt strategies to a static class Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>

Merge branch 'main' into features/reenable-preprocess

a227a50

Signed-off-by: Jared O'Connell <[email protected]>

Fix tests broken in prior commit

00ee570

Signed-off-by: Jared O'Connell <[email protected]>

Fix linter errors

6366782

Signed-off-by: Jared O'Connell <[email protected]>

jaredoconnell requested a review from sjmonson November 21, 2025 16:39

sjmonson requested changes Nov 21, 2025

View reviewed changes

src/guidellm/data/schemas.py Outdated Show resolved Hide resolved

src/guidellm/data/schemas.py Outdated Show resolved Hide resolved

jaredoconnell added 2 commits November 21, 2025 12:31

Remove unused field in SyntheticTextDatasetConfig

86bca8e

Signed-off-by: Jared O'Connell <[email protected]>

Addressed review comments

e922b5b

Use separate class for preprocess config Signed-off-by: Jared O'Connell <[email protected]>

sjmonson approved these changes Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reenable and improve preprocess dataset #472

Reenable and improve preprocess dataset #472

jaredoconnell commented Nov 19, 2025

Uh oh!

sjmonson left a comment

Uh oh!

Uh oh!

sjmonson Nov 20, 2025

Uh oh!

jaredoconnell Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaredoconnell commented Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

sjmonson left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reenable and improve preprocess dataset #472

Are you sure you want to change the base?

Reenable and improve preprocess dataset #472

Conversation

jaredoconnell commented Nov 19, 2025

Summary

Details

Test Plan

Use of AI

Uh oh!

sjmonson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sjmonson Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

jaredoconnell Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaredoconnell commented Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

sjmonson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants