-
Notifications
You must be signed in to change notification settings - Fork 100
Reenable and improve preprocess dataset #472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Reenable and improve preprocess dataset #472
Conversation
Also includes some new behavior Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>
Generated-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
sjmonson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reuse of SyntheticTextDatasetConfig is not a great idea. prefix_buckets are defined differently here. For synthetic data a prefix bucket of prefix_tokens=10,prefix_count=1 means you get one identical prefix for the entire dataset. As implemented here prefix_tokens=10,prefix_count=1 will only ensure that every row has a prefix of length 10. It does not guarantee any shared prefix between rows.
Rather then reuse SyntheticTextDatasetConfig I think the best option is to create a new config format that is similar only where it makes sense. For example:
prompt_tokens:
prompt_tokens_stdenv:
prompt_tokens_min:
prompt_tokens_max:
output_tokens:
output_tokens_stdenv:
output_tokens_min:
output_tokens_max:
prefix_tokens:
prefix_tokens_stdenv:
prefix_tokens_min:
prefix_tokens_max:Treat prefix the same as prompt and output.
src/guidellm/data/entrypoints.py
Outdated
| ERROR = "error" | ||
|
|
||
|
|
||
| def handle_ignore_strategy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better organization I think these handle_*_strategy methods should be defined on a static class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Use a separate config for preprocess's config, but it inherits several fields from a new shared class with the synthetic config. I did this so that the relevant fields are shared, lowering complexity. Signed-off-by: Jared O'Connell <[email protected]>
Moved short prompt strategies to a static class Assisted-by: Cursor AI Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
|
I moved it to its on class in a way that retains a single source of truth so that we can use the same documentation. I just simplified it to only have the option of trimming prefixes. I decided that it wouldn't make sense to use a randomized size sampling because that's not how samples work in real scenarios typically. |
Signed-off-by: Jared O'Connell <[email protected]>
Use separate class for preprocess config Signed-off-by: Jared O'Connell <[email protected]>
sjmonson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to double-check that benchmark run is unaffected but LGTM.
Summary
This PR re-enables, tests, and documents the
preprocess datasetcommand.Also changes the format that prompt and output sizes are specified, and makes the code aware of prefixes.
Details
benchmark run's synthetic data for the data config to enable more features and make the command more cohesive with the rest of GuideLLM.Test Plan
Use of AI
## WRITTEN BY AI ##)