-
Notifications
You must be signed in to change notification settings - Fork 86
Description
Describe the bug
A clear and concise description of what the bug is.
When i do benchmark testing using synthetic prompts, the prompts are started with not unified language such as:
In this screenshot, start_tokens are different languages that are followed by english.
In this screenshot, start_tokens are different with following German.
Such prompt_tokens are inappropriate for benchmark testing.
and i write one script to dig into this problem:
import json
from guidellm.dataset.synthetic import (
SyntheticDatasetCreator,
)
def test_handle_create_basic():
# Example: German
# data = "prompt_tokens=128,output_tokens=56, samples=10, source=https://www.gutenberg.org/cache/epub/30793/pg30793.txt"
# Example: English
data = "prompt_tokens=128,output_tokens=56, samples=10, source=https://www.gutenberg.org/cache/epub/1342/pg1342.txt"
synthetic_creator=SyntheticDatasetCreator()
res_list = []
result = synthetic_creator.handle_create(
data=data,
data_args=None,
processor="${local_path}/Qwen2.5-1.5B-Instruct",
processor_args=None,
random_seed=42,
)
res_list = result[:]['prompt']
with open('./syntheic_prompts.json', 'w', encoding='utf-8') as f:
json.dump(res_list, f, ensure_ascii=False, indent=4)
if __name__ == "__main__":
test_handle_create_basic()
Expected behavior
A clear and concise description of what you expected to happen.
start word should be the same with the whole prompt.
Environment
Include all relevant environment information:
- OS [e.g. Ubuntu 20.04]:
- Python version [e.g. 3.12.2]: Python 3.9.9
- guidellm version: 0.3.0
To Reproduce
Exact steps to reproduce the behavior:
step1. copy self-test scripts from above into local env.
step2. python ./test_synthetic_prompt.py
Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.
Additional context
Add any other context about the problem here. Also include any relevant files.