start_tokens in synthetic prompt is not related to prompt_text

**Describe the bug**
A clear and concise description of what the bug is.

When i do benchmark testing using synthetic prompts, the prompts are started with not unified language such as:

In this screenshot, start_tokens are different languages that are followed by english.
<img width="1125" height="410" alt="Image" src="https://github.com/user-attachments/assets/371e5ec0-9169-46fe-b4e6-22934ca341c8" />


In this screenshot, start_tokens are different with following German.
<img width="1344" height="400" alt="Image" src="https://github.com/user-attachments/assets/c1ac4ddd-bb82-4e9c-8ef1-e71d538120fa" />


Such prompt_tokens are inappropriate for benchmark testing.

and i write one script to dig into this problem:

```
import json

from guidellm.dataset.synthetic import (
    SyntheticDatasetCreator,
)

def test_handle_create_basic():
    # Example: German
    # data = "prompt_tokens=128,output_tokens=56, samples=10, source=https://www.gutenberg.org/cache/epub/30793/pg30793.txt" 
    # Example: English
    data = "prompt_tokens=128,output_tokens=56, samples=10, source=https://www.gutenberg.org/cache/epub/1342/pg1342.txt" 
    synthetic_creator=SyntheticDatasetCreator()
    res_list = []
    result = synthetic_creator.handle_create(
            data=data,
            data_args=None,
            processor="${local_path}/Qwen2.5-1.5B-Instruct",
            processor_args=None,
            random_seed=42,
        )
    res_list = result[:]['prompt']
    with open('./syntheic_prompts.json', 'w', encoding='utf-8') as f:
        json.dump(res_list, f, ensure_ascii=False, indent=4)
        
if __name__ == "__main__":
    test_handle_create_basic()
```

**Expected behavior**
A clear and concise description of what you expected to happen.

start word should be the same with the whole prompt.

**Environment**
Include all relevant environment information:
1. OS [e.g. Ubuntu 20.04]: 
2. Python version [e.g. 3.12.2]: Python 3.9.9
3. guidellm version: 0.3.0

**To Reproduce**
Exact steps to reproduce the behavior:
step1. copy self-test scripts from above into local env.
step2. python ./test_synthetic_prompt.py

**Errors**
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

**Additional context**
Add any other context about the problem here. Also include any relevant files.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

start_tokens in synthetic prompt is not related to prompt_text #360

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

start_tokens in synthetic prompt is not related to prompt_text #360

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions