Skip to content

Commit 1eea713

Browse files
committed
Address review comments
Use a separate config for preprocess's config, but it inherits several fields from a new shared class with the synthetic config. I did this so that the relevant fields are shared, lowering complexity. Signed-off-by: Jared O'Connell <[email protected]>
1 parent c60f1f2 commit 1eea713

File tree

9 files changed

+311
-430
lines changed

9 files changed

+311
-430
lines changed

docs/guides/datasets.md

Lines changed: 31 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -251,12 +251,37 @@ guidellm preprocess dataset \
251251
"path/to/input_dataset.jsonl" \
252252
"path/to/processed_dataset.jsonl" \
253253
--processor "gpt2" \
254-
--config "prompt_tokens=512,output_tokens=256"
254+
--config "prompt_tokens=512,output_tokens=256,prefix_tokens_max=100"
255255
```
256256

257257
### Configuration and Processor Options
258258

259-
The `--config` parameter uses the same format as synthetic data configuration. It accepts a JSON string, key=value pairs, or a configuration file path. For detailed information about available configuration parameters (such as `prompt_tokens`, `output_tokens`, `prompt_tokens_stdev`, etc.), see the [Synthetic Data Configuration Options](../datasets.md#configuration-options) in the Dataset Configurations guide.
259+
The `--config` parameter accepts a `PreprocessDatasetConfig` as a JSON string, key=value pairs, or a configuration file path (.json, .yaml, .yml, .config). This configuration is similar to the synthetic data configuration but includes additional fields specific to preprocessing.
260+
261+
**PreprocessDatasetConfig Options:**
262+
263+
- `prompt_tokens`: Average number of tokens in prompts. If nothing else is specified, all prompts will be resized to this number of tokens.
264+
- `prompt_tokens_stdev`: Standard deviation for prompt tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
265+
- `prompt_tokens_min`: Minimum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the minimum is 1.
266+
- `prompt_tokens_max`: Maximum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the maximum is 5 times the standard deviation.
267+
- `output_tokens`: Average number of tokens in outputs. If nothing else is specified, all outputs will have this number of tokens.
268+
- `output_tokens_stdev`: Standard deviation for output tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
269+
- `output_tokens_min`: Minimum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the minimum is 1.
270+
- `output_tokens_max`: Maximum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the maximum is 5 times the standard deviation.
271+
- `prefix_tokens_max`: Maximum number of prefix tokens to keep. If set, prefixes will be trimmed to this maximum length. If not set, prefixes are kept as-is (unless `--include-prefix-in-token-count` is used, which disables prefix trimming).
272+
273+
**Example configurations:**
274+
275+
```bash
276+
# Using key=value pairs
277+
--config "prompt_tokens=512,output_tokens=256,prefix_tokens_max=100"
278+
279+
# Using JSON string
280+
--config '{"prompt_tokens": 512, "output_tokens": 256, "prefix_tokens_max": 100}'
281+
282+
# Using a configuration file
283+
--config "path/to/config.json"
284+
```
260285

261286
The `--processor` argument specifies the tokenizer to use for calculating token counts. This is required because the preprocessing command needs to tokenize prompts to ensure they match the target token sizes. For information about using processors, including Hugging Face model IDs, local paths, and processor arguments, see the [Data Arguments Overview](../datasets.md#data-arguments-overview) section.
262287

@@ -358,7 +383,6 @@ guidellm preprocess dataset \
358383
| Option | Description |
359384
| --------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
360385
| `--data-args <JSON>` | JSON string of arguments to pass to dataset loading. See [Data Arguments Overview](../datasets.md#data-arguments-overview) for details. |
361-
| `--prefix-tokens <NUMBER>` | Single prefix token count (alternative to `prefix_tokens` in config). |
362386
| `--include-prefix-in-token-count` | Include prefix tokens in prompt token count calculation (flag). When enabled, prefix trimming is disabled and the prefix is kept as-is. |
363387
| `--random-seed <NUMBER>` | Random seed for reproducible token sampling (default: 42). |
364388
| `--push-to-hub` | Push the processed dataset to Hugging Face Hub (flag). |
@@ -390,16 +414,15 @@ guidellm preprocess dataset \
390414
--random-seed 123
391415
```
392416

393-
**Example 3: Preprocessing with processor arguments and prefix tokens**
417+
**Example 3: Preprocessing with processor arguments and prefix token limits**
394418

395419
```bash
396420
guidellm preprocess dataset \
397421
"dataset.jsonl" \
398422
"processed.jsonl" \
399423
--processor "gpt2" \
400424
--processor-args '{"use_fast": false}' \
401-
--config "prompt_tokens=512,output_tokens=256" \
402-
--prefix-tokens 100 \
425+
--config "prompt_tokens=512,output_tokens=256,prefix_tokens_max=100" \
403426
--include-prefix-in-token-count
404427
```
405428

@@ -417,9 +440,9 @@ guidellm preprocess dataset \
417440

418441
### Notes
419442

420-
- The `--config` parameter uses the same format as synthetic data configuration. See the [Synthetic Data Configuration Options](../datasets.md#configuration-options) for all available parameters.
443+
- The `--config` parameter accepts a `PreprocessDatasetConfig` which includes all token count fields (prompt_tokens, output_tokens, etc.) plus `prefix_tokens_max` for controlling prefix length. See the [Configuration and Processor Options](#configuration-and-processor-options) section above for all available parameters.
421444
- The processor/tokenizer is required because the preprocessing command needs to tokenize prompts to ensure they match target token sizes. See the [Data Arguments Overview](../datasets.md#data-arguments-overview) for processor usage details.
422445
- Column mappings are only needed when your dataset uses non-standard column names. GuideLLM will automatically try common column names if no mapping is provided.
423446
- When using `--short-prompt-strategy concatenate`, ensure your dataset has enough samples to concatenate, or some prompts may be skipped.
424447
- The output format is determined by the file extension of `OUTPUT_PATH` (e.g., `.jsonl`, `.csv`, `.parquet`).
425-
- The prefix handling only trims prefixes. It doesn't expand them. Prefix buckets, if specified, only trim the given prefixes by bucket weighting. It doesn't generate unique prefixes for each bucket.
448+
- The prefix handling only trims prefixes. It doesn't expand them. Use `prefix_tokens_max` in the config to set a maximum prefix length, which will trim prefixes that exceed this limit.

src/guidellm/__main__.py

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -520,10 +520,11 @@ def preprocess():
520520
type=str,
521521
required=True,
522522
help=(
523-
"SyntheticTextDatasetConfig as JSON string, key=value pairs, "
523+
"PreprocessDatasetConfig as JSON string, key=value pairs, "
524524
"or file path (.json, .yaml, .yml, .config). "
525-
"Example: 'prompt_tokens=100,output_tokens=50'"
526-
" or '{\"prompt_tokens\": 100, \"output_tokens\": 50}'"
525+
"Example: 'prompt_tokens=100,output_tokens=50,prefix_tokens_max=10'"
526+
" or '{\"prompt_tokens\": 100, \"output_tokens\": 50, "
527+
"\"prefix_tokens_max\": 10}'"
527528
),
528529
)
529530
@click.option(
@@ -565,18 +566,11 @@ def preprocess():
565566
"Delimiter for concatenating short prompts (used with 'concatenate' strategy)."
566567
),
567568
)
568-
@click.option(
569-
"--prefix-tokens",
570-
type=int,
571-
default=None,
572-
help="Single prefix token count (alternative to prefix_buckets in config).",
573-
)
574569
@click.option(
575570
"--include-prefix-in-token-count",
576571
is_flag=True,
577572
default=False,
578-
help="Include prefix tokens in prompt token count calculation. When enabled, "
579-
"prefix trimming is disabled and the prefix is kept as-is.",
573+
help="Include prefix tokens in prompt token count calculation.",
580574
)
581575
@click.option(
582576
"--push-to-hub",
@@ -607,7 +601,6 @@ def dataset(
607601
short_prompt_strategy,
608602
pad_char,
609603
concat_delimiter,
610-
prefix_tokens,
611604
include_prefix_in_token_count,
612605
push_to_hub,
613606
hub_dataset_id,
@@ -624,7 +617,6 @@ def dataset(
624617
short_prompt_strategy=short_prompt_strategy,
625618
pad_char=pad_char,
626619
concat_delimiter=concat_delimiter,
627-
prefix_tokens=prefix_tokens,
628620
include_prefix_in_token_count=include_prefix_in_token_count,
629621
push_to_hub=push_to_hub,
630622
hub_dataset_id=hub_dataset_id,

src/guidellm/data/config.py

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
from __future__ import annotations
2+
3+
from pathlib import Path
4+
from typing import Any, TypeVar
5+
6+
import yaml
7+
from pydantic import ValidationError
8+
9+
from guidellm.data.schemas import DataNotSupportedError
10+
from guidellm.schemas import StandardBaseModel
11+
12+
ConfigT = TypeVar("ConfigT", bound=StandardBaseModel)
13+
14+
15+
def load_config(config: Any, config_class: type[ConfigT]) -> ConfigT | None:
16+
# Try file path first
17+
if (loaded_config := _load_config_file(config, config_class)) is not None:
18+
return loaded_config
19+
20+
# Try dict parsing next
21+
if (loaded_config := _load_config_dict(config, config_class)) is not None:
22+
return loaded_config
23+
24+
# Try string parsing
25+
if (loaded_config := _load_config_str(config, config_class)) is not None:
26+
return loaded_config
27+
28+
return None
29+
30+
31+
def _load_config_dict(data: Any, config_class: type[ConfigT]) -> ConfigT | None:
32+
if not isinstance(data, dict | list):
33+
return None
34+
35+
try:
36+
return config_class.model_validate(data)
37+
except ValidationError:
38+
return None
39+
40+
41+
def _load_config_file(data: Any, config_class: type[ConfigT]) -> ConfigT | None:
42+
if (not isinstance(data, str) and not isinstance(data, Path)) or (
43+
not Path(data).is_file()
44+
):
45+
return None
46+
47+
data_path = Path(data) if isinstance(data, str) else data
48+
error = None
49+
50+
if Path(data).is_file() and data_path.suffix.lower() == ".json":
51+
try:
52+
return config_class.model_validate_json(
53+
data_path.read_text()
54+
)
55+
except Exception as err: # noqa: BLE001
56+
error = err
57+
58+
if Path(data).is_file() and data_path.suffix.lower() in {
59+
".yaml",
60+
".yml",
61+
".config",
62+
}:
63+
try:
64+
return config_class.model_validate(
65+
yaml.safe_load(data_path.read_text())
66+
)
67+
except Exception as err: # noqa: BLE001
68+
error = err
69+
70+
err_message = (
71+
f"Unsupported file {data_path} for "
72+
f"{config_class.__name__}, expected .json, "
73+
f".yaml, .yml, or .config"
74+
)
75+
76+
if error is not None:
77+
err_message += f" with error: {error}"
78+
raise DataNotSupportedError(err_message) from error
79+
raise DataNotSupportedError(err_message)
80+
81+
82+
def _load_config_str(data: str, config_class: type[ConfigT]) -> ConfigT | None:
83+
if not isinstance(data, str):
84+
return None
85+
86+
data_str = data.strip()
87+
error = None
88+
89+
if (data_str.startswith("{") and data_str.endswith("}")) or (
90+
data_str.startswith("[") and data_str.endswith("]")
91+
):
92+
try:
93+
return config_class.model_validate_json(data_str)
94+
except Exception as err: # noqa: BLE001
95+
error = err
96+
97+
if data_str.count("=") > 1:
98+
# key=value pairs separated by commas
99+
try:
100+
config_dict = {}
101+
items = data_str.split(",")
102+
for item in items:
103+
key, value = item.split("=")
104+
config_dict[key.strip()] = (
105+
int(value.strip())
106+
if value.strip().isnumeric()
107+
else value.strip()
108+
)
109+
110+
return config_class.model_validate(config_dict)
111+
except Exception as err: # noqa: BLE001
112+
error = err
113+
114+
err_message = (
115+
f"Unsupported string data for {config_class.__name__}, "
116+
f"expected JSON or key-value pairs, got {data}"
117+
)
118+
if error is not None:
119+
err_message += f" with error: {error}"
120+
raise DataNotSupportedError(err_message) from error
121+
raise DataNotSupportedError(err_message)

src/guidellm/data/deserializers/__init__.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,7 @@
2323
)
2424
from .synthetic import (
2525
SyntheticTextDataset,
26-
SyntheticTextDatasetConfig,
2726
SyntheticTextDatasetDeserializer,
28-
SyntheticTextPrefixBucketConfig,
2927
)
3028

3129
__all__ = [
@@ -45,9 +43,7 @@
4543
"JSONFileDatasetDeserializer",
4644
"ParquetFileDatasetDeserializer",
4745
"SyntheticTextDataset",
48-
"SyntheticTextDatasetConfig",
4946
"SyntheticTextDatasetDeserializer",
50-
"SyntheticTextPrefixBucketConfig",
5147
"TarFileDatasetDeserializer",
5248
"TextFileDatasetDeserializer",
5349
]

src/guidellm/data/deserializers/deserializer.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,16 @@
66
from datasets import Dataset, DatasetDict, IterableDataset, IterableDatasetDict
77
from transformers import PreTrainedTokenizerBase
88

9+
from guidellm.data.schemas import DataNotSupportedError
910
from guidellm.data.utils import resolve_dataset_split
1011
from guidellm.utils import RegistryMixin
1112

1213
__all__ = [
13-
"DataNotSupportedError",
1414
"DatasetDeserializer",
1515
"DatasetDeserializerFactory",
1616
]
1717

1818

19-
class DataNotSupportedError(Exception):
20-
"""Exception raised when data format is not supported by deserializer."""
21-
22-
2319
@runtime_checkable
2420
class DatasetDeserializer(Protocol):
2521
def __call__(

0 commit comments

Comments
 (0)