vllm-project · jaredoconnell · Nov 14, 2025 · Nov 17, 2025 · Nov 19, 2025 · Nov 19, 2025
diff --git a/docs/guides/datasets.md b/docs/guides/datasets.md
@@ -212,3 +212,214 @@ benchmark_generative_text(data=data, ...)
 - For lists of dictionaries, all items must have the same keys.
 - For lists of items, all elements must be of the same type.
 - A processor/tokenizer is only required if `GUIDELLM__PREFERRED_PROMPT_TOKENS_SOURCE="local"` or `GUIDELLM__PREFERRED_OUTPUT_TOKENS_SOURCE="local"` is set in the environment. In this case, the processor/tokenizer must be specified using the `--processor` argument. If not set, the processor/tokenizer will be set to the model passed in or retrieved from the server.
+
+## Preprocessing Datasets
+
+GuideLLM provides a preprocessing command that allows you to process datasets to have specific prompt and output token sizes. This is particularly useful when you need to standardize your dataset for benchmarking or when your dataset has prompts that don't match your target token requirements.
+
+The preprocessing command can:
+
+- Resize prompts to target token lengths
+- Handle prompts that are shorter or longer than the target length using various strategies
+- Map columns from your dataset to GuideLLM's expected column names
+- Generate output token counts based on your configuration
+- Save the processed dataset in various formats
+
+### Basic Usage
+
+```bash
+guidellm preprocess dataset \
+    <DATA> \
+    <OUTPUT_PATH> \
+    --processor <PROCESSOR> \
+    --config <CONFIG>
+```
+
+### Required Arguments
+
+| Argument      | Description                                                                                                                                    |
+| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
+| `DATA`        | Path to the input dataset or Hugging Face dataset ID. Supports all dataset formats documented in the [Dataset Configurations](../datasets.md). |
+| `OUTPUT_PATH` | Path to save the processed dataset, including file suffix (e.g., `processed_dataset.jsonl`, `output.csv`).                                     |
+| `--processor` | **Required.** Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path.                     |
+| `--config`    | **Required.** Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config).  |
+
+### Example
+
+```bash
+guidellm preprocess dataset \
+    "path/to/input_dataset.jsonl" \
+    "path/to/processed_dataset.jsonl" \
+    --processor "gpt2" \
+    --config "prompt_tokens=512,output_tokens=256"
+```
+
+### Configuration and Processor Options
+
+The `--config` parameter uses the same format as synthetic data configuration. It accepts a JSON string, key=value pairs, or a configuration file path. For detailed information about available configuration parameters (such as `prompt_tokens`, `output_tokens`, `prompt_tokens_stdev`, etc.), see the [Synthetic Data Configuration Options](../datasets.md#configuration-options) in the Dataset Configurations guide.
+
+The `--processor` argument specifies the tokenizer to use for calculating token counts. This is required because the preprocessing command needs to tokenize prompts to ensure they match the target token sizes. For information about using processors, including Hugging Face model IDs, local paths, and processor arguments, see the [Data Arguments Overview](../datasets.md#data-arguments-overview) section.
+
+### Column Mapping
+
+When your dataset uses non-standard column names, you can use `--data-column-mapper` to map your columns to GuideLLM's expected column names. This is particularly useful when:
+
+1. **Your dataset uses different column names** (e.g., `question` instead of `prompt`, `instruction` instead of `text_column`)
+2. **You have multiple datasets** and need to specify which dataset's columns to use
+3. **Your dataset has system prompts or prefixes** in a separate column
+
+**Column mapping format:** The `--data-column-mapper` accepts a JSON string mapping column types to column names:
+
+```json
+{
+  "text_column": "question",
+  "prefix_column": "system_prompt",
+  "prompt_tokens_count_column": "input_tokens",
+  "output_tokens_count_column": "completion_tokens"
+}
+```
+
+**Supported column types:**
+
+- `text_column`: The main prompt text (defaults: `prompt`, `instruction`, `question`, `input`, `context`, `content`, `text`)
+- `prefix_column`: System prompt or prefix (defaults: `system_prompt`, `system`, `prefix`)
+- `prompt_tokens_count_column`: Column containing prompt token counts (defaults: `prompt_tokens_count`, `input_tokens_count`)
+- `output_tokens_count_column`: Column containing output token counts (defaults: `output_tokens_count`, `completion_tokens_count`)
+- `image_column`: Image data column
+- `video_column`: Video data column
+- `audio_column`: Audio data column
+
+**Example: Mapping custom column names**
+
+If your dataset has a CSV file with columns `user_query` and `system_message`:
+
+```csv
+user_query,system_message
+"What is AI?","You are a helpful assistant."
+"How does ML work?","You are a technical expert."
+```
+
+You would use:
+
+```bash
+guidellm preprocess dataset \
+    "dataset.csv" \
+    "processed.jsonl" \
+    --processor "gpt2" \
+    --config "prompt_tokens=512,output_tokens=256" \
+    --data-column-mapper '{"text_column": "user_query", "prefix_column": "system_message"}'
+```
+
+**Example: Multiple datasets**
+
+If you're working with multiple datasets and need to specify which dataset's columns to use, you can use the format `<dataset_index>.<column_name>` or `<dataset_name>.<column_name>`:
+
+```bash
+--data-column-mapper '{"text_column": "0.prompt", "prefix_column": "1.system"}'
+```
+
+### Handling Short Prompts
+
+When prompts are shorter than the target token length, you can specify how to handle them using `--short-prompt-strategy`:
+
+| Strategy      | Description                                                                    |
+| ------------- | ------------------------------------------------------------------------------ |
+| `ignore`      | Skip prompts that are shorter than the target length (default)                 |
+| `concatenate` | Concatenate multiple short prompts together until the target length is reached |
+| `pad`         | Pad short prompts with a specified character to reach the target length        |
+| `error`       | Raise an error if a prompt is shorter than the target length                   |
+
+**Example: Concatenating short prompts**
+
+```bash
+guidellm preprocess dataset \
+    "dataset.jsonl" \
+    "processed.jsonl" \
+    --processor "gpt2" \
+    --config "prompt_tokens=512,output_tokens=256" \
+    --short-prompt-strategy "concatenate" \
+    --concat-delimiter "\n\n"
+```
+
+**Example: Padding short prompts**
+
+```bash
+guidellm preprocess dataset \
+    "dataset.jsonl" \
+    "processed.jsonl" \
+    --processor "gpt2" \
+    --config "prompt_tokens=512,output_tokens=256" \
+    --short-prompt-strategy "pad" \
+    --pad-char " "
+```
+
+### Additional Options
+
+| Option                            | Description                                                                                                                             |
+| --------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
+| `--data-args <JSON>`              | JSON string of arguments to pass to dataset loading. See [Data Arguments Overview](../datasets.md#data-arguments-overview) for details. |
+| `--prefix-tokens <NUMBER>`        | Single prefix token count (alternative to `prefix_tokens` in config).                                                                   |
+| `--include-prefix-in-token-count` | Include prefix tokens in prompt token count calculation (flag). When enabled, prefix trimming is disabled and the prefix is kept as-is. |
+| `--random-seed <NUMBER>`          | Random seed for reproducible token sampling (default: 42).                                                                              |
+| `--push-to-hub`                   | Push the processed dataset to Hugging Face Hub (flag).                                                                                  |
+| `--hub-dataset-id <ID>`           | Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set).                                                            |
+
+### Complete Examples
+
+**Example 1: Basic preprocessing with custom column names**
+
+```bash
+guidellm preprocess dataset \
+    "my_dataset.csv" \
+    "processed_dataset.jsonl" \
+    --processor "gpt2" \
+    --config "prompt_tokens=512,output_tokens=256" \
+    --data-column-mapper '{"text_column": "user_question", "prefix_column": "system_instruction"}'
+```
+
+**Example 2: Preprocessing with distribution and short prompt handling**
+
+```bash
+guidellm preprocess dataset \
+    "dataset.jsonl" \
+    "processed.jsonl" \
+    --processor "gpt2" \
+    --config "prompt_tokens=512,prompt_tokens_stdev=50,output_tokens=256,output_tokens_stdev=25" \
+    --short-prompt-strategy "concatenate" \
+    --concat-delimiter "\n\n" \
+    --random-seed 123
+```
+
+**Example 3: Preprocessing with processor arguments and prefix tokens**
+
+```bash
+guidellm preprocess dataset \
+    "dataset.jsonl" \
+    "processed.jsonl" \
+    --processor "gpt2" \
+    --processor-args '{"use_fast": false}' \
+    --config "prompt_tokens=512,output_tokens=256" \
+    --prefix-tokens 100 \
+    --include-prefix-in-token-count
+```
+
+**Example 4: Preprocessing and uploading to Hugging Face Hub**
+
+```bash
+guidellm preprocess dataset \
+    "my_dataset.jsonl" \
+    "processed.jsonl" \
+    --processor "gpt2" \
+    --config "prompt_tokens=512,output_tokens=256" \
+    --push-to-hub \
+    --hub-dataset-id "username/processed-dataset"
+```
+
+### Notes
+
+- The `--config` parameter uses the same format as synthetic data configuration. See the [Synthetic Data Configuration Options](../datasets.md#configuration-options) for all available parameters.
+- The processor/tokenizer is required because the preprocessing command needs to tokenize prompts to ensure they match target token sizes. See the [Data Arguments Overview](../datasets.md#data-arguments-overview) for processor usage details.
+- Column mappings are only needed when your dataset uses non-standard column names. GuideLLM will automatically try common column names if no mapping is provided.
+- When using `--short-prompt-strategy concatenate`, ensure your dataset has enough samples to concatenate, or some prompts may be skipped.
+- The output format is determined by the file extension of `OUTPUT_PATH` (e.g., `.jsonl`, `.csv`, `.parquet`).
+- The prefix handling only trims prefixes. It doesn't expand them. Prefix buckets, if specified, only trim the given prefixes by bucket weighting. It doesn't generate unique prefixes for each bucket.
diff --git a/src/guidellm/__main__.py b/src/guidellm/__main__.py
@@ -30,6 +30,8 @@
 import click
 from pydantic import ValidationError
 
+from guidellm.data import ShortPromptStrategy, process_dataset
+
 try:
     import uvloop
 except ImportError:
@@ -486,6 +488,150 @@ def preprocess():
     """Dataset preprocessing utilities."""
 
 
+@preprocess.command(
+    "dataset",
+    help=(
+        "Process a dataset to have specific prompt and output token sizes. "
+        "Supports multiple strategies for handling prompts and optional "
+        "Hugging Face Hub upload.\n\n"
+        "DATA: Path to the input dataset or dataset ID.\n\n"
+        "OUTPUT_PATH: Path to save the processed dataset, including file suffix."
+    ),
+    context_settings={"auto_envvar_prefix": "GUIDELLM"},
+)
+@click.argument(
+    "data",
+    type=str,
+    required=True,
+)
+@click.argument(
+    "output_path",
+    type=click.Path(file_okay=True, dir_okay=False, writable=True, resolve_path=True),
+    required=True,
+)
+@click.option(
+    "--processor",
+    type=str,
+    required=True,
+    help="Processor or tokenizer name for calculating token counts.",
+)
+@click.option(
+    "--config",
+    type=str,
+    required=True,
+    help=(
+        "SyntheticTextDatasetConfig as JSON string, key=value pairs, "
+        "or file path (.json, .yaml, .yml, .config). "
+        "Example: 'prompt_tokens=100,output_tokens=50'"
+        " or '{\"prompt_tokens\": 100, \"output_tokens\": 50}'"
+    ),
+)
+@click.option(
+    "--processor-args",
+    default=None,
+    callback=cli_tools.parse_json,
+    help="JSON string of arguments to pass to the processor constructor.",
+)
+@click.option(
+    "--data-args",
+    callback=cli_tools.parse_json,
+    help="JSON string of arguments to pass to dataset creation.",
+)
+@click.option(
+    "--data-column-mapper",
+    default=None,
+    callback=cli_tools.parse_json,
+    help="JSON string of column mappings to apply to the dataset.",
+)
+@click.option(
+    "--short-prompt-strategy",
+    type=click.Choice([s.value for s in ShortPromptStrategy]),
+    default=ShortPromptStrategy.IGNORE.value,
+    show_default=True,
+    help="Strategy for handling prompts shorter than target length.",
+)
+@click.option(
+    "--pad-char",
+    type=str,
+    default="",
+    callback=decode_escaped_str,
+    help="Character to pad short prompts with when using 'pad' strategy.",
+)
+@click.option(
+    "--concat-delimiter",
+    type=str,
+    default="",
+    help=(
+        "Delimiter for concatenating short prompts (used with 'concatenate' strategy)."
+    ),
+)
+@click.option(
+    "--prefix-tokens",
+    type=int,
+    default=None,
+    help="Single prefix token count (alternative to prefix_buckets in config).",
+)
+@click.option(
+    "--include-prefix-in-token-count",
+    is_flag=True,
+    default=False,
+    help="Include prefix tokens in prompt token count calculation. When enabled, "
+         "prefix trimming is disabled and the prefix is kept as-is.",
+)
+@click.option(
+    "--push-to-hub",
+    is_flag=True,
+    help="Push the processed dataset to Hugging Face Hub.",
+)
+@click.option(
+    "--hub-dataset-id",
+    type=str,
+    default=None,
+    help=("Hugging Face Hub dataset ID for upload (required if --push-to-hub is set)."),
+)
+@click.option(
+    "--random-seed",
+    type=int,
+    default=42,
+    show_default=True,
+    help="Random seed for reproducible token sampling.",
+)
+def dataset(
+    data,
+    output_path,
+    processor,
+    config,
+    processor_args,
+    data_args,
+    data_column_mapper,
+    short_prompt_strategy,
+    pad_char,
+    concat_delimiter,
+    prefix_tokens,
+    include_prefix_in_token_count,
+    push_to_hub,
+    hub_dataset_id,
+    random_seed,
+):
+    process_dataset(
+        data=data,
+        output_path=output_path,
+        processor=processor,
+        config=config,
+        processor_args=processor_args,
+        data_args=data_args,
+        data_column_mapper=data_column_mapper,
+        short_prompt_strategy=short_prompt_strategy,
+        pad_char=pad_char,
+        concat_delimiter=concat_delimiter,
+        prefix_tokens=prefix_tokens,
+        include_prefix_in_token_count=include_prefix_in_token_count,
+        push_to_hub=push_to_hub,
+        hub_dataset_id=hub_dataset_id,
+        random_seed=random_seed,
+    )
+
+
 @cli.command(
     "mock-server",
     help=(

diff --git a/src/guidellm/data/__init__.py b/src/guidellm/data/__init__.py
@@ -4,6 +4,7 @@
     DatasetDeserializer,
     DatasetDeserializerFactory,
 )
+from .entrypoints import ShortPromptStrategy, process_dataset
 from .loaders import DataLoader, DatasetsIterator
 from .preprocessors import (
     DataDependentPreprocessor,
@@ -27,4 +28,6 @@
     "PreprocessorRegistry",
     "ProcessorFactory",
     "RequestFormatter",
+    "ShortPromptStrategy",
+    "process_dataset",
 ]