You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GuideLLM provides a preprocessing command that allows you to process datasets to have specific prompt and output token sizes. This is particularly useful when you need to standardize your dataset for benchmarking or when your dataset has prompts that don't match your target token requirements.
219
219
220
220
The preprocessing command can:
221
+
221
222
- Resize prompts to target token lengths
222
223
- Handle prompts that are shorter or longer than the target length using various strategies
223
224
- Map columns from your dataset to GuideLLM's expected column names
|`DATA`| Path to the input dataset or Hugging Face dataset ID. Supports all dataset formats documented in the [Dataset Configurations](../datasets.md). |
242
-
|`OUTPUT_PATH`| Path to save the processed dataset, including file suffix (e.g., `processed_dataset.jsonl`, `output.csv`). |
243
-
|`--processor`|**Required.** Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path. |
244
-
|`--config`|**Required.** Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). |
|`DATA`| Path to the input dataset or Hugging Face dataset ID. Supports all dataset formats documented in the [Dataset Configurations](../datasets.md). |
243
+
|`OUTPUT_PATH`| Path to save the processed dataset, including file suffix (e.g., `processed_dataset.jsonl`, `output.csv`).|
244
+
|`--processor`|**Required.** Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path.|
245
+
|`--config`|**Required.** Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). |
245
246
246
247
### Example
247
248
@@ -267,8 +268,7 @@ When your dataset uses non-standard column names, you can use `--data-column-map
267
268
2.**You have multiple datasets** and need to specify which dataset's columns to use
268
269
3.**Your dataset has system prompts or prefixes** in a separate column
269
270
270
-
**Column mapping format:**
271
-
The `--data-column-mapper` accepts a JSON string mapping column types to column names:
271
+
**Column mapping format:** The `--data-column-mapper` accepts a JSON string mapping column types to column names:
272
272
273
273
```json
274
274
{
@@ -280,6 +280,7 @@ The `--data-column-mapper` accepts a JSON string mapping column types to column
280
280
```
281
281
282
282
**Supported column types:**
283
+
283
284
-`text_column`: The main prompt text (defaults: `prompt`, `instruction`, `question`, `input`, `context`, `content`, `text`)
284
285
-`prefix_column`: System prompt or prefix (defaults: `system_prompt`, `system`, `prefix`)
|`--data-args <JSON>`| JSON string of arguments to pass to dataset loading. See [Data Arguments Overview](../datasets.md#data-arguments-overview) for details. |
357
-
|`--prefix-tokens <NUMBER>`| Single prefix token count (alternative to `prefix_tokens` in config). |
|`--data-args <JSON>`| JSON string of arguments to pass to dataset loading. See [Data Arguments Overview](../datasets.md#data-arguments-overview) for details. |
361
+
|`--prefix-tokens <NUMBER>`| Single prefix token count (alternative to `prefix_tokens` in config). |
358
362
|`--include-prefix-in-token-count`| Include prefix tokens in prompt token count calculation (flag). When enabled, prefix trimming is disabled and the prefix is kept as-is. |
359
-
|`--random-seed <NUMBER>`| Random seed for reproducible token sampling (default: 42). |
360
-
|`--push-to-hub`| Push the processed dataset to Hugging Face Hub (flag). |
361
-
|`--hub-dataset-id <ID>`| Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). |
363
+
|`--random-seed <NUMBER>`| Random seed for reproducible token sampling (default: 42). |
364
+
|`--push-to-hub`| Push the processed dataset to Hugging Face Hub (flag). |
365
+
|`--hub-dataset-id <ID>`| Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). |
362
366
363
367
### Complete Examples
364
368
365
369
**Example 1: Basic preprocessing with custom column names**
370
+
366
371
```bash
367
372
guidellm preprocess dataset \
368
373
"my_dataset.csv" \
@@ -373,6 +378,7 @@ guidellm preprocess dataset \
373
378
```
374
379
375
380
**Example 2: Preprocessing with distribution and short prompt handling**
381
+
376
382
```bash
377
383
guidellm preprocess dataset \
378
384
"dataset.jsonl" \
@@ -385,6 +391,7 @@ guidellm preprocess dataset \
385
391
```
386
392
387
393
**Example 3: Preprocessing with processor arguments and prefix tokens**
394
+
388
395
```bash
389
396
guidellm preprocess dataset \
390
397
"dataset.jsonl" \
@@ -397,6 +404,7 @@ guidellm preprocess dataset \
397
404
```
398
405
399
406
**Example 4: Preprocessing and uploading to Hugging Face Hub**
0 commit comments