Skip to content

Commit f1857e0

Browse files
committed
Format docs
Signed-off-by: Jared O'Connell <[email protected]>
1 parent 08d9419 commit f1857e0

File tree

1 file changed

+29
-21
lines changed

1 file changed

+29
-21
lines changed

docs/guides/datasets.md

Lines changed: 29 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,7 @@ benchmark_generative_text(data=data, ...)
218218
GuideLLM provides a preprocessing command that allows you to process datasets to have specific prompt and output token sizes. This is particularly useful when you need to standardize your dataset for benchmarking or when your dataset has prompts that don't match your target token requirements.
219219

220220
The preprocessing command can:
221+
221222
- Resize prompts to target token lengths
222223
- Handle prompts that are shorter or longer than the target length using various strategies
223224
- Map columns from your dataset to GuideLLM's expected column names
@@ -236,12 +237,12 @@ guidellm preprocess dataset \
236237

237238
### Required Arguments
238239

239-
| Argument | Description |
240-
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- |
241-
| `DATA` | Path to the input dataset or Hugging Face dataset ID. Supports all dataset formats documented in the [Dataset Configurations](../datasets.md). |
242-
| `OUTPUT_PATH` | Path to save the processed dataset, including file suffix (e.g., `processed_dataset.jsonl`, `output.csv`). |
243-
| `--processor` | **Required.** Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path. |
244-
| `--config` | **Required.** Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). |
240+
| Argument | Description |
241+
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
242+
| `DATA` | Path to the input dataset or Hugging Face dataset ID. Supports all dataset formats documented in the [Dataset Configurations](../datasets.md). |
243+
| `OUTPUT_PATH` | Path to save the processed dataset, including file suffix (e.g., `processed_dataset.jsonl`, `output.csv`). |
244+
| `--processor` | **Required.** Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path. |
245+
| `--config` | **Required.** Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). |
245246

246247
### Example
247248

@@ -267,8 +268,7 @@ When your dataset uses non-standard column names, you can use `--data-column-map
267268
2. **You have multiple datasets** and need to specify which dataset's columns to use
268269
3. **Your dataset has system prompts or prefixes** in a separate column
269270

270-
**Column mapping format:**
271-
The `--data-column-mapper` accepts a JSON string mapping column types to column names:
271+
**Column mapping format:** The `--data-column-mapper` accepts a JSON string mapping column types to column names:
272272

273273
```json
274274
{
@@ -280,6 +280,7 @@ The `--data-column-mapper` accepts a JSON string mapping column types to column
280280
```
281281

282282
**Supported column types:**
283+
283284
- `text_column`: The main prompt text (defaults: `prompt`, `instruction`, `question`, `input`, `context`, `content`, `text`)
284285
- `prefix_column`: System prompt or prefix (defaults: `system_prompt`, `system`, `prefix`)
285286
- `prompt_tokens_count_column`: Column containing prompt token counts (defaults: `prompt_tokens_count`, `input_tokens_count`)
@@ -299,6 +300,7 @@ user_query,system_message
299300
```
300301

301302
You would use:
303+
302304
```bash
303305
guidellm preprocess dataset \
304306
"dataset.csv" \
@@ -320,14 +322,15 @@ If you're working with multiple datasets and need to specify which dataset's col
320322

321323
When prompts are shorter than the target token length, you can specify how to handle them using `--short-prompt-strategy`:
322324

323-
| Strategy | Description |
324-
| ------------- | ---------------------------------------------------------------------------------------------- |
325-
| `ignore` | Skip prompts that are shorter than the target length (default) |
326-
| `concatenate` | Concatenate multiple short prompts together until the target length is reached |
327-
| `pad` | Pad short prompts with a specified character to reach the target length |
328-
| `error` | Raise an error if a prompt is shorter than the target length |
325+
| Strategy | Description |
326+
| ------------- | ------------------------------------------------------------------------------ |
327+
| `ignore` | Skip prompts that are shorter than the target length (default) |
328+
| `concatenate` | Concatenate multiple short prompts together until the target length is reached |
329+
| `pad` | Pad short prompts with a specified character to reach the target length |
330+
| `error` | Raise an error if a prompt is shorter than the target length |
329331

330332
**Example: Concatenating short prompts**
333+
331334
```bash
332335
guidellm preprocess dataset \
333336
"dataset.jsonl" \
@@ -339,6 +342,7 @@ guidellm preprocess dataset \
339342
```
340343

341344
**Example: Padding short prompts**
345+
342346
```bash
343347
guidellm preprocess dataset \
344348
"dataset.jsonl" \
@@ -351,18 +355,19 @@ guidellm preprocess dataset \
351355

352356
### Additional Options
353357

354-
| Option | Description |
355-
| -------------------------------- |-----------------------------------------------------------------------------------------------------------------------------------------|
356-
| `--data-args <JSON>` | JSON string of arguments to pass to dataset loading. See [Data Arguments Overview](../datasets.md#data-arguments-overview) for details. |
357-
| `--prefix-tokens <NUMBER>` | Single prefix token count (alternative to `prefix_tokens` in config). |
358+
| Option | Description |
359+
| --------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
360+
| `--data-args <JSON>` | JSON string of arguments to pass to dataset loading. See [Data Arguments Overview](../datasets.md#data-arguments-overview) for details. |
361+
| `--prefix-tokens <NUMBER>` | Single prefix token count (alternative to `prefix_tokens` in config). |
358362
| `--include-prefix-in-token-count` | Include prefix tokens in prompt token count calculation (flag). When enabled, prefix trimming is disabled and the prefix is kept as-is. |
359-
| `--random-seed <NUMBER>` | Random seed for reproducible token sampling (default: 42). |
360-
| `--push-to-hub` | Push the processed dataset to Hugging Face Hub (flag). |
361-
| `--hub-dataset-id <ID>` | Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). |
363+
| `--random-seed <NUMBER>` | Random seed for reproducible token sampling (default: 42). |
364+
| `--push-to-hub` | Push the processed dataset to Hugging Face Hub (flag). |
365+
| `--hub-dataset-id <ID>` | Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). |
362366

363367
### Complete Examples
364368

365369
**Example 1: Basic preprocessing with custom column names**
370+
366371
```bash
367372
guidellm preprocess dataset \
368373
"my_dataset.csv" \
@@ -373,6 +378,7 @@ guidellm preprocess dataset \
373378
```
374379

375380
**Example 2: Preprocessing with distribution and short prompt handling**
381+
376382
```bash
377383
guidellm preprocess dataset \
378384
"dataset.jsonl" \
@@ -385,6 +391,7 @@ guidellm preprocess dataset \
385391
```
386392

387393
**Example 3: Preprocessing with processor arguments and prefix tokens**
394+
388395
```bash
389396
guidellm preprocess dataset \
390397
"dataset.jsonl" \
@@ -397,6 +404,7 @@ guidellm preprocess dataset \
397404
```
398405

399406
**Example 4: Preprocessing and uploading to Hugging Face Hub**
407+
400408
```bash
401409
guidellm preprocess dataset \
402410
"my_dataset.jsonl" \

0 commit comments

Comments
 (0)