Skip to content

Commit e84afe8

Browse files
authored
Dataset preprocessing scripts for AutoPDL (#994)
* Data processing scripts Signed-off-by: Claudio Spiess <[email protected]> * Bump datasets version Signed-off-by: Claudio Spiess <[email protected]> * Address typing/pyright issues Signed-off-by: Claudio Spiess <[email protected]> * Add funcy dependency Signed-off-by: Claudio Spiess <[email protected]> * Fix impossible version constraint Signed-off-by: Claudio Spiess <[email protected]> * pyright ignores Signed-off-by: Claudio Spiess <[email protected]> * pyright ignore Signed-off-by: Claudio Spiess <[email protected]> * Add evalplus dep to examples Signed-off-by: Claudio Spiess <[email protected]> * Improve docs Signed-off-by: Claudio Spiess <[email protected]> * Improve docs & lint Signed-off-by: Claudio Spiess <[email protected]> * Update doc Signed-off-by: Claudio Spiess <[email protected]> --------- Signed-off-by: Claudio Spiess <[email protected]>
1 parent 85ec7dd commit e84afe8

File tree

13 files changed

+890
-67
lines changed

13 files changed

+890
-67
lines changed

docs/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -50,13 +50,13 @@ pip install 'prompt-declaration-language[examples]'
5050

5151
The Live Explorer can be installed as follows (MacOS):
5252
```
53-
brew install pdl
53+
brew install pdl
5454
```
5555

5656
For other platforms, see installation notes.
5757

5858
You can run PDL with LLM models in local using [Ollama](https://ollama.com), or other cloud service.
59-
See [here](https://ibm.github.io/prompt-declaration-language/tutorial/#using-ollama-models) for
59+
See [here](https://ibm.github.io/prompt-declaration-language/tutorial/#using-ollama-models) for
6060
instructions on how to install an Ollama model locally.
6161

6262
Most examples in this repository use IBM Granite models on [Ollama](https://ollama.com) and some are on [Replicate](https://replicate.com/). In order to run these examples, you need to create a free account
@@ -172,7 +172,7 @@ text:
172172
temperature: 0
173173
```
174174

175-
Notice the syntactic differences. Model ids on watsonx start with `watsonx`.
175+
Notice the syntactic differences. Model ids on watsonx start with `watsonx`.
176176

177177
Watsonx also provides a text completion endpoint as shown in the following example. A text completion endpoint does not take chat
178178
templates into account:
@@ -266,10 +266,10 @@ When we execute this program with the PDL interpreter, we obtain the following t
266266
@SuppressWarnings("unchecked")
267267
public static Map<String, String> deserializeOffsetMap(String lastSourceOffset) throws IOException {
268268
Map<String, String> offsetMap;
269-
if (lastSourceOffset == null || lastSourceOffset.isEmpty()) {
270-
offsetMap = new HashMap<>();
269+
if (lastSourceOffset == null || lastSourceOffset.isEmpty()) {
270+
offsetMap = new HashMap<>();
271271
} else {
272-
offsetMap = JSON_MAPPER.readValue(lastSourceOffset, Map.class);
272+
offsetMap = JSON_MAPPER.readValue(lastSourceOffset, Map.class);
273273
}
274274
return offsetMap;
275275
}
@@ -293,10 +293,10 @@ When we execute this new program, we obtain the following:
293293
@SuppressWarnings("unchecked")
294294
public static Map<String, String> deserializeOffsetMap(String lastSourceOffset) throws IOException {
295295
Map<String, String> offsetMap;
296-
if (lastSourceOffset == null || lastSourceOffset.isEmpty()) {
297-
offsetMap = new HashMap<>();
296+
if (lastSourceOffset == null || lastSourceOffset.isEmpty()) {
297+
offsetMap = new HashMap<>();
298298
} else {
299-
offsetMap = JSON_MAPPER.readValue(lastSourceOffset, Map.class);
299+
offsetMap = JSON_MAPPER.readValue(lastSourceOffset, Map.class);
300300
}
301301
return offsetMap;
302302
}

docs/autopdl.md

Lines changed: 110 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,15 @@ hide:
77

88
# AutoPDL Tutorial
99

10-
The following sections show how to use the AutoPDL optimizer to produce optimized PDL programs for specific tasks.
10+
The following sections show how to use the AutoPDL optimizer introduced by [Spiess et al. (2025)](https://openreview.net/forum?id=CAeISyE3aR) in "AutoPDL: Automatic Prompt Optimization for LLM Agents" ([arXiv](https://arxiv.org/abs/2504.04365)), to produce optimized PDL programs for specific tasks. Please ensure PDL was installed with extras e.g.
11+
12+
``` { .bash .copy .annotate linenums="1" }
13+
pip install 'prompt-declaration-language[all]'
14+
# or from source
15+
git clone [email protected]:IBM/prompt-declaration-language.git
16+
cd prompt-declaration-language
17+
pip install -e '.[all]'
18+
```
1119

1220
To optimize a PDL program, we need the program, an optimizer configuration, a dataset, and an _evaluator_. An evaluator is a Python subclass of `OptimizerEvaluator` that evaluates a candidate, which is a generated configuration instance consisting of e.g. fewshot examples. The evaluator class follows this structure:
1321

@@ -52,41 +60,15 @@ class OptimizerEvaluator(Thread):
5260

5361
Let's go through an example for `GSM8K`. Our PDL program uses different prompt patterns from the prompt library, and the variables `prompt_pattern`, `question`, `model`, and `demonstrations` are inserted at runtime by the evaluator.
5462

55-
5663
```yaml title="examples/optimizer/gsm8k.pdl" linenums="1"
5764
--8<-- "./examples/optimizer/gsm8k.pdl"
5865
```
5966

60-
We write a configuration file for the optimizer, see `src/pdl/optimize/config_parser.py` for all fields:
61-
62-
``` { .yaml .copy .annotate title="gsm8k_optimizer_config.yml" linenums="1" }
63-
benchmark: gsm8k # Name our benchmark
64-
budget: null # Set a budget, can be number of iterations, or a duration string e.g. "2h"
65-
budget_growth: double # double validation set size each iteration
66-
# or to_max: reach max_test_set_size by final iteration
67-
initial_test_set_size: 2 # size of test set in first iteration
68-
max_test_set_size: 10 # maximum test set size
69-
num_candidates: 100 # how many candidates to evaluate
70-
num_demonstrations: 5 # how many demonstrations to include per candidate
71-
parallelism: 1 # how many threads to run evaluations across
72-
shuffle_test: false # shuffling of test set
73-
test_set_name: test # name of test set
74-
train_set_name: train # name of train set
75-
validation_set_name: validation # name of validation set
76-
demonstrations_variable_name: demonstrations # variable name to insert demonstrations into
77-
variables: # define discrete options to sample from
78-
model: # set ${ model } variable
79-
- watsonx/meta-llama/llama-3-1-8b-instruct
80-
prompt_pattern: # set ${ prompt_pattern } variable to one of these
81-
- cot
82-
- react
83-
- rewoo
84-
num_demonstrations: # overrides num demonstrations above
85-
- 0
86-
- 3
87-
- 5
88-
```
67+
We write a configuration file for the optimizer, and save it as `gsm8k_optimizer_config.yml`. See `src/pdl/optimize/config_parser.py` for all fields. Please note that this example uses the `watsonx` inference service, so an API key is required, although you can also use a local model or any other inference service.
8968

69+
``` { .yaml .copy .annotate title="examples/optimizer/gsm8k_optimizer_config.yml" linenums="1" }
70+
--8<-- "./examples/optimizer/gsm8k_optimizer_config.yml"
71+
```
9072

9173
```python title="examples/optimizer/gsm8k_evaluator.py" linenums="1"
9274
--8<-- "./examples/optimizer/gsm8k_evaluator.py"
@@ -95,20 +77,112 @@ variables: # define discrete options to sample from
9577
We can see an example of a script to run the optimization process in `examples/optimizer/optimize.py`.
9678
Usage:
9779

98-
```
80+
```text
9981
python optimize.py optimize -h
10082
usage: optimize.py optimize [-h] --config CONFIG --dataset-path DATASET_PATH [--experiments-path EXPERIMENTS_PATH]
10183
[--yield_output | --no-yield_output] [--dry | --no-dry]
10284
pdl_file
10385
```
10486

105-
We also need a dataset to optimize against, with `train`, `test`, and `validation` splits. To produce such a dataset, we can use HuggingFace Datasets `load_dataset` and `save_to_disk`. This example requires the dataset to have columns `question`, `reasoning`, and `answer`, which can be created from the original `openai/gsm8k` dataset. Processing scripts are under development and will follow shortly.
87+
We also need a dataset to optimize against, with `train`, `test`, and `validation` splits. To produce such a dataset, we can use HuggingFace Datasets `load_dataset` and `save_to_disk`. This example requires the dataset to have columns `question`, `reasoning`, and `answer`, which can be created from the original `openai/gsm8k` dataset.
88+
89+
We provide three scripts in `examples/optimizer` to create datasets, including the rule based agentic trajectories. These are `process_gsm8k.py`, `process_fever.py`, and `process_mbpp.py`. They load the original datasets, process them, and save them to disk in the required format. Dataset specific instructions may be found in the respective script files. Note that the scripts create a folder named `var` in the current directory, which contains the processed dataset in a format that can be used by the optimizer. Therefore, they should be run in the root of the PDL repository.
10690

107-
We can run an example like so:
91+
Let's run the GSM8K dataset processing script:
92+
93+
``` { .bash .copy .annotate linenums="1" }
94+
python examples/optimizer/process_gsm8k.py
95+
```
10896

97+
Which should save the processed dataset in `var/gsm8k_trajectified` and output something like:
98+
99+
```text
100+
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 557195.73 examples/s]
101+
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 363559.64 examples/s]
102+
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 1024/1024 [00:00<00:00, 271472.56 examples/s]
103+
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 71242.31 examples/s]
104+
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:00<00:00, 68826.30 examples/s]
105+
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 22520.85 examples/s]
106+
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 18186.53 examples/s]
107+
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 6449/6449 [00:00<00:00, 698328.77 examples/s]
108+
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 232468.57 examples/s]
109+
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████| 1024/1024 [00:00<00:00, 413375.10 examples/s]
110+
DatasetDict({
111+
train: Dataset({
112+
features: ['question', 'answer', 'reasoning', 'raw_answer', 'answer_part', 'traj_keys', 'traj_values', 'rewoo_traj_keys', 'rewoo_traj_values'],
113+
num_rows: 6449
114+
})
115+
test: Dataset({
116+
features: ['question', 'answer', 'reasoning', 'raw_answer', 'answer_part'],
117+
num_rows: 1319
118+
})
119+
validation: Dataset({
120+
features: ['question', 'answer', 'reasoning', 'raw_answer', 'answer_part'],
121+
num_rows: 1024
122+
})
123+
})
109124
```
125+
126+
Finally, we can run the example like so:
127+
128+
``` { .bash .copy .annotate linenums="1" }
110129
cd examples/optimizer
111-
python optimize.py optimize --config config.yml --dataset-path datasets/gsm8k gsm8k.pdl
130+
python optimize.py optimize --config gsm8k_optimizer_config.yml --dataset-path ../../var/gsm8k_trajectified gsm8k.pdl
131+
```
132+
133+
This will report details about the optimization process, such as the number of candidates evaluated. The output will look something like this:
134+
135+
```text
136+
PDL Optimizer pdl_optimizer.py:336
137+
┌──────────────────────────────┬─────────────────────────────────────────────┐
138+
│ Config combinations │ 9 │
139+
│ Max candidates │ 100 │
140+
│ Num. candidates │ 100 │
141+
│ Starting validation set size │ 2 │
142+
│ Max validation set size │ 10 │
143+
│ Num. iterations │ 7 │
144+
│ Total evaluations │ 1,200 │
145+
│ Num. threads │ 1 │
146+
│ Validation set multiplier │ 2 │
147+
│ Shuffle validation set │ False │
148+
│ Budget policy │ None │
149+
├──────────────────────────────┼─────────────────────────────────────────────┤
150+
│ model │ ['watsonx/meta-llama/llama-3-2-3b-instruct… │
151+
│ prompt_pattern │ ['cot', 'react', 'rewoo'] │
152+
│ num_demonstrations │ [0, 3, 5] │
153+
└──────────────────────────────┴─────────────────────────────────────────────┘
154+
Iteration pdl_optimizer.py:419
155+
┌─────────────────────┬─────┐
156+
│ Index │ 0 │
157+
│ Validation set size │ 2 │
158+
│ Num. candidates │ 100 │
159+
└─────────────────────┴─────┘
160+
Evaluation pdl_optimizer.py:601
161+
┌────────────────────────┬──────────────────────────────────────────┐
162+
│ Test set size │ 2 │
163+
├────────────────────────┼──────────────────────────────────────────┤
164+
│ model │ watsonx/meta-llama/llama-3-2-3b-instruct │
165+
│ prompt_pattern │ cot │
166+
│ num_demonstrations │ 0 │
167+
│ uuid │ enl0ertp │
168+
│ demonstrations_indices │ 0 │
169+
│ demonstrations │ 0 │
170+
└────────────────────────┴──────────────────────────────────────────┘
171+
Running without parallelism util.py:74
172+
0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1,200 [ 0:00:01 < -:--:-- , ? it/s ]
112173
```
113174

114-
Once the process is complete, a file `optimized_gsm8k.pdl` is written. This file contains the optimal configuration and is directly executable by the standard PDL interpreter.
175+
Note that it is not unusual to observe PDL exceptions during the optimization process.
176+
177+
```text
178+
[15:44:14] Type errors during spec checking:
179+
../../contrib/prompt_library/ReAct.pdl:0 - should be an object
180+
../../contrib/prompt_library/ReAct.pdl:0 - Type errors during spec checking:
181+
../../contrib/prompt_library/ReAct.pdl:0 - should be an object
182+
Retrying: False
183+
Runtime FAILED and took seconds: 10.21
184+
```
185+
186+
Such exceptions, here for example in `ReAct.pdl`, are caused by the _typed_ model call in `ReAct.pdl:98`. If the model output does not result in a parsable JSON that matches the expected type `{ name: string, arguments: object }`, the PDL interpreter raises an exception.
187+
188+
Once the process is complete, a file `optimized_gsm8k.pdl` is written in same directory as the source PDL file. This file contains the optimal configuration and is directly executable by the standard PDL interpreter. A log of the optimization process is written to `experiments/` by default.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
benchmark: gsm8k # Name our benchmark
2+
budget: null # Set a budget, can be number of iterations, or a duration string e.g. "2h"
3+
budget_growth: double # double validation set size each iteration
4+
# or to_max: reach max_test_set_size by final iteration
5+
initial_test_set_size: 2 # size of test set in first iteration
6+
max_test_set_size: 10 # maximum test set size
7+
num_candidates: 100 # how many candidates to evaluate
8+
num_demonstrations: 5 # how many demonstrations to include per candidate
9+
parallelism: 1 # how many threads to run evaluations across
10+
shuffle_test: false # shuffling of test set
11+
test_set_name: test # name of test set
12+
train_set_name: train # name of train set
13+
validation_set_name: validation # name of validation set
14+
demonstrations_variable_name: demonstrations # variable name to insert demonstrations into
15+
variables: # define discrete options to sample from
16+
model: # set ${ model } variable
17+
- watsonx/meta-llama/llama-3-2-3b-instruct
18+
prompt_pattern: # set ${ prompt_pattern } variable to one of these
19+
- cot
20+
- react
21+
- rewoo
22+
num_demonstrations: # overrides num demonstrations above
23+
- 0
24+
- 3
25+
- 5

examples/optimizer/mbpp_dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
from copy import deepcopy
55

6-
from datasets import load_from_disk
6+
from datasets.load import load_from_disk
77
from evalplus.data import get_mbpp_plus, get_mbpp_plus_hash
88
from evalplus.evaluate import MBPP_OUTPUT_NOT_NONE_TASKS, get_groundtruth
99

examples/optimizer/optimize.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from typing import Any
66

77
import yaml
8-
from datasets import load_from_disk
8+
from datasets.load import load_from_disk
99
from fever_evaluator import FEVEREvaluator
1010
from gsm8k_evaluator import Gsm8kEvaluator
1111
from gsmhard_evaluator import GsmHardEvaluator

0 commit comments

Comments
 (0)