Skip to content

Commit 83f24f9

Browse files
authored
Support generating reliability test programs and inputs (#1802)
* Commit Signed-off-by: dbczumar <[email protected]> * fix Signed-off-by: dbczumar <[email protected]> * fix Signed-off-by: dbczumar <[email protected]> * fix Signed-off-by: dbczumar <[email protected]> * fix Signed-off-by: dbczumar <[email protected]> * fix Signed-off-by: dbczumar <[email protected]> --------- Signed-off-by: dbczumar <[email protected]>
1 parent 1f93bff commit 83f24f9

File tree

8 files changed

+888
-29
lines changed

8 files changed

+888
-29
lines changed

tests/reliability/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,31 @@ Each test in this directory executes a DSPy program using various LLMs. By runni
4747

4848
This will execute all tests for the configured models and display detailed results for each model configuration. Tests are set up to mark expected failures for known challenging cases where a specific model might struggle, while actual (unexpected) DSPy reliability issues are flagged as failures (see below).
4949

50+
#### Running specific generated tests
51+
52+
You can run specific generated tests by using the `-k` flag with `pytest`. For example, to test the generated program located at `tests/reliability/complex_types/generated/test_nesting_1` against generated test input `input1.json`, you can run the following command from this directory:
53+
54+
```bash
55+
pytest test_generated.py -k "test_nesting_1-input1"
56+
```
57+
58+
### Test generation
59+
60+
You can generate test DSPy programs and test inputs from text descriptions using the `tests.reliability.generate` CLI, or the `tests.reliability.generate.generate_test_cases` API. For example, to generate a test classification program and 3 challenging test inputs in the `tests/reliability/classification/generated` directory, you can run the following command from the DSPy repository root directory:
61+
62+
```bash
63+
python \
64+
-m tests.reliability.generate \
65+
-d tests/reliability/classification/generated/test_example \
66+
-p "Generate a program that performs a classification task involving objects with multiple properties. The task should be realistic" \
67+
-i "Based on the program description, generate a challenging example" \
68+
-n 3
69+
```
70+
71+
The test program will be written to `tests/reliability/classification/generated/test_example/program.py`, and the test inputs will be written as JSON files to the `tests/reliability/classification/generated/test_exaple/inputs/` directory.
72+
73+
All generated tests should be located in directories with the structure `tests/reliability/<test_type>/generated/<test_name>`, where `<test_type>` is the type of test (e.g., `classification`, `complex_types`, `chat`, etc.), and `<test_name>` is a descriptive name for the test.
74+
5075
### Known Failing Models
5176

5277
Some tests may be expected to fail with certain models, especially in challenging cases. These known failures are logged but do not affect the overall test result. This setup allows us to keep track of model-specific limitations without obstructing general test outcomes. Models that are known to fail a particular test case are specified using the `@known_failing_models` decorator. For example:

tests/reliability/conftest.py

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
import dspy
66
from tests.conftest import clear_settings
7-
from tests.reliability.utils import parse_reliability_conf_yaml
7+
from tests.reliability.utils import get_adapter, parse_reliability_conf_yaml
88

99
# Standard list of models that should be used for periodic DSPy reliability testing
1010
MODEL_LIST = [
@@ -46,13 +46,7 @@ def configure_model(request):
4646
module_dir = os.path.dirname(os.path.abspath(__file__))
4747
conf_path = os.path.join(module_dir, "reliability_conf.yaml")
4848
reliability_conf = parse_reliability_conf_yaml(conf_path)
49-
50-
if reliability_conf.adapter.lower() == "chat":
51-
adapter = dspy.ChatAdapter()
52-
elif reliability_conf.adapter.lower() == "json":
53-
adapter = dspy.JSONAdapter()
54-
else:
55-
raise ValueError(f"Unknown adapter specification '{adapter}' in reliability_conf.yaml")
49+
adapter = get_adapter(reliability_conf)
5650

5751
model_name, should_ignore_failure = request.param
5852
model_params = reliability_conf.models.get(model_name)
@@ -61,7 +55,9 @@ def configure_model(request):
6155
dspy.configure(lm=lm, adapter=adapter)
6256
else:
6357
pytest.skip(
64-
f"Skipping test because no reliability testing YAML configuration was found" f" for model {model_name}."
58+
f"Skipping test because no reliability testing YAML configuration was found"
59+
f" for model {model_name}, or the YAML configuration is missing LiteLLM parameters"
60+
f" for this model ('litellm_params' section of conf file is missing)."
6561
)
6662

6763
# Store `should_ignore_failure` flag on the request node for use in post-test handling
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import os
2+
from typing import List, Optional
3+
4+
from tests.reliability.generate.utils import (
5+
GeneratedTestCase,
6+
generate_test_inputs,
7+
generate_test_program,
8+
load_generated_cases,
9+
load_generated_program,
10+
)
11+
12+
13+
def generate_test_cases(
14+
dst_path: str,
15+
num_inputs: int = 1,
16+
program_instructions: Optional[str] = None,
17+
input_instructions: Optional[str] = None,
18+
) -> List[GeneratedTestCase]:
19+
os.makedirs(dst_path, exist_ok=True)
20+
if _directory_contains_program(dst_path):
21+
print(f"Found an existing test program at path {dst_path}. Generating new" f" test inputs for this program.")
22+
else:
23+
print("Generating a new test program and test inputs")
24+
generate_test_program(
25+
dst_path=dst_path,
26+
additional_instructions=program_instructions,
27+
)
28+
generate_test_inputs(
29+
dst_path=os.path.join(dst_path, "inputs"),
30+
program_path=os.path.join(dst_path, "program.py"),
31+
num_inputs=num_inputs,
32+
additional_instructions=input_instructions,
33+
)
34+
return load_generated_cases(dir_path=dst_path)
35+
36+
37+
def _directory_contains_program(dir_path: str) -> bool:
38+
return any(file == "program.py" for file in os.listdir(dir_path))
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
import argparse
2+
3+
from tests.reliability.generate import generate_test_cases
4+
5+
if __name__ == "__main__":
6+
parser = argparse.ArgumentParser(
7+
description="Generate test cases by specifying configuration and input instructions."
8+
)
9+
parser.add_argument(
10+
"-d", "--dst_path", type=str, required=True, help="Destination path where generated test cases will be saved."
11+
)
12+
parser.add_argument(
13+
"-n", "--num_inputs", type=int, default=1, help="Number of input cases to generate (default: 1)."
14+
)
15+
parser.add_argument(
16+
"-p", "--program_instructions", type=str, help="Additional instructions for the generated test program."
17+
)
18+
parser.add_argument(
19+
"-i", "--input_instructions", type=str, help="Additional instructions for generating test inputs."
20+
)
21+
22+
args = parser.parse_args()
23+
24+
generate_test_cases(
25+
dst_path=args.dst_path,
26+
num_inputs=args.num_inputs,
27+
program_instructions=args.program_instructions,
28+
input_instructions=args.input_instructions,
29+
)

0 commit comments

Comments
 (0)