Skip to content

Commit 1ca24c1

Browse files
committed
Enhance evaluation to use actual component IDs from app code
Added extract_component_ids to parse Shiny app code and detect actual component and output IDs. Updated evaluation instructions and sample generation to ensure tests are only evaluated against components that exist in the app code, ignoring criteria for non-existent components. Improved grading instructions and made evaluation more robust to app-specific variations.
1 parent 8d7904f commit 1ca24c1

File tree

1 file changed

+130
-28
lines changed

1 file changed

+130
-28
lines changed

tests/inspect-ai/scripts/evaluation.py

Lines changed: 130 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import json
2+
import re
23
from pathlib import Path
34

45
from inspect_ai import Task, task
@@ -20,31 +21,37 @@ def get_app_specific_instructions(app_name: str) -> str:
2021
"""
2122
app_instructions = {
2223
"app_09_plots": """
23-
For this plot app tests, focus on:
24+
For this plot app tests, focus on components that exist in the app code:
2425
- Whether the test creates an instance of the InputSlider controller with id "my_plot_module-n_points"
2526
- Ensure that the slider component is verified for its label, min, max, and value attributes.
2627
- Ensure that the test checks by moving the slider to different values and verify the slider values accordingly
28+
29+
IMPORTANT: Only evaluate based on components and IDs that actually exist in the app code.
2730
""",
2831
"app_07_modules": """
29-
For this module-based app, focus on:
32+
For this module-based app, focus on components that exist in the app code:
3033
- Whether the test creates instances of the InputText controller with ids "module_instance_1-text_input_1" and "module_instance_1-text_input_2"
3134
- Whether the test creates an instance of the OutputText controller with id "module_instance_1-text_output"
3235
- Ensure that the text inputs are verified for their labels and initial values.
3336
- Ensure that the test checks the text output for correct concatenation of input values.
3437
- Check that the test verifies the module's reactivity by changing input values and checking output
38+
39+
IMPORTANT: Only evaluate based on components and IDs that actually exist in the app code.
3540
""",
3641
"app_03_slider": """
37-
For this slider app, focus on:
42+
For this slider app, focus on components that exist in the app code:
3843
- Whether the test creates an instance of the InputSlider controller with id "slider1"
3944
- Ensure that the slider component is verified for its label, min, max, and value attributes.
4045
- Ensure that the test checks by moving the slider to different values and verify the slider values accordingly.
46+
47+
IMPORTANT: Only evaluate based on components and IDs that actually exist in the app code.
4148
""",
4249
"app_06_R_shiny": """
4350
For this app, focus on:
4451
- The test code should be empty since the app code was not a Shiny for Python app.
4552
""",
4653
"app_10_complex_layout": """
47-
For this app, focus on:
54+
For this app, focus on the components that exist in the app code:
4855
- Whether the test creates an instance of the InputActionButton controller with id "action_button"
4956
- Ensure that the action button component is verified for its label and click functionality.
5057
- Whether the test creates an instance of the InputCheckbox controller with id "checkbox"
@@ -59,12 +66,15 @@ def get_app_specific_instructions(app_name: str) -> str:
5966
- Whether the test creates an instance of the InputRadioButtons controller with id "radio_buttons"
6067
- Ensure that the radio buttons component is verified for its label, choices, and selected value.
6168
- Ensure that the test checks the radio buttons state changes and verifies the output text accordingly.
62-
- Whether the test creates an instance of the InputText controller with id "text_input"
63-
- Ensure that the text input component is verified for its label and initial value.
64-
- Ensure that the test checks the text input state changes and verifies the output text accordingly.
65-
- Whether the test creates an instance of the OutputText controller with id "action_button_value", "checkbox_value", "date_selector_value", "numeric_input_value", "radio_buttons_value", and "text_input_value"
69+
- Whether the test creates an instance of the InputSwitch controller with id "switch"
70+
- Ensure that the switch component is verified for its label and state.
71+
- Ensure that the test checks the switch state changes and verifies the output text accordingly.
72+
- Whether the test creates an instance of the OutputText controller with ids "action_button_value", "checkbox_value", "date_selector_value", "numeric_input_value", "radio_buttons_value", and "switch_value"
6673
- Ensure that the output text components are verified for their initial values and updated values based on user interactions.
67-
- Ensure that the Output Data Frame controller with id "data_table" is created and verified for its initial state.
74+
- Whether the test creates an instance of the OutputDataFrame controller with id "data_grid"
75+
- Ensure that the data grid component is verified for its initial state and updates correctly based on user interactions.
76+
77+
IMPORTANT: Only evaluate based on components and IDs that actually exist in the app code. The test should only test functionality that is actually present in the app.
6878
""",
6979
"app_02_express_basic": """
7080
For this shiny express basic app, focus on:
@@ -113,6 +123,71 @@ def get_app_specific_instructions(app_name: str) -> str:
113123
return app_instructions.get(app_name, "")
114124

115125

126+
def extract_component_ids(app_code: str) -> dict:
127+
"""
128+
Extract component IDs from Shiny app code to ensure evaluation focuses on existing components.
129+
130+
Args:
131+
app_code: The Shiny app code to analyze
132+
133+
Returns:
134+
Dictionary with component types as keys and lists of IDs as values
135+
"""
136+
component_ids = {
137+
"input": [],
138+
"output": [],
139+
}
140+
141+
patterns = {
142+
# Standard ui.input_* and ui.output_* with ID as first arg
143+
"ui_input": r"ui\.input_\w+\(\s*['\"]([^'\"]+)['\"]|ui\.input_\w+\(\s*id\s*=\s*['\"]([^'\"]+)['\"])", # Both positional and named 'id' param
144+
"ui_output": r"ui\.output_\w+\(\s*['\"]([^'\"]+)['\"]|ui\.output_\w+\(\s*id\s*=\s*['\"]([^'\"]+)['\"])", # Both positional and named 'id' param
145+
# Shiny express syntax
146+
"express_input": r"input\.([\w_]+)\(\)", # input.name() references
147+
"express_output": r"@render\.[\w_]+\s+def\s+([\w_]+)\(", # @render.* def name(
148+
# Module IDs with instantiation
149+
"module_id": r"\w+_\w+\(['\"]([^'\"]+)['\"])", # module_name("id")
150+
# Nav panels, tabs and similar
151+
"ui_nav": r"ui\.nav[\w_]*\(\s*['\"]([^'\"]+)['\"]|ui\.navset_\w+\(.*?id\s*=\s*['\"]([^'\"]+)['\"])", # ui.nav* or ui.navset_* with id param
152+
}
153+
154+
# Process each pattern type
155+
for pattern_type, pattern in patterns.items():
156+
# Find all matches of the pattern
157+
matches = re.findall(pattern, app_code)
158+
159+
# Flatten tuple results if any and filter out empty matches
160+
flattened_matches = []
161+
for match in matches:
162+
if isinstance(match, tuple):
163+
# Add all non-empty groups from the tuple
164+
for m in match:
165+
if m:
166+
flattened_matches.append(m)
167+
elif match: # Single string match
168+
flattened_matches.append(match)
169+
170+
# Add to appropriate category
171+
if pattern_type.startswith("ui_input") or pattern_type.startswith(
172+
"express_input"
173+
):
174+
component_ids["input"].extend(flattened_matches)
175+
elif pattern_type.startswith("ui_output") or pattern_type.startswith(
176+
"express_output"
177+
):
178+
component_ids["output"].extend(flattened_matches)
179+
else: # Other types (nav, module, etc.)
180+
# These could go in either category or a new one, but we'll add to both
181+
component_ids["input"].extend(flattened_matches)
182+
component_ids["output"].extend(flattened_matches)
183+
184+
# Remove duplicates while preserving order
185+
component_ids["input"] = list(dict.fromkeys(component_ids["input"]))
186+
component_ids["output"] = list(dict.fromkeys(component_ids["output"]))
187+
188+
return component_ids
189+
190+
116191
def create_inspect_ai_samples(test_data: dict) -> list[Sample]:
117192
"""
118193
Create Inspect AI samples from the generated test data.
@@ -128,9 +203,21 @@ def create_inspect_ai_samples(test_data: dict) -> list[Sample]:
128203
for test_name, data in test_data.items():
129204
app_specific_guidance = get_app_specific_instructions(data["app_name"])
130205

206+
# Extract component IDs from app code to help with evaluation
207+
component_ids = extract_component_ids(data["app_code"])
208+
component_ids_str = "\n".join(
209+
[f"{k.title()} IDs: {', '.join(v)}" for k, v in component_ids.items() if v]
210+
)
211+
131212
# The question should be clear about what we're evaluating
132213
question = f"""Evaluate the quality of this Shiny test code for app {data['app_name']}.
133214
215+
IMPORTANT: First carefully analyze the App Code below to understand what components and IDs actually exist in the app.
216+
Then evaluate the test code ONLY against components and IDs that actually exist in the app code.
217+
218+
Actual Component IDs automatically detected in App:
219+
{component_ids_str}
220+
134221
App Code:
135222
```python
136223
{data['app_code']}
@@ -139,12 +226,19 @@ def create_inspect_ai_samples(test_data: dict) -> list[Sample]:
139226
Test Code to Evaluate:
140227
```python
141228
{data['test_code']}
142-
```"""
229+
```
230+
231+
Evaluation Instructions:
232+
1. ONLY evaluate components that ACTUALLY EXIST in the app code - the detected IDs above show what's really in the app
233+
2. If a component mentioned in the criteria doesn't exist in the app code, IGNORE that part of the criteria completely
234+
3. If the app uses different IDs than what's in the criteria (e.g., "data_grid" instead of "data_table"), use the actual IDs from the app
235+
4. Check if the test code properly tests all the EXISTING components (creating controllers, verifying attributes, testing interactions, etc.)
236+
5. The test should receive a Complete grade if it adequately tests all components that actually exist in the app"""
143237

144238
if app_specific_guidance:
145-
target_answer = f"CORRECT: A test that meets all specified criteria.\n{app_specific_guidance.strip()}"
239+
target_answer = f"CORRECT: A test that meets all specified criteria for components that actually exist in the app code.\n{app_specific_guidance.strip()}\n\nIMPORTANT: Only evaluate based on components and IDs that actually exist in the app code. Ignore criteria for components that don't exist."
146240
else:
147-
target_answer = "CORRECT: A test that meets all specified criteria."
241+
target_answer = "CORRECT: A test that meets all specified criteria for components that actually exist in the app code."
148242

149243
sample = Sample(
150244
input=question,
@@ -177,37 +271,45 @@ def shiny_test_evaluation() -> Task:
177271

178272
scorer = model_graded_qa(
179273
instructions="""
180-
You are an expert evaluator for Shiny application testing. Your task is to evaluate test code quality based STRICTLY on the provided criteria.
274+
You are an expert evaluator for Shiny application testing. Your task is to evaluate test code quality based ONLY on the provided app code and specific criteria.
181275
182276
CRITICAL INSTRUCTIONS:
183-
1. ONLY evaluate based on the specific criteria listed in the "criterion" section
184-
2. DO NOT add your own criteria or suggestions beyond what is explicitly stated
185-
3. DO NOT penalize for missing features that are not mentioned in the criteria
186-
4. DO NOT suggest improvements unless they directly relate to the specified criteria
187-
5. For non-Shiny frameworks (R Shiny, Streamlit, etc.), the test code should be empty - grade as Complete if empty
277+
1. FIRST, carefully analyze the app code to understand what components ACTUALLY exist in the app
278+
2. Extract a precise list of all component IDs present in the app code
279+
3. IGNORE any criteria that reference UI components or IDs that don't exist in the actual app code
280+
4. ONLY evaluate based on specific criteria that match components in the actual app
281+
5. DO NOT add your own criteria or suggestions beyond what is explicitly stated
282+
6. DO NOT penalize for missing features that are not mentioned in the criteria OR don't exist in the app
283+
7. For non-Shiny frameworks (R Shiny, Streamlit, etc.), the test code should be empty - grade as Complete if empty
284+
8. If test_code tests components that are actually in the app, it should get a 'C' grade even if it doesn't test components mentioned in the criteria that don't exist in the app
188285
189286
EVALUATION PROCESS:
190-
- Read the specific criteria for this app
191-
- Check if the test code implements EXACTLY what is specified
192-
- Ignore any additional features or missing features not mentioned in the criteria
193-
- Base your grade solely on whether the specified requirements are met
287+
- First carefully extract all component IDs from the app code (e.g., "action_button", "checkbox", etc.)
288+
- Compare these IDs with those mentioned in the criteria
289+
- ONLY evaluate criteria for components that actually exist in the app code
290+
- COMPLETELY IGNORE criteria about components that don't exist in the app
291+
- Grade based ONLY on how well the test code tests the components that actually exist
292+
293+
MOST IMPORTANT:
294+
- If the app does not contain a component mentioned in the criteria, IGNORE that part of the criteria completely
295+
- If the app uses a different ID than what's in the criteria (e.g., "data_grid" instead of "data_table"), use the actual ID from the app
194296
195297
GRADING SCALE:
196-
- C (Complete): ALL specified criteria are met
197-
- P (Partial): MOST specified criteria are met, minor gaps in the specified requirements
198-
- I (Incomplete): MAJOR specified criteria are missing or incorrectly implemented
298+
- C (Complete): ALL criteria for EXISTING components are met
299+
- P (Partial): MOST criteria for EXISTING components are met, with minor gaps
300+
- I (Incomplete): MAJOR criteria for EXISTING components are missing or incorrectly implemented
199301
200302
Provide your evaluation in the following format:
201303
GRADE: [C/P/I]
202-
Explanation: [Brief explanation focusing ONLY on how well the specified criteria were met]
304+
Explanation: [Brief explanation focusing ONLY on how well the specified criteria were met for EXISTING components]
203305
""",
204306
grade_pattern=r"GRADE:\s*([CPI])",
205-
model=get_model("openai/gpt-5-mini-2025-08-07"),
307+
model=get_model("openai/gpt-5-nano-2025-08-07"),
206308
)
207309

208310
return Task(
209311
dataset=samples,
210312
solver=generate(),
211313
scorer=scorer,
212-
model=get_model("openai/gpt-5-mini-2025-08-07"),
314+
model=get_model("openai/gpt-5-nano-2025-08-07"),
213315
)

0 commit comments

Comments
 (0)