-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[WIP] for eval sdk e2e test #44601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] for eval sdk e2e test #44601
Conversation
…e-sdk-for-python into etwinter/nestfile
…into etwinter/nestfile
* fix result converter for usage with null value * resolve comments * fix F1_Score metric * Update changelog and version * leverage metrics from evaluators api * update * run black * address comments and refactor converter * Update comments * add more comments * fix typo for test data * remove dup extraction * remove dup extraction * remove dup extraction * Address comments and update UT * update UT * Address comments and update method name --------- Co-authored-by: Waqas Javed <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This WIP pull request adds end-to-end testing support for the evaluation SDK with a focus on fixing nested field handling for AOAI graders when using files as datasources. The changes introduce a normalization layer to handle rows that already contain item or sample wrapper structures, preserving nested dictionaries while ensuring data is properly mapped to the AOAI evaluation format.
Key changes:
- Added normalization logic to preserve reserved root-level keys (like "sample") when flattening data
- Enhanced evaluator configuration support with inverse metric handling for boolean metrics with "desirable_direction": "decrease"
- Refactored evaluation result conversion to be more modular with dedicated helper functions for metadata extraction and result transformation
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py | Added evaluator_config parameter to test conversion function |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_data_source.py | Added 3 new test fixtures and 3 new test methods for nested item/sample handling, removed assertion for required fields |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/data/nested_item_sample_keyword.jsonl | New test data with nested item and sample structures |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/data/nested_item_keyword.jsonl | New test data with nested item and flat sample metadata |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/data/flat_sample_output.jsonl | New test data with dotted sample metadata keys |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_old_output_test.jsonl | Changed custom_result to custom_label for boolean evaluator |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_expected_output.json | Updated expected transformations for inverse metrics and reordered results |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_eval_config.json | New evaluator configuration file with metric definitions |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py | Added row normalization logic, removed required field population in schema builder, enhanced value conversion |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py | Major refactoring: extracted metadata extraction into helper functions, added inverse metric handling, improved modularity |
| sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | Added bug fix entry for nested field handling |
Comments suppressed due to low confidence (1)
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py:666
- The removal of the line that appends to the
requiredarray in the schema tree builder causes the schema to have emptyrequiredarrays at all levels. This breaks the schema validation contract - properties that exist should be marked as required in the schema. The original code was correctly populating the required fields based on which properties were present.
required = []
for name, child in children.items():
props[name] = to_schema(child)
return {
"type": "object",
"properties": props,
"required": required,
| except (TypeError, ValueError): | ||
| # Fallback for unserializable objects | ||
| return str(val) | ||
| if isinstance(val, (dict)): |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parentheses around dict in the isinstance check are unnecessary. It should be isinstance(val, dict) rather than isinstance(val, (dict)). While functionally equivalent, the extra parentheses suggest a tuple but only contain a single element, which is confusing and non-idiomatic Python.
| if isinstance(val, (dict)): | |
| if isinstance(val, dict): |
| Example Output: | ||
| (1.0, "pass", True) (no violation should result in pass) | ||
| """ | ||
| if label is not None and isinstance(label, bool) and label == True: |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comparison label == True should use is True instead of == True for boolean comparisons in Python. Using is True is more explicit and prevents unexpected behavior with truthy values. Alternatively, simply use if label is not None and isinstance(label, bool) and label: which is more idiomatic.
| if label is not None and isinstance(label, bool) and label == True: | |
| if label is not None and isinstance(label, bool) and label: |
| ### Bugs Fixed | ||
|
|
||
| - Updated CodeVulnerability and UngroundedAttributes evaluators for RedTeam to use the binary true/false scoring pattern so their results align with service responses. | ||
| - Fixed handling of nested fields for AOAI graders when using files as datasource |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CHANGELOG entry has a minor spelling/grammar issue. The phrase "handling of nested fields for AOAI graders when using files as datasource" should include an article: "when using files as a datasource" or "when using files as datasources".
| - Fixed handling of nested fields for AOAI graders when using files as datasource | |
| - Fixed handling of nested fields for AOAI graders when using files as a datasource |
| assert "properties" in schema | ||
| assert "required" in schema | ||
| assert set(schema["properties"].keys()) == {"query", "response", "ground_truth"} | ||
| assert all(prop["type"] == "string" for prop in schema["properties"].values()) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test assertion for the required field has been removed, but this is likely incorrect. The schema should still include a required array that lists all the properties. Removing this assertion means the test no longer validates that the required fields are properly set in the schema, which could mask bugs in the schema generation logic.
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines