[WIP] for eval sdk e2e test #44601

YoYoJa · 2026-01-09T19:12:59Z

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

…e-sdk-for-python into etwinter/nestfile

…into etwinter/nestfile

* fix result converter for usage with null value * resolve comments * fix F1_Score metric * Update changelog and version * leverage metrics from evaluators api * update * run black * address comments and refactor converter * Update comments * add more comments * fix typo for test data * remove dup extraction * remove dup extraction * remove dup extraction * Address comments and update UT * update UT * Address comments and update method name --------- Co-authored-by: Waqas Javed <[email protected]>

Copilot

Pull request overview

This WIP pull request adds end-to-end testing support for the evaluation SDK with a focus on fixing nested field handling for AOAI graders when using files as datasources. The changes introduce a normalization layer to handle rows that already contain item or sample wrapper structures, preserving nested dictionaries while ensuring data is properly mapped to the AOAI evaluation format.

Key changes:

Added normalization logic to preserve reserved root-level keys (like "sample") when flattening data
Enhanced evaluator configuration support with inverse metric handling for boolean metrics with "desirable_direction": "decrease"
Refactored evaluation result conversion to be more modular with dedicated helper functions for metadata extraction and result transformation

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py	Added evaluator_config parameter to test conversion function
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_data_source.py	Added 3 new test fixtures and 3 new test methods for nested item/sample handling, removed assertion for required fields
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/nested_item_sample_keyword.jsonl	New test data with nested item and sample structures
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/nested_item_keyword.jsonl	New test data with nested item and flat sample metadata
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/flat_sample_output.jsonl	New test data with dotted sample metadata keys
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_old_output_test.jsonl	Changed custom_result to custom_label for boolean evaluator
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_expected_output.json	Updated expected transformations for inverse metrics and reordered results
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_eval_config.json	New evaluator configuration file with metric definitions
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py	Added row normalization logic, removed required field population in schema builder, enhanced value conversion
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py	Major refactoring: extracted metadata extraction into helper functions, added inverse metric handling, improved modularity
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md	Added bug fix entry for nested field handling

Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py:666

The removal of the line that appends to the required array in the schema tree builder causes the schema to have empty required arrays at all levels. This breaks the schema validation contract - properties that exist should be marked as required in the schema. The original code was correctly populating the required fields based on which properties were present.

        required = []
        for name, child in children.items():
            props[name] = to_schema(child)
        return {
            "type": "object",
            "properties": props,
            "required": required,

Copilot · 2026-01-09T19:23:42Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py

-            except (TypeError, ValueError):
-                # Fallback for unserializable objects
-                return str(val)
+        if isinstance(val, (dict)):


The parentheses around dict in the isinstance check are unnecessary. It should be isinstance(val, dict) rather than isinstance(val, (dict)). While functionally equivalent, the extra parentheses suggest a tuple but only contain a single element, which is confusing and non-idiomatic Python.

Suggested change

if isinstance(val, (dict)):

if isinstance(val, dict):

Copilot · 2026-01-09T19:23:42Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

+    Example Output:
+        (1.0, "pass", True) (no violation should result in pass)
+    """
+    if label is not None and isinstance(label, bool) and label == True:


The comparison label == True should use is True instead of == True for boolean comparisons in Python. Using is True is more explicit and prevents unexpected behavior with truthy values. Alternatively, simply use if label is not None and isinstance(label, bool) and label: which is more idiomatic.

Suggested change

if label is not None and isinstance(label, bool) and label == True:

if label is not None and isinstance(label, bool) and label:

Copilot · 2026-01-09T19:23:43Z

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

 ### Bugs Fixed

 - Updated CodeVulnerability and UngroundedAttributes evaluators for RedTeam to use the binary true/false scoring pattern so their results align with service responses.
+- Fixed handling of nested fields for AOAI graders when using files as datasource 


The CHANGELOG entry has a minor spelling/grammar issue. The phrase "handling of nested fields for AOAI graders when using files as datasource" should include an article: "when using files as a datasource" or "when using files as datasources".

Suggested change

- Fixed handling of nested fields for AOAI graders when using files as datasource

- Fixed handling of nested fields for AOAI graders when using files as a datasource

Copilot · 2026-01-09T19:23:43Z

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_data_source.py

        assert "properties" in schema
        assert "required" in schema
        assert set(schema["properties"].keys()) == {"query", "response", "ground_truth"}
        assert all(prop["type"] == "string" for prop in schema["properties"].values())


The test assertion for the required field has been removed, but this is likely incorrect. The schema should still include a required array that lists all the properties. Removing this assertion means the test no longer validates that the required fields are properly set in the schema, which could mask bugs in the schema generation logic.

ebwinters and others added 17 commits December 5, 2025 16:08

handle nesting for aoai grader

f3c7122

fix test

d7d3ac4

Merge branch 'main' into etwinter/nestfile

3ee48f1

fix spelling error in build

0d8e382

Merge branch 'etwinter/nestfile' of https://github.com/ebwinters/azur…

fa0f5b4

…e-sdk-for-python into etwinter/nestfile

Merge branch 'main' into etwinter/nestfile

c961851

Merge branch 'main' into etwinter/nestfile

6adf165

changelog

896585f

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

e6d52c5

…into etwinter/nestfile

tests

62bd9f0

fix tests

489346b

fix tests

d6745d0

tox

3a37911

Merge branch 'main' into etwinter/nestfile

197a2ac

Merge branch 'main' into etwinter/nestfile

490c516

merge ebwinters:etwinter/nestfile

c47daf4

Copilot AI review requested due to automatic review settings January 9, 2026 19:12

YoYoJa requested a review from a team as a code owner January 9, 2026 19:12

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Jan 9, 2026

Copilot started reviewing on behalf of YoYoJa January 9, 2026 19:13 View session

Copilot AI reviewed Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] for eval sdk e2e test #44601

[WIP] for eval sdk e2e test #44601

Uh oh!

YoYoJa commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if label is not None and isinstance(label, bool) and label == True:
	if label is not None and isinstance(label, bool) and label:

	- Fixed handling of nested fields for AOAI graders when using files as datasource
	- Fixed handling of nested fields for AOAI graders when using files as a datasource

[WIP] for eval sdk e2e test #44601

Are you sure you want to change the base?

[WIP] for eval sdk e2e test #44601

Uh oh!

Conversation

YoYoJa commented Jan 9, 2026

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants