Skip to content

Conversation

@YoYoJa
Copy link
Contributor

@YoYoJa YoYoJa commented Jan 9, 2026

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

ebwinters and others added 17 commits December 5, 2025 16:08
* fix result converter for usage with null value

* resolve comments

* fix F1_Score metric

* Update changelog and version

* leverage metrics from evaluators api

* update

* run black

* address comments and refactor converter

* Update comments

* add more comments

* fix typo for test data

* remove dup extraction

* remove dup extraction

* remove dup extraction

* Address comments and update UT

* update UT

* Address comments and update method name

---------

Co-authored-by: Waqas Javed <[email protected]>
Copilot AI review requested due to automatic review settings January 9, 2026 19:12
@YoYoJa YoYoJa requested a review from a team as a code owner January 9, 2026 19:12
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Jan 9, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This WIP pull request adds end-to-end testing support for the evaluation SDK with a focus on fixing nested field handling for AOAI graders when using files as datasources. The changes introduce a normalization layer to handle rows that already contain item or sample wrapper structures, preserving nested dictionaries while ensuring data is properly mapped to the AOAI evaluation format.

Key changes:

  • Added normalization logic to preserve reserved root-level keys (like "sample") when flattening data
  • Enhanced evaluator configuration support with inverse metric handling for boolean metrics with "desirable_direction": "decrease"
  • Refactored evaluation result conversion to be more modular with dedicated helper functions for metadata extraction and result transformation

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py Added evaluator_config parameter to test conversion function
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_data_source.py Added 3 new test fixtures and 3 new test methods for nested item/sample handling, removed assertion for required fields
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/nested_item_sample_keyword.jsonl New test data with nested item and sample structures
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/nested_item_keyword.jsonl New test data with nested item and flat sample metadata
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/flat_sample_output.jsonl New test data with dotted sample metadata keys
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_old_output_test.jsonl Changed custom_result to custom_label for boolean evaluator
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_expected_output.json Updated expected transformations for inverse metrics and reordered results
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_eval_config.json New evaluator configuration file with metric definitions
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py Added row normalization logic, removed required field population in schema builder, enhanced value conversion
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py Major refactoring: extracted metadata extraction into helper functions, added inverse metric handling, improved modularity
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md Added bug fix entry for nested field handling
Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py:666

  • The removal of the line that appends to the required array in the schema tree builder causes the schema to have empty required arrays at all levels. This breaks the schema validation contract - properties that exist should be marked as required in the schema. The original code was correctly populating the required fields based on which properties were present.
        required = []
        for name, child in children.items():
            props[name] = to_schema(child)
        return {
            "type": "object",
            "properties": props,
            "required": required,

except (TypeError, ValueError):
# Fallback for unserializable objects
return str(val)
if isinstance(val, (dict)):
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parentheses around dict in the isinstance check are unnecessary. It should be isinstance(val, dict) rather than isinstance(val, (dict)). While functionally equivalent, the extra parentheses suggest a tuple but only contain a single element, which is confusing and non-idiomatic Python.

Suggested change
if isinstance(val, (dict)):
if isinstance(val, dict):

Copilot uses AI. Check for mistakes.
Example Output:
(1.0, "pass", True) (no violation should result in pass)
"""
if label is not None and isinstance(label, bool) and label == True:
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison label == True should use is True instead of == True for boolean comparisons in Python. Using is True is more explicit and prevents unexpected behavior with truthy values. Alternatively, simply use if label is not None and isinstance(label, bool) and label: which is more idiomatic.

Suggested change
if label is not None and isinstance(label, bool) and label == True:
if label is not None and isinstance(label, bool) and label:

Copilot uses AI. Check for mistakes.
### Bugs Fixed

- Updated CodeVulnerability and UngroundedAttributes evaluators for RedTeam to use the binary true/false scoring pattern so their results align with service responses.
- Fixed handling of nested fields for AOAI graders when using files as datasource
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CHANGELOG entry has a minor spelling/grammar issue. The phrase "handling of nested fields for AOAI graders when using files as datasource" should include an article: "when using files as a datasource" or "when using files as datasources".

Suggested change
- Fixed handling of nested fields for AOAI graders when using files as datasource
- Fixed handling of nested fields for AOAI graders when using files as a datasource

Copilot uses AI. Check for mistakes.
assert "properties" in schema
assert "required" in schema
assert set(schema["properties"].keys()) == {"query", "response", "ground_truth"}
assert all(prop["type"] == "string" for prop in schema["properties"].values())
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test assertion for the required field has been removed, but this is likely incorrect. The schema should still include a required array that lists all the properties. Removing this assertion means the test no longer validates that the required fields are properly set in the schema, which could mask bugs in the schema generation logic.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants