Add TruthfulQA benchmarkspec asset by Copilot · Pull Request #4781 · Azure/azureml-assets

Copilot · 2026-02-13T00:03:41Z

Adds benchmarkspec for TruthfulQA dataset (817 questions measuring model truthfulness on common misconceptions across 38 categories). Restructures evaluator schema to support C# JObject deserialization.

Changes

New benchmarkspec: assets/benchmarkspecs/builtin/truthful_qa/
- spec.yaml: Full benchmark definition with string_check evaluator
- asset.yaml: Asset metadata

Simplified evaluator schema: Reduced to id + testingCriteria for JObject compatibility

evaluator:
  id: "azureml://registries/azureml/evaluators/builtin.string_check/versions/2"
  testingCriteria:
    type: "string_check"
    input: "{{sample.output_text}}"
    operation: "eq"
    reference: "{{item.Best_Answer}}"
    name: "TruthfulQA"

Dataset Configuration

Source: domenicrosati/TruthfulQA (HuggingFace)
Schema: 7 string columns (Type, Category, Question, Best_Answer, Correct_Answers, Incorrect_Answers, Source)
Evaluator: Exact string match against Best_Answer
License: Apache 2.0

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: AbdelmohsenMS <127798197+AbdelmohsenMS@users.noreply.github.com>

github-actions · 2026-02-13T15:37:32Z

Test Results for assets-test

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit 303953c.

♻️ This comment has been updated with latest results.

Initial plan

96e1122

Copilot AI assigned Copilot and AbdelmohsenMS Feb 13, 2026

Copilot started work on behalf of AbdelmohsenMS February 13, 2026 00:04 View session

Copilot AI and others added 2 commits February 13, 2026 00:06

Add truthful_qa benchmarkspec with string_check evaluator

898cfc4

Co-authored-by: AbdelmohsenMS <127798197+AbdelmohsenMS@users.noreply.github.com>

Fix prompt to remove $ANSWER format for exact string matching

6c1f0db

Co-authored-by: AbdelmohsenMS <127798197+AbdelmohsenMS@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add benchmark specification for builtin.truthful_qa dataset~~ Add builtin.truthful_qa benchmarkspec with string_check evaluator Feb 13, 2026

Copilot AI requested a review from AbdelmohsenMS February 13, 2026 00:09

Copilot finished work on behalf of AbdelmohsenMS February 13, 2026 00:09

Copilot started work on behalf of AbdelmohsenMS February 13, 2026 00:17 View session

Add asset.yaml for truthful_qa benchmarkspec

b351118

Co-authored-by: AbdelmohsenMS <127798197+AbdelmohsenMS@users.noreply.github.com>

Copilot AI changed the title ~~Add builtin.truthful_qa benchmarkspec with string_check evaluator~~ Add builtin.truthful_qa benchmarkspec Feb 13, 2026

Copilot finished work on behalf of AbdelmohsenMS February 13, 2026 00:19

Copilot started work on behalf of AbdelmohsenMS February 13, 2026 01:12 View session

Update file name from truthful_qa.csv to TruthfulQA.csv

2c4d27a

Co-authored-by: AbdelmohsenMS <127798197+AbdelmohsenMS@users.noreply.github.com>

Copilot finished work on behalf of AbdelmohsenMS February 13, 2026 01:14

AbdelmohsenMS marked this pull request as ready for review February 13, 2026 06:32

AbdelmohsenMS requested a review from a team as a code owner February 13, 2026 06:32

Copilot started work on behalf of AbdelmohsenMS February 13, 2026 06:36 View session

Restructure evaluator to use id and testingCriteria fields

8c46be2

Co-authored-by: AbdelmohsenMS <127798197+AbdelmohsenMS@users.noreply.github.com>

Copilot AI changed the title ~~Add builtin.truthful_qa benchmarkspec~~ Add TruthfulQA benchmarkspec with simplified evaluator schema Feb 13, 2026

Copilot finished work on behalf of AbdelmohsenMS February 13, 2026 06:38

AbdelmohsenMS changed the title ~~Add TruthfulQA benchmarkspec with simplified evaluator schema~~ Add TruthfulQA benchmarkspec asset Feb 13, 2026

Copilot AI temporarily deployed to Testing February 13, 2026 15:37 Inactive

Merge branch 'main' into copilot/add-benchmarkspec-builtin-truthful-qa

252571e

AbdelmohsenMS temporarily deployed to Testing February 13, 2026 15:39 — with GitHub Actions Inactive

vizhur added the safe to publish Pull request containing new asset has been tested properly label Feb 13, 2026

vizhur approved these changes Feb 13, 2026

View reviewed changes

vizhur temporarily deployed to Testing February 13, 2026 15:49 — with GitHub Actions Inactive

Merge branch 'main' into copilot/add-benchmarkspec-builtin-truthful-qa

303953c

AbdelmohsenMS temporarily deployed to Testing February 13, 2026 15:50 — with GitHub Actions Inactive

AbdelmohsenMS temporarily deployed to Testing February 13, 2026 15:51 — with GitHub Actions Inactive

AbdelmohsenMS merged commit 2a7a8ee into main Feb 13, 2026
36 checks passed

AbdelmohsenMS deleted the copilot/add-benchmarkspec-builtin-truthful-qa branch February 13, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TruthfulQA benchmarkspec asset#4781

Add TruthfulQA benchmarkspec asset#4781
AbdelmohsenMS merged 8 commits intomainfrom
copilot/add-benchmarkspec-builtin-truthful-qa

Copilot AI commented Feb 13, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Dataset Configuration

Uh oh!

github-actions bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results for assets-test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Feb 13, 2026 •

edited

Loading

github-actions bot commented Feb 13, 2026 •

edited

Loading