Skip to content

Commit 1656c0a

Browse files
updated onechat evaluation suite (#232728)
## Summary This pull request updates the system prompt and example outputs for the correctness evaluator. The main changes include: 1. Moving the `summary` object to the end of each JSON output 2. Adding the `sequence_match` field to each claim analysis, and ensuring all example outputs follow the revised format 3. Adding all the examples (total 70; current 5) from the onechat evaluation dataset to the onechat test suite Closes: elastic/search-team#10824 ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/src/platform/packages/shared/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) - [ ] Review the [backport guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing) and apply applicable `backport:*` labels. ### Identify risks Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss. Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging. - [ ] [See some risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) - [ ] ... --------- Co-authored-by: Srdjan Lulic <[email protected]>
1 parent 443f21b commit 1656c0a

File tree

2 files changed

+1094
-48
lines changed

2 files changed

+1094
-48
lines changed

x-pack/platform/packages/shared/kbn-evals/src/evaluators/correctness/system_prompt.text

Lines changed: 31 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -67,11 +67,6 @@ Your final output MUST be a single, valid JSON object. Do not include any text o
6767

6868
```json
6969
{
70-
"summary": {
71-
"factual_accuracy_summary": "ACCURATE | MINOR_INACCURACIES | MAJOR_INACCURACIES",
72-
"relevance_summary": "RELEVANT | PARTIALLY_RELEVANT | IRRELEVANT",
73-
"sequence_accuracy_summary": "MATCH | MISMATCH | NOT_APPLICABLE"
74-
},
7570
"analysis": [
7671
{
7772
"claim": "The specific claim extracted from the agent's response.",
@@ -82,7 +77,12 @@ Your final output MUST be a single, valid JSON object. Do not include any text o
8277
"justification_snippet": "A direct snippet from the Ground Truth Response or null.",
8378
"explanation": "A brief explanation of the verdict reasoning."
8479
}
85-
]
80+
],
81+
"summary": {
82+
"factual_accuracy_summary": "ACCURATE | MINOR_INACCURACIES | MAJOR_INACCURACIES",
83+
"relevance_summary": "RELEVANT | PARTIALLY_RELEVANT | IRRELEVANT",
84+
"sequence_accuracy_summary": "MATCH | MISMATCH | NOT_APPLICABLE"
85+
}
8686
}
8787
```
8888

@@ -95,16 +95,13 @@ EXAMPLE 1: Procedural HR Query
9595

9696
```json
9797
{
98-
"summary": {
99-
"factual_accuracy_summary": "MAJOR_INACCURACIES",
100-
"relevance_summary": "RELEVANT"
101-
},
10298
"analysis": [
10399
{
104100
"claim": "log into WorkDay and go to the 'Time Entry' panel.",
105101
"centrality": "central",
106102
"centrality_reason": "These are the first two essential steps in the procedure the user asked for.",
107103
"verdict": "FULLY_SUPPORTED",
104+
"sequence_match": "MATCH",
108105
"justification_snippet": "Login to WorkDay. Navigate to the 'Time Entry' panel.",
109106
"explanation": "The claims directly match the instructions in the ground truth."
110107
},
@@ -113,6 +110,7 @@ EXAMPLE 1: Procedural HR Query
113110
"centrality": "central",
114111
"centrality_reason": "This is the final action in the procedure.",
115112
"verdict": "PARTIALLY_SUPPORTED",
113+
"sequence_match": "MATCH",
116114
"justification_snippet": "click the 'Submit' button located at the top-right.",
117115
"explanation": "The core action 'press the Submit button' is correct, but the adjectives 'prominent green' are unverified embellishments not present in the ground truth."
118116
},
@@ -121,10 +119,16 @@ EXAMPLE 1: Procedural HR Query
121119
"centrality": "central",
122120
"centrality_reason": "The submission deadline is a critical detail of the process.",
123121
"verdict": "CONTRADICTED",
122+
"sequence_match": "MATCH",
124123
"justification_snippet": "Timesheets are due by 5 PM Friday EST.",
125124
"explanation": "The agent's claim of 'local time' is explicitly contradicted by the ground truth's specific 'EST' timezone."
126125
}
127-
]
126+
],
127+
"summary": {
128+
"factual_accuracy_summary": "MAJOR_INACCURACIES",
129+
"relevance_summary": "RELEVANT",
130+
"sequence_accuracy_summary": "MATCH"
131+
}
128132
}
129133
```
130134

@@ -136,16 +140,13 @@ EXAMPLE 2: Factual Sales Query
136140

137141
```json
138142
{
139-
"summary": {
140-
"factual_accuracy_summary": "MAJOR_INACCURACIES",
141-
"relevance_summary": "PARTIALLY_RELEVANT"
142-
},
143143
"analysis": [
144144
{
145145
"claim": "the total value of open deals is $250,000",
146146
"centrality": "central",
147147
"centrality_reason": "This directly answers the first part of the user's query.",
148148
"verdict": "FULLY_SUPPORTED",
149+
"sequence_match": "NOT_APPLICABLE",
149150
"justification_snippet": "The total value of open deals for Acme Corp is $250,000.",
150151
"explanation": "The value given by the agent is an exact match with the ground truth."
151152
},
@@ -154,6 +155,7 @@ EXAMPLE 2: Factual Sales Query
154155
"centrality": "central",
155156
"centrality_reason": "This attempts to answer the second part of the user's query about the contact person.",
156157
"verdict": "CONTRADICTED",
158+
"sequence_match": "NOT_APPLICABLE",
157159
"justification_snippet": "The primary contact is Jane Doe",
158160
"explanation": "The agent incorrectly identifies Jane Doe's role as 'account manager' when the ground truth specifies 'primary contact'. In a sales context, these are different roles."
159161
},
@@ -162,6 +164,7 @@ EXAMPLE 2: Factual Sales Query
162164
"centrality": "peripheral",
163165
"centrality_reason": "The user did not ask for the account creation date.",
164166
"verdict": "FULLY_SUPPORTED",
167+
"sequence_match": "NOT_APPLICABLE",
165168
"justification_snippet": "The account was created on July 15, 2025.",
166169
"explanation": "This claim is factually correct per the ground truth but is not relevant to the user's question."
167170
},
@@ -170,10 +173,16 @@ EXAMPLE 2: Factual Sales Query
170173
"centrality": "peripheral",
171174
"centrality_reason": "This is conversational advice and was not requested by the user.",
172175
"verdict": "NOT_IN_GROUND_TRUTH",
176+
"sequence_match": "NOT_APPLICABLE",
173177
"justification_snippet": null,
174178
"explanation": "This claim is a suggestion made by the agent and its core fact is not present anywhere in the ground truth data."
175179
}
176-
]
180+
],
181+
"summary": {
182+
"factual_accuracy_summary": "MAJOR_INACCURACIES",
183+
"relevance_summary": "PARTIALLY_RELEVANT",
184+
"sequence_accuracy_summary": "NOT_APPLICABLE"
185+
}
177186
}
178187
```
179188

@@ -185,11 +194,6 @@ EXAMPLE 3: Accurate but Sequentially Incorrect Procedural Query
185194

186195
```json
187196
{
188-
"summary": {
189-
"factual_accuracy_summary": "ACCURATE",
190-
"relevance_summary": "RELEVANT",
191-
"sequence_accuracy_summary": "MISMATCH"
192-
},
193197
"analysis": [
194198
{
195199
"claim": "Navigate to the 'Reports' dashboard.",
@@ -227,6 +231,11 @@ EXAMPLE 3: Accurate but Sequentially Incorrect Procedural Query
227231
"justification_snippet": "3. Add your desired data sources and visuals, then click the 'Save' button...",
228232
"explanation": "This action is factually correct but is presented as step 4, after the impossible 'Share' step. It should be step 3."
229233
}
230-
]
234+
],
235+
"summary": {
236+
"factual_accuracy_summary": "ACCURATE",
237+
"relevance_summary": "RELEVANT",
238+
"sequence_accuracy_summary": "MISMATCH"
239+
}
231240
}
232241
```

0 commit comments

Comments
 (0)