Skip to content

Commit b28458d

Browse files
authored
add dataflow eval doc and taxonomy (#128)
1 parent af5c0bf commit b28458d

File tree

6 files changed

+319
-1
lines changed

6 files changed

+319
-1
lines changed

docs/.vuepress/notes/en/guide.ts

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,16 @@ export const Guide: ThemeNote = defineNoteConfig({
7070
"FuncCallPipeline",
7171
]
7272
},
73+
{
74+
text: "Model Evaluation",
75+
collapsed: false,
76+
icon: 'carbon:flow',
77+
prefix: 'model_evaluation',
78+
items: [
79+
"command_eval",
80+
"easy_evaluation",
81+
]
82+
},
7383
{
7484
text: "General Operators",
7585
collapsed: false,

docs/.vuepress/notes/zh/guide.ts

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,16 @@ export const Guide: ThemeNote = defineNoteConfig({
6969
"FuncCallPipeline",
7070
]
7171
},
72+
{
73+
text:"模型自动评估",
74+
collapsed: false,
75+
icon: 'carbon:flow',
76+
prefix: 'model_evaluation',
77+
items: [
78+
"command_eval",
79+
"easy_evaluation",
80+
]
81+
},
7282
{
7383
text: "通用算子(移动到API)",
7484
collapsed: false,

docs/en/notes/guide/pipelines/EvalPipeline.md renamed to docs/en/notes/guide/model_evaluation/command_eval.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
---
2+
title: Command Model Evaluation Pipeline
3+
icon: hugeicons:chart-evaluation
4+
createTime: 2025/10/17 15:00:50
5+
permalink: /en/guide/qi6ikv5s/
6+
---
17
# **Evaluation Pipeline**
28

39
Only supports QA pair format evaluation
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: easy_evaluation
3+
icon: hugeicons:chart-evaluation
4+
createTime: 2025/10/17 15:20:10
5+
permalink: /en/guide/97wq40d9/
6+
---
7+
8+
# 📊 Model Evaluation Pipeline Guide
9+
10+
This guide explains how to use the **DataFlow** evaluation pipeline to assess model-generated answers against ground-truth answers using either **semantic** or **exact match** comparison.
11+
Two evaluation modes are supported:
12+
13+
1. **Direct Comparison Mode**: Compare existing model outputs with ground truth answers.
14+
2. **Generate-and-Evaluate Mode**: First generate model answers, then compare them with ground truth answers.
15+
16+
---
17+
18+
## 🧩 Step 1: Install the Evaluation Environment
19+
20+
```bash
21+
cd DataFlow
22+
pip install -e .
23+
````
24+
25+
This installs DataFlow in editable mode, making it easier for local development and debugging.
26+
27+
---
28+
29+
## 📁 Step 2: Create and Enter the Workspace
30+
31+
```bash
32+
mkdir workspace
33+
cd workspace
34+
```
35+
36+
All configuration files and cached evaluation data will be stored in this workspace directory.
37+
38+
---
39+
40+
## ⚙️ Step 3: Initialize the Evaluation Configuration
41+
42+
Run the following command to initialize the evaluation configuration:
43+
44+
```bash
45+
dataflow init
46+
```
47+
48+
After initialization, the directory structure will look like this:
49+
50+
```text
51+
api_pipelines/
52+
├── core_text_bencheval_semantic_pipeline.py # Evaluator for API models
53+
├── core_text_bencheval_semantic_pipeline_question.py # Evaluator for local models (requires question)
54+
└── core_text_bencheval_semantic_pipeline_question_single_step.py # Evaluator for local models (generate + evaluate)
55+
```
56+
57+
---
58+
59+
## 🚀 Step 4: Run the Evaluation
60+
61+
Navigate to the `api_pipelines` folder:
62+
63+
```bash
64+
cd api_pipelines
65+
```
66+
67+
Select the corresponding script based on your evaluation mode:
68+
69+
<table>
70+
<thead>
71+
<tr>
72+
<th style="width: 22%">🧩 Task Type</th>
73+
<th style="width: 22%">❓ Requires Question</th>
74+
<th style="width: 22%">🧠 Generates Answers</th>
75+
<th style="width: 34%">▶️ Script to Run</th>
76+
</tr>
77+
</thead>
78+
<tbody>
79+
<tr>
80+
<td>Compare existing answers (no Question required)</td>
81+
<td align="center"></td>
82+
<td align="center"></td>
83+
<td><code>core_text_bencheval_semantic_pipeline.py</code></td>
84+
</tr>
85+
<tr>
86+
<td>Compare existing answers (requires Question)</td>
87+
<td align="center"></td>
88+
<td align="center"></td>
89+
<td><code>core_text_bencheval_semantic_pipeline_question.py</code></td>
90+
</tr>
91+
<tr>
92+
<td>Generate answers then compare (requires Question)</td>
93+
<td align="center"></td>
94+
<td align="center"></td>
95+
<td><code>core_text_bencheval_semantic_pipeline_question_single_step.py</code></td>
96+
</tr>
97+
</tbody>
98+
</table>
99+
100+
Example:
101+
102+
```bash
103+
python core_text_bencheval_semantic_pipeline_question_single_step.py
104+
```
105+
106+
---
107+
108+
## 🗂️ Data Storage Configuration
109+
110+
Evaluation data paths are managed by `FileStorage`, which can be customized in the script:
111+
112+
```python
113+
self.storage = FileStorage(
114+
first_entry_file_name="../example_data/chemistry/matched_sample_10.json",
115+
cache_path="./cache_all_17_24_gpt_5",
116+
file_name_prefix="math_QA",
117+
cache_type="json",
118+
)
119+
```
120+
121+
* **first_entry_file_name** — Path to the evaluation dataset (e.g., example data)
122+
* **cache_path** — Directory for caching intermediate evaluation results
123+
* **file_name_prefix** — Prefix for cached files
124+
* **cache_type** — File type for cache (typically `json`)
125+
126+
---
127+
128+
## 🧠 Step 5: Define Evaluation Keys
129+
130+
Specify the field mappings between model outputs and ground-truth labels:
131+
132+
```python
133+
self.evaluator_step.run(
134+
storage=self.storage.step(),
135+
input_test_answer_key="model_answer",
136+
input_gt_answer_key="golden_label",
137+
)
138+
```
139+
140+
* **input_test_answer_key** — Key name for model-generated answers
141+
* **input_gt_answer_key** — Key name for ground-truth answers
142+
143+
Make sure the field names match the corresponding keys in your dataset.

docs/zh/notes/guide/pipelines/EvalPipeline.md renamed to docs/zh/notes/guide/model_evaluation/command_eval.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
---
2+
title: 命令行评估流水线
3+
icon: hugeicons:chart-evaluation
4+
createTime: 2025/10/17 15:00:50
5+
permalink: /zh/guide/enty5kqg/
6+
---
17
# 评估流水线
28

39
仅支持QA对形式的评估
@@ -6,7 +12,7 @@
612

713
```
814
cd DataFlow
9-
pip install -e .[llamafactory]
15+
pip install -e .[vllm]
1016
1117
cd ..
1218
mkdir workspace
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: 模型评估流水线
3+
icon: hugeicons:chart-evaluation
4+
createTime: 2025/10/17 15:00:50
5+
permalink: /zh/guide/enty5ksn/
6+
---
7+
8+
# 📊 模型评估流水线使用指南
9+
10+
本指南介绍如何使用 **DataFlow** 的评估流水线,对模型生成答案与标准答案进行语义或精确匹配评估。
11+
支持以下两种模式:
12+
13+
1. **直接对比模式**:对已有生成结果与标准答案进行比对。
14+
2. **生成-评估模式**:先由模型生成答案,再与标准答案进行对比。
15+
16+
---
17+
18+
## 🧩 第一步:安装评估环境
19+
20+
```bash
21+
cd DataFlow
22+
pip install -e .
23+
```
24+
25+
这将以可编辑模式安装 DataFlow,方便本地开发与调试。
26+
27+
---
28+
29+
## 📁 第二步:创建并进入工作目录
30+
31+
```bash
32+
mkdir workspace
33+
cd workspace
34+
```
35+
36+
所有评估相关的配置文件与缓存数据都将在该目录下生成和保存。
37+
38+
---
39+
40+
## ⚙️ 第三步:初始化评估配置文件
41+
42+
使用以下命令初始化评估配置:
43+
44+
```bash
45+
dataflow init
46+
```
47+
48+
初始化后,项目目录结构如下:
49+
50+
```text
51+
api_pipelines/
52+
├── core_text_bencheval_semantic_pipeline.py # 评估器:API模型
53+
├── core_text_bencheval_semantic_pipeline_question.py # 评估器:本地模型(需要question)
54+
└── core_text_bencheval_semantic_pipeline_question_single_step.py # 评估器:本地模型(先生成再评估)
55+
```
56+
57+
---
58+
59+
## 🚀 第四步:运行评估
60+
61+
进入 `api_pipelines` 文件夹:
62+
63+
```bash
64+
cd api_pipelines
65+
```
66+
67+
根据你的任务选择对应脚本运行:
68+
69+
<table>
70+
<thead>
71+
<tr>
72+
<th style="width: 22%">🧩 任务类型</th>
73+
<th style="width: 22%">❓ 是否需要 Question</th>
74+
<th style="width: 22%">🧠 是否需要生成答案</th>
75+
<th style="width: 34%">▶️ 运行脚本</th>
76+
</tr>
77+
</thead>
78+
<tbody>
79+
<tr>
80+
<td>对比已有答案(无需 Question)</td>
81+
<td align="center">❌</td>
82+
<td align="center">❌</td>
83+
<td><code>core_text_bencheval_semantic_pipeline.py</code></td>
84+
</tr>
85+
<tr>
86+
<td>对比已有答案(需要 Question)</td>
87+
<td align="center">✅</td>
88+
<td align="center">❌</td>
89+
<td><code>core_text_bencheval_semantic_pipeline_question.py</code></td>
90+
</tr>
91+
<tr>
92+
<td>先生成答案再对比(需要 Question)</td>
93+
<td align="center">✅</td>
94+
<td align="center">✅</td>
95+
<td><code>core_text_bencheval_semantic_pipeline_question_single_step.py</code></td>
96+
</tr>
97+
</tbody>
98+
</table>
99+
100+
示例:
101+
102+
```bash
103+
python core_text_bencheval_semantic_pipeline_question_single_step.py
104+
```
105+
106+
---
107+
108+
## 🗂️ 数据存储与配置说明
109+
110+
评估数据路径由 `FileStorage` 管理,可在脚本中修改:
111+
112+
```python
113+
self.storage = FileStorage(
114+
first_entry_file_name="../example_data/chemistry/matched_sample_10.json",
115+
cache_path="./cache_all_17_24_gpt_5",
116+
file_name_prefix="math_QA",
117+
cache_type="json",
118+
)
119+
```
120+
121+
* **first_entry_file_name**:评估数据文件路径(如示例数据)
122+
* **cache_path**:评估中间结果缓存路径
123+
* **file_name_prefix**:缓存文件名前缀
124+
* **cache_type**:缓存文件类型(通常为 `json`
125+
126+
---
127+
128+
## 🧠 第五步:设置评估字段
129+
130+
定义模型输出与标准答案的对应字段:
131+
132+
```python
133+
self.evaluator_step.run(
134+
storage=self.storage.step(),
135+
input_test_answer_key="model_answer",
136+
input_gt_answer_key="golden_label",
137+
)
138+
```
139+
140+
* **input_test_answer_key**:模型生成的答案字段名
141+
* **input_gt_answer_key**:标准答案(golden label)字段名
142+
143+
请确保字段名与数据文件中的键名完全一致。

0 commit comments

Comments
 (0)