Skip to content

Commit b5ece5c

Browse files
authored
Merge pull request #38 from DataEval/dev
Dev to Main
2 parents b216a7e + c8af4a8 commit b5ece5c

File tree

117 files changed

+1242
-808
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

117 files changed

+1242
-808
lines changed

.github/workflows/IntegrationTest.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ on:
77
push:
88
branches: [ "main", "dev" ]
99
pull_request:
10-
branches: [ "main" ]
10+
branches: [ "main", "dev" ]
1111
workflow_dispatch:
1212

13-
13+
1414
jobs:
1515
build:
1616

.github/workflows/lint.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
name: lint
2+
3+
on: [push, pull_request]
4+
5+
concurrency:
6+
group: ${{ github.workflow }}-${{ github.ref }}
7+
cancel-in-progress: true
8+
9+
jobs:
10+
lint:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
matrix:
14+
python-version: [3.10.15]
15+
steps:
16+
- uses: actions/checkout@v3
17+
- name: Set up Python ${{ matrix.python-version }}
18+
uses: actions/setup-python@v4
19+
with:
20+
python-version: ${{ matrix.python-version }}
21+
- name: Install pre-commit hook
22+
run: |
23+
pip install pre-commit==3.8.0
24+
pre-commit install
25+
- name: Linting
26+
run: |
27+
pre-commit sample-config > .pre-commit-config.yaml
28+
pre-commit run --all-files

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
__pycache__/
2+
*.egg-info/

.owners.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
assign:
2+
strategy:
3+
# random
4+
daily-shift-based
5+
schedule:
6+
'*/1 * * * *'
7+
assignees:
8+
- e06084
9+
- shijinpjlab

.pre-commit-config.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# See https://pre-commit.com for more information
2+
# See https://pre-commit.com/hooks.html for more hooks
3+
repos:
4+
- repo: https://github.com/pre-commit/pre-commit-hooks
5+
rev: v5.0.0
6+
hooks:
7+
- id: trailing-whitespace
8+
- id: end-of-file-fixer
9+
- id: check-yaml
10+
- id: check-added-large-files
11+
- repo: https://github.com/PyCQA/isort
12+
rev: 6.0.0
13+
hooks:
14+
- id: isort

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,4 +198,4 @@
198198
distributed under the License is distributed on an "AS IS" BASIS,
199199
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200200
See the License for the specific language governing permissions and
201-
limitations under the License.
201+
limitations under the License.

README.md

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,17 @@
99

1010
</div>
1111

12-
[English](README.md) | [简体中文](README_CN.md)
12+
[English](README.md) | [简体中文](README_zh-CN.md)
13+
14+
<div align="center">
15+
<a href="https://discord.gg/Jhgb2eKWh8" style="text-decoration:none;">
16+
<img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
17+
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
18+
<a href="https://huggingface.co/spaces/DataEval/dingo" style="text-decoration:none;">
19+
<img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="3%" alt="Hugging Face" /></a>
20+
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
21+
</div>
22+
1323

1424
# Changelog
1525

@@ -83,7 +93,7 @@ $ cat test/data/config_gpt.json
8393
"llm_config": {
8494
"openai": {
8595
"model": "gpt-4o",
86-
"key": "xxxx",
96+
"key": "xxxx",
8797
"api_url": "https://api.openai.com/v1/chat/completions"
8898
}
8999
}
@@ -99,7 +109,10 @@ If the user wants to manually start a frontend page, you need to enter the follo
99109
python -m dingo.run.vsl --input xxx
100110
```
101111

102-
The input followed is the directory of the quality inspection results. Users need to ensure that there is a summary.json file when the directory is opened.
112+
The input followed is the directory of the quality inspection results. Users need to ensure that there is a summary.json file when the directory is opened. Frontend page of output looks like:![GUI output](docs/assets/dingo_gui.png)
113+
114+
## Online Demo
115+
Try dingo on our online demo: [(Hugging Face)🤗](https://huggingface.co/spaces/DataEval/dingo)
103116

104117
# Feature List
105118

@@ -153,17 +166,17 @@ then you can refer to: [Install Dependencies](requirements)
153166

154167
## Register Rules/Prompts/Models
155168

156-
If the heuristic rules inside the project do not meet the user's quality inspection requirements, users can also customize rules or models.
169+
If the heuristic rules inside the project do not meet the user's quality inspection requirements, users can also customize rules or models.
157170

158171
### Register Rules
159172

160-
If the user wants to create a new rule `CommonPatternDemo`, then the first step is to add a decorator to the rule to inject the rule into the project.
161-
Secondly, the `metric_type` type, such as `QUALITY_BAD_RELEVANCE`, needs to be set for the rule, and `group` does not need to be set.
162-
Then the user needs to define the `DynamicRuleConfig` object, so that the properties of the rule can be configured dynamically.
163-
In addition, the method name of the rule must be `eval` and it needs to be a class method.
164-
The return value of the last step should be a `ModelRes` object.
173+
If the user wants to create a new rule `CommonPatternDemo`, then the first step is to add a decorator to the rule to inject the rule into the project.
174+
Secondly, the `metric_type` type, such as `QUALITY_BAD_RELEVANCE`, needs to be set for the rule, and `group` does not need to be set.
175+
Then the user needs to define the `DynamicRuleConfig` object, so that the properties of the rule can be configured dynamically.
176+
In addition, the method name of the rule must be `eval` and it needs to be a class method.
177+
The return value of the last step should be a `ModelRes` object.
165178

166-
For example: [Register Rules](examples/register/sdk_register_rule.py)
179+
For example: [Register Rules](examples/register/sdk_register_rule.py)
167180

168181
### Register Prompts
169182

@@ -173,8 +186,8 @@ For example: [Register Prompts](examples/register/sdk_register_prompt.py)
173186

174187
### Register Models
175188

176-
The way to register models is slightly different, users need to implement a call_api method, accept MetaData type parameters, and return ModelRes type results.
177-
There are already implemented basic model classes [BaseOpenAI](dingo/model/llm/base_openai.py) in the project, users can directly inherit.
189+
The way to register models is slightly different, users need to implement a call_api method, accept MetaData type parameters, and return ModelRes type results.
190+
There are already implemented basic model classes [BaseOpenAI](dingo/model/llm/base_openai.py) in the project, users can directly inherit.
178191
If the user has special functions to implement, then you can rewrite the corresponding methods.
179192

180193
For example: [Register Models](examples/register/sdk_register_llm.py)
@@ -185,7 +198,7 @@ For example: [Register Models](examples/register/sdk_register_llm.py)
185198

186199
## Execution Engine
187200

188-
`Dingo` can run locally or on a spark cluster.
201+
`Dingo` can run locally or on a spark cluster.
189202
Regardless of the choice of engine, the executor supports some common methods:
190203

191204
| function name | description |
@@ -203,9 +216,9 @@ When choosing the spark engine, users can freely choose rules, models for qualit
203216

204217
### Spark Mode
205218

206-
When choosing the spark engine, users can only choose rules for quality inspection, and models cannot be used.
207-
And only `eval_group`,`save_data`,`save_correct`,`custom_config` in `InputArgs` are still valid.
208-
Therefore, the user needs to input `spark_session` to initialize spark, and input `spark_rdd` (composed of `MetaData` structure) as data for quality inspection.
219+
When choosing the spark engine, users can only choose rules for quality inspection, and models cannot be used.
220+
And only `eval_group`,`save_data`,`save_correct`,`custom_config` in `InputArgs` are still valid.
221+
Therefore, the user needs to input `spark_session` to initialize spark, and input `spark_rdd` (composed of `MetaData` structure) as data for quality inspection.
209222
It should be noted that if `save_data` is `False`, then the data in memory will be cleared immediately after the quality inspection is completed, and `spark_session` will also stop immediately.
210223

211224
[Spark Example](examples/spark/sdk_spark.py)
@@ -275,7 +288,8 @@ If you find this project useful, please consider citing our tool:
275288
```
276289
@misc{dingo,
277290
title={Dingo: A Comprehensive Data Quality Evaluation Tool for Large Models},
291+
author={Dingo Contributors},
278292
howpublished={\url{https://github.com/DataEval/dingo}},
279293
year={2024}
280294
}
281-
```
295+
```

README_CN.md renamed to README_zh-CN.md

Lines changed: 21 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ $ cat test/data/config_gpt.json
8282
"llm_config": {
8383
"openai": {
8484
"model": "gpt-4o",
85-
"key": "xxxx",
85+
"key": "xxxx",
8686
"api_url": "https://api.openai.com/v1/chat/completions"
8787
}
8888
}
@@ -98,7 +98,12 @@ $ cat test/data/config_gpt.json
9898
python -m dingo.run.vsl --input xxx
9999
```
100100

101-
input之后跟随的是质检结果的目录,用户需要确保目录打开后其中有summary.json文件
101+
input之后跟随的是质检结果的目录,用户需要确保目录打开后其中有summary.json文件。
102+
前端页面输出效果如下:![GUI output](docs/assets/dingo_gui.png)
103+
104+
## 5.在线demo
105+
106+
尝试使用我们的在线demo: [(Hugging Face)🤗](https://huggingface.co/spaces/DataEval/dingo)
102107

103108
# 三、功能列表
104109

@@ -152,17 +157,17 @@ Dingo 支持输出7个Quality Metrics概况报告和异常数据追溯详情报
152157

153158
## 2.注册规则/prompt/模型
154159

155-
如果项目内部的启发式规则不满足用户的质检需求,用户还可以自定义规则或者模型。
160+
如果项目内部的启发式规则不满足用户的质检需求,用户还可以自定义规则或者模型。
156161

157162
### 2.1 注册规则
158163

159-
如果用户想要创建一个新规则 `CommonPatternDemo`,那么首先要为规则添加装饰器,将规则注入项目中。
160-
其次还需要为规则设置 `metric_type` 类型,比如 `QUALITY_BAD_RELEVANCE``group` 可以不用设置。
161-
然后用户需要定义 `DynamicRuleConfig` 对象,这样可以动态的配置规则的属性。
162-
除此之外,规则的方法名称必须是 `eval` 且需要是类方法。
163-
最后一步的返回值应该是 `ModelRes` 对象。
164+
如果用户想要创建一个新规则 `CommonPatternDemo`,那么首先要为规则添加装饰器,将规则注入项目中。
165+
其次还需要为规则设置 `metric_type` 类型,比如 `QUALITY_BAD_RELEVANCE``group` 可以不用设置。
166+
然后用户需要定义 `DynamicRuleConfig` 对象,这样可以动态的配置规则的属性。
167+
除此之外,规则的方法名称必须是 `eval` 且需要是类方法。
168+
最后一步的返回值应该是 `ModelRes` 对象。
164169

165-
例如:[注册规则](examples/register/sdk_register_rule.py)
170+
例如:[注册规则](examples/register/sdk_register_rule.py)
166171

167172
### 2.2 注册prompt
168173

@@ -172,8 +177,8 @@ Dingo 支持输出7个Quality Metrics概况报告和异常数据追溯详情报
172177

173178
### 2.3 注册模型
174179

175-
注册模型的方式略有不同,用户需要实现一个call_api方法,接受MetaData类型参数,返回ModelRes类型结果。
176-
项目中有已经实现好的基础模型类[BaseOpenAI](dingo/model/llm/base_openai.py),用户可以直接继承。
180+
注册模型的方式略有不同,用户需要实现一个call_api方法,接受MetaData类型参数,返回ModelRes类型结果。
181+
项目中有已经实现好的基础模型类[BaseOpenAI](dingo/model/llm/base_openai.py),用户可以直接继承。
177182
如果用户有特殊的功能要实现,那么就可以重写对应的方法。
178183

179184
例如:[注册模型](examples/register/sdk_register_llm.py)
@@ -184,7 +189,7 @@ Dingo 支持输出7个Quality Metrics概况报告和异常数据追溯详情报
184189

185190
## 4.执行引擎
186191

187-
`Dingo` 可以在本地运行,也可以在spark集群上运行。
192+
`Dingo` 可以在本地运行,也可以在spark集群上运行。
188193
无论选择何种引擎,executor都支持一些公共方法:
189194

190195
| function name | description |
@@ -202,9 +207,9 @@ Dingo 支持输出7个Quality Metrics概况报告和异常数据追溯详情报
202207

203208
### 4.2 Spark Mode
204209

205-
选择spark引擎时,用户只能选择规则进行质检,模型无法使用。
206-
而且`InputArgs`中仅有`eval_group`,`save_data`,`save_correct`,`custom_config`依旧有效。
207-
因此,用户需要输入`spark_session`用来初始化spark,输入`spark_rdd`(由`MetaData`结构组成)作为数据用来质检。
210+
选择spark引擎时,用户只能选择规则进行质检,模型无法使用。
211+
而且`InputArgs`中仅有`eval_group`,`save_data`,`save_correct`,`custom_config`依旧有效。
212+
因此,用户需要输入`spark_session`用来初始化spark,输入`spark_rdd`(由`MetaData`结构组成)作为数据用来质检。
208213
需要注意,`save_data`如果为`False`,那么质检完成后会立刻清除内存中的数据,`spark_session`也立即停止。
209214

210215
[spark示例](examples/spark/sdk_spark.py)
@@ -274,6 +279,7 @@ If you find this project useful, please consider citing our tool:
274279
```
275280
@misc{dingo,
276281
title={Dingo: A Comprehensive Data Quality Evaluation Tool for Large Models},
282+
author={Dingo Contributors},
277283
howpublished={\url{https://github.com/DataEval/dingo}},
278284
year={2024}
279285
}

Todo.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
{"verion":"0.0.1","entries":[]}
1+
{"verion":"0.0.1","entries":[]}

app/.editorconfig

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ indent_style = space
66
indent_size = 2
77
end_of_line = lf
88
insert_final_newline = true
9-
trim_trailing_whitespace = true
9+
trim_trailing_whitespace = true

0 commit comments

Comments
 (0)