Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
ce50ff3
add: lint pre-commit
shijinPJ Dec 31, 2024
255b659
add: issue pr notice
shijinPJ Dec 31, 2024
3bbbbca
update: change to en
shijinPJ Dec 31, 2024
6890aa0
update: file name in eval dir
shijinPJ Jan 3, 2025
37ea732
add: isort in pre commit
shijinPJ Jan 6, 2025
59d36f5
update: all files isort
shijinPJ Jan 6, 2025
c37b7a2
update: ExecProto and Executor
shijinPJ Jan 6, 2025
4246b45
feat: add ChatMLConvertor for chatml-jsonl format and enhance Executor
zihan-huang Jan 6, 2025
93d67cd
add prompt
imMid-Star Jan 6, 2025
b4d4b1f
update prompt
imMid-Star Jan 7, 2025
1cb72ed
update prompt
imMid-Star Jan 7, 2025
64ed023
add: prompt PromptUnreadIssue
dt-yy Jan 7, 2025
4be1bb9
add: ahocorasick matching way in RuleUnsafeWords
shijinpjlab Jan 7, 2025
527ca13
update: llm_register decorator use class
shijinpjlab Jan 8, 2025
fde6029
add: ci check when pull request into dev branch
shijinpjlab Jan 8, 2025
97e512b
fix: group contains rule and prompt
shijinpjlab Jan 8, 2025
10ed840
fix: group contains rule and prompt
shijinpjlab Jan 8, 2025
a4fb742
add duplicate info
Jan 8, 2025
f68fa0f
add duplicate info
Jan 8, 2025
dc41285
add duplicate info
Jan 8, 2025
892290c
update: delete print in RuleOnlyUrl
shijinpjlab Jan 9, 2025
e08c0a2
Merge branch 'dev' of github.com:DataEval/dingo into dev
shijinpjlab Jan 9, 2025
6c4cc82
remove value
Jan 9, 2025
e9c0b98
Merge pull request #12 from dt-yy/dev
shijinpjlab Jan 9, 2025
2b15e21
Merge pull request #13 from shijinpjlab/dev
e06084 Jan 9, 2025
aaa5711
fix: ProcessPoolExecutor leads summary not update
shijinpjlab Jan 9, 2025
b7f40d0
Merge branch 'dev' of github.com:DataEval/dingo into dev
shijinpjlab Jan 9, 2025
5630dd7
Merge pull request #14 from shijinpjlab/dev
e06084 Jan 9, 2025
cff29f4
update: pbar update number related to batch size
shijinpjlab Jan 10, 2025
6372e49
Merge pull request #16 from shijinpjlab/dev
e06084 Jan 10, 2025
051d8a4
fix: error_info write twice
shijinpjlab Jan 10, 2025
8ca0b2e
Merge branch 'dev' of github.com:DataEval/dingo into dev
shijinpjlab Jan 10, 2025
2e9ebcb
Merge pull request #17 from shijinpjlab/dev
e06084 Jan 10, 2025
71b3525
feat: google verify (#19)
e06084 Jan 13, 2025
a0eab52
update: google search index
shijinpjlab Jan 13, 2025
6a775b6
add: TEXT_QUALITY_V3_B3
shijinpjlab Jan 13, 2025
fdc8274
Merge pull request #20 from shijinpjlab/dev
e06084 Jan 13, 2025
90b26e4
update: prompt v3, v4
shijinpjlab Jan 14, 2025
f2f61fa
Merge branch 'dev' of github.com:DataEval/dingo into dev
shijinpjlab Jan 14, 2025
c581462
delete: prompt v4
shijinpjlab Jan 14, 2025
8bad559
Merge pull request #21 from shijinpjlab/dev
e06084 Jan 14, 2025
cc5b726
refactor: enhance LocalExecutor to support separate thread and proces…
zihan-huang Jan 15, 2025
8ffc56e
add: RuleAbnormalChar, RuleAbnormalHtml, RuleEnterAndSpace
shijinpjlab Jan 15, 2025
5c04f66
update: qa_standard_v1 delete some rule
shijinpjlab Jan 15, 2025
c0c2a4b
Merge pull request #23 from shijinpjlab/dev
e06084 Jan 15, 2025
3a03ccf
feat: clear bad_info_list and good_info_list according to size
shijinpjlab Jan 17, 2025
bc79bdf
Merge pull request #24 from shijinpjlab/dev
e06084 Jan 17, 2025
7319471
feat: delete wraps in prompt_register
shijinpjlab Jan 17, 2025
a53edf8
feat: when custom, group is not necessary to set.
shijinpjlab Jan 17, 2025
3608e7d
Merge branch 'dev' of github.com:DataEval/dingo into dev
shijinpjlab Jan 17, 2025
590f590
Merge pull request #25 from shijinpjlab/dev
e06084 Jan 17, 2025
32fd910
feat: support spark llm check.
shijinpjlab Jan 23, 2025
fb9375e
Merge branch 'dev' of github.com:DataEval/dingo into dev
shijinpjlab Jan 23, 2025
c401e6a
Merge pull request #26 from shijinpjlab/dev
e06084 Jan 23, 2025
1227a94
feat: add app html in huggingface
shijinpjlab Feb 7, 2025
5b01e52
Merge branch 'dev' of github.com:DataEval/dingo into dev
shijinpjlab Feb 7, 2025
2b41fa4
Merge pull request #28 from shijinpjlab/dev
e06084 Feb 7, 2025
76de536
[pre-commit.ci] pre-commit autoupdate
pre-commit-ci[bot] Feb 10, 2025
9153203
Merge pull request #29 from DataEval/pre-commit-ci-update-config
e06084 Feb 11, 2025
f61ec9d
feat: add end_index
shijinpjlab Feb 11, 2025
448e449
Merge pull request #30 from shijinpjlab/dev
e06084 Feb 11, 2025
3c8e9ea
[docs]: add hf spaces demo and GUI output (#32)
e06084 Feb 12, 2025
fa6b79a
feat: add header html in hf demo
shijinpjlab Feb 13, 2025
98868f0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 13, 2025
a31351a
Merge pull request #33 from shijinpjlab/dev
e06084 Feb 13, 2025
9783aeb
[docs]: add discord invite link (#34)
e06084 Feb 17, 2025
8f9c186
feat: add llm demo, including local and remote.
shijinpjlab Feb 25, 2025
9ce24d7
Merge pull request #36 from shijinpjlab/dev
e06084 Feb 25, 2025
4f307cd
feat: add v1.4
shijinpjlab Feb 28, 2025
ff8945d
feat: use v1.4.0
shijinpjlab Feb 28, 2025
c8af4a8
Merge pull request #37 from shijinpjlab/dev
e06084 Feb 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/IntegrationTest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ on:
push:
branches: [ "main", "dev" ]
pull_request:
branches: [ "main" ]
branches: [ "main", "dev" ]
workflow_dispatch:


jobs:
build:

Expand Down
28 changes: 28 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: lint

on: [push, pull_request]

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
lint:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.10.15]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install pre-commit hook
run: |
pip install pre-commit==3.8.0
pre-commit install
- name: Linting
run: |
pre-commit sample-config > .pre-commit-config.yaml
pre-commit run --all-files
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
__pycache__/
*.egg-info/
9 changes: 9 additions & 0 deletions .owners.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
assign:
strategy:
# random
daily-shift-based
schedule:
'*/1 * * * *'
assignees:
- e06084
- shijinpjlab
14 changes: 14 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
- repo: https://github.com/PyCQA/isort
rev: 6.0.0
hooks:
- id: isort
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -198,4 +198,4 @@
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
limitations under the License.
48 changes: 31 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,17 @@

</div>

[English](README.md) | [简体中文](README_CN.md)
[English](README.md) | [简体中文](README_zh-CN.md)

<div align="center">
<a href="https://discord.gg/Jhgb2eKWh8" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://huggingface.co/spaces/DataEval/dingo" style="text-decoration:none;">
<img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="3%" alt="Hugging Face" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
</div>


# Changelog

Expand Down Expand Up @@ -83,7 +93,7 @@ $ cat test/data/config_gpt.json
"llm_config": {
"openai": {
"model": "gpt-4o",
"key": "xxxx",
"key": "xxxx",
"api_url": "https://api.openai.com/v1/chat/completions"
}
}
Expand All @@ -99,7 +109,10 @@ If the user wants to manually start a frontend page, you need to enter the follo
python -m dingo.run.vsl --input xxx
```

The input followed is the directory of the quality inspection results. Users need to ensure that there is a summary.json file when the directory is opened.
The input followed is the directory of the quality inspection results. Users need to ensure that there is a summary.json file when the directory is opened. Frontend page of output looks like:![GUI output](docs/assets/dingo_gui.png)

## Online Demo
Try dingo on our online demo: [(Hugging Face)🤗](https://huggingface.co/spaces/DataEval/dingo)

# Feature List

Expand Down Expand Up @@ -153,17 +166,17 @@ then you can refer to: [Install Dependencies](requirements)

## Register Rules/Prompts/Models

If the heuristic rules inside the project do not meet the user's quality inspection requirements, users can also customize rules or models.
If the heuristic rules inside the project do not meet the user's quality inspection requirements, users can also customize rules or models.

### Register Rules

If the user wants to create a new rule `CommonPatternDemo`, then the first step is to add a decorator to the rule to inject the rule into the project.
Secondly, the `metric_type` type, such as `QUALITY_BAD_RELEVANCE`, needs to be set for the rule, and `group` does not need to be set.
Then the user needs to define the `DynamicRuleConfig` object, so that the properties of the rule can be configured dynamically.
In addition, the method name of the rule must be `eval` and it needs to be a class method.
The return value of the last step should be a `ModelRes` object.
If the user wants to create a new rule `CommonPatternDemo`, then the first step is to add a decorator to the rule to inject the rule into the project.
Secondly, the `metric_type` type, such as `QUALITY_BAD_RELEVANCE`, needs to be set for the rule, and `group` does not need to be set.
Then the user needs to define the `DynamicRuleConfig` object, so that the properties of the rule can be configured dynamically.
In addition, the method name of the rule must be `eval` and it needs to be a class method.
The return value of the last step should be a `ModelRes` object.

For example: [Register Rules](examples/register/sdk_register_rule.py)
For example: [Register Rules](examples/register/sdk_register_rule.py)

### Register Prompts

Expand All @@ -173,8 +186,8 @@ For example: [Register Prompts](examples/register/sdk_register_prompt.py)

### Register Models

The way to register models is slightly different, users need to implement a call_api method, accept MetaData type parameters, and return ModelRes type results.
There are already implemented basic model classes [BaseOpenAI](dingo/model/llm/base_openai.py) in the project, users can directly inherit.
The way to register models is slightly different, users need to implement a call_api method, accept MetaData type parameters, and return ModelRes type results.
There are already implemented basic model classes [BaseOpenAI](dingo/model/llm/base_openai.py) in the project, users can directly inherit.
If the user has special functions to implement, then you can rewrite the corresponding methods.

For example: [Register Models](examples/register/sdk_register_llm.py)
Expand All @@ -185,7 +198,7 @@ For example: [Register Models](examples/register/sdk_register_llm.py)

## Execution Engine

`Dingo` can run locally or on a spark cluster.
`Dingo` can run locally or on a spark cluster.
Regardless of the choice of engine, the executor supports some common methods:

| function name | description |
Expand All @@ -203,9 +216,9 @@ When choosing the spark engine, users can freely choose rules, models for qualit

### Spark Mode

When choosing the spark engine, users can only choose rules for quality inspection, and models cannot be used.
And only `eval_group`,`save_data`,`save_correct`,`custom_config` in `InputArgs` are still valid.
Therefore, the user needs to input `spark_session` to initialize spark, and input `spark_rdd` (composed of `MetaData` structure) as data for quality inspection.
When choosing the spark engine, users can only choose rules for quality inspection, and models cannot be used.
And only `eval_group`,`save_data`,`save_correct`,`custom_config` in `InputArgs` are still valid.
Therefore, the user needs to input `spark_session` to initialize spark, and input `spark_rdd` (composed of `MetaData` structure) as data for quality inspection.
It should be noted that if `save_data` is `False`, then the data in memory will be cleared immediately after the quality inspection is completed, and `spark_session` will also stop immediately.

[Spark Example](examples/spark/sdk_spark.py)
Expand Down Expand Up @@ -275,7 +288,8 @@ If you find this project useful, please consider citing our tool:
```
@misc{dingo,
title={Dingo: A Comprehensive Data Quality Evaluation Tool for Large Models},
author={Dingo Contributors},
howpublished={\url{https://github.com/DataEval/dingo}},
year={2024}
}
```
```
36 changes: 21 additions & 15 deletions README_CN.md → README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ $ cat test/data/config_gpt.json
"llm_config": {
"openai": {
"model": "gpt-4o",
"key": "xxxx",
"key": "xxxx",
"api_url": "https://api.openai.com/v1/chat/completions"
}
}
Expand All @@ -98,7 +98,12 @@ $ cat test/data/config_gpt.json
python -m dingo.run.vsl --input xxx
```

input之后跟随的是质检结果的目录,用户需要确保目录打开后其中有summary.json文件
input之后跟随的是质检结果的目录,用户需要确保目录打开后其中有summary.json文件。
前端页面输出效果如下:![GUI output](docs/assets/dingo_gui.png)

## 5.在线demo

尝试使用我们的在线demo: [(Hugging Face)🤗](https://huggingface.co/spaces/DataEval/dingo)

# 三、功能列表

Expand Down Expand Up @@ -152,17 +157,17 @@ Dingo 支持输出7个Quality Metrics概况报告和异常数据追溯详情报

## 2.注册规则/prompt/模型

如果项目内部的启发式规则不满足用户的质检需求,用户还可以自定义规则或者模型。
如果项目内部的启发式规则不满足用户的质检需求,用户还可以自定义规则或者模型。

### 2.1 注册规则

如果用户想要创建一个新规则 `CommonPatternDemo`,那么首先要为规则添加装饰器,将规则注入项目中。
其次还需要为规则设置 `metric_type` 类型,比如 `QUALITY_BAD_RELEVANCE`, `group` 可以不用设置。
然后用户需要定义 `DynamicRuleConfig` 对象,这样可以动态的配置规则的属性。
除此之外,规则的方法名称必须是 `eval` 且需要是类方法。
最后一步的返回值应该是 `ModelRes` 对象。
如果用户想要创建一个新规则 `CommonPatternDemo`,那么首先要为规则添加装饰器,将规则注入项目中。
其次还需要为规则设置 `metric_type` 类型,比如 `QUALITY_BAD_RELEVANCE`, `group` 可以不用设置。
然后用户需要定义 `DynamicRuleConfig` 对象,这样可以动态的配置规则的属性。
除此之外,规则的方法名称必须是 `eval` 且需要是类方法。
最后一步的返回值应该是 `ModelRes` 对象。

例如:[注册规则](examples/register/sdk_register_rule.py)
例如:[注册规则](examples/register/sdk_register_rule.py)

### 2.2 注册prompt

Expand All @@ -172,8 +177,8 @@ Dingo 支持输出7个Quality Metrics概况报告和异常数据追溯详情报

### 2.3 注册模型

注册模型的方式略有不同,用户需要实现一个call_api方法,接受MetaData类型参数,返回ModelRes类型结果。
项目中有已经实现好的基础模型类[BaseOpenAI](dingo/model/llm/base_openai.py),用户可以直接继承。
注册模型的方式略有不同,用户需要实现一个call_api方法,接受MetaData类型参数,返回ModelRes类型结果。
项目中有已经实现好的基础模型类[BaseOpenAI](dingo/model/llm/base_openai.py),用户可以直接继承。
如果用户有特殊的功能要实现,那么就可以重写对应的方法。

例如:[注册模型](examples/register/sdk_register_llm.py)
Expand All @@ -184,7 +189,7 @@ Dingo 支持输出7个Quality Metrics概况报告和异常数据追溯详情报

## 4.执行引擎

`Dingo` 可以在本地运行,也可以在spark集群上运行。
`Dingo` 可以在本地运行,也可以在spark集群上运行。
无论选择何种引擎,executor都支持一些公共方法:

| function name | description |
Expand All @@ -202,9 +207,9 @@ Dingo 支持输出7个Quality Metrics概况报告和异常数据追溯详情报

### 4.2 Spark Mode

选择spark引擎时,用户只能选择规则进行质检,模型无法使用。
而且`InputArgs`中仅有`eval_group`,`save_data`,`save_correct`,`custom_config`依旧有效。
因此,用户需要输入`spark_session`用来初始化spark,输入`spark_rdd`(由`MetaData`结构组成)作为数据用来质检。
选择spark引擎时,用户只能选择规则进行质检,模型无法使用。
而且`InputArgs`中仅有`eval_group`,`save_data`,`save_correct`,`custom_config`依旧有效。
因此,用户需要输入`spark_session`用来初始化spark,输入`spark_rdd`(由`MetaData`结构组成)作为数据用来质检。
需要注意,`save_data`如果为`False`,那么质检完成后会立刻清除内存中的数据,`spark_session`也立即停止。

[spark示例](examples/spark/sdk_spark.py)
Expand Down Expand Up @@ -274,6 +279,7 @@ If you find this project useful, please consider citing our tool:
```
@misc{dingo,
title={Dingo: A Comprehensive Data Quality Evaluation Tool for Large Models},
author={Dingo Contributors},
howpublished={\url{https://github.com/DataEval/dingo}},
year={2024}
}
Expand Down
2 changes: 1 addition & 1 deletion Todo.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"verion":"0.0.1","entries":[]}
{"verion":"0.0.1","entries":[]}
2 changes: 1 addition & 1 deletion app/.editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ indent_style = space
indent_size = 2
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true
trim_trailing_whitespace = true
7 changes: 4 additions & 3 deletions app/app-static.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json
import re
import argparse
import base64
import json
import os
import re


def get_folder_structure(root_path):
structure = []
Expand Down
5 changes: 3 additions & 2 deletions app/app.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import sys
import subprocess
import argparse
import subprocess
import sys


def run_electron_app():
parser = argparse.ArgumentParser(description="Run Electron app with optional input path")
Expand Down
2 changes: 1 addition & 1 deletion app/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,4 @@
"typescript": "^5.5.2",
"vite": "^5.3.1"
}
}
}
Loading