Skip to content

Commit fa0009e

Browse files
shijinpjlabseancoding-daypre-commit-ci[bot]e06084
authored
release v1.7 (#89)
* feat: add more mcp tools and mcp demo (#78) * 1. update mcp server 2. update mcp server docs 3. add mcp demo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update mcp readme (#79) * 1. update mcp server 2. update mcp server docs 3. add mcp demo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update README_mcp.md * Update README_mcp_zh-CN.md --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * feat: add wechat * x * docs: update readme * x * update fasttext download (#82) * feat: change download_fasttext * feat: add os * feat: add package * feat: add md5 check and TestDownloadFasttext * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * docs: update readme * x * optimize: change MetaData to Data (#85) * 1. update mcp server 2. update mcp server docs 3. add mcp demo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update README_mcp.md * Update README_mcp_zh-CN.md * change MetaData to Data * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * docs: add Japanese readme (#86) * docs: update MCP readme * docs: add ja readme * x * Dev continue (#87) * feat: add continue exec example. * feat: fix lint * feat: add ci test * feat: fix lint * feat: 调整test目录与项目对齐,并加入ci * feat: v1.7 (#88) --------- Co-authored-by: seanpjlab <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: chupei <[email protected]> Co-authored-by: chupei <[email protected]>
1 parent 2de6b47 commit fa0009e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+2482
-557
lines changed

.github/workflows/IntegrationTest.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,3 +62,6 @@ jobs:
6262
- name: Integration Test(custom config)
6363
run: |
6464
python -m dingo.run.cli --input_path test/data/test_local_json.json --dataset local -e test --data_format json --column_content prediction --custom_config test/config/config_rule.json --log_level=DEBUG
65+
- name: Run unit tests with pytest
66+
run: |
67+
pytest test/scripts --ignore=test/scripts/data

.gitignore

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,50 @@
1-
__pycache__/
2-
*.egg-info/
1+
*.tar
2+
*.tar.gz
3+
*.zip
4+
venv*/
5+
envs/
6+
slurm_logs/
7+
local_tests/
8+
9+
__pycache__
10+
*.log
11+
*.pyc
12+
.vscode
13+
debug/
14+
*.ipynb
15+
.idea
16+
.python-version
17+
18+
# vscode history
19+
.history
20+
21+
.DS_Store
22+
.env
23+
24+
bad_words/
25+
bak/
26+
27+
app/tests/*
28+
temp/
29+
tmp/
30+
tmp
31+
.vscode
32+
.vscode/
33+
ocr_demo
34+
.coveragerc
35+
36+
37+
# sphinx docs
38+
_build/
39+
40+
41+
output/
42+
**/temp.py
43+
44+
# coverage file
45+
.coverage*
46+
coverage.xml
47+
48+
llm_web_kit.egg-info/*
49+
.llm-web-kit.jsonc
50+
.llm-web-kit-pageclassify.jsonc

README.md

Lines changed: 46 additions & 111 deletions
Original file line numberDiff line numberDiff line change
@@ -21,17 +21,16 @@
2121

2222
<div align="center">
2323

24-
[English](README.md) · [简体中文](README_zh-CN.md)
24+
[English](README.md) · [简体中文](README_zh-CN.md) · [日本語](README_ja.md)
2525

2626
</div>
2727

2828

29-
<div align="center">
30-
<a href="https://discord.gg/Jhgb2eKWh8" style="text-decoration:none;">
31-
<img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="Discord" /></a>
32-
<a href="https://huggingface.co/spaces/DataEval/dingo" style="text-decoration:none;">
33-
<img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="3%" alt="Hugging Face" /></a>
34-
</div>
29+
<!-- join us -->
30+
31+
<p align="center">
32+
👋 join us on <a href="https://discord.gg/Jhgb2eKWh8" target="_blank">Discord</a> and <a href="./docs/assets/wechat.jpg" target="_blank">WeChat</a>
33+
</p>
3534

3635

3736
# Changelog
@@ -56,64 +55,36 @@ pip install dingo-python
5655

5756
## Example Use Cases
5857

59-
### 1. Using Evaluate Core
58+
### 1. Evaluate LLM chat data
6059

6160
```python
6261
from dingo.config.config import DynamicLLMConfig
63-
from dingo.io.input.MetaData import MetaData
62+
from dingo.io.input.Data import Data
6463
from dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase
6564
from dingo.model.rule.rule_common import RuleEnterAndSpace
6665

66+
data = Data(
67+
data_id='123',
68+
prompt="hello, introduce the world",
69+
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
70+
)
6771

6872
def llm():
69-
data = MetaData(
70-
data_id='123',
71-
prompt="hello, introduce the world",
72-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
73-
)
74-
7573
LLMTextQualityModelBase.dynamic_config = DynamicLLMConfig(
76-
key='',
77-
api_url='',
78-
# model='',
74+
key='YOUR_API_KEY',
75+
api_url='https://api.openai.com/v1/chat/completions',
76+
model='gpt-4o',
7977
)
8078
res = LLMTextQualityModelBase.eval(data)
8179
print(res)
8280

8381

8482
def rule():
85-
data = MetaData(
86-
data_id='123',
87-
prompt="hello, introduce the world",
88-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
89-
)
90-
9183
res = RuleEnterAndSpace().eval(data)
9284
print(res)
9385
```
9486

95-
### 2. Evaluate Local Text File (Plaintext)
96-
97-
```python
98-
from dingo.io import InputArgs
99-
from dingo.exec import Executor
100-
101-
# Evaluate a plaintext file
102-
input_data = {
103-
"eval_group": "sft", # Rule set for SFT data
104-
"input_path": "data.txt", # Path to local text file
105-
"dataset": "local",
106-
"data_format": "plaintext", # Format: plaintext
107-
"save_data": True # Save evaluation results
108-
}
109-
110-
input_args = InputArgs(**input_data)
111-
executor = Executor.exec_map["local"](input_args)
112-
result = executor.execute()
113-
print(result)
114-
```
115-
116-
### 3. Evaluate Hugging Face Dataset
87+
### 2. Evaluate Dataset
11788

11889
```python
11990
from dingo.io import InputArgs
@@ -133,58 +104,6 @@ result = executor.execute()
133104
print(result)
134105
```
135106

136-
### 4. Evaluate JSON/JSONL Format
137-
138-
```python
139-
from dingo.io import InputArgs
140-
from dingo.exec import Executor
141-
142-
# Evaluate a JSON file
143-
input_data = {
144-
"eval_group": "default", # Default rule set
145-
"input_path": "data.json", # Path to local JSON file
146-
"dataset": "local",
147-
"data_format": "json", # Format: json
148-
"column_content": "text", # Column containing the text to evaluate
149-
"save_data": True # Save evaluation results
150-
}
151-
152-
input_args = InputArgs(**input_data)
153-
executor = Executor.exec_map["local"](input_args)
154-
result = executor.execute()
155-
print(result)
156-
```
157-
158-
### 5. Using LLM for Evaluation
159-
160-
```python
161-
from dingo.io import InputArgs
162-
from dingo.exec import Executor
163-
164-
# Evaluate using GPT model
165-
input_data = {
166-
"input_path": "data.jsonl", # Path to local JSONL file
167-
"dataset": "local",
168-
"data_format": "jsonl",
169-
"column_content": "content",
170-
"custom_config": {
171-
"prompt_list": ["PromptRepeat"], # Prompt to use
172-
"llm_config": {
173-
"detect_text_quality": {
174-
"model": "gpt-4o",
175-
"key": "YOUR_API_KEY",
176-
"api_url": "https://api.openai.com/v1/chat/completions"
177-
}
178-
}
179-
}
180-
}
181-
182-
input_args = InputArgs(**input_data)
183-
executor = Executor.exec_map["local"](input_args)
184-
result = executor.execute()
185-
print(result)
186-
```
187-
188107
## Command Line Interface
189108

190109
### Evaluate with Rule Sets
@@ -227,6 +146,22 @@ Where `output_directory` contains the evaluation results with a `summary.json` f
227146
## Online Demo
228147
Try Dingo on our online demo: [(Hugging Face)🤗](https://huggingface.co/spaces/DataEval/dingo)
229148

149+
150+
# MCP Server
151+
152+
Dingo includes an experimental Model Context Protocol (MCP) server. For details on running the server and integrating it with clients like Cursor, please see the dedicated documentation:
153+
154+
[English](README_mcp.md) · [简体中文](README_mcp_zh-CN.md) · [日本語](README_mcp_ja.md)
155+
156+
## Video Demonstration
157+
158+
To help you get started quickly with Dingo MCP, we've created a video walkthrough:
159+
160+
https://github.com/user-attachments/assets/aca26f4c-3f2e-445e-9ef9-9331c4d7a37b
161+
162+
This video demonstrates step-by-step how to use Dingo MCP server with Cursor.
163+
164+
230165
# Data Quality Metrics
231166

232167
Dingo classifies data quality issues into 7 dimensions of Quality Metrics. Each dimension can be evaluated using both rule-based methods and LLM-based prompts:
@@ -364,7 +299,7 @@ If the built-in rules don't meet your requirements, you can create custom ones:
364299
from dingo.model import Model
365300
from dingo.model.rule.base import BaseRule
366301
from dingo.config.config import DynamicRuleConfig
367-
from dingo.io import MetaData
302+
from dingo.io import Data
368303
from dingo.model.modelres import ModelRes
369304

370305
@Model.rule_register('QUALITY_BAD_RELEVANCE', ['default'])
@@ -374,7 +309,7 @@ class MyCustomRule(BaseRule):
374309
dynamic_config = DynamicRuleConfig(pattern=r'your_pattern_here')
375310

376311
@classmethod
377-
def eval(cls, input_data: MetaData) -> ModelRes:
312+
def eval(cls, input_data: Data) -> ModelRes:
378313
res = ModelRes()
379314
# Your rule implementation here
380315
return res
@@ -424,7 +359,7 @@ from pyspark.sql import SparkSession
424359

425360
# Initialize Spark
426361
spark = SparkSession.builder.appName("Dingo").getOrCreate()
427-
spark_rdd = spark.sparkContext.parallelize([...]) # Your data as MetaData objects
362+
spark_rdd = spark.sparkContext.parallelize([...]) # Your data as Data objects
428363

429364
input_args = InputArgs(eval_group="default", save_data=True)
430365
executor = Executor.exec_map["spark"](input_args, spark_session=spark, spark_rdd=spark_rdd)
@@ -463,19 +398,17 @@ Example summary:
463398
```
464399

465400

466-
# MCP Server (Experimental)
467-
468-
Dingo includes an experimental Model Context Protocol (MCP) server. For details on running the server and integrating it with clients like Cursor, please see the dedicated documentation:
469-
470-
[**Dingo MCP Server Documentation (README_mcp.md)**](README_mcp.md)
471-
472-
473401
# Research & Publications
474402

475-
- **"Comprehensive Data Quality Assessment for Multilingual WebData"** : [WanJuanSiLu: A High-Quality Open-Source Webtext
476-
Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
477-
- **"Pre-training data quality using the DataMan methodology"** : [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
403+
## Research Powered by Dingo
404+
- **WanJuanSiLu**: [A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
405+
*Uses Dingo for comprehensive data quality assessment of multilingual web data*
478406

407+
## Methodologies Implemented in Dingo
408+
- **DataMan Methodology**: [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
409+
*Dingo implements the DataMan methodology for pre-training data quality assessment*
410+
- **RedPajama-Data-v2**: [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
411+
*Dingo implements parts of the RedPajama-Data-v2 methodology for web text quality assessment and filtering*
479412

480413
# Future Plans
481414

@@ -501,6 +434,8 @@ We appreciate all the contributors for their efforts to improve and enhance `Din
501434

502435
This project uses the [Apache 2.0 Open Source License](LICENSE).
503436

437+
This project uses fasttext for some functionality including language detection. fasttext is licensed under the MIT License, which is compatible with our Apache 2.0 license and provides flexibility for various usage scenarios.
438+
504439
# Citation
505440

506441
If you find this project useful, please consider citing our tool:

0 commit comments

Comments
 (0)