Skip to content

Commit 9e91aaa

Browse files
authored
Merge pull request #83 from e06084/dev
docs: update readme
2 parents e4c8edb + 35f2bba commit 9e91aaa

File tree

2 files changed

+36
-185
lines changed

2 files changed

+36
-185
lines changed

README.md

Lines changed: 18 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -55,64 +55,36 @@ pip install dingo-python
5555

5656
## Example Use Cases
5757

58-
### 1. Using Evaluate Core
58+
### 1. Evaluate LLM chat data
5959

6060
```python
6161
from dingo.config.config import DynamicLLMConfig
6262
from dingo.io.input.MetaData import MetaData
6363
from dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase
6464
from dingo.model.rule.rule_common import RuleEnterAndSpace
6565

66+
data = MetaData(
67+
data_id='123',
68+
prompt="hello, introduce the world",
69+
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
70+
)
6671

6772
def llm():
68-
data = MetaData(
69-
data_id='123',
70-
prompt="hello, introduce the world",
71-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
72-
)
73-
7473
LLMTextQualityModelBase.dynamic_config = DynamicLLMConfig(
75-
key='',
76-
api_url='',
77-
# model='',
74+
key='YOUR_API_KEY',
75+
api_url='https://api.openai.com/v1/chat/completions',
76+
model='gpt-4o',
7877
)
7978
res = LLMTextQualityModelBase.eval(data)
8079
print(res)
8180

8281

8382
def rule():
84-
data = MetaData(
85-
data_id='123',
86-
prompt="hello, introduce the world",
87-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
88-
)
89-
9083
res = RuleEnterAndSpace().eval(data)
9184
print(res)
9285
```
9386

94-
### 2. Evaluate Local Text File (Plaintext)
95-
96-
```python
97-
from dingo.io import InputArgs
98-
from dingo.exec import Executor
99-
100-
# Evaluate a plaintext file
101-
input_data = {
102-
"eval_group": "sft", # Rule set for SFT data
103-
"input_path": "data.txt", # Path to local text file
104-
"dataset": "local",
105-
"data_format": "plaintext", # Format: plaintext
106-
"save_data": True # Save evaluation results
107-
}
108-
109-
input_args = InputArgs(**input_data)
110-
executor = Executor.exec_map["local"](input_args)
111-
result = executor.execute()
112-
print(result)
113-
```
114-
115-
### 3. Evaluate Hugging Face Dataset
87+
### 2. Evaluate Dataset
11688

11789
```python
11890
from dingo.io import InputArgs
@@ -132,58 +104,6 @@ result = executor.execute()
132104
print(result)
133105
```
134106

135-
### 4. Evaluate JSON/JSONL Format
136-
137-
```python
138-
from dingo.io import InputArgs
139-
from dingo.exec import Executor
140-
141-
# Evaluate a JSON file
142-
input_data = {
143-
"eval_group": "default", # Default rule set
144-
"input_path": "data.json", # Path to local JSON file
145-
"dataset": "local",
146-
"data_format": "json", # Format: json
147-
"column_content": "text", # Column containing the text to evaluate
148-
"save_data": True # Save evaluation results
149-
}
150-
151-
input_args = InputArgs(**input_data)
152-
executor = Executor.exec_map["local"](input_args)
153-
result = executor.execute()
154-
print(result)
155-
```
156-
157-
### 5. Using LLM for Evaluation
158-
159-
```python
160-
from dingo.io import InputArgs
161-
from dingo.exec import Executor
162-
163-
# Evaluate using GPT model
164-
input_data = {
165-
"input_path": "data.jsonl", # Path to local JSONL file
166-
"dataset": "local",
167-
"data_format": "jsonl",
168-
"column_content": "content",
169-
"custom_config": {
170-
"prompt_list": ["PromptRepeat"], # Prompt to use
171-
"llm_config": {
172-
"detect_text_quality": {
173-
"model": "gpt-4o",
174-
"key": "YOUR_API_KEY",
175-
"api_url": "https://api.openai.com/v1/chat/completions"
176-
}
177-
}
178-
}
179-
}
180-
181-
input_args = InputArgs(**input_data)
182-
executor = Executor.exec_map["local"](input_args)
183-
result = executor.execute()
184-
print(result)
185-
```
186-
187107
## Command Line Interface
188108

189109
### Evaluate with Rule Sets
@@ -471,10 +391,15 @@ Dingo includes an experimental Model Context Protocol (MCP) server. For details
471391

472392
# Research & Publications
473393

474-
- **"Comprehensive Data Quality Assessment for Multilingual WebData"** : [WanJuanSiLu: A High-Quality Open-Source Webtext
475-
Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
476-
- **"Pre-training data quality using the DataMan methodology"** : [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
394+
## Research Powered by Dingo
395+
- **WanJuanSiLu**: [A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
396+
*Uses Dingo for comprehensive data quality assessment of multilingual web data*
477397

398+
## Methodologies Implemented in Dingo
399+
- **DataMan Methodology**: [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
400+
*Dingo implements the DataMan methodology for pre-training data quality assessment*
401+
- **RedPajama-Data-v2**: [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
402+
*Dingo implements parts of the RedPajama-Data-v2 methodology for web text quality assessment and filtering*
478403

479404
# Future Plans
480405

README_zh-CN.md

Lines changed: 18 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -53,64 +53,36 @@ pip install dingo-python
5353

5454
## 2. 使用示例
5555

56-
### 2.1 使用评估核心方法
56+
### 2.1 评估LLM对话数据
5757

5858
```python
5959
from dingo.config.config import DynamicLLMConfig
6060
from dingo.io.input.MetaData import MetaData
6161
from dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase
6262
from dingo.model.rule.rule_common import RuleEnterAndSpace
6363

64+
data = MetaData(
65+
data_id='123',
66+
prompt="hello, introduce the world",
67+
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
68+
)
6469

6570
def llm():
66-
data = MetaData(
67-
data_id='123',
68-
prompt="hello, introduce the world",
69-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
70-
)
71-
7271
LLMTextQualityModelBase.dynamic_config = DynamicLLMConfig(
73-
key='',
74-
api_url='',
75-
# model='',
72+
key='YOUR_API_KEY',
73+
api_url='https://api.openai.com/v1/chat/completions',
74+
model='gpt-4o',
7675
)
7776
res = LLMTextQualityModelBase.eval(data)
7877
print(res)
7978

8079

8180
def rule():
82-
data = MetaData(
83-
data_id='123',
84-
prompt="hello, introduce the world",
85-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
86-
)
87-
8881
res = RuleEnterAndSpace().eval(data)
8982
print(res)
9083
```
9184

92-
### 2.2 评估本地文本文件(纯文本)
93-
94-
```python
95-
from dingo.io import InputArgs
96-
from dingo.exec import Executor
97-
98-
# 评估纯文本文件
99-
input_data = {
100-
"eval_group": "sft", # SFT数据的规则集
101-
"input_path": "data.txt", # 本地文本文件路径
102-
"dataset": "local",
103-
"data_format": "plaintext", # 格式: plaintext
104-
"save_data": True # 保存评估结果
105-
}
106-
107-
input_args = InputArgs(**input_data)
108-
executor = Executor.exec_map["local"](input_args)
109-
result = executor.execute()
110-
print(result)
111-
```
112-
113-
### 2.3 评估Hugging Face数据集
85+
### 2.2 评估数据集
11486

11587
```python
11688
from dingo.io import InputArgs
@@ -130,58 +102,6 @@ result = executor.execute()
130102
print(result)
131103
```
132104

133-
### 2.4 评估JSON/JSONL格式
134-
135-
```python
136-
from dingo.io import InputArgs
137-
from dingo.exec import Executor
138-
139-
# 评估JSON文件
140-
input_data = {
141-
"eval_group": "default", # 默认规则集
142-
"input_path": "data.json", # 本地JSON文件路径
143-
"dataset": "local",
144-
"data_format": "json", # 格式: json
145-
"column_content": "text", # 包含要评估文本的列
146-
"save_data": True # 保存评估结果
147-
}
148-
149-
input_args = InputArgs(**input_data)
150-
executor = Executor.exec_map["local"](input_args)
151-
result = executor.execute()
152-
print(result)
153-
```
154-
155-
### 2.5 使用LLM进行评估
156-
157-
```python
158-
from dingo.io import InputArgs
159-
from dingo.exec import Executor
160-
161-
# 使用GPT模型评估
162-
input_data = {
163-
"input_path": "data.jsonl", # 本地JSONL文件路径
164-
"dataset": "local",
165-
"data_format": "jsonl",
166-
"column_content": "content",
167-
"custom_config": {
168-
"prompt_list": ["PromptRepeat"], # 使用的prompt
169-
"llm_config": {
170-
"detect_text_quality": {
171-
"model": "gpt-4o",
172-
"key": "您的API密钥",
173-
"api_url": "https://api.openai.com/v1/chat/completions"
174-
}
175-
}
176-
}
177-
}
178-
179-
input_args = InputArgs(**input_data)
180-
executor = Executor.exec_map["local"](input_args)
181-
result = executor.execute()
182-
print(result)
183-
```
184-
185105
## 3. 命令行界面
186106

187107
### 3.1 使用规则集评估
@@ -470,9 +390,15 @@ Dingo 包含一个实验性的模型上下文协议 (MCP) 服务端。有关运
470390

471391
# 研究与学术成果
472392

393+
## Dingo驱动的研究
394+
- **WanJuanSiLu**: [A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
395+
*使用Dingo对多语言网页数据进行全面的数据质量评估*
473396

474-
- **"多语言网页数据的数据质量评估"** : [WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
475-
- **"使用DataMan方法论评估预训练数据质量"** : [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
397+
## Dingo实现的方法论
398+
- **DataMan方法论**: [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
399+
*Dingo实现了DataMan方法论用于预训练数据质量评估*
400+
- **RedPajama-Data-v2**: [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
401+
*Dingo实现了部分RedPajama-Data-v2方法论用于网页文本质量评估和过滤*
476402

477403
# 未来计划
478404

0 commit comments

Comments
 (0)