Skip to content

Commit 42b24e8

Browse files
committed
docs: update readme
1 parent e4c8edb commit 42b24e8

File tree

2 files changed

+32
-121
lines changed

2 files changed

+32
-121
lines changed

README.md

Lines changed: 16 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -55,22 +55,21 @@ pip install dingo-python
5555

5656
## Example Use Cases
5757

58-
### 1. Using Evaluate Core
58+
### 1. Evaluate Stream Data
5959

6060
```python
6161
from dingo.config.config import DynamicLLMConfig
6262
from dingo.io.input.MetaData import MetaData
6363
from dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase
6464
from dingo.model.rule.rule_common import RuleEnterAndSpace
6565

66+
data = MetaData(
67+
data_id='123',
68+
prompt="hello, introduce the world",
69+
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
70+
)
6671

6772
def llm():
68-
data = MetaData(
69-
data_id='123',
70-
prompt="hello, introduce the world",
71-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
72-
)
73-
7473
LLMTextQualityModelBase.dynamic_config = DynamicLLMConfig(
7574
key='',
7675
api_url='',
@@ -81,38 +80,11 @@ def llm():
8180

8281

8382
def rule():
84-
data = MetaData(
85-
data_id='123',
86-
prompt="hello, introduce the world",
87-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
88-
)
89-
9083
res = RuleEnterAndSpace().eval(data)
9184
print(res)
9285
```
9386

94-
### 2. Evaluate Local Text File (Plaintext)
95-
96-
```python
97-
from dingo.io import InputArgs
98-
from dingo.exec import Executor
99-
100-
# Evaluate a plaintext file
101-
input_data = {
102-
"eval_group": "sft", # Rule set for SFT data
103-
"input_path": "data.txt", # Path to local text file
104-
"dataset": "local",
105-
"data_format": "plaintext", # Format: plaintext
106-
"save_data": True # Save evaluation results
107-
}
108-
109-
input_args = InputArgs(**input_data)
110-
executor = Executor.exec_map["local"](input_args)
111-
result = executor.execute()
112-
print(result)
113-
```
114-
115-
### 3. Evaluate Hugging Face Dataset
87+
### 2. Evaluate Hugging Face Dataset
11688

11789
```python
11890
from dingo.io import InputArgs
@@ -132,29 +104,7 @@ result = executor.execute()
132104
print(result)
133105
```
134106

135-
### 4. Evaluate JSON/JSONL Format
136-
137-
```python
138-
from dingo.io import InputArgs
139-
from dingo.exec import Executor
140-
141-
# Evaluate a JSON file
142-
input_data = {
143-
"eval_group": "default", # Default rule set
144-
"input_path": "data.json", # Path to local JSON file
145-
"dataset": "local",
146-
"data_format": "json", # Format: json
147-
"column_content": "text", # Column containing the text to evaluate
148-
"save_data": True # Save evaluation results
149-
}
150-
151-
input_args = InputArgs(**input_data)
152-
executor = Executor.exec_map["local"](input_args)
153-
result = executor.execute()
154-
print(result)
155-
```
156-
157-
### 5. Using LLM for Evaluation
107+
### 3. Using LLM for Evaluation
158108

159109
```python
160110
from dingo.io import InputArgs
@@ -471,10 +421,15 @@ Dingo includes an experimental Model Context Protocol (MCP) server. For details
471421

472422
# Research & Publications
473423

474-
- **"Comprehensive Data Quality Assessment for Multilingual WebData"** : [WanJuanSiLu: A High-Quality Open-Source Webtext
475-
Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
476-
- **"Pre-training data quality using the DataMan methodology"** : [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
424+
## Research Powered by Dingo
425+
- **WanJuanSiLu**: [A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
426+
*Uses Dingo for comprehensive data quality assessment of multilingual web data*
477427

428+
## Methodologies Implemented in Dingo
429+
- **DataMan Methodology**: [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
430+
*Dingo implements the DataMan methodology for pre-training data quality assessment*
431+
- **RedPajama-Data-v2**: [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
432+
*Dingo implements parts of the RedPajama-Data-v2 methodology for web text quality assessment and filtering*
478433

479434
# Future Plans
480435

README_zh-CN.md

Lines changed: 16 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -53,22 +53,21 @@ pip install dingo-python
5353

5454
## 2. 使用示例
5555

56-
### 2.1 使用评估核心方法
56+
### 2.1 评估流式数据
5757

5858
```python
5959
from dingo.config.config import DynamicLLMConfig
6060
from dingo.io.input.MetaData import MetaData
6161
from dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase
6262
from dingo.model.rule.rule_common import RuleEnterAndSpace
6363

64+
data = MetaData(
65+
data_id='123',
66+
prompt="hello, introduce the world",
67+
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
68+
)
6469

6570
def llm():
66-
data = MetaData(
67-
data_id='123',
68-
prompt="hello, introduce the world",
69-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
70-
)
71-
7271
LLMTextQualityModelBase.dynamic_config = DynamicLLMConfig(
7372
key='',
7473
api_url='',
@@ -79,38 +78,11 @@ def llm():
7978

8079

8180
def rule():
82-
data = MetaData(
83-
data_id='123',
84-
prompt="hello, introduce the world",
85-
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
86-
)
87-
8881
res = RuleEnterAndSpace().eval(data)
8982
print(res)
9083
```
9184

92-
### 2.2 评估本地文本文件(纯文本)
93-
94-
```python
95-
from dingo.io import InputArgs
96-
from dingo.exec import Executor
97-
98-
# 评估纯文本文件
99-
input_data = {
100-
"eval_group": "sft", # SFT数据的规则集
101-
"input_path": "data.txt", # 本地文本文件路径
102-
"dataset": "local",
103-
"data_format": "plaintext", # 格式: plaintext
104-
"save_data": True # 保存评估结果
105-
}
106-
107-
input_args = InputArgs(**input_data)
108-
executor = Executor.exec_map["local"](input_args)
109-
result = executor.execute()
110-
print(result)
111-
```
112-
113-
### 2.3 评估Hugging Face数据集
85+
### 2.2 评估Hugging Face数据集
11486

11587
```python
11688
from dingo.io import InputArgs
@@ -130,29 +102,7 @@ result = executor.execute()
130102
print(result)
131103
```
132104

133-
### 2.4 评估JSON/JSONL格式
134-
135-
```python
136-
from dingo.io import InputArgs
137-
from dingo.exec import Executor
138-
139-
# 评估JSON文件
140-
input_data = {
141-
"eval_group": "default", # 默认规则集
142-
"input_path": "data.json", # 本地JSON文件路径
143-
"dataset": "local",
144-
"data_format": "json", # 格式: json
145-
"column_content": "text", # 包含要评估文本的列
146-
"save_data": True # 保存评估结果
147-
}
148-
149-
input_args = InputArgs(**input_data)
150-
executor = Executor.exec_map["local"](input_args)
151-
result = executor.execute()
152-
print(result)
153-
```
154-
155-
### 2.5 使用LLM进行评估
105+
### 2.3 使用LLM进行评估
156106

157107
```python
158108
from dingo.io import InputArgs
@@ -470,9 +420,15 @@ Dingo 包含一个实验性的模型上下文协议 (MCP) 服务端。有关运
470420

471421
# 研究与学术成果
472422

423+
## Dingo驱动的研究
424+
- **WanJuanSiLu**: [A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
425+
*使用Dingo对多语言网页数据进行全面的数据质量评估*
473426

474-
- **"多语言网页数据的数据质量评估"** : [WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
475-
- **"使用DataMan方法论评估预训练数据质量"** : [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
427+
## Dingo实现的方法论
428+
- **DataMan方法论**: [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
429+
*Dingo实现了DataMan方法论用于预训练数据质量评估*
430+
- **RedPajama-Data-v2**: [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
431+
*Dingo实现了部分RedPajama-Data-v2方法论用于网页文本质量评估和过滤*
476432

477433
# 未来计划
478434

0 commit comments

Comments
 (0)