You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
88
-
)
89
-
90
83
res = RuleEnterAndSpace().eval(data)
91
84
print(res)
92
85
```
93
86
94
-
### 2. Evaluate Local Text File (Plaintext)
95
-
96
-
```python
97
-
from dingo.io import InputArgs
98
-
from dingo.exec import Executor
99
-
100
-
# Evaluate a plaintext file
101
-
input_data = {
102
-
"eval_group": "sft", # Rule set for SFT data
103
-
"input_path": "data.txt", # Path to local text file
104
-
"dataset": "local",
105
-
"data_format": "plaintext", # Format: plaintext
106
-
"save_data": True# Save evaluation results
107
-
}
108
-
109
-
input_args = InputArgs(**input_data)
110
-
executor = Executor.exec_map["local"](input_args)
111
-
result = executor.execute()
112
-
print(result)
113
-
```
114
-
115
-
### 3. Evaluate Hugging Face Dataset
87
+
### 2. Evaluate Hugging Face Dataset
116
88
117
89
```python
118
90
from dingo.io import InputArgs
@@ -132,29 +104,7 @@ result = executor.execute()
132
104
print(result)
133
105
```
134
106
135
-
### 4. Evaluate JSON/JSONL Format
136
-
137
-
```python
138
-
from dingo.io import InputArgs
139
-
from dingo.exec import Executor
140
-
141
-
# Evaluate a JSON file
142
-
input_data = {
143
-
"eval_group": "default", # Default rule set
144
-
"input_path": "data.json", # Path to local JSON file
145
-
"dataset": "local",
146
-
"data_format": "json", # Format: json
147
-
"column_content": "text", # Column containing the text to evaluate
148
-
"save_data": True# Save evaluation results
149
-
}
150
-
151
-
input_args = InputArgs(**input_data)
152
-
executor = Executor.exec_map["local"](input_args)
153
-
result = executor.execute()
154
-
print(result)
155
-
```
156
-
157
-
### 5. Using LLM for Evaluation
107
+
### 3. Using LLM for Evaluation
158
108
159
109
```python
160
110
from dingo.io import InputArgs
@@ -471,10 +421,15 @@ Dingo includes an experimental Model Context Protocol (MCP) server. For details
471
421
472
422
# Research & Publications
473
423
474
-
-**"Comprehensive Data Quality Assessment for Multilingual WebData"** : [WanJuanSiLu: A High-Quality Open-Source Webtext
475
-
Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
476
-
-**"Pre-training data quality using the DataMan methodology"** : [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
424
+
## Research Powered by Dingo
425
+
-**WanJuanSiLu**: [A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
426
+
*Uses Dingo for comprehensive data quality assessment of multilingual web data*
477
427
428
+
## Methodologies Implemented in Dingo
429
+
-**DataMan Methodology**: [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
430
+
*Dingo implements the DataMan methodology for pre-training data quality assessment*
0 commit comments