Merge pull request #60 from e06084/dev

shijinpjlab · web-flow · commit 8996fc92b548 · 2025-04-17T10:55:15.000+08:00
docs: update readme
diff --git a/README.md b/README.md
@@ -21,11 +21,9 @@
 
 <div align="center">
   <a href="https://discord.gg/Jhgb2eKWh8" style="text-decoration:none;">
-    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="Discord" /></a>
-  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="8%" alt="Discord" /></a>
   <a href="https://huggingface.co/spaces/DataEval/dingo" style="text-decoration:none;">
-    <img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="3%" alt="Hugging Face" /></a>
-  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+    <img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="8%" alt="Hugging Face" /></a>
 </div>
 
 
@@ -453,6 +451,12 @@ Example summary:
 }
 ```
 
+# Research & Publications
+
+- **"Comprehensive Data Quality Assessment for Multilingual WebData"** : [WanJuanSiLu: A High-Quality Open-Source Webtext
+Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
+
+
 # Future Plans
 
 - [ ] Richer graphic and text evaluation indicators
diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -15,13 +15,21 @@
   <a href="https://github.com/DataEval/dingo/issues"><img src="https://img.shields.io/github/issues/DataEval/dingo" alt="GitHub 问题"></a>
 </p>
 
+
+<div align="center">
+  <a href="https://discord.gg/Jhgb2eKWh8" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="8%" alt="Discord" /></a>
+  <a href="https://huggingface.co/spaces/DataEval/dingo" style="text-decoration:none;">
+    <img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="8%" alt="Hugging Face" /></a>
+</div>
+
 </div>
 
 # Changelog
 
 - 2024/12/27: Project Initialization
 
-# 一、介绍
+# 介绍
 
 Dingo是一款数据质量评估工具，帮助你自动化检测数据集中的数据质量问题。Dingo提供了多种内置的规则和模型评估方法，同时也支持自定义评估方法。Dingo支持常用的文本数据集和多模态数据集，包括预训练数据集、微调数据集和评测数据集。此外，Dingo支持多种使用方式，包括本地CLI和SDK，便于集成到各种评测平台，如[OpenCompass](https://github.com/open-compass/opencompass)等。
 
@@ -30,7 +38,7 @@ Dingo是一款数据质量评估工具，帮助你自动化检测数据集中的
 ![Architecture of dingo](./docs/assets/architeture.png)
 
 
-# 二、快速启动
+# 快速启动
 
 ## 1. 安装
 
@@ -208,7 +216,7 @@ python -m dingo.run.vsl --input 输出目录
 ## 5. 在线演示
 尝试我们的在线演示: [(Hugging Face)🤗](https://huggingface.co/spaces/DataEval/dingo)
 
-# 三、数据质量指标
+# 数据质量指标
 
 Dingo将数据质量问题分为7个维度的质量指标。每个维度可以通过基于规则的方法和基于LLM的prompt进行评估：
 
@@ -289,7 +297,7 @@ input_data = {
 
 每条规则都针对文本质量的特定方面进行检查，并映射到这些指标之一。运行评估时，Dingo将提供每个维度的分数并识别触发了哪些规则。
 
-# 四、规则组
+# 规则组
 
 Dingo为不同类型的数据集提供预配置的规则组：
 
@@ -308,7 +316,7 @@ input_data = {
 }
 ```
 
-# 五、功能亮点
+# 功能亮点
 
 ## 1. 多源和多模态支持
 
@@ -334,7 +342,7 @@ input_data = {
 - **质量指标**：7维质量评估
 - **可追溯性**：异常追踪的详细报告
 
-# 六、使用指南
+# 使用指南
 
 ## 1. 自定义规则、Prompt和模型
 
@@ -444,27 +452,32 @@ result = executor.execute()
 }
 ```
 
-# 七、未来计划
+# 研究与学术成果
+
+
+- **"多语言网页数据的数据质量评估"** : [WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
+
+# 未来计划
 
 - [ ] 更丰富的图文评测指标
 - [ ] 音频和视频数据模态评测
 - [ ] 小模型评测（如fasttext、Qurating）
 - [ ] 数据多样性评测
 
-# 八、局限性
+# 局限性
 
 当前内置的检测规则和模型方法主要关注常见的数据质量问题。对于特殊评估需求，我们建议定制化检测规则。
 
-# 九、致谢
+# 致谢
 
 - [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
 - [mlflow](https://github.com/mlflow/mlflow)
 
-# 十、贡献
+# 贡献
 
 我们感谢所有的贡献者为改进和提升 `Dingo` 所作出的努力。请参考[贡献指南](docs/en/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
 
-# 十一、开源许可证
+# 开源许可证
 
 该项目采用 [Apache 2.0 开源许可证](LICENSE)。