Skip to content

Commit 6890aa0

Browse files
committed
update: file name in eval dir
1 parent 3bbbbca commit 6890aa0

File tree

5 files changed

+17
-16
lines changed

5 files changed

+17
-16
lines changed

docs/eval/dataset_multi_lan.md renamed to docs/eval/prompt/multi_language_data_evaluated_by_prompt.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Multi_Lan Dataset
22

33
## Dataset Introduction
4-
Multi_Lan Dataset aims to evaluate the ability of Dingo's built-in prompt to mine low-quality data in multi-language pre-training datasets. We extracted a portion of data from the Common Crawl (CC) dataset, which was then annotated by experts in these languages based on seven quality dimensions([quality_metrics](../metrics.md)). If any dimension has problems, the data will be marked as low-quality data.
4+
Multi_Lan Dataset aims to evaluate the ability of Dingo's built-in prompt to mine low-quality data in multi-language pre-training datasets. We extracted a portion of data from the Common Crawl (CC) dataset, which was then annotated by experts in these languages based on seven quality dimensions([quality_metrics](../../metrics.md)). If any dimension has problems, the data will be marked as low-quality data.
55

66
| Field Name | Description |
77
|--------------|------------------------------|
@@ -16,25 +16,25 @@ Multi_Lan Dataset aims to evaluate the ability of Dingo's built-in prompt to min
1616
### Dataset Link
1717
The dataset is available for different languages through the following links:
1818

19-
| Language | Dataset Link |
20-
|----------|----------------------------------------------|
21-
| Russian | https://huggingface.co/datasets/chupei/cc_ru |
19+
| Language | Dataset Link |
20+
|------------|----------------------------------------------|
21+
| Russian | https://huggingface.co/datasets/chupei/cc_ru |
2222
| Thai | https://huggingface.co/datasets/chupei/cc_th |
23-
| Vietnamese | https://huggingface.co/datasets/chupei/cc_vi |
24-
| Hungarian | https://huggingface.co/datasets/chupei/cc_hu |
23+
| Vietnamese | https://huggingface.co/datasets/chupei/cc_vi |
24+
| Hungarian | https://huggingface.co/datasets/chupei/cc_hu |
2525
| Serbian | https://huggingface.co/datasets/chupei/cc_sr |
2626

2727

2828
### Dataset Composition
2929
The dataset includes five languages: Russian, Thai, Vietnamese, Hungarian, and Serbian. Below is a summary of each language's data:
3030

3131
| Language | Number of dataset | Number of High-Quality Data | Number of Low-Quality Data |
32-
|------|-------------------|-----------------------------|----------------------------|
33-
| Russian | 154 | 71 | 83 |
34-
| Thai | 267 | 128 | 139 |
35-
| Vietnamese | 214 | 101 | 113 |
36-
| Hungarian | 225 | 99 | 126 |
37-
| Serbian | 144 | 38 | 76 |
32+
|------------|-------------------|-----------------------------|----------------------------|
33+
| Russian | 154 | 71 | 83 |
34+
| Thai | 267 | 128 | 139 |
35+
| Vietnamese | 214 | 101 | 113 |
36+
| Hungarian | 225 | 99 | 126 |
37+
| Serbian | 144 | 38 | 76 |
3838

3939

4040

File renamed without changes.

docs/eval/dataset_redpajama.md renamed to docs/eval/prompt/redpajama_data_evaluated_by_prompt.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,9 @@ https://huggingface.co/datasets/chupei/redpajama_bad_model
2929
| Negative Examples: irrelevance | 49 |
3030

3131
## Prompt Introduction
32-
The built-in **PromptTextQualityV2** is used as the prompt for this test. Specific content can be referred to: [Introduction to PromptTextQualityV2](../../dingo/model/prompt/prompt_text_quality_v2.py)<br>
33-
The built-in prompt collection can be referred to: [Prompt Collection](../../dingo/model/prompt)
32+
The built-in **PromptTextQualityV2** is used as the prompt for this test.<br>
33+
Specific content can be referred to: [Introduction to PromptTextQualityV2](../../../dingo/model/prompt/prompt_text_quality_v2.py)<br>
34+
The built-in prompt collection can be referred to: [Prompt Collection](../../../dingo/model/prompt)
3435

3536
## Evaluation Results
3637
### Concept Introduction
File renamed without changes.

docs/eval/dataset_slimpajama.md renamed to docs/eval/rule/slimpajama_data_evaluated_by_rule.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,8 @@ https://huggingface.co/datasets/chupei/slimpajama_goodcase_rule
4040
| Negative examples: RuleWordNumber | 7 |
4141

4242
## Rules Introduction
43-
This test uses the built-in **pretrain** as the eval_group. For specific rules included, please refer to: [Group Introduction](../groups.md).<br>
44-
For rules within the group, please refer to: [Rules Introduction](../rules.md).
43+
This test uses the built-in **pretrain** as the eval_group. For specific rules included, please refer to: [Group Introduction](../../groups.md).<br>
44+
For rules within the group, please refer to: [Rules Introduction](../../rules.md).
4545

4646
## Evaluation Results
4747
### Definitions

0 commit comments

Comments
 (0)