|
1 | | -# Dataset Slimpajama |
| 1 | +# Slimpajama Dataset |
2 | 2 |
|
3 | | -## 数据集介绍 |
4 | | -本数据集旨在评估dingo内置规则的准确性,因此选择开源数据集slimpajama,从中抽取数据构建测试集。 |
| 3 | +## Dataset Introduction |
| 4 | +This dataset aims to evaluate the accuracy of the built-in rules in dingo. Therefore, the open-source dataset Slimpajama was selected, and data was extracted from it to construct the test set. |
5 | 5 |
|
6 | | -| 字段名 | 介绍 | |
7 | | -|--------------|------------------------------------------| |
8 | | -| data_id | 数据id,没有特殊含义,用户可根据自身需求修改 | |
9 | | -| content | 待测试数据 | |
10 | | -| language | 语言类型 | |
11 | | -| error_status | 数据状态,True为负例数据,False为正例数据 | |
12 | | -| type_list | 负例数据的负例类型,正例数据该字段则为空列表 | |
13 | | -| name_list | 负例数据的负例名称,正例数据该字段则为空列表 | |
14 | | -| reason_list | 负例数据的负例介绍,正例数据该字段则为空列表 | |
| 6 | +| Field Name | Description | |
| 7 | +|--------------|-------------------------------------------------------------------------------| |
| 8 | +| data_id | Data ID, without special meaning, can be modified according to user needs | |
| 9 | +| content | Data to be tested | |
| 10 | +| language | Language type | |
| 11 | +| error_status | Data status, True for negative examples, False for positive examples | |
| 12 | +| type_list | Negative example types for negative data, empty list for positive data | |
| 13 | +| name_list | Negative example names for negative data, empty list for positive data | |
| 14 | +| reason_list | Negative example descriptions for negative data, empty list for positive data | |
15 | 15 |
|
16 | | -链接: |
| 16 | +Links: |
17 | 17 | https://huggingface.co/datasets/chupei/slimpajama_badcase_rule |
18 | 18 | https://huggingface.co/datasets/chupei/slimpajama_goodcase_rule |
19 | 19 |
|
20 | | -### 数据集构成 |
21 | | -| 类型 | 数量 | |
22 | | -|-----------------------------------|----| |
23 | | -| 正例数据 | 82 | |
24 | | -| 负例数据:RuleAlphaWords | 27 | |
25 | | -| 负例数据:RuleCapitalWords | 26 | |
26 | | -| 负例数据:RuleCharNumber | 5 | |
27 | | -| 负例数据:RuleDocRepeat | 17 | |
28 | | -| 负例数据:RuleHtmlEntity | 3 | |
29 | | -| 负例数据:RuleLineEndWithEllipsis | 5 | |
30 | | -| 负例数据:RuleLineEndWithTerminal | 5 | |
31 | | -| 负例数据:RuleLineStartWithBulletpoint | 6 | |
32 | | -| 负例数据:RuleLoremIpsum | 5 | |
33 | | -| 负例数据:RuleMeanWordLength | 12 | |
34 | | -| 负例数据:RuleNoPunc | 7 | |
35 | | -| 负例数据:RuleSentenceNumber | 8 | |
36 | | -| 负例数据:RuleSpecialCharacter | 4 | |
37 | | -| 负例数据:RuleStopWord | 24 | |
38 | | -| 负例数据:RuleSymbolWordRatio | 5 | |
39 | | -| 负例数据:RuleUniqueWords | 7 | |
40 | | -| 负例数据:RuleWordNumber | 7 | |
| 20 | +### Dataset Composition |
| 21 | +| Type | Count | |
| 22 | +|-------------------------------------------------|-------| |
| 23 | +| Positive examples | 82 | |
| 24 | +| Negative examples: RuleAlphaWords | 27 | |
| 25 | +| Negative examples: RuleCapitalWords | 26 | |
| 26 | +| Negative examples: RuleCharNumber | 5 | |
| 27 | +| Negative examples: RuleDocRepeat | 17 | |
| 28 | +| Negative examples: RuleHtmlEntity | 3 | |
| 29 | +| Negative examples: RuleLineEndWithEllipsis | 5 | |
| 30 | +| Negative examples: RuleLineEndWithTerminal | 5 | |
| 31 | +| Negative examples: RuleLineStartWithBulletpoint | 6 | |
| 32 | +| Negative examples: RuleLoremIpsum | 5 | |
| 33 | +| Negative examples: RuleMeanWordLength | 12 | |
| 34 | +| Negative examples: RuleNoPunc | 7 | |
| 35 | +| Negative examples: RuleSentenceNumber | 8 | |
| 36 | +| Negative examples: RuleSpecialCharacter | 4 | |
| 37 | +| Negative examples: RuleStopWord | 24 | |
| 38 | +| Negative examples: RuleSymbolWordRatio | 5 | |
| 39 | +| Negative examples: RuleUniqueWords | 7 | |
| 40 | +| Negative examples: RuleWordNumber | 7 | |
41 | 41 |
|
42 | | -## 规则介绍 |
43 | | -本次测试使用内置的 **pretrain** 作为eval_group,具体包含的规则可以参考:[集合介绍](../groups.md) |
44 | | -集合内部的规则可以参考:[规则介绍](../rules.md) |
| 42 | +## Rules Introduction |
| 43 | +This test uses the built-in **pretrain** as the eval_group. For specific rules included, please refer to: [Group Introduction](../groups.md).<br> |
| 44 | +For rules within the group, please refer to: [Rules Introduction](../rules.md). |
45 | 45 |
|
46 | | -## 评测结果 |
47 | | -### 概念介绍 |
48 | | -正例数据与负例数据经过评测,均会生成对应的summary文件,因此需要对结果进行定义,明确概念。 |
| 46 | +## Evaluation Results |
| 47 | +### Definitions |
| 48 | +After evaluation, both positive and negative data will generate corresponding summary files. Therefore, the results need to be defined with clear concepts. |
49 | 49 |
|
50 | | -| 名称 | 介绍 | |
51 | | -|-----|-------------------------------| |
52 | | -| TP | True Positive:正例数据中被评测为正例的数量 | |
53 | | -| FP | False Positive:负例数据中被评测为正例的数量 | |
54 | | -| TN | True Negative:负例数据中被评测为负例的数量 | |
55 | | -| FN | False Negative:正例数据中被评测为负例的数量 | |
56 | | -| 准确率 | TP / (TP + FP) 被评测为正例中正例数据的比率 | |
57 | | -| 召回率 | TP / (TP + FN) 正例数据被评测为正例的比率 | |
58 | | -| F1 | (准确率 + 召回率) / 2 | |
| 50 | +| Term | Description | |
| 51 | +|----------|--------------------------------------------------------------------------------| |
| 52 | +| TP | True Positive: Number of positive examples correctly identified | |
| 53 | +| FP | False Positive: Number of negative examples incorrectly identified as positive | |
| 54 | +| TN | True Negative: Number of negative examples correctly identified | |
| 55 | +| FN | False Negative: Number of positive examples incorrectly identified as negative | |
| 56 | +| Accuracy | TP / (TP + FP) Ratio of positive examples in the identified positives | |
| 57 | +| Recall | TP / (TP + FN) Ratio of positive examples correctly identified | |
| 58 | +| F1 | (Accuracy + Recall) / 2 | |
59 | 59 |
|
60 | | -### 结果展示 |
61 | | -| 数据集名称 | TP | FP | TN | FN | 准确率% | 召回率% | F1 | |
62 | | -|------------|----|----|-----|----|------|------|------| |
63 | | -| slimpajama | 78 | 5 | 103 | 4 | 94 | 95 | 94.5 | |
| 60 | +### Results Display |
| 61 | +| Dataset Name | TP | FP | TN | FN | Accuracy% | Recall% | F1 | |
| 62 | +|--------------|----|----|-----|----|-----------|---------|------| |
| 63 | +| slimpajama | 78 | 5 | 103 | 4 | 94 | 95 | 94.5 | |
64 | 64 |
|
65 | | -## 评测方式 |
| 65 | +## Evaluation Method |
| 66 | +Translate this markdown into English. |
66 | 67 |
|
67 | 68 | ```python |
68 | 69 | from dingo.io import InputArgs |
|
0 commit comments