Slimpajama Dataset

Dataset Introduction

This dataset aims to evaluate the accuracy of the built-in rules in dingo. Therefore, the open-source dataset Slimpajama was selected, and data was extracted from it to construct the test set.

Field Name	Description
data_id	Data ID, without special meaning, can be modified according to user needs
content	Data to be tested
language	Language type
eval_status	Data status, True for negative examples, False for positive examples
type_list	Negative example types for negative data, empty list for positive data
name_list	Negative example names for negative data, empty list for positive data
reason_list	Negative example descriptions for negative data, empty list for positive data

Links: https://huggingface.co/datasets/chupei/slimpajama_badcase_rule https://huggingface.co/datasets/chupei/slimpajama_goodcase_rule

Dataset Composition

Type	Count
Positive examples	82
Negative examples: RuleAlphaWords	27
Negative examples: RuleCapitalWords	26
Negative examples: RuleCharNumber	5
Negative examples: RuleDocRepeat	17
Negative examples: RuleHtmlEntity	3
Negative examples: RuleLineEndWithEllipsis	5
Negative examples: RuleLineEndWithTerminal	5
Negative examples: RuleLineStartWithBulletpoint	6
Negative examples: RuleLoremIpsum	5
Negative examples: RuleMeanWordLength	12
Negative examples: RuleNoPunc	7
Negative examples: RuleSentenceNumber	8
Negative examples: RuleSpecialCharacter	4
Negative examples: RuleStopWord	24
Negative examples: RuleSymbolWordRatio	5
Negative examples: RuleUniqueWords	7
Negative examples: RuleWordNumber	7

Rules Introduction

This test uses the built-in pretrain as the eval_group. For specific rules included, please refer to: Group Introduction.
For rules within the group, please refer to: Rules Introduction.

Evaluation Results

Definitions

After evaluation, both positive and negative data will generate corresponding summary files. Therefore, the results need to be defined with clear concepts.

Term	Description
TP	True Positive: Number of positive examples correctly identified
FP	False Positive: Number of negative examples incorrectly identified as positive
TN	True Negative: Number of negative examples correctly identified
FN	False Negative: Number of positive examples incorrectly identified as negative
Accuracy	TP / (TP + FP) Ratio of positive examples in the identified positives
Recall	TP / (TP + FN) Ratio of positive examples correctly identified
F1	(Accuracy + Recall) / 2

Results Display

Dataset Name	TP	FP	TN	FN	Accuracy%	Recall%	F1
slimpajama	78	5	103	4	94	95	94.5

Evaluation Method

Translate this markdown into English.

from dingo.config import InputArgs
from dingo.exec import Executor

input_data = {
    "eval_group": "pretrain",
    "input_path": "chupei/slimpajama_badcase_rule",
    "save_data": True,
    "save_correct": True,
    "save_raw": True,
    "max_workers": 10,
    "batch_size": 10,
    "data_format": "jsonl",
    "column_content": "content",
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
print(result)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slimpajama Dataset

Dataset Introduction

Dataset Composition

Rules Introduction

Evaluation Results

Definitions

Results Display

Evaluation Method

FilesExpand file tree

slimpajama_data_evaluated_by_rule.md

Latest commit

History

slimpajama_data_evaluated_by_rule.md

File metadata and controls

Slimpajama Dataset

Dataset Introduction

Dataset Composition

Rules Introduction

Evaluation Results

Definitions

Results Display

Evaluation Method