Skip to content

Commit 90f2cf2

Browse files
authored
Update text_pt & text_sft docs (#151)
1 parent e2e87a5 commit 90f2cf2

File tree

64 files changed

+1816
-333
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+1816
-333
lines changed

docs/en/notes/api/operators/text_pt/eval/DebertaV3SampleEvaluator.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,42 @@ def run(self, storage: DataFlowStorage, input_key: str, output_key: str='Deberta
3333
| **output_key** | str | 'Debertav3Score' | The name of the output column to store the classification result. |
3434

3535
## 🧠 Example Usage
36+
```python
37+
from dataflow.operators.text_pt.eval import DebertaV3SampleEvaluator
38+
from dataflow.utils.storage import FileStorage
39+
40+
# Prepare data and storage
41+
storage = FileStorage(first_entry_file_name="pt_input.jsonl")
42+
43+
# Initialize and run the operator
44+
deberta_evaluator = DebertaV3SampleEvaluator(
45+
model_name="Nvidia/deberta-v3-large-local"
46+
)
47+
deberta_evaluator.run(
48+
storage.step(),
49+
input_key='raw_content',
50+
output_key='Debertav3Score'
51+
)
52+
```
3653

3754
#### 🧾 Default Output Format
3855

3956
| Field | Type | Description |
4057
| :--- | :--- | :--- |
4158
| ... | any | Original columns from the input DataFrame. |
4259
| Debertav3Score | str | The quality classification result from the model. The actual column name is determined by the `output_key`. |
60+
61+
**Example Input:**
62+
```json
63+
{
64+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios..."
65+
}
66+
```
67+
68+
**Example Output:**
69+
```json
70+
{
71+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios...",
72+
"Debertav3Score": "High"
73+
}
74+
```

docs/en/notes/api/operators/text_pt/eval/FineWebEduSampleEvaluator.md

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,27 +34,42 @@ def run(self, storage: DataFlowStorage, input_key: str, output_key: str='FineWeb
3434
| **output\_key** | str | 'FineWebEduScore' | The name of the output column where the generated educational value scores will be stored. |
3535

3636
## 🧠 Example Usage
37+
```python
38+
from dataflow.operators.text_pt.eval import FineWebEduSampleEvaluator
39+
from dataflow.utils.storage import FileStorage
40+
41+
# Prepare data and storage
42+
storage = FileStorage(first_entry_file_name="pt_input.jsonl")
43+
44+
# Initialize and run the operator
45+
fineweb_evaluator = FineWebEduSampleEvaluator()
46+
fineweb_evaluator.run(
47+
storage.step(),
48+
input_key='raw_content',
49+
output_key='FinewebEduScore'
50+
)
51+
```
3752

3853
## 🧾 Output Format
3954

4055
The operator appends a new column (specified by `output_key`) to the input DataFrame, containing the educational score for the text in the `input_key` column.
4156

4257
| Field | Type | Description |
4358
| :--- | :--- | :--- |
44-
| [input\_key] | str | The original input text from the source column. |
45-
| [output\_key] | float | The calculated educational value score, ranging from 0 to 1. |
59+
| ... | ... | Original columns from the input data. |
60+
| FinewebEduScore | float | The calculated educational value score. |
4661

4762
**Example Input:**
4863
```json
4964
{
50-
"text": "Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy, through a process that uses sunlight, water, and carbon dioxide."
65+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios..."
5166
}
5267
```
5368

5469
**Example Output:**
5570
```json
5671
{
57-
"text": "Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy, through a process that uses sunlight, water, and carbon dioxide.",
58-
"FineWebEduScore": 0.98765
72+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios...",
73+
"FinewebEduScore": 1.5264956951
5974
}
6075
```

docs/en/notes/api/operators/text_pt/eval/MetaSampleEvaluator.md

Lines changed: 29 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -46,39 +46,57 @@ def run(self, storage: DataFlowStorage, input_key: str):
4646
## 🧠 Example Usage
4747

4848
```python
49-
49+
from dataflow.operators.text_pt.eval import MetaSampleEvaluator
50+
from dataflow.utils.storage import FileStorage
51+
from dataflow.utils.llm_serving import APILLMServing_request
52+
53+
# Prepare data and storage
54+
storage = FileStorage(first_entry_file_name="pt_input.jsonl")
55+
56+
# Initialize LLM serving
57+
llm_serving = APILLMServing_request(
58+
api_url="http://<your_llm_api_endpoint>",
59+
model_name="<your_model_name>"
60+
)
61+
62+
# Initialize and run the operator
63+
meta_evaluator = MetaSampleEvaluator(llm_serving=llm_serving)
64+
meta_evaluator.run(
65+
storage.step(),
66+
input_key='raw_content'
67+
)
5068
```
5169

5270
#### 🧾 Default Output Format
5371

5472
| Field | Type | Description |
5573
| :--- | :--- | :--- |
56-
| *input_key* | str | The original input text from the specified input column. |
74+
| ... | ... | Original columns from the input data. |
5775
| Text Structure | float | Score for the text's structure. |
5876
| Diversity & Complexity | float | Score for the text's diversity and complexity. |
5977
| Fluency & Understandability | float | Score for the text's fluency and understandability. |
6078
| Safety | float | Score for the text's safety. |
6179
| Educational Value | float | Score for the text's educational value. |
6280
| Content Accuracy & Effectiveness | float | Score for the content's accuracy and effectiveness. |
6381

64-
**Example Input (assuming `input_key="text"`):**
82+
**Example Input:**
6583

6684
```json
6785
{
68-
"text": "The Pythagorean theorem states that in a right-angled triangle, the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides."
86+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios..."
6987
}
7088
```
7189

7290
**Example Output:**
7391

7492
```json
7593
{
76-
"text": "The Pythagorean theorem states that in a right-angled triangle, the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides.",
77-
"Text Structure": 5.0,
78-
"Diversity & Complexity": 4.0,
79-
"Fluency & Understandability": 5.0,
80-
"Safety": 5.0,
81-
"Educational Value": 5.0,
82-
"Content Accuracy & Effectiveness": 5.0
94+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios...",
95+
"Text Structure": 4.0,
96+
"Diversity and Complexity": 5.0,
97+
"Fluency and Understandability": 4.0,
98+
"Safety": 5.0,
99+
"Educational Value": 5.0,
100+
"Content Accuracy and Effectiveness": 5.0
83101
}
84102
```

docs/en/notes/api/operators/text_pt/eval/PairQualSampleEvaluator.md

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,23 +35,37 @@ def run(self, storage: DataFlowStorage, input_key: str, output_key: str='PairQua
3535

3636
## 🧠 Example Usage
3737
```python
38+
from dataflow.operators.text_pt.eval import PairQualSampleEvaluator
39+
from dataflow.utils.storage import FileStorage
40+
41+
# Prepare data and storage
42+
storage = FileStorage(first_entry_file_name="pt_input.jsonl")
43+
44+
# Initialize and run the operator
45+
pairqual_evaluator = PairQualSampleEvaluator(lang='en')
46+
pairqual_evaluator.run(
47+
storage.step(),
48+
input_key='raw_content',
49+
output_key='PairQualScore'
50+
)
3851
```
3952
#### 🧾 Default Output Format
4053
| Field | Type | Description |
4154
| :-------------- | :---- | :------------------------------------------------------------ |
42-
| *input_key* | str | The original input text from the specified input column. |
43-
| PairQualScore | float | The calculated quality score, ranging from 0 to 1. |
55+
| ... | ... | Original columns from the input data. |
56+
| PairQualScore | float | The calculated quality score. |
4457

45-
Example Input:
58+
**Example Input:**
4659
```json
4760
{
48-
"instruction":"This is a high-quality piece of text that should receive a good score."
61+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios..."
4962
}
5063
```
51-
Example Output:
64+
65+
**Example Output:**
5266
```json
5367
{
54-
"instruction":"This is a high-quality piece of text that should receive a good score.",
55-
"PairQualScore": 0.958
68+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios...",
69+
"PairQualScore": 3.2509903908
5670
}
5771
```

docs/en/notes/api/operators/text_pt/eval/PerplexitySampleEvaluator.md

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -40,29 +40,41 @@ def run(self, storage: DataFlowStorage, input_key: str = 'raw_content', output_k
4040
## 🧠 Example Usage
4141

4242
```python
43-
43+
from dataflow.operators.text_pt.eval import PerplexitySampleEvaluator
44+
from dataflow.utils.storage import FileStorage
45+
46+
# Prepare data and storage
47+
storage = FileStorage(first_entry_file_name="pt_input.jsonl")
48+
49+
# Initialize and run the operator
50+
perplexity_evaluator = PerplexitySampleEvaluator(model_name='gpt2')
51+
perplexity_evaluator.run(
52+
storage.step(),
53+
input_key='raw_content',
54+
output_key='PerplexityScore'
55+
)
4456
```
4557

4658
#### 🧾 Default Output Format
4759

4860
| Field | Type | Description |
4961
| :--- | :--- | :--- |
50-
| raw_content | str | The input text. |
62+
| ... | ... | Original columns from the input data. |
5163
| PerplexityScore | float | The calculated perplexity score for the input text. Lower is better. |
5264

5365
**Example Input:**
5466

5567
```json
5668
{
57-
"raw_content": "The quick brown fox jumps over the lazy dog."
69+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios..."
5870
}
5971
```
6072

6173
**Example Output:**
6274

6375
```json
6476
{
65-
"raw_content": "The quick brown fox jumps over the lazy dog.",
66-
"PerplexityScore": 35.82
77+
"raw_content": "AMICUS ANTHOLOGIES, PART ONE (1965-1972)\nFebruary 23, 2017 Alfred Eaker Leave a comment\nWith Dr. Terror's House of Horrors (1965, directed by Freddie Francis and written by Milton Subotsky) Amicus Productions (spearheaded by Subotsky and Max Rosenberg, who previously produced for Hammer and was a cousin to Doris Wishman) established itself as a vital competitor to Hammer Studios...",
78+
"PerplexityScore": 49.2016410828
6779
}
6880
```

0 commit comments

Comments
 (0)