You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user-guide/distributeddataclassification.rst
+90Lines changed: 90 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,6 +31,10 @@ Here, we summarize why each is useful for training an LLM:
31
31
32
32
- The **FineWeb Educational Content Classifier** focuses on identifying and prioritizing educational material within datasets. This classifier is especially useful for training LLMs on specialized educational content, which can improve their performance on knowledge-intensive tasks. Models trained on high-quality educational content demonstrate enhanced capabilities on academic benchmarks such as MMLU and ARC, showcasing the classifier's impact on improving the knowledge-intensive task performance of LLMs.
33
33
34
+
- The **FineWeb Mixtral Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.
35
+
36
+
- The **FineWeb Nemotron-4 Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.
37
+
34
38
- The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
35
39
36
40
- The **Prompt Task and Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
@@ -236,6 +240,92 @@ For example, to create a dataset with only highly educational content (scores 4
This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
273
+
It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.
274
+
275
+
The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
276
+
The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
277
+
You can filter the results based on these scores to create datasets with varying levels of educational content.
278
+
279
+
For example, to create a dataset with only highly educational content (scores 4 and 5):
This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
316
+
It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.
317
+
318
+
The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
319
+
The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
320
+
You can filter the results based on these scores to create datasets with varying levels of educational content.
321
+
322
+
For example, to create a dataset with only highly educational content (scores 4 and 5):
0 commit comments