Skip to content

Commit a7fde15

Browse files
authored
Add support for Nemotron-CC EDU classifiers (#518)
* add fineweb mixtral classifier Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add more files Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * create _FineWebBaseClassifier Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add more docs Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add notebooks and tests Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update classifier names Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix label logic Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add Vibhu's suggestions Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * skip pytests Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
1 parent c5a1c50 commit a7fde15

16 files changed

+1358
-26
lines changed

docs/user-guide/api/classifiers.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,12 @@ Classifiers
1414
.. autoclass:: nemo_curator.classifiers.FineWebEduClassifier
1515
:members:
1616

17+
.. autoclass:: nemo_curator.classifiers.FineWebMixtralEduClassifier
18+
:members:
19+
20+
.. autoclass:: nemo_curator.classifiers.FineWebNemotronEduClassifier
21+
:members:
22+
1723
.. autoclass:: nemo_curator.classifiers.AegisClassifier
1824
:members:
1925

docs/user-guide/cpuvsgpu.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ The following NeMo Curator modules are GPU based.
7171
* Quality Classification
7272
* AEGIS and Instruction Data Guard Safety Models
7373
* FineWeb Educational Content Classification
74+
* FineWeb Mixtral and FineWeb Nemotron-4 Educational Models
7475
* Content Type Classification
7576
* Prompt Task and Complexity Classification
7677

docs/user-guide/distributeddataclassification.rst

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ Here, we summarize why each is useful for training an LLM:
3131

3232
- The **FineWeb Educational Content Classifier** focuses on identifying and prioritizing educational material within datasets. This classifier is especially useful for training LLMs on specialized educational content, which can improve their performance on knowledge-intensive tasks. Models trained on high-quality educational content demonstrate enhanced capabilities on academic benchmarks such as MMLU and ARC, showcasing the classifier's impact on improving the knowledge-intensive task performance of LLMs.
3333

34+
- The **FineWeb Mixtral Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.
35+
36+
- The **FineWeb Nemotron-4 Educational Classifier** is designed to determine the educational value (score 0-5 from low to high). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.
37+
3438
- The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
3539

3640
- The **Prompt Task and Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
@@ -236,6 +240,92 @@ For example, to create a dataset with only highly educational content (scores 4
236240
high_edu_dataset = result_dataset[result_dataset["fineweb-edu-score-int"] >= 4]
237241
high_edu_dataset.to_json("high_educational_content/")
238242
243+
FineWeb Mixtral Edu Classifier
244+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
245+
246+
The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
247+
It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct.
248+
In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
249+
This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
250+
These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.
251+
252+
To use the FineWeb Mixtral Edu Classifier, you can follow this example:
253+
254+
.. code-block:: python
255+
256+
from nemo_curator.classifiers import FineWebMixtralEduClassifier
257+
258+
files = get_all_files_paths_under("web_documents/")
259+
input_dataset = DocumentDataset.read_json(files, backend="cudf")
260+
261+
classifier = FineWebMixtralEduClassifier(
262+
batch_size=256,
263+
text_field="text",
264+
pred_column="fineweb-mixtral-edu-score",
265+
int_column="fineweb-mixtral-edu-score-int",
266+
quality_label_column="fineweb-mixtral-edu-score-label",
267+
)
268+
result_dataset = classifier(dataset=input_dataset)
269+
270+
result_dataset.to_json("educational_content/")
271+
272+
This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
273+
It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.
274+
275+
The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
276+
The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
277+
You can filter the results based on these scores to create datasets with varying levels of educational content.
278+
279+
For example, to create a dataset with only highly educational content (scores 4 and 5):
280+
281+
.. code-block:: python
282+
283+
high_edu_dataset = result_dataset[result_dataset["fineweb-mixtral-edu-score-int"] >= 4]
284+
high_edu_dataset.to_json("high_educational_content/")
285+
286+
FineWeb Nemotron-4 Edu Classifier
287+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
288+
289+
The FineWeb Mixtral Edu Classifier is designed to identify and prioritize educational content within a dataset.
290+
It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct.
291+
In contrast, the original FineWeb-Edu classifier was trained using annotations from Llama 3 70B-Instruct.
292+
This classifier was used as part of a classifier ensemble in the creation of the `Nemotron-CC dataset <https://arxiv.org/abs/2412.02595>`_.
293+
These datasets can be used to train LLMs with a focus on educational content, potentially improving their performance on knowledge-intensive tasks.
294+
295+
To use the FineWeb Nemotron-4 Edu Classifier, you can follow this example:
296+
297+
.. code-block:: python
298+
299+
from nemo_curator.classifiers import FineWebNemotronEduClassifier
300+
301+
files = get_all_files_paths_under("web_documents/")
302+
input_dataset = DocumentDataset.read_json(files, backend="cudf")
303+
304+
classifier = FineWebNemotronEduClassifier(
305+
batch_size=256,
306+
text_field="text",
307+
pred_column="fineweb-nemotron-edu-score",
308+
int_column="fineweb-nemotron-edu-score-int",
309+
quality_label_column="fineweb-nemotron-edu-score-label",
310+
)
311+
result_dataset = classifier(dataset=input_dataset)
312+
313+
result_dataset.to_json("educational_content/")
314+
315+
This classifier uses a model based on the `Snowflake Arctic-embed-m <https://huggingface.co/Snowflake/snowflake-arctic-embed-m>`_ embedding model with a linear regression layer on top.
316+
It assigns an educational score to each document on a scale from 0 to 5, where higher scores indicate more educational content.
317+
318+
The ``pred_column`` will contain the raw floating-point scores, while the ``int_column`` will contain the rounded integer scores.
319+
The ``quality_label_column`` identifies text as high quality if it scores higher than 2.5 and low quality otherwise.
320+
You can filter the results based on these scores to create datasets with varying levels of educational content.
321+
322+
For example, to create a dataset with only highly educational content (scores 4 and 5):
323+
324+
.. code-block:: python
325+
326+
high_edu_dataset = result_dataset[result_dataset["fineweb-nemotron-edu-score-int"] >= 4]
327+
high_edu_dataset.to_json("high_educational_content/")
328+
239329
Content Type Classifier DeBERTa
240330
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
241331

examples/classifiers/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ The Python scripts in this directory demonstrate how to run classification on yo
88
- AEGIS Safety Models
99
- Instruction Data Guard Model
1010
- FineWeb Educational Content Classifier
11+
- FineWeb Mixtral Educational Classifier
12+
- FineWeb Nemotron-4 Educational Classifier
1113
- Content Type Classifier
1214
- Prompt Task and Complexity Classifier
1315

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import argparse
16+
import time
17+
18+
from nemo_curator.classifiers import FineWebMixtralEduClassifier
19+
from nemo_curator.datasets import DocumentDataset
20+
from nemo_curator.utils.distributed_utils import get_client
21+
from nemo_curator.utils.script_utils import ArgumentHelper
22+
23+
24+
def main(args):
25+
global_st = time.time()
26+
27+
# Input can be a string or list
28+
input_file_path = "/path/to/data"
29+
output_file_path = "./"
30+
31+
client_args = ArgumentHelper.parse_client_args(args)
32+
client_args["cluster_type"] = "gpu"
33+
client = get_client(**client_args)
34+
35+
input_dataset = DocumentDataset.read_json(
36+
input_file_path, backend="cudf", add_filename=True
37+
)
38+
39+
fineweb_mixtral_edu_classifier = FineWebMixtralEduClassifier()
40+
result_dataset = fineweb_mixtral_edu_classifier(dataset=input_dataset)
41+
result_dataset.to_json(output_path=output_file_path, write_to_filename=True)
42+
43+
global_et = time.time()
44+
print(
45+
f"Total time taken for FineWeb Mixtral Edu Classifier inference: {global_et-global_st} s",
46+
flush=True,
47+
)
48+
49+
client.close()
50+
51+
52+
def attach_args(
53+
parser=argparse.ArgumentParser(
54+
formatter_class=argparse.ArgumentDefaultsHelpFormatter
55+
),
56+
):
57+
argumentHelper = ArgumentHelper(parser)
58+
argumentHelper.add_distributed_classifier_cluster_args()
59+
60+
return argumentHelper.parser
61+
62+
63+
if __name__ == "__main__":
64+
main(attach_args().parse_args())
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import argparse
16+
import time
17+
18+
from nemo_curator.classifiers import FineWebNemotronEduClassifier
19+
from nemo_curator.datasets import DocumentDataset
20+
from nemo_curator.utils.distributed_utils import get_client
21+
from nemo_curator.utils.script_utils import ArgumentHelper
22+
23+
24+
def main(args):
25+
global_st = time.time()
26+
27+
# Input can be a string or list
28+
input_file_path = "/path/to/data"
29+
output_file_path = "./"
30+
31+
client_args = ArgumentHelper.parse_client_args(args)
32+
client_args["cluster_type"] = "gpu"
33+
client = get_client(**client_args)
34+
35+
input_dataset = DocumentDataset.read_json(
36+
input_file_path, backend="cudf", add_filename=True
37+
)
38+
39+
fineweb_nemotron_edu_classifier = FineWebNemotronEduClassifier()
40+
result_dataset = fineweb_nemotron_edu_classifier(dataset=input_dataset)
41+
result_dataset.to_json(output_path=output_file_path, write_to_filename=True)
42+
43+
global_et = time.time()
44+
print(
45+
f"Total time taken for FineWeb Nemotron-4 Edu Classifier inference: {global_et-global_st} s",
46+
flush=True,
47+
)
48+
49+
client.close()
50+
51+
52+
def attach_args(
53+
parser=argparse.ArgumentParser(
54+
formatter_class=argparse.ArgumentDefaultsHelpFormatter
55+
),
56+
):
57+
argumentHelper = ArgumentHelper(parser)
58+
argumentHelper.add_distributed_classifier_cluster_args()
59+
60+
return argumentHelper.parser
61+
62+
63+
if __name__ == "__main__":
64+
main(attach_args().parse_args())

nemo_curator/classifiers/__init__.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License");
44
# you may not use this file except in compliance with the License.
@@ -18,7 +18,11 @@
1818
from .aegis import AegisClassifier, InstructionDataGuardClassifier
1919
from .content_type import ContentTypeClassifier
2020
from .domain import DomainClassifier, MultilingualDomainClassifier
21-
from .fineweb_edu import FineWebEduClassifier
21+
from .fineweb_edu import (
22+
FineWebEduClassifier,
23+
FineWebMixtralEduClassifier,
24+
FineWebNemotronEduClassifier,
25+
)
2226
from .prompt_task_complexity import PromptTaskComplexityClassifier
2327
from .quality import QualityClassifier
2428

@@ -29,6 +33,8 @@
2933
"AegisClassifier",
3034
"InstructionDataGuardClassifier",
3135
"FineWebEduClassifier",
36+
"FineWebMixtralEduClassifier",
37+
"FineWebNemotronEduClassifier",
3238
"ContentTypeClassifier",
3339
"PromptTaskComplexityClassifier",
3440
]

0 commit comments

Comments
 (0)