You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* initial commit
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* update readmes
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* edit readmes
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* working scripts
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* run isort
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* modify base
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* change to output_path
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* Apply suggestions from code review
Co-authored-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
* change to notimplementederror
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
---------
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Co-authored-by: Vibhu Jawa <vibhujawa@gmail.com>
- GPU-Accelerated models: [Domain (English and multilingual), Quality, Safety, Educational Content, and Content Type Classification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html)
31
+
- GPU-Accelerated models: [Domain (English and multilingual), Quality, Safety, Educational Content, Content Type, and Prompt Task/Complexity Classification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html)
Copy file name to clipboardExpand all lines: docs/user-guide/distributeddataclassification.rst
+24-1Lines changed: 24 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ NeMo Curator provides a module to help users run inference with pre-trained mode
15
15
This is achieved by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to accelerate the classification task in a distributed manner.
16
16
Since the classification of a single text document is independent of other documents within the dataset, we can distribute the workload across multiple nodes and GPUs to perform parallel processing.
17
17
18
-
Domain (English and multilingual), quality, content safety, educational content, and content type models are tasks we include as examples within our module.
18
+
Domain (English and multilingual), quality, content safety, educational content, content type, and prompt task/complexity models are tasks we include as examples within our module.
19
19
20
20
Here, we summarize why each is useful for training an LLM:
21
21
@@ -33,6 +33,8 @@ Here, we summarize why each is useful for training an LLM:
33
33
34
34
- The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
35
35
36
+
- The **Prompt Task/Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
37
+
36
38
-----------------------------------------
37
39
Usage
38
40
-----------------------------------------
@@ -256,6 +258,27 @@ Let's see how ``ContentTypeClassifier`` works in a small excerpt taken from ``ex
256
258
In this example, the content type classifier is obtained directly from `Hugging Face <https://huggingface.co/nvidia/content-type-classifier-deberta>`_.
257
259
It filters the input dataset to include only documents classified as "Blogs" or "News".
258
260
261
+
Prompt Task/Complexity Classifier
262
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
263
+
264
+
The Prompt Task/Complexity Classifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score.
265
+
266
+
Here's an example of how to use the ``PromptTaskComplexityClassifier``:
267
+
268
+
.. code-block:: python
269
+
270
+
from nemo_curator.classifiers import PromptTaskComplexityClassifier
Copy file name to clipboardExpand all lines: examples/classifiers/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,7 @@ The Python scripts in this directory demonstrate how to run classification on yo
9
9
- Instruction-Data-Guard Model
10
10
- FineWeb Educational Content Classifier
11
11
- Content Type Classifier
12
+
- Prompt Task/Complexity Classifier
12
13
13
14
For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html).
0 commit comments