Prompt Task/Complexity Classifier (#364)

sarahyurick · VibhuJawa · web-flow · commit 27dd211c5468 · 2024-12-18T12:52:50.000-08:00
* initial commit

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* update readmes

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* edit readmes

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* working scripts

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* run isort

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* modify base

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* change to output_path

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

* Apply suggestions from code review

Co-authored-by: Vibhu Jawa &lt;vibhujawa@gmail.com&gt;
Signed-off-by: Sarah Yurick &lt;53962159+sarahyurick@users.noreply.github.com&gt;

* change to notimplementederror

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;

---------

Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;
Signed-off-by: Sarah Yurick &lt;53962159+sarahyurick@users.noreply.github.com&gt;
Co-authored-by: Vibhu Jawa &lt;vibhujawa@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -28,7 +28,7 @@ All of our text pipelines have great multilingual support.
 - [Heuristic Filtering](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)
 - Classifier Filtering
   - [fastText](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)
-  - GPU-Accelerated models: [Domain (English and multilingual), Quality, Safety, Educational Content, and Content Type Classification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html)
+  - GPU-Accelerated models: [Domain (English and multilingual), Quality, Safety, Educational Content, Content Type, and Prompt Task/Complexity Classification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html)
 - **GPU-Accelerated Deduplication**
   - [Exact Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html)
   - [Fuzzy Deduplication](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html) via MinHash Locality Sensitive Hashing
diff --git a/docs/user-guide/api/classifiers.rst b/docs/user-guide/api/classifiers.rst
@@ -22,3 +22,6 @@ Classifiers
 
 .. autoclass:: nemo_curator.classifiers.ContentTypeClassifier
     :members:
+
+.. autoclass:: nemo_curator.classifiers.PromptTaskComplexityClassifier
+    :members:
diff --git a/docs/user-guide/cpuvsgpu.rst b/docs/user-guide/cpuvsgpu.rst
@@ -72,6 +72,7 @@ The following NeMo Curator modules are GPU based.
   * AEGIS and Instruction-Data-Guard Safety Models
   * FineWeb Educational Content Classification
   * Content Type Classification
+  * Prompt Task/Complexity Classification
 
 GPU modules store the ``DocumentDataset`` using a ``cudf`` backend instead of a ``pandas`` one.
 To read a dataset into GPU memory, one could use the following function call.
diff --git a/docs/user-guide/distributeddataclassification.rst b/docs/user-guide/distributeddataclassification.rst
@@ -15,7 +15,7 @@ NeMo Curator provides a module to help users run inference with pre-trained mode
 This is achieved by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to accelerate the classification task in a distributed manner.
 Since the classification of a single text document is independent of other documents within the dataset, we can distribute the workload across multiple nodes and GPUs to perform parallel processing.
 
-Domain (English and multilingual), quality, content safety, educational content, and content type models are tasks we include as examples within our module.
+Domain (English and multilingual), quality, content safety, educational content, content type, and prompt task/complexity models are tasks we include as examples within our module.
 
 Here, we summarize why each is useful for training an LLM:
 
@@ -33,6 +33,8 @@ Here, we summarize why each is useful for training an LLM:
 
 - The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
 
+- The **Prompt Task/Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
+
 -----------------------------------------
 Usage
 -----------------------------------------
@@ -256,6 +258,27 @@ Let's see how ``ContentTypeClassifier`` works in a small excerpt taken from ``ex
 In this example, the content type classifier is obtained directly from `Hugging Face <https://huggingface.co/nvidia/content-type-classifier-deberta>`_.
 It filters the input dataset to include only documents classified as "Blogs" or "News".
 
+Prompt Task/Complexity Classifier
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The Prompt Task/Complexity Classifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score.
+
+Here's an example of how to use the ``PromptTaskComplexityClassifier``:
+
+.. code-block:: python
+
+    from nemo_curator.classifiers import PromptTaskComplexityClassifier
+
+    files = get_all_files_paths_under("my_dataset/")
+    input_dataset = DocumentDataset.read_json(files, backend="cudf")
+
+    classifier = PromptTaskComplexityClassifier()
+    result_dataset = classifier(dataset=input_dataset)
+
+    result_dataset.to_json("labeled_dataset/")
+
+The prompt task and complexity classifier is obtained from `Hugging Face <https://huggingface.co/nvidia/prompt-task-and-complexity-classifier>`_.
+
 -----------------------------------------
 CrossFit Integration
 -----------------------------------------
diff --git a/examples/classifiers/README.md b/examples/classifiers/README.md
@@ -9,6 +9,7 @@ The Python scripts in this directory demonstrate how to run classification on yo
 - Instruction-Data-Guard Model
 - FineWeb Educational Content Classifier
 - Content Type Classifier
+- Prompt Task/Complexity Classifier
 
 For more information about these classifiers, please see NeMo Curator's [Distributed Data Classification documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html).
 
diff --git a/examples/classifiers/prompt_task_complexity_example.py b/examples/classifiers/prompt_task_complexity_example.py
@@ -0,0 +1,65 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+from nemo_curator.classifiers import PromptTaskComplexityClassifier
+from nemo_curator.datasets import DocumentDataset
+from nemo_curator.utils.distributed_utils import get_client
+from nemo_curator.utils.script_utils import ArgumentHelper
+
+
+def main(args):
+    global_st = time.time()
+
+    # Input can be a string or list
+    input_file_path = "/path/to/data"
+    output_file_path = "./"
+
+    client_args = ArgumentHelper.parse_client_args(args)
+    client_args["cluster_type"] = "gpu"
+    client = get_client(**client_args)
+
+    input_dataset = DocumentDataset.read_json(
+        input_file_path, backend="cudf", add_filename=True
+    )
+
+    prompt_task_complexity_classifier = PromptTaskComplexityClassifier()
+    result_dataset = prompt_task_complexity_classifier(dataset=input_dataset)
+
+    result_dataset.to_json(output_path=output_file_path, write_to_filename=True)
+
+    global_et = time.time()
+    print(
+        f"Total time taken for prompt task and complexity classifier inference: {global_et-global_st} s",
+        flush=True,
+    )
+
+    client.close()
+
+
+def attach_args(
+    parser=argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    ),
+):
+    argumentHelper = ArgumentHelper(parser)
+    argumentHelper.add_distributed_classifier_cluster_args()
+
+    return argumentHelper.parser
+
+
+if __name__ == "__main__":
+    main(attach_args().parse_args())
diff --git a/nemo_curator/classifiers/__init__.py b/nemo_curator/classifiers/__init__.py
@@ -19,6 +19,7 @@
 from .content_type import ContentTypeClassifier
 from .domain import DomainClassifier, MultilingualDomainClassifier
 from .fineweb_edu import FineWebEduClassifier
+from .prompt_task_complexity import PromptTaskComplexityClassifier
 from .quality import QualityClassifier
 
 __all__ = [
@@ -29,4 +30,5 @@
     "InstructionDataGuardClassifier",
     "FineWebEduClassifier",
     "ContentTypeClassifier",
+    "PromptTaskComplexityClassifier",
 ]
diff --git a/nemo_curator/classifiers/base.py b/nemo_curator/classifiers/base.py
@@ -16,7 +16,7 @@
 
 os.environ["RAPIDS_NO_INITIALIZE"] = "1"
 from abc import ABC, abstractmethod
-from typing import List, Optional
+from typing import List, Optional, Union
 
 import torch
 import torch.nn as nn
@@ -37,8 +37,8 @@ def __init__(
         labels: Optional[List[str]],
         filter_by: Optional[List[str]],
         batch_size: int,
-        out_dim: int,
-        pred_column: str,
+        out_dim: Optional[int],
+        pred_column: Union[str, List[str]],
         max_chars: int,
         device_type: str,
         autocast: bool,
diff --git a/nemo_curator/classifiers/prompt_task_complexity.py b/nemo_curator/classifiers/prompt_task_complexity.py
diff --git a/nemo_curator/scripts/classifiers/README.md b/nemo_curator/scripts/classifiers/README.md
diff --git a/nemo_curator/scripts/classifiers/prompt_task_complexity_classifier_inference.py b/nemo_curator/scripts/classifiers/prompt_task_complexity_classifier_inference.py
diff --git a/pyproject.toml b/pyproject.toml