Skip to content

Commit 57f9666

Browse files
authored
Download Hugging Face datasets for text classification tutorials (#1288)
* use real dataset for domain tutorial Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * ruff Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update content type classifier notebook Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update fineweb notebook Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update fineweb mixtral notebook Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update fineweb nemotron notebook Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update prompt task complexity tutorial Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update multilingual domain tutorial Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * update content type with text_field=text_field * update domain with text_field=text_field * update fineweb with text_field=text_field * update fineweb mixtral with text_field=text_field * update fineweb nemotron with text_field=text_field * update quality notebook Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
1 parent c60fa3b commit 57f9666

10 files changed

+1610
-622
lines changed

tutorials/text/distributed-data-classification/aegis-classification.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -263,7 +263,7 @@
263263
"\n",
264264
"```python\n",
265265
"pipeline = Pipeline(name=\"full_pipeline\")\n",
266-
"pipeline.add_stage(read_stage) # reader (JSONL/S3/etc.)\n",
266+
"pipeline.add_stage(read_stage) # reader (JSONL/Parquet)\n",
267267
"pipeline.add_stage(lang_id_stage) # optional: language filter\n",
268268
"pipeline.add_stage(classifier_stage) # classifier\n",
269269
"pipeline.add_stage(write_stage) # writer (JSONL/Parquet)\n",
@@ -288,7 +288,7 @@
288288
"cell_type": "markdown",
289289
"metadata": {},
290290
"source": [
291-
"Since the pipeline ran to completion and the result was written to a JSONL file, we can shut down the Ray cluster with:"
291+
"Since the pipeline ran to completion and the result was written to disk, we can shut down the Ray cluster with:"
292292
]
293293
},
294294
{
@@ -388,7 +388,7 @@
388388
],
389389
"source": [
390390
"# For simplicity, we take the first written file from the writer stage\n",
391-
"# In real pipelines, the writer may return multiple files (shards) or objects\n",
391+
"# In real pipelines, adjust as needed\n",
392392
"result_file = result[0].data[0]\n",
393393
"\n",
394394
"result_df = pd.read_json(result_file, lines=True)\n",

tutorials/text/distributed-data-classification/content-type-classification.ipynb

Lines changed: 205 additions & 91 deletions
Large diffs are not rendered by default.

tutorials/text/distributed-data-classification/domain-classification.ipynb

Lines changed: 208 additions & 65 deletions
Large diffs are not rendered by default.

tutorials/text/distributed-data-classification/fineweb-edu-classification.ipynb

Lines changed: 188 additions & 58 deletions
Large diffs are not rendered by default.

tutorials/text/distributed-data-classification/fineweb-mixtral-edu-classification.ipynb

Lines changed: 196 additions & 64 deletions
Large diffs are not rendered by default.

tutorials/text/distributed-data-classification/fineweb-nemotron-edu-classification.ipynb

Lines changed: 193 additions & 62 deletions
Large diffs are not rendered by default.

tutorials/text/distributed-data-classification/instruction-data-guard-classification.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@
249249
"\n",
250250
"```python\n",
251251
"pipeline = Pipeline(name=\"full_pipeline\")\n",
252-
"pipeline.add_stage(read_stage) # reader (JSONL/S3/etc.)\n",
252+
"pipeline.add_stage(read_stage) # reader (JSONL/Parquet)\n",
253253
"pipeline.add_stage(lang_id_stage) # optional: language filter\n",
254254
"pipeline.add_stage(classifier_stage) # classifier\n",
255255
"pipeline.add_stage(write_stage) # writer (JSONL/Parquet)\n",
@@ -274,7 +274,7 @@
274274
"cell_type": "markdown",
275275
"metadata": {},
276276
"source": [
277-
"Since the pipeline ran to completion and the result was written to a JSONL file, we can shut down the Ray cluster with:"
277+
"Since the pipeline ran to completion and the result was written to disk, we can shut down the Ray cluster with:"
278278
]
279279
},
280280
{
@@ -355,7 +355,7 @@
355355
],
356356
"source": [
357357
"# For simplicity, we take the first written file from the writer stage\n",
358-
"# In real pipelines, the writer may return multiple files (shards) or objects\n",
358+
"# In real pipelines, adjust as needed\n",
359359
"result_file = result[0].data[0]\n",
360360
"\n",
361361
"result_df = pd.read_json(result_file, lines=True)\n",

tutorials/text/distributed-data-classification/multilingual-domain-classification.ipynb

Lines changed: 203 additions & 75 deletions
Large diffs are not rendered by default.

tutorials/text/distributed-data-classification/prompt-task-complexity-classification.ipynb

Lines changed: 212 additions & 107 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)