Skip to content

Conversation

@sarahyurick
Copy link
Contributor

Closes #1095.

@sarahyurick sarahyurick added the documentation Improvements or additions to documentation label Oct 20, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 20, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@sarahyurick sarahyurick marked this pull request as ready for review October 20, 2025 23:07
@sarahyurick sarahyurick requested a review from lbliii October 20, 2025 23:07
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sarahyurick sarahyurick requested a review from arhamm1 October 21, 2025 16:59
Copy link
Contributor

@lbliii lbliii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Minor nit: remove "please" throughout. Typically we reserve please for scenarios where there's a known bug and a very inconvenient workaround they must use.

@sarahyurick
Copy link
Contributor Author

@lbliii thanks, I will modify.

Copy link
Contributor Author

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding suggestions for nits...

" - CUDA 12.x\n",
"\n",
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies."
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",

"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies."
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
"\n",
"For more information about the classifiers, please refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page. Please refer to the [Classifier-Based Filtering](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/classifier.html) page for more information about quality classification in NeMo Curator."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"For more information about the classifiers, please refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page. Please refer to the [Classifier-Based Filtering](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/classifier.html) page for more information about quality classification in NeMo Curator."
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page. Refer to the [Classifier-Based Filtering](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/classifier.html) page for more information about quality classification in NeMo Curator."

"```"
"```\n",
"\n",
"Please refer to the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.classifiers.quality.html#stages.text.classifiers.quality.QualityClassifier) for more information about the `QualityClassifier` class."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Please refer to the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.classifiers.quality.html#stages.text.classifiers.quality.QualityClassifier) for more information about the `QualityClassifier` class."
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.classifiers.quality.html#stages.text.classifiers.quality.QualityClassifier) for more information about the `QualityClassifier` class."


This Jupyter notebook tutorial demonstrates how to use NeMo Curator to download text data from [Common Crawl](https://commoncrawl.org/), [Wikipedia](https://dumps.wikimedia.org/backup-index.html), and [ArXiv](https://info.arxiv.org/help/bulk_data_s3.html), respectively.

For more information about downloading and extracting data with NeMo Curator, please refer to the [Download Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) and [Data Acquisition Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/text/data-acquisition-concepts.html) documentation pages.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more information about downloading and extracting data with NeMo Curator, please refer to the [Download Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) and [Data Acquisition Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/text/data-acquisition-concepts.html) documentation pages.
For more information about downloading and extracting data with NeMo Curator, refer to the [Download Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) and [Data Acquisition Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/text/data-acquisition-concepts.html) documentation pages.

"NeMo Curator has pre-built download and extract pipelines for **Common Crawl**, **Wikipedia** and **ArXiv** datasets. In this tutorial, we will introduce how to execute these pipelines."
"NeMo Curator has pre-built download and extract pipelines for **Common Crawl**, **Wikipedia** and **ArXiv** datasets. In this tutorial, we will introduce how to execute these pipelines.\n",
"\n",
"For more information about downloading and extracting data with NeMo Curator, please refer to the [Download Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) and [Data Acquisition Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/text/data-acquisition-concepts.html) documentation pages."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"For more information about downloading and extracting data with NeMo Curator, please refer to the [Download Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) and [Data Acquisition Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/text/data-acquisition-concepts.html) documentation pages."
"For more information about downloading and extracting data with NeMo Curator, refer to the [Download Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) and [Data Acquisition Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/text/data-acquisition-concepts.html) documentation pages."

Copy link
Contributor Author

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding suggestions for nits...

@sarahyurick sarahyurick merged commit 6697e59 into NVIDIA-NeMo:main Oct 22, 2025
11 checks passed
lbliii pushed a commit to lbliii/NeMo-Curator that referenced this pull request Oct 22, 2025
* link to docs in classifier tutorials

Signed-off-by: Sarah Yurick <[email protected]>

* add links to semdedup tutorials

Signed-off-by: Sarah Yurick <[email protected]>

* download and extract links

Signed-off-by: Sarah Yurick <[email protected]>

* Apply suggestions from code review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
jnke2016 pushed a commit to jnke2016/Curator that referenced this pull request Nov 12, 2025
* link to docs in classifier tutorials

Signed-off-by: Sarah Yurick <[email protected]>

* add links to semdedup tutorials

Signed-off-by: Sarah Yurick <[email protected]>

* download and extract links

Signed-off-by: Sarah Yurick <[email protected]>

* Apply suggestions from code review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick deleted the text_tutorial_links branch February 9, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add more documentation links to text tutorials

3 participants