Skip to content

Commit 70278d1

Browse files
authored
Add improved cleaning methods from Nemotron-CC (#517)
* Add improved cleaning features Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Fix cleaning tests Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Update documentation and CLI scripts Signed-off-by: Ryan Wolf <rywolf@nvidia.com> * Address Sarah and Lawrence's reviews Signed-off-by: Ryan Wolf <rywolf@nvidia.com> --------- Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
1 parent ca30808 commit 70278d1

File tree

13 files changed

+355
-126
lines changed

13 files changed

+355
-126
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@ All of our text pipelines have great multilingual support.
2323
- [Download and Extraction](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/download.html)
2424
- Default implementations for Common Crawl, Wikipedia, and ArXiv sources
2525
- Easily customize and extend to other sources
26-
- [Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
27-
- [Unicode Reformatting](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
26+
- [Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentification.html)
27+
- [Text Cleaning](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/textcleaning.html)
2828
- [Heuristic Filtering](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)
2929
- Classifier Filtering
3030
- [fastText](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)

docs/user-guide/index.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,11 @@ Text Curation
1616
:ref:`Document Filtering <data-curator-qualityfiltering>`
1717
This section describes how to use the 30+ heuristic and classifier filters available within the NeMo Curator and implement custom filters to apply to the documents within the corpora.
1818

19-
:ref:`Language Identification and Unicode Fixing <data-curator-languageidentification>`
20-
Large, unlabeled text corpora often contain a variety of languages. The NeMo Curator provides utilities to identify languages and fix improperly decoded Unicode characters.
19+
:ref:`Language Identification <data-curator-languageidentification>`
20+
Large, unlabeled text corpora often contain a variety of languages. NeMo Curator provides utilities to identify languages.
21+
22+
:ref:`Text Cleaning <data-curator-text-cleaning>`
23+
Many parts of the Internet contained malformed or poorly formatted text. NeMo Curator can fix many of these issues with text.
2124

2225
:ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
2326
Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.

docs/user-guide/languageidentificationunicodeformatting.rst renamed to docs/user-guide/languageidentification.rst

Lines changed: 3 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -11,40 +11,17 @@ Background
1111
Large unlabeled text corpora often contain a variety of languages.
1212
However, data curation usually includes steps that are language specific (e.g. using language-tuned heuristics for quality filtering)
1313
and many curators are only interested in curating a monolingual dataset.
14-
Datasets also may have improperly decoded unicode characters (e.g. "The Mona Lisa doesn't have eyebrows." decoding as "The Mona Lisa doesn’t have eyebrows.").
1514

16-
NeMo Curator provides utilities to identify languages and fix improperly decoded unicode characters.
17-
The language identification is performed using `fastText <https://fasttext.cc/docs/en/language-identification.html>`_ and unicode fixing is performed using `ftfy <https://ftfy.readthedocs.io/en/latest/>`_.
15+
NeMo Curator provides utilities to identify languages using `fastText <https://fasttext.cc/docs/en/language-identification.html>`_.
1816
Even though a preliminary language identification may have been performed on the unextracted text (as is the case in our Common Crawl pipeline
1917
using pyCLD2), `fastText <https://fasttext.cc/docs/en/language-identification.html>`_ is more accurate so it can be used for a second pass.
2018

2119
-----------------------------------------
2220
Usage
2321
-----------------------------------------
2422

25-
We provide an example of how to use the language identification and unicode reformatting utility at ``examples/identify_languages_and_fix_unicode.py``.
23+
We provide an example of how to use the language identification and unicode reformatting utility at ``examples/identify_languages.py``.
2624
At a high level, the module first identifies the languages of the documents and removes any documents for which it has high uncertainty about the language.
27-
Notably, this line uses one of the ``DocmentModifiers`` that NeMo Curator provides:
28-
29-
.. code-block:: python
30-
31-
cleaner = nc.Modify(UnicodeReformatter())
32-
cleaned_data = cleaner(lang_data)
33-
34-
``DocumentModifier``s like ``UnicodeReformatter`` are very similar to ``DocumentFilter``s.
35-
They implement a single ``modify_document`` function that takes in a document and outputs a modified document.
36-
Here is the implementation of the ``UnicodeReformatter`` modifier:
37-
38-
.. code-block:: python
39-
40-
class UnicodeReformatter(DocumentModifier):
41-
def __init__(self):
42-
super().__init__()
43-
44-
def modify_document(self, text: str) -> str:
45-
return ftfy.fix_text(text)
46-
47-
Also like the ``DocumentFilter`` functions, ``modify_document`` can be annotated with ``batched`` to take in a pandas series of documents instead of a single document.
4825

4926
-----------------------------------------
5027
Related Scripts
@@ -79,15 +56,4 @@ within that file. Below is an example run command for :code:`separate_by_metadat
7956
--output-metadata-distribution=./data/lang_distro.json
8057
8158
After running this module, the output directory will consist of one directory per language present within the corpus and all documents
82-
within those directories will contain text that originates from the same language. Finally, the text within a specific language can have
83-
its unicode fixed using the :code:`text_cleaning` module
84-
85-
.. code-block:: bash
86-
87-
text_cleaning \
88-
--input-data-dir=<Output directory containing sub-directories>/EN \
89-
--output-clean-dir=<Output directory to which cleaned english documents will be written>
90-
91-
92-
The above :code:`text_cleaning` module uses the heuristics defined within the :code:`ftfy` package that is commonly used for fixing
93-
improperly decoded unicode.
59+
within those directories will contain text that originates from the same language.

docs/user-guide/text-cleaning.rst

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
.. _data-curator-text-cleaning:
2+
3+
=========================
4+
Text Cleaning
5+
=========================
6+
7+
--------------------
8+
Overview
9+
--------------------
10+
Use NeMo Curator's text cleaning modules to remove undesirable text such as improperly decoded unicode characters, inconsistent line spacing, or excessive URLs from documents being pre-processed for dataset.
11+
12+
For example, the input sentence `"The Mona Lisa doesn't have eyebrows."` from a given document may not have included a properly encoded apostrophe (`'`), resulting in the sentence decoding as `"The Mona Lisa doesn’t have eyebrows."` NeMo Curator enables you to easily run this document through the default `UnicodeReformatter()` module to detect and remove the unwanted text, or you can define your own custom unicode text cleaner tailored to your needs.
13+
14+
--------------------
15+
Use Cases
16+
--------------------
17+
* Fix improperly decoded Unicode characters from webpages.
18+
* Standardize document layout by removing excessive newlines.
19+
* Remove URLs in documents.
20+
21+
--------------------
22+
Modules
23+
--------------------
24+
NeMo Curator provides the following modules for cleaning text:
25+
26+
- ``UnicodeReformatter()``: Uses [ftfy](https://ftfy.readthedocs.io/en/latest/) to fix broken Unicode characters. Modifies the "text" field of the dataset by default.
27+
- ``NewlineNormalizer()``: Uses regex to replace 3 or more consecutive newline characters in each document with only 2 newline characters.
28+
- ``UrlRemover()``: Uses regex to remove all urls in each document.
29+
30+
You can use these modules individually or sequentially in a cleaning pipeline.
31+
32+
Consider the following example, which loads a dataset (`books.jsonl`), steps through each module in a cleaning pipeline, and outputs the processed dataset as `cleaned_books.jsonl`:
33+
34+
35+
.. code-block:: python
36+
37+
from nemo_curator import Sequential, Modify, get_client
38+
from nemo_curator.datasets import DocumentDataset
39+
from nemo_curator.modifiers import UnicodeReformatter, UrlRemover, NewlineNormalizer
40+
41+
def main():
42+
client = get_client(cluster_type="cpu")
43+
44+
dataset = DocumentDataset.read_json("books.jsonl")
45+
cleaning_pipeline = Sequential([
46+
Modify(UnicodeReformatter()),
47+
Modify(NewlineNormalizer()),
48+
Modify(UrlRemover()),
49+
])
50+
51+
cleaned_dataset = cleaning_pipeline(dataset)
52+
53+
cleaned_dataset.to_json("cleaned_books.jsonl")
54+
55+
if __name__ == "__main__":
56+
main()
57+
58+
You can also perform text cleaning operations using the CLI by running the `text_cleaning` command:
59+
60+
.. code-block:: bash
61+
62+
text_cleaning \
63+
--input-data-dir=/path/to/input/ \
64+
--output-clean-dir=/path/to/output/ \
65+
--normalize-newlines \
66+
--remove-urls
67+
68+
By default, the CLI will only perform unicode reformatting. Adding the ``--normalize-newlines`` and ``--remove-urls`` options add the other text cleaning options.
69+
70+
------------------------
71+
Custom Text Cleaner
72+
------------------------
73+
It's easy to write your own custom text cleaner. The implementation of ``UnicodeReformatter`` can be used as an example.
74+
75+
.. code-block:: python
76+
import ftfy
77+
78+
from nemo_curator.modifiers import DocumentModifier
79+
80+
81+
class UnicodeReformatter(DocumentModifier):
82+
def __init__(self):
83+
super().__init__()
84+
85+
def modify_document(self, text: str) -> str:
86+
return ftfy.fix_text(text)
87+
88+
Simply define a new class that inherits from ``DocumentModifier`` and define the constructor and ``modify_text`` method.
89+
Also, like the ``DocumentFilter`` class, ``modify_document`` can be annotated with ``batched`` to take in a pandas series of documents instead of a single document.
90+
See the :ref:`document filtering page <data-curator-qualityfiltering>` for more information.
91+
92+
---------------------------
93+
Additional Resources
94+
---------------------------
95+
* `Single GPU Tutorial <https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb>`_
96+
* `ftfy <https://ftfy.readthedocs.io/en/latest/>`_
97+
* `Refined Web Paper <https://arxiv.org/abs/2306.01116>`_
98+
* `Nemotron-CC Paper <https://arxiv.org/abs/2412.02595>`_

docs/user-guide/text-curation.rst

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,11 @@ Text Curation
1313
:ref:`Document Filtering <data-curator-qualityfiltering>`
1414
This section describes how to use the 30+ heuristic and classifier filters available within the NeMo Curator and implement custom filters to apply to the documents within the corpora.
1515

16-
:ref:`Language Identification and Unicode Fixing <data-curator-languageidentification>`
17-
Large, unlabeled text corpora often contain a variety of languages. The NeMo Curator provides utilities to identify languages and fix improperly decoded Unicode characters.
16+
:ref:`Language Identification <data-curator-languageidentification>`
17+
Large, unlabeled text corpora often contain a variety of languages. NeMo Curator provides utilities to identify languages.
18+
19+
:ref:`Text Cleaning <data-curator-text-cleaning>`
20+
Many parts of the Internet contained malformed or poorly formatted text. NeMo Curator can fix many of these issues with text.
1821

1922
:ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
2023
Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.
@@ -43,7 +46,8 @@ Text Curation
4346
documentdataset.rst
4447
cpuvsgpu.rst
4548
qualityfiltering.rst
46-
languageidentificationunicodeformatting.rst
49+
languageidentification.rst
50+
textcleaning.rst
4751
gpudeduplication.rst
4852
semdedup.rst
4953
syntheticdata.rst

examples/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ These include:
1414
| exact_deduplication.py | Use the `ExactDuplicates` class to perform exact deduplication on text data. |
1515
| find_pii_and_deidentify.py | Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data. |
1616
| fuzzy_deduplication.py | Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data. |
17-
| identify_languages_and_fix_unicode.py | Use `FastTextLangId` to filter data by language, then fix the unicode in it. |
17+
| identify_languages.py | Use `FastTextLangId` to filter data by language |
1818
| raw_download_common_crawl.py | Download the raw compressed WARC files from Common Crawl without extracting them. |
1919
| semdedup_example.py | Use the `SemDedup` class to perform semantic deduplication on text data. |
2020
| task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. |

examples/identify_languages_and_fix_unicode.py renamed to examples/identify_languages.py

Lines changed: 2 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License");
44
# you may not use this file except in compliance with the License.
@@ -13,13 +13,11 @@
1313
# limitations under the License.
1414

1515
import argparse
16-
import os
1716

1817
import nemo_curator as nc
1918
from nemo_curator.datasets import DocumentDataset
2019
from nemo_curator.filters import FastTextLangId
21-
from nemo_curator.modifiers import UnicodeReformatter
22-
from nemo_curator.utils.distributed_utils import get_client, read_data, write_to_disk
20+
from nemo_curator.utils.distributed_utils import get_client, read_data
2321
from nemo_curator.utils.file_utils import (
2422
get_all_files_paths_under,
2523
separate_by_metadata,
@@ -45,7 +43,6 @@ def main(args):
4543
# and see a list of supported languages here:
4644
# https://fasttext.cc/docs/en/language-identification.html
4745
model_path = "/path/to/model.bin"
48-
target_language = "EN"
4946
language_field = "language"
5047

5148
# Prepare samples for the classifier
@@ -70,18 +67,6 @@ def main(args):
7067
metadata_field=language_field,
7168
).compute()
7269

73-
# Read the language specific data and fix the unicode in it
74-
lang_data_path = os.path.join(language_separated_output_path, target_language)
75-
if not os.path.exists(lang_data_path):
76-
raise RuntimeError(f"Dataset did not have language: {target_language}")
77-
lang_data = load_dataset(lang_data_path)
78-
79-
cleaner = nc.Modify(UnicodeReformatter())
80-
cleaned_data = cleaner(lang_data)
81-
82-
# Write the cleaned_data
83-
write_to_disk(cleaned_data.df, cleaned_data_output_path, write_to_filename=True)
84-
8570

8671
def attach_args(
8772
parser=argparse.ArgumentParser(

nemo_curator/modifiers/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,17 @@
1515
from .c4 import BoilerPlateStringModifier
1616
from .doc_modifier import DocumentModifier
1717
from .fasttext import FastTextLabelModifier
18+
from .newline_normalizer import NewlineNormalizer
1819
from .pii_modifier import PiiModifier
1920
from .unicode_reformatter import UnicodeReformatter
21+
from .url_remover import UrlRemover
2022

2123
__all__ = [
2224
"DocumentModifier",
2325
"BoilerPlateStringModifier",
2426
"FastTextLabelModifier",
2527
"UnicodeReformatter",
2628
"PiiModifier",
29+
"NewlineNormalizer",
30+
"UrlRemover",
2731
]
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import re
15+
16+
from nemo_curator.modifiers import DocumentModifier
17+
18+
THREE_OR_MORE_NEWLINES_REGEX = re.compile(r"(\n){3,}")
19+
THREE_OR_MORE_WINDOWS_NEWLINES_REGEX = re.compile(r"(\r\n){3,}")
20+
21+
22+
class NewlineNormalizer(DocumentModifier):
23+
"""
24+
Replaces 3 or more consecutive newline characters with only 2 newline characters.
25+
"""
26+
27+
def __init__(self):
28+
super().__init__()
29+
30+
def modify_document(self, text):
31+
text = THREE_OR_MORE_NEWLINES_REGEX.sub("\n\n", text)
32+
text = THREE_OR_MORE_WINDOWS_NEWLINES_REGEX.sub("\r\n\r\n", text)
33+
return text
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import re
15+
16+
from nemo_curator.modifiers import DocumentModifier
17+
18+
URL_REGEX = re.compile(r"https?://\S+|www\.\S+", flags=re.IGNORECASE)
19+
20+
21+
class UrlRemover(DocumentModifier):
22+
"""
23+
Removes all URLs in a document.
24+
"""
25+
26+
def __init__(self):
27+
super().__init__()
28+
29+
def modify_document(self, text):
30+
return URL_REGEX.sub("", text)

0 commit comments

Comments
 (0)