Skip to content

Commit 9a2bd42

Browse files
authored
Add support for Chinese and Japanese stop words (#507)
* add zh and ja stopwords Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run isort Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * edit doc Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * indent? Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * rst file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * rst? Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * more indents? Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * fix todos and add pytests Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run black Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add Ryan's suggestions Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * run isort Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * edit rst file Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add trafilatura support Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
1 parent 1f8e5ad commit 9a2bd42

File tree

6 files changed

+1300
-73
lines changed

6 files changed

+1300
-73
lines changed

docs/user-guide/download.rst

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ By "extraction", we typically mean the process of converting a data format from
8080
* ``"2021-04"`` is the last common crawl snapshot that will be included in the download.
8181
* ``output_type="jsonl"`` is the file format that will be used for storing the data on disk. Currently ``"jsonl"`` and ``"parquet"`` are supported.
8282

83-
You can choose to modify the HTML text extraction algorithm used in ``download_common_crawl``. See an example below.
83+
You can choose to modify the HTML text extraction algorithm used in ``download_common_crawl``. See an example below.
8484

8585
.. code-block:: python
8686
@@ -133,13 +133,33 @@ You can choose to modify the HTML text extraction algorithm used in ``download_c
133133
134134
Above, we changed the extraction algorithm from the default ``JusTextExtractor``. **Note:** The JusTextExtractor, ResiliparseExtractor, and TrafilaturaExtractor classes each have their own unique parameters which are specific to their extraction algorithms. Please see the docstrings for each class for more details.
135135

136+
You can set your own dictionary of stop words by language to be used when extracting text:
137+
138+
.. code-block:: python
139+
140+
from nemo_curator.download import download_common_crawl
141+
142+
# Change the default stop list used
143+
stop_lists = {"ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"])}
144+
145+
common_crawl = download_common_crawl(
146+
"/extracted/output/folder",
147+
"2020-50",
148+
"2021-04",
149+
output_type="jsonl",
150+
stop_lists=stop_lists,
151+
)
152+
153+
This may be desirable to further customize your text extraction pipeline, or to enable text extraction support for languages not included by jusText and NeMo Curator.
154+
136155
The return value ``common_crawl`` will be in NeMo Curator's standard ``DocumentDataset`` format. Check out the function's docstring for more parameters you can use.
137156

138157
NeMo Curator's Common Crawl extraction process looks like this under the hood:
139158

140-
1. Decode the HTML within the record from binary to text.
141-
2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML.
142-
3. Finally, the extract the relevant text with `jusText <https://github.com/miso-belica/jusText>`_, `Resiliparse <https://github.com/chatnoir-eu/chatnoir-resiliparse>`_, or `Trafilatura <https://trafilatura.readthedocs.io/en/latest/>`_ from the HTML and write it out as a single string within the 'text' field of a json entry within a `.jsonl` file.
159+
1. Decode the HTML within the record from binary to text.
160+
2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML.
161+
3. Finally, the extract the relevant text with `jusText <https://github.com/miso-belica/jusText>`_, `Resiliparse <https://github.com/chatnoir-eu/chatnoir-resiliparse>`_, or `Trafilatura <https://trafilatura.readthedocs.io/en/latest/>`_ from the HTML and write it out as a single string within the "text" field of a JSON entry within a ``.jsonl`` file.
162+
143163
* ``download_wikipedia`` will download and extract the latest wikipedia dump. Files are downloaded using ``wget``. Wikipedia might download slower than the other datasets. This is because they limit the number of downloads that can occur per-ip address.
144164

145165
.. code-block:: python

nemo_curator/download/commoncrawl.py

Lines changed: 114 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import os
1717
import subprocess
1818
import unicodedata
19+
import warnings
1920
from abc import ABC, abstractmethod
2021
from copy import deepcopy
2122
from typing import Literal, Optional
@@ -40,6 +41,8 @@
4041
from nemo_curator.utils.download_utils import get_common_crawl_urls
4142
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir
4243

44+
NON_SPACED_LANGUAGES = ["THAI", "CHINESE", "JAPANESE", "KOREAN"]
45+
4346

4447
def decode_html(html_bytes):
4548
# Convert from bytes to text using utf-8 encoding
@@ -76,7 +79,7 @@ def lang_detect(decoded_html):
7679

7780
class HTMLExtractorAlgorithm(ABC):
7881
@abstractmethod
79-
def extract_text(self, html, stop_words):
82+
def extract_text(self, html, stop_words, language):
8083
pass
8184

8285

@@ -90,6 +93,7 @@ def __init__(
9093
max_link_density=0.2,
9194
max_heading_distance=200,
9295
no_headings=False,
96+
is_boilerplate=None,
9397
logger=None,
9498
):
9599
"""
@@ -123,6 +127,9 @@ def __init__(
123127
max_link_density: Maximum allowed link density in the text.
124128
max_heading_distance: Maximum distance from a heading to consider text for extraction.
125129
no_headings: If True, text extraction will ignore headings.
130+
is_boilerplate: If True, text extraction will ignore boilerplate content.
131+
Default is True for space-separated languages and False for non-space-separated languages
132+
(Thai, Chinese, Japanese, and Korean).
126133
logger: Optional logger instance for logging messages.
127134
128135
"""
@@ -133,9 +140,10 @@ def __init__(
133140
self.max_link_density = max_link_density
134141
self.max_heading_distance = max_heading_distance
135142
self.no_headings = no_headings
143+
self.is_boilerplate = is_boilerplate
136144
self.logger = logger
137145

138-
def extract_text(self, html, stop_words):
146+
def extract_text(self, html, stop_words, language):
139147
# Segment the HTML into paragraphs
140148
try:
141149
# Form the DOM tree
@@ -149,6 +157,7 @@ def extract_text(self, html, stop_words):
149157
if self.logger is not None:
150158
self.logger.info("Could not segment paragaphs in the document")
151159
return
160+
152161
paragraphs = handler.paragraphs
153162

154163
# Context free classification
@@ -175,7 +184,21 @@ def extract_text(self, html, stop_words):
175184
self.max_heading_distance,
176185
)
177186

178-
return [p.text for p in paragraphs if not p.is_boilerplate]
187+
if self.is_boilerplate is None:
188+
if language in NON_SPACED_LANGUAGES:
189+
warnings.warn("Disabling is_boilerplate check for jusText extraction.")
190+
is_boilerplate = False
191+
else:
192+
is_boilerplate = True
193+
194+
else:
195+
is_boilerplate = self.is_boilerplate
196+
197+
if is_boilerplate:
198+
return [p.text for p in paragraphs if not p.is_boilerplate]
199+
200+
else:
201+
return [p.text for p in paragraphs]
179202

180203

181204
class ResiliparseExtractor(HTMLExtractorAlgorithm):
@@ -212,26 +235,34 @@ def __init__(
212235
self.main_content = main_content
213236
self.alt_texts = alt_texts
214237

215-
def extract_text(self, html, stop_words):
238+
def extract_text(self, html, stop_words, language):
216239
text = extract_plain_text(
217240
html, main_content=self.main_content, alt_texts=self.alt_texts
218241
)
219242

220243
paragraphs = list(filter(None, text.split("\n")))
221-
result = []
222-
for paragraph in paragraphs:
223-
words = paragraph.split()
224-
length = len(words)
225-
if length == 0:
226-
continue
227-
stopwords = [word for word in words if word in stop_words]
228-
stopword_density = len(stopwords) / length
229244

230-
if stopword_density >= self.required_stopword_density:
231-
result.append(paragraph)
245+
if language in NON_SPACED_LANGUAGES:
246+
warnings.warn(
247+
"stopword_density is ignored for non-space-separated languages."
248+
)
249+
result = paragraphs
250+
else:
251+
result = []
252+
253+
for paragraph in paragraphs:
254+
words = paragraph.split()
255+
length = len(words)
256+
257+
if length == 0:
258+
continue
259+
260+
stopwords = [word for word in words if word in stop_words]
261+
stopword_density = len(stopwords) / length
262+
263+
if stopword_density >= self.required_stopword_density:
264+
result.append(paragraph)
232265

233-
if len(result) == 0:
234-
return None
235266
return result
236267

237268

@@ -300,7 +331,7 @@ def __init__(
300331
self.max_repetitions = max_repetitions
301332
self.extract_kwargs = extract_kwargs
302333

303-
def extract_text(self, html, stop_words):
334+
def extract_text(self, html, stop_words, language):
304335
trafilatura_config = deepcopy(TRAFILATURA_DEFAULT_CONFIG)
305336
trafilatura_config["DEFAULT"]["MIN_EXTRACTED_SIZE"] = str(
306337
self.min_extracted_size
@@ -328,17 +359,29 @@ def extract_text(self, html, stop_words):
328359

329360
if text is not None:
330361
paragraphs = list(filter(None, text.split("\n")))
331-
result = []
332-
for paragraph in paragraphs:
333-
words = paragraph.split()
334-
length = len(words)
335-
if length == 0:
336-
continue
337-
stopwords = [word for word in words if word in stop_words]
338-
stopword_density = len(stopwords) / length
339362

340-
if stopword_density >= self.required_stopword_density:
341-
result.append(paragraph)
363+
if language in NON_SPACED_LANGUAGES:
364+
warnings.warn(
365+
"stopword_density is ignored for non-space-separated languages."
366+
)
367+
result = paragraphs
368+
369+
else:
370+
result = []
371+
372+
for paragraph in paragraphs:
373+
words = paragraph.split()
374+
length = len(words)
375+
376+
if length == 0:
377+
continue
378+
379+
stopwords = [word for word in words if word in stop_words]
380+
stopword_density = len(stopwords) / length
381+
382+
if stopword_density >= self.required_stopword_density:
383+
result.append(paragraph)
384+
342385
else:
343386
return None
344387

@@ -357,25 +400,47 @@ def get_stop_list_dict(languages=[]):
357400
"Norwegian_Nynorsk": "NORWEGIAN_N",
358401
"Waray_Waray": "WARAY_PHILIPPINES",
359402
}
403+
404+
# List obtained from https://github.com/stopwords-iso/stopwords-ja
405+
from .ja_stopwords import ja_stopwords
406+
407+
# List obtained from https://github.com/stopwords-iso/stopwords-th
408+
from .th_stopwords import th_stopwords
409+
410+
# List obtained from https://github.com/stopwords-iso/stopwords-zh
411+
from .zh_stopwords import zh_stopwords
412+
413+
custom_stopwords = {
414+
"THAI": th_stopwords,
415+
"CHINESE": zh_stopwords,
416+
"JAPANESE": ja_stopwords,
417+
}
418+
360419
if len(languages) == 0:
361420
languages = justext.get_stoplists()
362-
# Remove latin as it yields a lot of low quality documents
363-
languages_no_latin = list(languages)
364-
languages_no_latin.remove("Latin")
365-
languages = frozenset(languages_no_latin)
421+
422+
# Remove Latin as it yields a lot of low quality documents
423+
languages = list(languages)
424+
languages.remove("Latin")
425+
426+
# Manually add Thai, Chinese, and Japanese
427+
languages.append("THAI")
428+
languages.append("CHINESE")
429+
languages.append("JAPANESE")
430+
431+
languages = frozenset(languages)
366432

367433
stop_list_dict = {}
368434
for language in languages:
369435
if language in lang_map:
370436
lang_key = lang_map[language]
371437
else:
372438
lang_key = language.upper()
373-
stop_list_dict[lang_key] = justext.get_stoplist(language)
374-
375-
# List obtained from https://github.com/stopwords-iso/stopwords-th
376-
from .thai_stopwords import thai_stopwords
377439

378-
stop_list_dict["THAI"] = thai_stopwords
440+
if lang_key in custom_stopwords:
441+
stop_list_dict[lang_key] = custom_stopwords[lang_key]
442+
else:
443+
stop_list_dict[lang_key] = justext.get_stoplist(language)
379444

380445
return stop_list_dict
381446

@@ -484,8 +549,12 @@ def iterate(self, file_path):
484549

485550
class CommonCrawlWARCExtractor(DocumentExtractor):
486551

487-
def __init__(self, algorithm=JusTextExtractor()):
488-
self._stop_lists = get_stop_list_dict()
552+
def __init__(self, algorithm=JusTextExtractor(), stop_lists=None):
553+
if stop_lists is not None:
554+
self._stop_lists = stop_lists
555+
else:
556+
self._stop_lists = get_stop_list_dict()
557+
489558
self.algorithm = algorithm
490559
super().__init__()
491560

@@ -496,7 +565,7 @@ def extract(self, content):
496565
lang = lang_detect(html)
497566
text = None
498567
if lang in self._stop_lists:
499-
text = self.algorithm.extract_text(html, self._stop_lists[lang])
568+
text = self.algorithm.extract_text(html, self._stop_lists[lang], lang)
500569
if text is not None:
501570
if len(text) > 0:
502571
text = "\n\n".join(text)
@@ -512,6 +581,7 @@ def download_common_crawl(
512581
end_snapshot: str,
513582
output_type: Literal["jsonl", "parquet"] = "jsonl",
514583
algorithm=JusTextExtractor(),
584+
stop_lists=None,
515585
news: bool = False,
516586
aws: bool = False,
517587
raw_download_dir: Optional[str] = None,
@@ -536,6 +606,10 @@ def download_common_crawl(
536606
• This is not used for the output file, but is used to check if an extracted output already exists.
537607
algorithm: The text extraction algorithm instance to use for HTML processing.
538608
• This can be a JusTextExtractor (default), ResiliparseExtractor, or TrafilaturaExtractor object.
609+
stop_lists: A dictionary stop lists, where the keys are languages (e.g., "ENGLISH")
610+
and the values are Python frozensets denoting the list of stop words for that language.
611+
If None, it defaults to jusText's stop lists: https://github.com/miso-belica/jusText/tree/main/justext/stoplists,
612+
with added Thai, Chinese, and Japanese support.
539613
news (bool): When True, indicates that URLs should be retrieved from the CC-NEWS dataset.
540614
• This also means snapshot identifiers should follow the 'YYYY-MM' format.
541615
aws (bool): If True, downloads are sourced from Common Crawl's S3 bucket using s5cmd;
@@ -577,7 +651,7 @@ def download_common_crawl(
577651
expand_outdir_and_mkdir(raw_download_dir)
578652
downloader = CommonCrawlWARCDownloader(raw_download_dir, aws=aws)
579653
iterator = CommonCrawlWARCIterator()
580-
extractor = CommonCrawlWARCExtractor(algorithm=algorithm)
654+
extractor = CommonCrawlWARCExtractor(algorithm=algorithm, stop_lists=stop_lists)
581655

582656
output_format = {
583657
"text": str,

0 commit comments

Comments
 (0)