Skip to content

Commit 1f8e5ad

Browse files
authored
Add TrafilaturaExtractor class (#431)
* first commit Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add implementation and pytest Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * allow editing trafilatura config params Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add ryan's suggestions Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add to docstring Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
1 parent add5f1a commit 1f8e5ad

File tree

6 files changed

+237
-55
lines changed

6 files changed

+237
-55
lines changed

docs/user-guide/api/download.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ Common Crawl
5555
.. autoclass:: nemo_curator.download.ResiliparseExtractor
5656
:members:
5757

58+
.. autoclass:: nemo_curator.download.TrafilaturaExtractor
59+
:members:
60+
5861
------------------------------
5962
Wikipedia
6063
------------------------------

docs/user-guide/download.rst

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ the extraction step to limit the amount of documents that undergo this heavy com
1818
NeMo Curator provides example utilities for downloading and extracting Common Crawl, ArXiv, and Wikipedia data.
1919
In addition, it provides a flexible interface to extend the utility to other datasets.
2020
Our Common Crawl example demonstrates how to process a crawl by downloading the data from S3, doing preliminary language filtering with pyCLD2,
21-
and extracting the relevant text with jusText or Resiliparse to output :code:`.jsonl` files.
21+
and extracting the relevant text with jusText, Resiliparse, or Trafilatura to output :code:`.jsonl` files.
2222

2323
NeMo Curator currently does not provide out-of-the-box support for web-crawling or web-scraping.
2424
It provides utilities for downloading and extracting data from the preexisting online sources given above.
@@ -88,6 +88,7 @@ You can choose to modify the HTML text extraction algorithm used in ``download_c
8888
from nemo_curator import get_client
8989
from nemo_curator.download import (
9090
ResiliparseExtractor,
91+
TrafilaturaExtractor,
9192
download_common_crawl,
9293
)
9394
from nemo_curator.datasets import DocumentDataset
@@ -106,8 +107,10 @@ You can choose to modify the HTML text extraction algorithm used in ``download_c
106107
output_type = "jsonl"
107108
os.makedirs(output_folder, exist_ok=True)
108109
109-
# Change the extraction algorithm to use ResiliparseExtractor
110+
# Change the extraction algorithm to Resiliparse
110111
extraction_algorithm = ResiliparseExtractor()
112+
# Alternatively, change the extraction algorithm to Trafilatura
113+
# extraction_algorithm = TrafilaturaExtractor()
111114
112115
# Download and extract the Common Crawl data using the Resiliparse extraction algorithm.
113116
# The function returns a DocumentDataset that contains the extracted documents.
@@ -128,15 +131,15 @@ You can choose to modify the HTML text extraction algorithm used in ``download_c
128131
if __name__ == "__main__":
129132
main()
130133
131-
Above, we changed the extraction algorithm from the default ``JusTextExtractor``.
134+
Above, we changed the extraction algorithm from the default ``JusTextExtractor``. **Note:** The JusTextExtractor, ResiliparseExtractor, and TrafilaturaExtractor classes each have their own unique parameters which are specific to their extraction algorithms. Please see the docstrings for each class for more details.
132135

133136
The return value ``common_crawl`` will be in NeMo Curator's standard ``DocumentDataset`` format. Check out the function's docstring for more parameters you can use.
134137

135138
NeMo Curator's Common Crawl extraction process looks like this under the hood:
136139

137140
1. Decode the HTML within the record from binary to text.
138141
2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML.
139-
3. Finally, the extract the relevant text with `jusText <https://github.com/miso-belica/jusText>`_ or `Resiliparse <https://github.com/chatnoir-eu/chatnoir-resiliparse>`_ from the HTML and write it out as a single string within the 'text' field of a json entry within a `.jsonl` file.
142+
3. Finally, the extract the relevant text with `jusText <https://github.com/miso-belica/jusText>`_, `Resiliparse <https://github.com/chatnoir-eu/chatnoir-resiliparse>`_, or `Trafilatura <https://trafilatura.readthedocs.io/en/latest/>`_ from the HTML and write it out as a single string within the 'text' field of a json entry within a `.jsonl` file.
140143
* ``download_wikipedia`` will download and extract the latest wikipedia dump. Files are downloaded using ``wget``. Wikipedia might download slower than the other datasets. This is because they limit the number of downloads that can occur per-ip address.
141144

142145
.. code-block:: python

nemo_curator/download/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
CommonCrawlWARCIterator,
2121
JusTextExtractor,
2222
ResiliparseExtractor,
23+
TrafilaturaExtractor,
2324
download_common_crawl,
2425
)
2526
from .doc_builder import (
@@ -54,6 +55,7 @@
5455
"CommonCrawlWARCDownloaderExtractOnly",
5556
"JusTextExtractor",
5657
"ResiliparseExtractor",
58+
"TrafilaturaExtractor",
5759
"download_wikipedia",
5860
"WikipediaDownloader",
5961
"WikipediaIterator",

nemo_curator/download/commoncrawl.py

Lines changed: 150 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License");
44
# you may not use this file except in compliance with the License.
@@ -17,6 +17,7 @@
1717
import subprocess
1818
import unicodedata
1919
from abc import ABC, abstractmethod
20+
from copy import deepcopy
2021
from typing import Literal, Optional
2122
from urllib.parse import urlparse
2223

@@ -25,6 +26,8 @@
2526
import pycld2 as cld2
2627
from charset_normalizer import detect
2728
from resiliparse.extract.html2text import extract_plain_text
29+
from trafilatura import extract as extract_with_trafilatura
30+
from trafilatura.settings import DEFAULT_CONFIG as TRAFILATURA_DEFAULT_CONFIG
2831
from warcio.archiveiterator import ArchiveIterator
2932

3033
from nemo_curator.datasets import DocumentDataset
@@ -92,6 +95,26 @@ def __init__(
9295
"""
9396
Initialize the jusText text extraction algorithm with specified parameters.
9497
98+
jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages.
99+
It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.
100+
The key idea is that long blocks can often be classified with high confidence, while shorter blocks require context-based adjustments.
101+
102+
Here is an overview of the jusText algorithm:
103+
• Segmentation: The document is split into textual blocks based on HTML tags that typically define separate sections (e.g., <div>, <p>, <table>).
104+
• Preprocessing: Contents of <header>, <style>, and <script> tags are removed.
105+
Certain elements (e.g., <select>, copyright symbols) are immediately classified as boilerplate.
106+
• Context-Free Classification: Each block is classified as:
107+
- Bad (boilerplate) if it has high link density.
108+
- Short if it is too small to be classified reliably.
109+
- Near-Good if it has a moderate density of stopwords.
110+
- Good (main content) if it is long and contains many stopwords.
111+
• Context-Sensitive Classification: Blocks that were classified as short or near-good are reclassified based on surrounding blocks.
112+
The assumption is that main content clusters together, as does boilerplate.
113+
• Headings Processing: Header elements (e.g., <h1>, <h2>) are treated separately to ensure useful headings are preserved.
114+
Short headers near good content may be reclassified as near-good or good.
115+
116+
Please refer to the jusText documentation for more details: https://corpus.tools/wiki/Justext/Algorithm
117+
95118
Args:
96119
length_low: Minimum length of text to be considered for extraction.
97120
length_high: Maximum length of text to be considered for extraction.
@@ -165,6 +188,18 @@ def __init__(
165188
"""
166189
Initialize the Resiliparse text extraction algorithm with specified parameters.
167190
191+
The Resiliparse algorithm extracts structural or semantic information from noisy raw web data for further processing,
192+
such as (main) content extraction / boilerplate removal, schema extraction, general web data cleansing, and more.
193+
194+
It is implemented via the `extract_plain_text` function in the `resiliparse.extract.html2text` module.
195+
Resiliparse HTML2Text is a very fast and rule-based plain text extractor for HTML pages which uses the Resiliparse DOM parser.
196+
The `extract_plain_text` function extracts all visible text nodes inside the HTML document's <body>.
197+
Only <script>, <style> and a few other (generally) invisible elements are skipped and very basic ASCII formatting is applied.
198+
199+
Please refer to the Resiliparse documentation for more details: https://resiliparse.chatnoir.eu/en/latest/man/extract/html2text.html
200+
201+
NeMo Curator has added a stopword density filter to the Resiliparse extraction process, which requires that a paragraph contains a certain proportion of stopwords.
202+
168203
Args:
169204
required_stopword_density: Proportion of stopwords required preserve an extracted paragraph.
170205
Studies on stopword lists and their distribution in various text corpora often
@@ -200,6 +235,118 @@ def extract_text(self, html, stop_words):
200235
return result
201236

202237

238+
class TrafilaturaExtractor(HTMLExtractorAlgorithm):
239+
def __init__(
240+
self,
241+
required_stopword_density=0.32,
242+
min_extracted_size=250,
243+
min_extracted_comm_size=1,
244+
min_output_size=1,
245+
min_output_comm_size=1,
246+
max_tree_size=None,
247+
min_duplcheck_size=100,
248+
max_repetitions=2,
249+
**extract_kwargs,
250+
):
251+
"""
252+
Initialize the Trafilatura text extraction algorithm with specified parameters.
253+
254+
The Trafilatura extraction process combines readability-lxml and jusText as fallbacks to ensure robustness.
255+
Trafilatura's own algorithm follows a cascade of rule-based filters and content heuristics:
256+
• Content Delimitation: Uses XPath expressions to exclude unwanted HTML elements (e.g., navigation bars) and focus on relevant content (e.g., article body).
257+
Extracted HTML nodes are analyzed for relevance based on element type, text length, and link density.
258+
• Fallback Mechanism: If extraction seems faulty, alternative algorithms are run as backups.
259+
These use heuristics like line length, text-to-markup ratio, and HTML depth to improve extraction.
260+
Outputs are compared, prioritizing longer extractions with fewer impurities.
261+
• Baseline Extraction: If all else fails, it searches for text elements that might have been missed, discarding irrelevant content.
262+
263+
The system balances precision and recall, extracting main text, comments, and metadata (title, site name, author, date, categories, tags).
264+
265+
Please refer to the Trafilatura documentation for more details:
266+
https://trafilatura.readthedocs.io/en/latest/ and https://aclanthology.org/2021.acl-demo.15/
267+
268+
NeMo Curator has added a stopword density filter to the Trafilatura extraction process, which requires that a paragraph contains a certain proportion of stopwords.
269+
270+
Args:
271+
required_stopword_density: Proportion of stopwords required preserve an extracted paragraph.
272+
Studies on stopword lists and their distribution in various text corpora often
273+
suggest that around 30-40% of a typical English text consists of stopwords.
274+
min_extracted_size: Acceptable size in characters (used to trigger fallbacks).
275+
Defaults to 250. See Trafilatura documentation: https://trafilatura.readthedocs.io/en/latest/settings.html.
276+
min_extracted_comm_size: Works the same as min_output_comm_size for comment extraction.
277+
Defaults to 1. See Trafilatura documentation: https://trafilatura.readthedocs.io/en/latest/settings.html.
278+
min_output_size: Absolute acceptable minimum for main text output.
279+
Defaults to 1. See Trafilatura documentation: https://trafilatura.readthedocs.io/en/latest/settings.html.
280+
min_output_comm_size: Works the same as min_output_comm_size for comment extraction.
281+
Defaults to 1. See Trafilatura documentation: https://trafilatura.readthedocs.io/en/latest/settings.html.
282+
max_tree_size: Used to discard documents with too many elements. Defaults to None.
283+
min_duplcheck_size: Minimum size in characters to run deduplication on.
284+
Defaults to 100. See Trafilatura documentation: https://trafilatura.readthedocs.io/en/latest/settings.html.
285+
max_repetitions: Maximum number of duplicates allowed.
286+
Defaults to 2. See Trafilatura documentation: https://trafilatura.readthedocs.io/en/latest/settings.html.
287+
extract_kwargs: Additional keyword arguments for the Trafilatura extract function.
288+
See API documentation https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract
289+
for list of possible parameters.
290+
All arguments are set to their default values, except for deduplicate (bool) which is set to True.
291+
292+
"""
293+
self.required_stopword_density = required_stopword_density
294+
self.min_extracted_size = min_extracted_size
295+
self.min_extracted_comm_size = min_extracted_comm_size
296+
self.min_output_size = min_output_size
297+
self.min_output_comm_size = min_output_comm_size
298+
self.max_tree_size = max_tree_size
299+
self.min_duplcheck_size = min_duplcheck_size
300+
self.max_repetitions = max_repetitions
301+
self.extract_kwargs = extract_kwargs
302+
303+
def extract_text(self, html, stop_words):
304+
trafilatura_config = deepcopy(TRAFILATURA_DEFAULT_CONFIG)
305+
trafilatura_config["DEFAULT"]["MIN_EXTRACTED_SIZE"] = str(
306+
self.min_extracted_size
307+
)
308+
trafilatura_config["DEFAULT"]["MIN_EXTRACTED_COMM_SIZE"] = str(
309+
self.min_extracted_comm_size
310+
)
311+
trafilatura_config["DEFAULT"]["MIN_OUTPUT_SIZE"] = str(self.min_output_size)
312+
trafilatura_config["DEFAULT"]["MIN_OUTPUT_COMM_SIZE"] = str(
313+
self.min_output_comm_size
314+
)
315+
if self.max_tree_size:
316+
trafilatura_config["DEFAULT"]["MAX_TREE_SIZE"] = str(self.max_tree_size)
317+
trafilatura_config["DEFAULT"]["MIN_DUPLCHECK_SIZE"] = str(
318+
self.min_duplcheck_size
319+
)
320+
trafilatura_config["DEFAULT"]["MAX_REPETITIONS"] = str(self.max_repetitions)
321+
322+
# Recommended to set deduplicate=True
323+
self.extract_kwargs.setdefault("deduplicate", True)
324+
325+
text = extract_with_trafilatura(
326+
html, config=trafilatura_config, **self.extract_kwargs
327+
)
328+
329+
if text is not None:
330+
paragraphs = list(filter(None, text.split("\n")))
331+
result = []
332+
for paragraph in paragraphs:
333+
words = paragraph.split()
334+
length = len(words)
335+
if length == 0:
336+
continue
337+
stopwords = [word for word in words if word in stop_words]
338+
stopword_density = len(stopwords) / length
339+
340+
if stopword_density >= self.required_stopword_density:
341+
result.append(paragraph)
342+
else:
343+
return None
344+
345+
if len(result) == 0:
346+
return None
347+
return result
348+
349+
203350
def get_stop_list_dict(languages=[]):
204351

205352
# Name mapping for language names from CLD2 (values)
@@ -387,7 +534,8 @@ def download_common_crawl(
387534
end_snapshot (str): Identifier for the latest snapshot to process, which must be chronologically after start_snapshot.
388535
output_type (Literal["jsonl", "parquet"]): The file format for the extracted output. Must be either "jsonl" or "parquet".
389536
• This is not used for the output file, but is used to check if an extracted output already exists.
390-
algorithm: The text extraction algorithm instance (e.g., JusTextExtractor or ResiliparseExtractor) to use for HTML processing.
537+
algorithm: The text extraction algorithm instance to use for HTML processing.
538+
• This can be a JusTextExtractor (default), ResiliparseExtractor, or TrafilaturaExtractor object.
391539
news (bool): When True, indicates that URLs should be retrieved from the CC-NEWS dataset.
392540
• This also means snapshot identifiers should follow the 'YYYY-MM' format.
393541
aws (bool): If True, downloads are sourced from Common Crawl's S3 bucket using s5cmd;

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ dependencies = [
6666
"resiliparse",
6767
"sentencepiece",
6868
"spacy>=3.6.0, <3.8.0",
69+
"trafilatura",
6970
"transformers>=4.48.0",
7071
"unidic-lite==1.0.8",
7172
"usaddress==0.5.10",

0 commit comments

Comments
 (0)