You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add support for Chinese and Japanese stop words (#507)
* add zh and ja stopwords
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* run isort
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* edit doc
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* indent?
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* rst file
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* rst?
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* more indents?
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* fix todos and add pytests
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* run black
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add Ryan's suggestions
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* run isort
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* edit rst file
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
* add trafilatura support
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
---------
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Copy file name to clipboardExpand all lines: docs/user-guide/download.rst
+24-4Lines changed: 24 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,7 +80,7 @@ By "extraction", we typically mean the process of converting a data format from
80
80
* ``"2021-04"`` is the last common crawl snapshot that will be included in the download.
81
81
* ``output_type="jsonl"`` is the file format that will be used for storing the data on disk. Currently ``"jsonl"`` and ``"parquet"`` are supported.
82
82
83
-
You can choose to modify the HTML text extraction algorithm used in ``download_common_crawl``. See an example below.
83
+
You can choose to modify the HTML text extraction algorithm used in ``download_common_crawl``. See an example below.
84
84
85
85
.. code-block:: python
86
86
@@ -133,13 +133,33 @@ You can choose to modify the HTML text extraction algorithm used in ``download_c
133
133
134
134
Above, we changed the extraction algorithm from the default ``JusTextExtractor``. **Note:** The JusTextExtractor, ResiliparseExtractor, and TrafilaturaExtractor classes each have their own unique parameters which are specific to their extraction algorithms. Please see the docstrings for each class for more details.
135
135
136
+
You can set your own dictionary of stop words by language to be used when extracting text:
137
+
138
+
.. code-block:: python
139
+
140
+
from nemo_curator.download import download_common_crawl
This may be desirable to further customize your text extraction pipeline, or to enable text extraction support for languages not included by jusText and NeMo Curator.
154
+
136
155
The return value ``common_crawl`` will be in NeMo Curator's standard ``DocumentDataset`` format. Check out the function's docstring for more parameters you can use.
137
156
138
157
NeMo Curator's Common Crawl extraction process looks like this under the hood:
139
158
140
-
1. Decode the HTML within the record from binary to text.
141
-
2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML.
142
-
3. Finally, the extract the relevant text with `jusText <https://github.com/miso-belica/jusText>`_, `Resiliparse <https://github.com/chatnoir-eu/chatnoir-resiliparse>`_, or `Trafilatura <https://trafilatura.readthedocs.io/en/latest/>`_ from the HTML and write it out as a single string within the 'text' field of a json entry within a `.jsonl` file.
159
+
1. Decode the HTML within the record from binary to text.
160
+
2. If the HTML can be properly decoded, then with `pyCLD2 <https://github.com/aboSamoor/pycld2>`_, perform language detection on the input HTML.
161
+
3. Finally, the extract the relevant text with `jusText <https://github.com/miso-belica/jusText>`_, `Resiliparse <https://github.com/chatnoir-eu/chatnoir-resiliparse>`_, or `Trafilatura <https://trafilatura.readthedocs.io/en/latest/>`_ from the HTML and write it out as a single string within the "text" field of a JSON entry within a ``.jsonl`` file.
162
+
143
163
* ``download_wikipedia`` will download and extract the latest wikipedia dump. Files are downloaded using ``wget``. Wikipedia might download slower than the other datasets. This is because they limit the number of downloads that can occur per-ip address.
0 commit comments