Skip to content

Commit 3d14b0d

Browse files
shuoyangdnverma1
andauthored
Add support for parallel data curation (#193)
* add data interface to read simple bitext Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * adding ParallelScoreFilter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add test for ParallelScoreFilter, small style change for ParallelDataset test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * allow ParallelScoreFilter to take different filters for source and target Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add JointScoreFilter and LengthRatioFilter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] add heuristic filter w/o test Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * merge with main Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add test for histogram filter, fix a few bugs Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * length ratio, joint score filter testing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix typing in joint test Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add a fake comet qe filter as an initial step Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] adding bitext cleaning tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] fixing example Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix slow histogram filter, fix faulty bitext loading Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * tutorial running Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] documentation of bitext tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add tested version of comet-qe filter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix ParallelDataset bug where single file name is not accepted, and dataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add docstring to explain simple bitext format, fix a bug where file extensions are removed twice before writing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * remove print line for debug Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add comet filter to tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * refactor COMET QE filter to decouple model from filter, make sure JointScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * use refactored qe filter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * wrap_qe_input should be a static method Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * use conditional import for comet, formatting changes Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] add cometoid Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] attempt to resolve device conflict but is failing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] playing with cometoid arguments Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] -d 0 doesn't look necessary Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * tested arguments for Cometoid Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * use proper safe import, make sure test doesn't crash sans comet/pymarian Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * falling back to comet for tutorial since that's easier to set up, uppdate README Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * give credit to original fairseq implementation of histogram filtering, run black formatter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix pre-commit complaint Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix small bug Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix another occurrence of the same bug Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * introduce shard limit to a single PyMarian API call to avoid memory leakage Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * repartition after reading simple bitext data Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * -d 0 is actually needed for pymarian Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * remove duplicate LengthRatioFilter definition Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * refactor repeated code segment in file writing, change classifier to accomodate custom field names, pause doc repartition since it causes problems Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] addressed comments in #193 apart from resolving .iloc pattern, test currently failing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * refactor to resolve .loc pattern, test passing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add missing file Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * revert changes in setup.py Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix a small bug in parallel dataset, explain why repartition is disabled, fix tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add api guide, small change on bitext/parallel score filter docstring Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix read_simple_bitext test issues Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * reinstate dependencies lost during merging Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * re-enable multiple partitions for simple bitext, add parallel write Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * take care of the case where filename is not supplied in dataframe, make logic clearer Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * address other minor comments in the PR, fix segment order scrambling Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix test errors, add bitext dependencies Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add back more missing imports Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add bitext to [all] in .toml, add platformdirs as dependency Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * merge upstream, remove old bitext requirement list Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * delete requirement file again Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> --------- Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> Co-authored-by: nverma1 <neha.verma2017@gmail.com>
1 parent b15b08a commit 3d14b0d

File tree

23 files changed

+1490
-30
lines changed

23 files changed

+1490
-30
lines changed

docs/user-guide/api/datasets.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,12 @@ DocumentDataset
99
.. autoclass:: nemo_curator.datasets.DocumentDataset
1010
:members:
1111

12+
.. autoclass:: nemo_curator.datasets.ParallelDataset
13+
:members:
1214

1315
-------------------------------
1416
ImageTextPairDataset
1517
-------------------------------
1618

1719
.. autoclass:: nemo_curator.datasets.ImageTextPairDataset
18-
:members:
20+
:members:

docs/user-guide/api/filters.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@ Base Class
1010
:members:
1111
:member-order: bysource
1212

13+
.. autoclass:: nemo_curator.filters.BitextFilter
14+
:members:
15+
:member-order: bysource
16+
1317
.. autofunction:: nemo_curator.filters.import_filter
1418

1519
------------------------------
@@ -40,6 +44,14 @@ FastText Filters
4044
:members:
4145
:member-order: bysource
4246

47+
------------------------------
48+
Quality Estimation Filters
49+
------------------------------
50+
51+
.. autoclass:: nemo_curator.filters.QualityEstimationFilter
52+
:members:
53+
:member-order: bysource
54+
4355
------------------------------
4456
Heuristic Filters
4557
------------------------------
@@ -132,6 +144,14 @@ Heuristic Filters
132144
:members:
133145
:member-order: bysource
134146

147+
.. autoclass:: nemo_curator.filters.HistogramFilter
148+
:members:
149+
:member-order: bysource
150+
151+
.. autoclass:: nemo_curator.filters.LengthRatioFilter
152+
:members:
153+
:member-order: bysource
154+
135155
------------------------------
136156
Code Filters
137157
------------------------------

nemo_curator/datasets/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,10 @@
1515
from nemo_curator.utils.import_utils import image_only_import_from
1616

1717
from .doc_dataset import DocumentDataset
18+
from .parallel_dataset import ParallelDataset
1819

1920
ImageTextPairDataset = image_only_import_from(
2021
"nemo_curator.datasets.image_text_pair_dataset", "ImageTextPairDataset"
2122
)
2223

23-
__all__ = ["DocumentDataset", "ImageTextPairDataset"]
24+
__all__ = ["DocumentDataset", "ImageTextPairDataset", "ParallelDataset"]
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
import csv
2+
from typing import List, Optional, Tuple, Union
3+
4+
import dask.dataframe as dd
5+
import pandas as pd
6+
7+
from nemo_curator.datasets.doc_dataset import DocumentDataset
8+
from nemo_curator.utils.distributed_utils import write_to_disk
9+
from nemo_curator.utils.file_utils import remove_path_extension
10+
from nemo_curator.utils.import_utils import gpu_only_import
11+
12+
cudf = gpu_only_import("cudf")
13+
14+
15+
class ParallelDataset(DocumentDataset):
16+
"""
17+
An extension of the standard `DocumentDataset` with a special method that loads simple bitext.
18+
19+
For data with more complicated metadata, please convert your data into jsonl/parquet/pickle format
20+
and use interfaces defined in `DocumentDataset`.
21+
"""
22+
23+
def persist(self):
24+
return ParallelDataset(self.df.persist())
25+
26+
@classmethod
27+
def read_simple_bitext(
28+
cls,
29+
src_input_files: Union[str, List[str]],
30+
tgt_input_files: Union[str, List[str]],
31+
src_lang: str,
32+
tgt_lang: str,
33+
backend: str = "pandas",
34+
add_filename: bool = False,
35+
npartitions: int = 16,
36+
):
37+
"""See `read_single_simple_bitext_file_pair` docstring for what "simple_bitext" means and usage of other parameters.
38+
39+
Args:
40+
src_input_files (Union[str, List[str]]): one or several input files, in source language
41+
tgt_input_files (Union[str, List[str]]): one or several input files, in target language
42+
43+
Raises:
44+
TypeError: If types of `src_input_files` and `tgt_input_files` doesn't agree.
45+
46+
Returns:
47+
ParallelDataset: A `ParallelDataset` object with `self.df` holding the ingested simple bitext.
48+
"""
49+
50+
if isinstance(src_input_files, str) and isinstance(tgt_input_files, str):
51+
src_input_files = [src_input_files]
52+
tgt_input_files = [tgt_input_files]
53+
elif not isinstance(src_input_files, list) or not isinstance(
54+
tgt_input_files, list
55+
):
56+
raise TypeError("Both file inputs must be strings or lists.")
57+
58+
# use default doc id for now
59+
# but in the future it might be useful to allow customizing doc id by passing a prefix
60+
df_files = []
61+
# We do not use `dd.from_map` because an individual file could be pretty large,
62+
# hence, it's not appropriate to partition based on individual files.
63+
# What we do is that we concatenate all the individual files and perform repartition.
64+
for src_input_file, tgt_input_file in zip(src_input_files, tgt_input_files):
65+
df_file = ParallelDataset.read_single_simple_bitext_file_pair(
66+
(src_input_file, tgt_input_file),
67+
src_lang=src_lang,
68+
tgt_lang=tgt_lang,
69+
backend=backend,
70+
add_filename=add_filename,
71+
)
72+
df_files.append(df_file)
73+
74+
if backend == "cudf":
75+
df = cudf
76+
else:
77+
df = pd
78+
79+
data = dd.from_pandas(df.concat(df_files), npartitions=npartitions)
80+
return cls(data)
81+
82+
def to_bitext(
83+
self,
84+
output_file_dir,
85+
write_to_filename=False,
86+
):
87+
"""See `nemo_curator.utils.distributed_utils.write_to_disk` docstring for parameter usage."""
88+
write_to_disk(
89+
df=self.df,
90+
output_file_dir=output_file_dir,
91+
write_to_filename=write_to_filename,
92+
output_type="bitext",
93+
)
94+
95+
@staticmethod
96+
def read_single_simple_bitext_file_pair(
97+
input_file_pair: Tuple[str],
98+
src_lang: str,
99+
tgt_lang: str,
100+
doc_id: str = None,
101+
backend: str = "cudf",
102+
add_filename: bool = False,
103+
) -> Union[dd.DataFrame, "dask_cudf.DataFrame"]:
104+
"""This function reads a pair of "simple bitext" files into a pandas DataFrame.
105+
A simple bitext is a commonly data format in machine translation.
106+
It consists of two plain text files with the same number of lines, each line pair being translations of each other. For example:
107+
108+
data.de:
109+
110+
```
111+
Wir besitzen keine Reisetaschen aus Leder.
112+
Die Firma produziert Computer für den deutschen Markt.
113+
...
114+
```
115+
116+
data.en:
117+
118+
```
119+
We don't own duffel bags made of leather.
120+
The company produces computers for the German market.
121+
...
122+
```
123+
124+
For simplicity, we also assume that the names of the two text files have the same prefix, except for different language code at the end as file extensions.
125+
126+
Args:
127+
input_file_pair (Tuple[str]): A pair of file paths pointing to the input files
128+
src_lang (str): Source language, in ISO-639-1 (two character) format (e.g. 'en')
129+
tgt_lang (str): Target language, in ISO-639-1 (two character) format (e.g. 'en')
130+
doc_id (str, optional): A string document id to assign to every segment in the file. Defaults to None.
131+
backend (str, optional): Backend of the data frame. Defaults to "cudf".
132+
add_filename (bool, optional): Add filename as an extra field to every segment in the file. Defaults to False.
133+
134+
Returns:
135+
Union[dd.DataFrame, dask_cudf.DataFrame]
136+
"""
137+
src_input_file, tgt_input_file = input_file_pair
138+
assert remove_path_extension(src_input_file) == remove_path_extension(
139+
tgt_input_file
140+
), f"Assuming source and target filenames would have common prefix before language code, but got {src_input_file} and {tgt_input_file}."
141+
142+
if not doc_id:
143+
doc_id = "▁".join([src_input_file, tgt_input_file])
144+
145+
if backend == "cudf":
146+
df = cudf
147+
else:
148+
df = pd
149+
150+
df_src = df.read_csv(
151+
src_input_file, names=["src"], sep="\t", quoting=csv.QUOTE_NONE
152+
)
153+
df_tgt = df.read_csv(
154+
tgt_input_file, names=["tgt"], sep="\t", quoting=csv.QUOTE_NONE
155+
)
156+
assert len(df_src) == len(
157+
df_tgt
158+
), f"We assume the source and target file would have the same number of lines, but got {len(df_src)} and {len(df_tgt)}."
159+
df_combined = df.concat([df_src, df_tgt], axis=1)
160+
df_combined["doc_id"] = doc_id
161+
df_combined["src_lang"] = src_lang
162+
df_combined["tgt_lang"] = tgt_lang
163+
164+
if add_filename:
165+
df_combined["filename"] = remove_path_extension(src_input_file)
166+
167+
return df_combined

nemo_curator/filters/__init__.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,12 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
from .classifier_filter import FastTextLangId, FastTextQualityFilter
15+
from .bitext_filter import BitextFilter
16+
from .classifier_filter import (
17+
FastTextLangId,
18+
FastTextQualityFilter,
19+
QualityEstimationFilter,
20+
)
1621
from .code import (
1722
AlphaFilter,
1823
GeneralCommentToCodeFilter,
@@ -29,6 +34,8 @@
2934
BulletsFilter,
3035
CommonEnglishWordsFilter,
3136
EllipsisFilter,
37+
HistogramFilter,
38+
LengthRatioFilter,
3239
LongWordFilter,
3340
MeanWordLengthFilter,
3441
NonAlphaNumericFilter,
@@ -51,6 +58,7 @@
5158
from .synthetic import AnswerabilityFilter, EasinessFilter
5259

5360
__all__ = [
61+
"BitextFilter",
5462
"DocumentFilter",
5563
"import_filter",
5664
"FastTextLangId",
@@ -85,6 +93,9 @@
8593
"AlphaFilter",
8694
"HTMLBoilerplateFilter",
8795
"PerExtensionFilter",
96+
"LengthRatioFilter",
97+
"HistogramFilter",
98+
"QualityEstimationFilter",
8899
"AnswerabilityFilter",
89100
"EasinessFilter",
90101
]

0 commit comments

Comments
 (0)