Commit 3d14b0d
Add support for parallel data curation (#193)
* add data interface to read simple bitext
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* adding ParallelScoreFilter
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add test for ParallelScoreFilter, small style change for ParallelDataset test, fix a few data and import bugs
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* allow ParallelScoreFilter to take different filters for source and target
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add JointScoreFilter and LengthRatioFilter
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* [WIP] add heuristic filter w/o test
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* merge with main
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add test for histogram filter, fix a few bugs
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* length ratio, joint score filter testing
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* fix typing in joint test
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add a fake comet qe filter as an initial step
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* [WIP] adding bitext cleaning tutorial
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* [WIP] fixing example
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* fix slow histogram filter, fix faulty bitext loading
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* tutorial running
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* [WIP] documentation of bitext tutorial
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add tested version of comet-qe filter
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* fix ParallelDataset bug where single file name is not accepted, and dataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add docstring to explain simple bitext format, fix a bug where file extensions are removed twice before writing
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* remove print line for debug
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add comet filter to tutorial
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* refactor COMET QE filter to decouple model from filter, make sure JointScoreFilter can take more than one fields for source and target
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* use refactored qe filter
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* wrap_qe_input should be a static method
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* use conditional import for comet, formatting changes
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* [WIP] add cometoid
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* [WIP] attempt to resolve device conflict but is failing
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* [WIP] playing with cometoid arguments
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* [WIP] -d 0 doesn't look necessary
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* tested arguments for Cometoid
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* use proper safe import, make sure test doesn't crash sans comet/pymarian
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* falling back to comet for tutorial since that's easier to set up, uppdate README
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* give credit to original fairseq implementation of histogram filtering, run black formatter
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* fix pre-commit complaint
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* fix small bug
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* fix another occurrence of the same bug
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* introduce shard limit to a single PyMarian API call to avoid memory leakage
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* repartition after reading simple bitext data
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* -d 0 is actually needed for pymarian
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* remove duplicate LengthRatioFilter definition
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* refactor repeated code segment in file writing, change classifier to accomodate custom field names, pause doc repartition since it causes problems
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* [WIP] addressed comments in #193 apart from resolving .iloc pattern, test currently failing
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* refactor to resolve .loc pattern, test passing
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add missing file
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* revert changes in setup.py
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* fix a small bug in parallel dataset, explain why repartition is disabled, fix tutorial
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add api guide, small change on bitext/parallel score filter docstring
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* fix read_simple_bitext test issues
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* reinstate dependencies lost during merging
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* re-enable multiple partitions for simple bitext, add parallel write
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* take care of the case where filename is not supplied in dataframe, make logic clearer
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* address other minor comments in the PR, fix segment order scrambling
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* fix test errors, add bitext dependencies
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add back more missing imports
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* add bitext to [all] in .toml, add platformdirs as dependency
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* merge upstream, remove old bitext requirement list
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
* delete requirement file again
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
---------
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Co-authored-by: nverma1 <neha.verma2017@gmail.com>1 parent b15b08a commit 3d14b0d
File tree
23 files changed
+1490
-30
lines changed- docs/user-guide/api
- nemo_curator
- datasets
- filters
- models
- modules
- utils
- tests
- bitext_data
- tutorials/bitext_cleaning
23 files changed
+1490
-30
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
| 13 | + | |
12 | 14 | | |
13 | 15 | | |
14 | 16 | | |
15 | 17 | | |
16 | 18 | | |
17 | 19 | | |
18 | | - | |
| 20 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
13 | 17 | | |
14 | 18 | | |
15 | 19 | | |
| |||
40 | 44 | | |
41 | 45 | | |
42 | 46 | | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
43 | 55 | | |
44 | 56 | | |
45 | 57 | | |
| |||
132 | 144 | | |
133 | 145 | | |
134 | 146 | | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
135 | 155 | | |
136 | 156 | | |
137 | 157 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
23 | | - | |
| 24 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
16 | 21 | | |
17 | 22 | | |
18 | 23 | | |
| |||
29 | 34 | | |
30 | 35 | | |
31 | 36 | | |
| 37 | + | |
| 38 | + | |
32 | 39 | | |
33 | 40 | | |
34 | 41 | | |
| |||
51 | 58 | | |
52 | 59 | | |
53 | 60 | | |
| 61 | + | |
54 | 62 | | |
55 | 63 | | |
56 | 64 | | |
| |||
85 | 93 | | |
86 | 94 | | |
87 | 95 | | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
88 | 99 | | |
89 | 100 | | |
90 | 101 | | |
0 commit comments