You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* update script name
update example
Added cross deduplicate script
create separate dir for deduplicate scripts
Update dedup config
add readme
update comments
update dependencies and deduplicate output
add arabic config
add deduplication results and report
update en results
* final touch
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Copy file name to clipboardExpand all lines: ac_dc/README.md
+1-62Lines changed: 1 addition & 62 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,68 +33,7 @@ Run the filtering with the file [main_filtering.py](https://github.com/bigscienc
33
33
34
34
#### 5. Do the deduplication
35
35
36
-
Do the deduplication, which is detailed in the following section, with the file [deduplicate.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/deduplicate.py).
37
-
38
-
39
-
### Deduplication
40
-
41
-
Runnable script example at `ac_dc/examples/dedup.sh`
42
-
43
-
#### 0. Sharding a dataset
44
-
45
-
We want to shard a dataset into multiple shards so that each node on HPC can take a shard and each shard can be further parallelized with CPU cores.
The above commands add an addition column `hash` in the data and outputs two datasets at `cache/en_hashes_00001` and `cache/en_hashes_00002`. This is useful for large dataset and each node/worker can hash some shards of the data in parallel.
This removes all duplicates from the given datasets and outputs `cache/en_hashes_0000{1,2,3}_deduplicated`; Partially parallelized because thre is a step finding connected components of duplicates and it takes O(n) time.
0 commit comments