Skip to content

Commit 8128aa4

Browse files
Improved Deduplication (#412)
* update script name update example Added cross deduplicate script create separate dir for deduplicate scripts Update dedup config add readme update comments update dependencies and deduplicate output add arabic config add deduplication results and report update en results * final touch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 2077ca7 commit 8128aa4

28 files changed

+1342
-779
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,3 +149,8 @@ lid.189.bin
149149

150150
# Visualization json files
151151
**/*examples_with_stats*
152+
153+
# Deduplication temp files
154+
**/outputs
155+
**/cache
156+
**/.vscode

ac_dc/README.md

Lines changed: 1 addition & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -33,68 +33,7 @@ Run the filtering with the file [main_filtering.py](https://github.com/bigscienc
3333

3434
#### 5. Do the deduplication
3535

36-
Do the deduplication, which is detailed in the following section, with the file [deduplicate.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/deduplicate.py).
37-
38-
39-
### Deduplication
40-
41-
Runnable script example at `ac_dc/examples/dedup.sh`
42-
43-
#### 0. Sharding a dataset
44-
45-
We want to shard a dataset into multiple shards so that each node on HPC can take a shard and each shard can be further parallelized with CPU cores.
46-
47-
```bash
48-
python ac_dc/deduplicate.py create-shards "cache/sharded" 5 --path "oscar-corpus/OSCAR-2109" --name "deduplicated_af" --split "train"
49-
# or
50-
python ac_dc/deduplicate.py create-shards "cache/sharded" 5 --path "oscar-corpus/OSCAR-2109" --name "deduplicated_af" --data-dir "local path to data directory" --split "train"
51-
```
52-
53-
It loads a local dataset and segments its `train` split into 5 shards/sub-datasets under `cache/sharded`. This gives you
54-
```
55-
cache/sharded
56-
├── sharded_00000.jsonl
57-
├── sharded_00001.jsonl
58-
├── sharded_00002.jsonl
59-
├── sharded_00003.jsonl
60-
└── sharded_00004.jsonl
61-
```
62-
63-
#### 1. Create Simhashes
64-
```bash
65-
# run each command on each node
66-
python ac_dc/deduplicate.py build-hashes "cache/deduplicated_af_hashes_00001" --data-files "sharded_00000.jsonl" --data-files "sharded_00001.jsonl" --path "cache/sharded" --split "train"
67-
python ac_dc/deduplicate.py build-hashes "cache/deduplicated_af_hashes_00002" --data-files "sharded_00002.jsonl" --data-files "sharded_00003.jsonl" --path "cache/sharded" --split "train"
68-
python ac_dc/deduplicate.py build-hashes "cache/deduplicated_af_hashes_00003" --data-files "sharded_00004.jsonl" --path "cache/sharded" --split "train"
69-
```
70-
The above commands add an addition column `hash` in the data and outputs two datasets at `cache/en_hashes_00001` and `cache/en_hashes_00002`. This is useful for large dataset and each node/worker can hash some shards of the data in parallel.
71-
72-
#### 2. Create a Simhash Index
73-
```bash
74-
python ac_dc/deduplicate.py build-index "cache/deduplicated_af_simhash_index.ann" "cache/deduplicated_af_hashes_00001" "cache/deduplicated_af_hashes_00002" "cache/deduplicated_af_hashes_00003" --split "train"
75-
```
76-
This creates the index file based on ALL the hashed datasets. This is a merge step and takes O(n) time.
77-
78-
#### 3. Find Duplicates
79-
```bash
80-
# run each command on each node
81-
LOG_LEVEL="INFO" python ac_dc/deduplicate.py find-duplicates "cache/deduplicated_af_hashes_00001" "cache/deduplicated_af_simhash_index.pkl" --split "train" --k 100 --threshold 3
82-
LOG_LEVEL="INFO" python ac_dc/deduplicate.py find-duplicates "cache/deduplicated_af_hashes_00002" "cache/deduplicated_af_simhash_index.pkl" --split "train" --k 100 --threshold 3
83-
LOG_LEVEL="INFO" python ac_dc/deduplicate.py find-duplicates "cache/deduplicated_af_hashes_00003" "cache/deduplicated_af_simhash_index.pkl" --split "train" --k 100 --threshold 3
84-
```
85-
This adds another column `duplicates` into the data with the index and outputs them into `cache/en_hashes_0000{1,2,3}_duplicates`.
86-
87-
#### 4. Remove Duplicates
88-
```bash
89-
python ac_dc/deduplicate.py remove-duplicates "cache/deduplicated_af_hashes_00001_duplicates" "cache/deduplicated_af_hashes_00002_duplicates" "cache/deduplicated_af_hashes_00003_duplicates" --split "train"
90-
```
91-
This removes all duplicates from the given datasets and outputs `cache/en_hashes_0000{1,2,3}_deduplicated`; Partially parallelized because thre is a step finding connected components of duplicates and it takes O(n) time.
92-
93-
#### 5. Merge Shards
94-
```bash
95-
python ac_dc/deduplicate.py merge-shards "cache/simhash_deduplicated_af" "cache/deduplicated_af_hashes_00001_deduplicated" "cache/deduplicated_af_hashes_00002_deduplicated" "cache/deduplicated_af_hashes_00003_deduplicated" --split "train"
96-
```
97-
This merges all shards back into one dataset.
36+
Do the deduplication, which is detailed in the sub folder `ac_dc/deduplicate`.
9837

9938

10039
### Merge metadata from OSCAR 21.09 to OSCAR

ac_dc/anonymization.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,13 @@
33

44
trannum = str.maketrans("0123456789", "1111111111")
55

6+
67
def apply_regex_anonymization(
78
sentence: str,
89
lang_id: str,
910
context_window: int = 20,
1011
anonymize_condition=None,
11-
tag_type= {'IP_ADDRESS', 'KEY', 'ID', 'PHONE', 'USER', 'EMAIL', 'LICENSE_PLATE'},
12+
tag_type={"IP_ADDRESS", "KEY", "ID", "PHONE", "USER", "EMAIL", "LICENSE_PLATE"},
1213
) -> str:
1314
"""
1415
Params:

0 commit comments

Comments
 (0)