bigscience-workshop
diff --git a/‎.gitignore‎
Lines changed: 5 additions & 0 deletions b/‎.gitignore‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎ac_dc/README.md‎
Lines changed: 1 addition & 62 deletions b/‎ac_dc/README.md‎
Lines changed: 1 addition & 62 deletions
diff --git a/‎ac_dc/anonymization.py‎
Lines changed: 2 additions & 1 deletion b/‎ac_dc/anonymization.py‎
Lines changed: 2 additions & 1 deletion
@@ -149,3 +149,8 @@ lid.189.bin
 
 # Visualization json files
 **/*examples_with_stats*
+
+# Deduplication temp files
+**/outputs
+**/cache
+**/.vscode
@@ -33,68 +33,7 @@ Run the filtering with the file [main_filtering.py](https://github.com/bigscienc
 
 #### 5. Do the deduplication
 
-Do the deduplication, which is detailed in the following section, with the file [deduplicate.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/deduplicate.py).
-
-
-### Deduplication
-
-Runnable script example at `ac_dc/examples/dedup.sh`
-
-#### 0. Sharding a dataset
-
-We want to shard a dataset into multiple shards so that each node on HPC can take a shard and each shard can be further parallelized with CPU cores.
-
-```bash
-python ac_dc/deduplicate.py create-shards "cache/sharded" 5 --path "oscar-corpus/OSCAR-2109" --name "deduplicated_af" --split "train"
-# or
-python ac_dc/deduplicate.py create-shards "cache/sharded" 5 --path "oscar-corpus/OSCAR-2109" --name "deduplicated_af" --data-dir "local path to data directory" --split "train"
-```
-
-It loads a local dataset and segments its `train` split into 5 shards/sub-datasets under `cache/sharded`. This gives you
-```
-cache/sharded
-├── sharded_00000.jsonl
-├── sharded_00001.jsonl
-├── sharded_00002.jsonl
-├── sharded_00003.jsonl
-└── sharded_00004.jsonl
-```
-
-#### 1. Create Simhashes
-```bash
-# run each command on each node
-python ac_dc/deduplicate.py build-hashes "cache/deduplicated_af_hashes_00001" --data-files "sharded_00000.jsonl" --data-files "sharded_00001.jsonl" --path "cache/sharded" --split "train"
-python ac_dc/deduplicate.py build-hashes "cache/deduplicated_af_hashes_00002" --data-files "sharded_00002.jsonl" --data-files "sharded_00003.jsonl" --path "cache/sharded" --split "train"
-python ac_dc/deduplicate.py build-hashes "cache/deduplicated_af_hashes_00003" --data-files "sharded_00004.jsonl" --path "cache/sharded" --split "train"
-```
-The above commands add an addition column `hash` in the data and outputs two datasets at `cache/en_hashes_00001` and `cache/en_hashes_00002`. This is useful for large dataset and each node/worker can hash some shards of the data in parallel.
-
-#### 2. Create a Simhash Index
-```bash
-python ac_dc/deduplicate.py build-index "cache/deduplicated_af_simhash_index.ann" "cache/deduplicated_af_hashes_00001" "cache/deduplicated_af_hashes_00002" "cache/deduplicated_af_hashes_00003" --split "train"
-```
-This creates the index file based on ALL the hashed datasets. This is a merge step and takes O(n) time.
-
-#### 3. Find Duplicates
-```bash
-# run each command on each node
-LOG_LEVEL="INFO" python ac_dc/deduplicate.py find-duplicates "cache/deduplicated_af_hashes_00001" "cache/deduplicated_af_simhash_index.pkl" --split "train" --k 100 --threshold 3
-LOG_LEVEL="INFO" python ac_dc/deduplicate.py find-duplicates "cache/deduplicated_af_hashes_00002" "cache/deduplicated_af_simhash_index.pkl" --split "train" --k 100 --threshold 3
-LOG_LEVEL="INFO" python ac_dc/deduplicate.py find-duplicates "cache/deduplicated_af_hashes_00003" "cache/deduplicated_af_simhash_index.pkl" --split "train" --k 100 --threshold 3
-```
-This adds another column `duplicates` into the data with the index and outputs them into `cache/en_hashes_0000{1,2,3}_duplicates`.
-
-#### 4. Remove Duplicates
-```bash
-python ac_dc/deduplicate.py remove-duplicates "cache/deduplicated_af_hashes_00001_duplicates" "cache/deduplicated_af_hashes_00002_duplicates" "cache/deduplicated_af_hashes_00003_duplicates" --split "train"
-```
-This removes all duplicates from the given datasets and outputs `cache/en_hashes_0000{1,2,3}_deduplicated`; Partially parallelized because thre is a step finding connected components of duplicates and it takes O(n) time.
-
-#### 5. Merge Shards
-```bash
-python ac_dc/deduplicate.py merge-shards "cache/simhash_deduplicated_af" "cache/deduplicated_af_hashes_00001_deduplicated" "cache/deduplicated_af_hashes_00002_deduplicated" "cache/deduplicated_af_hashes_00003_deduplicated" --split "train"
-```
-This merges all shards back into one dataset.
+Do the deduplication, which is detailed in the sub folder `ac_dc/deduplicate`.
 
 
 ### Merge metadata from OSCAR 21.09 to OSCAR
 
@@ -3,12 +3,13 @@
 
 trannum = str.maketrans("0123456789", "1111111111")
 
+
 def apply_regex_anonymization(
     sentence: str,
     lang_id: str,
     context_window: int = 20,
     anonymize_condition=None,
-    tag_type= {'IP_ADDRESS', 'KEY', 'ID', 'PHONE', 'USER', 'EMAIL', 'LICENSE_PLATE'},
+    tag_type={"IP_ADDRESS", "KEY", "ID", "PHONE", "USER", "EMAIL", "LICENSE_PLATE"},
 ) -> str:
     """
     Params: