You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user-guide/semdedup.rst
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,6 +57,7 @@ Semantic deduplication in NeMo Curator can be configured using a YAML file. Here
57
57
random_state: 1234
58
58
sim_metric: "cosine"
59
59
which_to_keep: "hard"
60
+
batched_cosine_similarity: 1024
60
61
sort_clusters: true
61
62
kmeans_with_cos_dist: false
62
63
clustering_input_partition_size: "2gb"
@@ -209,6 +210,7 @@ Use Individual Components
209
210
id_column="doc_id",
210
211
id_column_type="str",
211
212
which_to_keep="hard",
213
+
batched_cosine_similarity=1024,
212
214
output_dir="path/to/output/deduped",
213
215
logger="path/to/log/dir"
214
216
)
@@ -257,7 +259,7 @@ Key parameters in the configuration file include:
257
259
- ``n_clusters``: Number of clusters for k-means clustering.
258
260
- ``eps_to_extract``: Deduplication threshold. Higher values result in more aggressive deduplication.
259
261
- ``which_to_keep``: Strategy for choosing which duplicate to keep ("hard" or "soft").
260
-
262
+
- ``batched_cosine_similarity``: Whether to use batched cosine similarity (has less memory usage, O(N*B) where B is the batch size) or vanilla cosine similarity (O(N^2) memory usage).
embedding_column (str): The column name that stores the embeddings.
66
66
Default is "embeddings".
67
+
batched_cosine_similarity (int): Whether to use batched cosine similarity (has less memory usage).
68
+
Default is 1024. When greater than 0, batching is used and memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size.
69
+
When less than or equal to 0, no batching is used and memory requirements are O(N^2) where N is the number of items in the cluster.
67
70
logger (Union[logging.Logger, str]): Existing logger to log to, or a path to a log directory.
68
71
Default is "./".
69
72
profile_dir (Optional[str]): If specified, directory to write Dask profile.
0 commit comments