Skip to content

Commit ebf6565

Browse files
psriramsncvipul-mittalzephyrzilla
authored
[Enhancement] Add Support for Semantic Deduplication (#104)
* Added Semantic Dedup Graph Post Processor * Added ANN dedup, test cases and documentation * Added Documentation page to mkdocs --------- Co-authored-by: Vipul Mittal <[email protected]> Co-authored-by: Surajit Dasgupta <[email protected]>
1 parent 427e8cf commit ebf6565

File tree

10 files changed

+851
-52
lines changed

10 files changed

+851
-52
lines changed
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Semantic Deduplication
2+
3+
> **Remove near-duplicate generated records using embedding-based similarity as a graph post-processor**
4+
5+
## Overview
6+
7+
SyGra supports semantic deduplication as a **graph post-processing** step via:
8+
9+
`sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor`
10+
11+
It embeds a configured output field (e.g., `answer`, `description`) and removes items whose **cosine similarity** is above a configurable threshold.
12+
13+
This is useful when:
14+
15+
- Your generation workflow tends to repeat the same/very similar answers.
16+
- You are generating multiple records and want to reduce redundant samples.
17+
- You want a report of duplicate pairs to inspect or tune dedup behavior.
18+
19+
## Quick Start
20+
21+
Add the post processor under `graph_post_process` in your task `graph_config.yaml`.
22+
23+
Example (dedup over `answer`, see `tasks/examples/semantic_dedup/graph_config.yaml`):
24+
25+
```yaml
26+
graph_post_process:
27+
- processor: sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor
28+
params:
29+
field: answer
30+
similarity_threshold: 0.92
31+
id_field: id
32+
embedding_backend: sentence_transformers
33+
embedding_model: all-MiniLM-L6-v2
34+
dedup_mode: nearest_neighbor
35+
vectorstore_k: 20
36+
keep: first
37+
max_pairs_in_report: 1000
38+
```
39+
40+
Example (dedup over `description`, see `tasks/examples/semantic_dedup_no_seed/graph_config.yaml`):
41+
42+
```yaml
43+
graph_post_process:
44+
- processor: sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor
45+
params:
46+
field: description
47+
similarity_threshold: 0.85
48+
id_field: id
49+
embedding_backend: sentence_transformers
50+
embedding_model: all-MiniLM-L6-v2
51+
keep: first
52+
max_pairs_in_report: 1000
53+
```
54+
55+
## Configuration Reference
56+
57+
### Parameters
58+
59+
All parameters are provided under `params:`.
60+
61+
| Parameter | Type | Description | Default |
62+
|-----------|------|-------------|---------|
63+
| `field` | string | Field to embed and compare for similarity. If the field value is a list/tuple, values are joined with newlines. | `text` |
64+
| `similarity_threshold` | float | Cosine similarity threshold. Higher values drop fewer items. | `0.9` |
65+
| `id_field` | string | Optional ID field used in the report for readability. If missing, indices are used. | `id` |
66+
| `embedding_backend` | string | Embedding backend. Currently only `sentence_transformers` is supported. | `sentence_transformers` |
67+
| `embedding_model` | string | SentenceTransformers model name to use for embeddings. | `all-MiniLM-L6-v2` |
68+
| `report_filename` | string | Optional report JSON filename. If relative, it is written next to the graph output file. If omitted, the report name is derived from the output file name. | (derived) |
69+
| `keep` | string | Which item to keep when duplicates are found: `first` or `last`. | `first` |
70+
| `max_pairs_in_report` | int | Max number of duplicate pairs written to the report. | `2000` |
71+
| `dedup_mode` | string | Dedup implementation to use: `nearest_neighbor` (default) or `all_pairs`. Any other value is unsupported and will raise an error. `nearest_neighbor` avoids building a full similarity matrix by only comparing against nearest neighbors / kept items. `all_pairs` computes a full similarity matrix (exact, but O(n^2)). | `nearest_neighbor` |
72+
| `vectorstore_k` | int | Number of nearest neighbors to retrieve/consider when `dedup_mode: nearest_neighbor`. | `20` |
73+
74+
### How dedup is applied
75+
76+
- A greedy pass keeps an item if it is not too similar to a previously kept one.
77+
- Similarity is computed via cosine similarity over normalized embeddings.
78+
- `keep: first` keeps the earlier item, `keep: last` prefers the later item.
79+
80+
## Output report
81+
82+
If SyGra provides `metadata["output_file"]` at runtime, the post processor writes a JSON report next to the output file.
83+
84+
### Report naming
85+
86+
- If `report_filename` is provided:
87+
- absolute paths are used as-is
88+
- relative paths are resolved relative to the output directory
89+
- Otherwise, the report filename is derived from the output filename:
90+
- `output_*.json` -> `semantic_dedup_report_*.json`
91+
92+
### Report format (high level)
93+
94+
The report includes:
95+
96+
- `input_count`, `output_count`, `dropped_count`
97+
- configuration (`field`, `similarity_threshold`, `embedding_model`, etc.)
98+
- a bounded list of duplicate pairs under `duplicates`
99+
100+
Each entry in `duplicates` contains:
101+
102+
- `kept_index`, `dropped_index`
103+
- `kept_id`, `dropped_id`
104+
- `similarity`
105+
106+
## Dependencies
107+
108+
When using `embedding_backend: sentence_transformers`, this feature requires the `sentence-transformers` package to be available in your environment.
109+
110+
## Performance considerations
111+
112+
When `dedup_mode: nearest_neighbor` (default), dedup runs incrementally and does not build a full similarity matrix. This is typically faster and uses less memory for larger outputs.
113+
114+
When `dedup_mode: all_pairs`, the implementation computes a full similarity matrix (**O(n^2)** time/memory), so it is intended for **relatively small** output lists.
115+
116+
If you plan to deduplicate very large outputs, consider:
117+
118+
- generating in smaller batches
119+
- using a higher threshold to reduce comparisons
120+
- implementing an approximate/streaming dedup strategy
121+
122+
## Troubleshooting
123+
124+
### Unsupported embedding backend
125+
126+
If you set `embedding_backend` to anything other than `sentence_transformers`, SyGra will raise:
127+
128+
`ValueError: Unsupported embedding_backend: ...`
129+
130+
### No report is written
131+
132+
A report is only written if `metadata["output_file"]` is present. If you are running in a context where SyGra does not set it, the post processor will still deduplicate in-memory but will not persist the report.

0 commit comments

Comments
 (0)