|
| 1 | +# Semantic Deduplication |
| 2 | + |
| 3 | +> **Remove near-duplicate generated records using embedding-based similarity as a graph post-processor** |
| 4 | +
|
| 5 | +## Overview |
| 6 | + |
| 7 | +SyGra supports semantic deduplication as a **graph post-processing** step via: |
| 8 | + |
| 9 | +`sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor` |
| 10 | + |
| 11 | +It embeds a configured output field (e.g., `answer`, `description`) and removes items whose **cosine similarity** is above a configurable threshold. |
| 12 | + |
| 13 | +This is useful when: |
| 14 | + |
| 15 | +- Your generation workflow tends to repeat the same/very similar answers. |
| 16 | +- You are generating multiple records and want to reduce redundant samples. |
| 17 | +- You want a report of duplicate pairs to inspect or tune dedup behavior. |
| 18 | + |
| 19 | +## Quick Start |
| 20 | + |
| 21 | +Add the post processor under `graph_post_process` in your task `graph_config.yaml`. |
| 22 | + |
| 23 | +Example (dedup over `answer`, see `tasks/examples/semantic_dedup/graph_config.yaml`): |
| 24 | + |
| 25 | +```yaml |
| 26 | +graph_post_process: |
| 27 | + - processor: sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor |
| 28 | + params: |
| 29 | + field: answer |
| 30 | + similarity_threshold: 0.92 |
| 31 | + id_field: id |
| 32 | + embedding_backend: sentence_transformers |
| 33 | + embedding_model: all-MiniLM-L6-v2 |
| 34 | + dedup_mode: nearest_neighbor |
| 35 | + vectorstore_k: 20 |
| 36 | + keep: first |
| 37 | + max_pairs_in_report: 1000 |
| 38 | +``` |
| 39 | +
|
| 40 | +Example (dedup over `description`, see `tasks/examples/semantic_dedup_no_seed/graph_config.yaml`): |
| 41 | + |
| 42 | +```yaml |
| 43 | +graph_post_process: |
| 44 | + - processor: sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor |
| 45 | + params: |
| 46 | + field: description |
| 47 | + similarity_threshold: 0.85 |
| 48 | + id_field: id |
| 49 | + embedding_backend: sentence_transformers |
| 50 | + embedding_model: all-MiniLM-L6-v2 |
| 51 | + keep: first |
| 52 | + max_pairs_in_report: 1000 |
| 53 | +``` |
| 54 | + |
| 55 | +## Configuration Reference |
| 56 | + |
| 57 | +### Parameters |
| 58 | + |
| 59 | +All parameters are provided under `params:`. |
| 60 | + |
| 61 | +| Parameter | Type | Description | Default | |
| 62 | +|-----------|------|-------------|---------| |
| 63 | +| `field` | string | Field to embed and compare for similarity. If the field value is a list/tuple, values are joined with newlines. | `text` | |
| 64 | +| `similarity_threshold` | float | Cosine similarity threshold. Higher values drop fewer items. | `0.9` | |
| 65 | +| `id_field` | string | Optional ID field used in the report for readability. If missing, indices are used. | `id` | |
| 66 | +| `embedding_backend` | string | Embedding backend. Currently only `sentence_transformers` is supported. | `sentence_transformers` | |
| 67 | +| `embedding_model` | string | SentenceTransformers model name to use for embeddings. | `all-MiniLM-L6-v2` | |
| 68 | +| `report_filename` | string | Optional report JSON filename. If relative, it is written next to the graph output file. If omitted, the report name is derived from the output file name. | (derived) | |
| 69 | +| `keep` | string | Which item to keep when duplicates are found: `first` or `last`. | `first` | |
| 70 | +| `max_pairs_in_report` | int | Max number of duplicate pairs written to the report. | `2000` | |
| 71 | +| `dedup_mode` | string | Dedup implementation to use: `nearest_neighbor` (default) or `all_pairs`. Any other value is unsupported and will raise an error. `nearest_neighbor` avoids building a full similarity matrix by only comparing against nearest neighbors / kept items. `all_pairs` computes a full similarity matrix (exact, but O(n^2)). | `nearest_neighbor` | |
| 72 | +| `vectorstore_k` | int | Number of nearest neighbors to retrieve/consider when `dedup_mode: nearest_neighbor`. | `20` | |
| 73 | + |
| 74 | +### How dedup is applied |
| 75 | + |
| 76 | +- A greedy pass keeps an item if it is not too similar to a previously kept one. |
| 77 | +- Similarity is computed via cosine similarity over normalized embeddings. |
| 78 | +- `keep: first` keeps the earlier item, `keep: last` prefers the later item. |
| 79 | + |
| 80 | +## Output report |
| 81 | + |
| 82 | +If SyGra provides `metadata["output_file"]` at runtime, the post processor writes a JSON report next to the output file. |
| 83 | + |
| 84 | +### Report naming |
| 85 | + |
| 86 | +- If `report_filename` is provided: |
| 87 | + - absolute paths are used as-is |
| 88 | + - relative paths are resolved relative to the output directory |
| 89 | +- Otherwise, the report filename is derived from the output filename: |
| 90 | + - `output_*.json` -> `semantic_dedup_report_*.json` |
| 91 | + |
| 92 | +### Report format (high level) |
| 93 | + |
| 94 | +The report includes: |
| 95 | + |
| 96 | +- `input_count`, `output_count`, `dropped_count` |
| 97 | +- configuration (`field`, `similarity_threshold`, `embedding_model`, etc.) |
| 98 | +- a bounded list of duplicate pairs under `duplicates` |
| 99 | + |
| 100 | +Each entry in `duplicates` contains: |
| 101 | + |
| 102 | +- `kept_index`, `dropped_index` |
| 103 | +- `kept_id`, `dropped_id` |
| 104 | +- `similarity` |
| 105 | + |
| 106 | +## Dependencies |
| 107 | + |
| 108 | +When using `embedding_backend: sentence_transformers`, this feature requires the `sentence-transformers` package to be available in your environment. |
| 109 | + |
| 110 | +## Performance considerations |
| 111 | + |
| 112 | +When `dedup_mode: nearest_neighbor` (default), dedup runs incrementally and does not build a full similarity matrix. This is typically faster and uses less memory for larger outputs. |
| 113 | + |
| 114 | +When `dedup_mode: all_pairs`, the implementation computes a full similarity matrix (**O(n^2)** time/memory), so it is intended for **relatively small** output lists. |
| 115 | + |
| 116 | +If you plan to deduplicate very large outputs, consider: |
| 117 | + |
| 118 | +- generating in smaller batches |
| 119 | +- using a higher threshold to reduce comparisons |
| 120 | +- implementing an approximate/streaming dedup strategy |
| 121 | + |
| 122 | +## Troubleshooting |
| 123 | + |
| 124 | +### Unsupported embedding backend |
| 125 | + |
| 126 | +If you set `embedding_backend` to anything other than `sentence_transformers`, SyGra will raise: |
| 127 | + |
| 128 | +`ValueError: Unsupported embedding_backend: ...` |
| 129 | + |
| 130 | +### No report is written |
| 131 | + |
| 132 | +A report is only written if `metadata["output_file"]` is present. If you are running in a context where SyGra does not set it, the post processor will still deduplicate in-memory but will not persist the report. |
0 commit comments