Skip to content

Commit 78bb225

Browse files
committed
Updated docs and citation info
1 parent fc2ba26 commit 78bb225

File tree

2 files changed

+11
-8
lines changed

2 files changed

+11
-8
lines changed

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
cff-version: 1.2.0
22
message: "If you use SemHash in your research, please cite it as below."
3-
title: "SemHash: Fast Semantic Text Deduplication & Filtering"
3+
title: "SemHash: Fast Multimodal Semantic Deduplication & Filtering"
44
authors:
55
- family-names: "van Dongen"
66
given-names: "Thomas"
@@ -14,7 +14,7 @@ date-released: "2025-01-05"
1414

1515
preferred-citation:
1616
type: software
17-
title: "SemHash: Fast Semantic Text Deduplication & Filtering"
17+
title: "SemHash: Fast Multimodal Semantic Deduplication & Filtering"
1818
authors:
1919
- family-names: "van Dongen"
2020
given-names: "Thomas"

README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ filtered_texts = semhash.self_filter_outliers().selected
7373
representative_texts = semhash.self_find_representative().selected
7474
```
7575

76-
### Image Deduplication
76+
### Image Deduplication, Filtering & Representative Sampling
7777

7878
Deduplicate an image dataset using a vision model (requires `pip install sentence-transformers`):
7979

@@ -91,6 +91,12 @@ semhash = SemHash.from_records(list(dataset), columns=["img"], model=model)
9191

9292
# Deduplicate the images
9393
deduplicated_images = semhash.self_deduplicate().selected
94+
95+
# Filter outliers
96+
filtered_images = semhash.self_filter_outliers().selected
97+
98+
# Find representative images
99+
representative_images = semhash.self_find_representative().selected
94100
```
95101

96102
### Cross-Dataset Deduplication, Filtering & Representative Sampling
@@ -229,14 +235,11 @@ The following code snippet shows how to deduplicate across two datasets, filter
229235
from datasets import load_dataset
230236
from semhash import SemHash
231237

232-
# Initialize a SemHash instance
233-
semhash = SemHash()
234-
235238
# Load two datasets to deduplicate
236239
train_texts = load_dataset("ag_news", split="train")["text"]
237240
test_texts = load_dataset("ag_news", split="test")["text"]
238241

239-
# Initialize a SemHash instance
242+
# Initialize a SemHash instance with the training data
240243
semhash = SemHash.from_records(records=train_texts)
241244

242245
# Deduplicate the test data against the training data
@@ -524,7 +527,7 @@ If you use SemHash in your research, please cite the following:
524527
```bibtex
525528
@software{minishlab2025semhash,
526529
author = {{van Dongen}, Thomas and Stephan Tulkens},
527-
title = {SemHash: Fast Semantic Text Deduplication \& Filtering},
530+
title = {SemHash: Fast Multimodal Semantic Deduplication \& Filtering},
528531
year = {2025},
529532
publisher = {Zenodo},
530533
doi = {10.5281/zenodo.17265942},

0 commit comments

Comments
 (0)