@@ -73,7 +73,7 @@ filtered_texts = semhash.self_filter_outliers().selected
7373representative_texts = semhash.self_find_representative().selected
7474```
7575
76- ### Image Deduplication
76+ ### Image Deduplication, Filtering & Representative Sampling
7777
7878Deduplicate an image dataset using a vision model (requires ` pip install sentence-transformers ` ):
7979
@@ -91,6 +91,12 @@ semhash = SemHash.from_records(list(dataset), columns=["img"], model=model)
9191
9292# Deduplicate the images
9393deduplicated_images = semhash.self_deduplicate().selected
94+
95+ # Filter outliers
96+ filtered_images = semhash.self_filter_outliers().selected
97+
98+ # Find representative images
99+ representative_images = semhash.self_find_representative().selected
94100```
95101
96102### Cross-Dataset Deduplication, Filtering & Representative Sampling
@@ -229,14 +235,11 @@ The following code snippet shows how to deduplicate across two datasets, filter
229235from datasets import load_dataset
230236from semhash import SemHash
231237
232- # Initialize a SemHash instance
233- semhash = SemHash()
234-
235238# Load two datasets to deduplicate
236239train_texts = load_dataset(" ag_news" , split = " train" )[" text" ]
237240test_texts = load_dataset(" ag_news" , split = " test" )[" text" ]
238241
239- # Initialize a SemHash instance
242+ # Initialize a SemHash instance with the training data
240243semhash = SemHash.from_records(records = train_texts)
241244
242245# Deduplicate the test data against the training data
@@ -524,7 +527,7 @@ If you use SemHash in your research, please cite the following:
524527``` bibtex
525528@software{minishlab2025semhash,
526529 author = {{van Dongen}, Thomas and Stephan Tulkens},
527- title = {SemHash: Fast Semantic Text Deduplication \& Filtering},
530+ title = {SemHash: Fast Multimodal Semantic Deduplication \& Filtering},
528531 year = {2025},
529532 publisher = {Zenodo},
530533 doi = {10.5281/zenodo.17265942},
0 commit comments