Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
b96b82d
Added from_dataset class method
Pringled Jan 11, 2026
e0af5f7
Added from_dataset class method
Pringled Jan 11, 2026
2c01486
Optimized from_dataset class method
Pringled Jan 11, 2026
3ace1bd
Optimized from_dataset class method
Pringled Jan 11, 2026
f8e5e54
Simplified from_dataset class method
Pringled Jan 11, 2026
a8282c8
Simplified from_dataset class method
Pringled Jan 11, 2026
e7820ab
Added DatasetLike protocol
Pringled Jan 11, 2026
e685309
Updated tests
Pringled Jan 11, 2026
56e5ab3
Updated tests
Pringled Jan 11, 2026
f9e182e
Improved code
Pringled Jan 11, 2026
801de63
Improved code
Pringled Jan 11, 2026
edbee51
Added optional datasets dependency
Pringled Jan 11, 2026
b153cea
Updated tests
Pringled Jan 11, 2026
861edf0
Renamed testfile
Pringled Jan 12, 2026
2a103ec
Improved code, refactored utils
Pringled Jan 14, 2026
5a6a440
Updated tests
Pringled Jan 14, 2026
44ce337
Simplified tests
Pringled Jan 14, 2026
b8e497a
Simplified tests
Pringled Jan 14, 2026
2969dcf
Improved coverage
Pringled Jan 14, 2026
db8bef7
Consolidated tests
Pringled Jan 14, 2026
fdba077
Consolidated tests
Pringled Jan 14, 2026
23e7506
Simplified tests
Pringled Jan 14, 2026
b101fdb
Simplified tests
Pringled Jan 14, 2026
e113ac8
Generalized hashing functions to support complex types
Pringled Jan 15, 2026
653dccc
Removed complex method
Pringled Jan 15, 2026
95cc5e6
Updated docstrings
Pringled Jan 15, 2026
e808922
Moved functions to records
Pringled Jan 15, 2026
c64a72e
Removed from_dataset integration
Pringled Jan 15, 2026
b3db57a
Updated docs and tagline
Pringled Jan 15, 2026
fc2ba26
Updated docs
Pringled Jan 15, 2026
78bb225
Updated docs and citation info
Pringled Jan 15, 2026
2a7badf
Updated docs
Pringled Jan 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
cff-version: 1.2.0
message: "If you use SemHash in your research, please cite it as below."
title: "SemHash: Fast Semantic Text Deduplication & Filtering"
title: "SemHash: Fast Multimodal Semantic Deduplication & Filtering"
authors:
- family-names: "van Dongen"
given-names: "Thomas"
Expand All @@ -14,7 +14,7 @@ date-released: "2025-01-05"

preferred-citation:
type: software
title: "SemHash: Fast Semantic Text Deduplication & Filtering"
title: "SemHash: Fast Multimodal Semantic Deduplication & Filtering"
authors:
- family-names: "van Dongen"
given-names: "Thomas"
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ install: venv
uv run pre-commit install

install-no-pre-commit:
uv pip install ".[dev]"
uv pip install ".[dev,all]"

fix:
uv run pre-commit run --all-files
Expand Down
150 changes: 129 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

<h2 align="center">
<img width="30%" alt="SemHash logo" src="assets/images/semhash_logo_v2.png"><br/>
Fast Semantic Text Deduplication & Filtering
Fast Multimodal Semantic Deduplication & Filtering
</h2>


Expand Down Expand Up @@ -38,9 +38,9 @@
</div>


SemHash is a lightweight and flexible tool for deduplicating datasets, filtering outliers, and finding representative samples using semantic similarity. It combines fast embedding generation from [Model2Vec](https://github.com/MinishLab/model2vec) with efficient ANN-based similarity search through [Vicinity](https://github.com/MinishLab/vicinity).
SemHash is a lightweight library for semantic deduplication, outlier filtering, and representative sample selection. It's fully multimodal: text works out-of-the-box with fast Model2Vec embeddings, and you can bring your own encoders for images, audio, or custom models.

SemHash supports both single-dataset deduplication & filtering (e.g., cleaning up a train set by removing duplicates and outliers) and multi-dataset deduplication & filtering (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.
SemHash supports both single-dataset operations (clean a training set) and cross-dataset operations (deduplicate test against train). It works with simple lists and complex multi-column datasets, and includes inspection tools to help you understand and refine results. All operations use Vicinity for efficient similarity search.

## Quickstart

Expand All @@ -49,6 +49,8 @@ Install the package with:
pip install semhash
```

### Text Deduplication, Filtering & Representative Sampling

Deduplicate a single dataset, filter outliers, and find representative samples with the following code (note: the examples assume you have `datasets` installed, which you can install with `pip install datasets`):

```python
Expand All @@ -71,7 +73,35 @@ filtered_texts = semhash.self_filter_outliers().selected
representative_texts = semhash.self_find_representative().selected
```

Or, deduplicate across two datasets, filter outliers, and find representative samples with the following code (e.g., eliminating train/test leakage):
### Image Deduplication, Filtering & Representative Sampling

Deduplicate an image dataset, filter outliers, and find representative samples using a vision model (requires `pip install sentence-transformers`):

```python
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from semhash import SemHash

# Load an image dataset and vision model
model = SentenceTransformer('clip-ViT-B-32')
dataset = load_dataset("uoft-cs/cifar10", split="test")

# Initialize a SemHash instance with the 'img' column
semhash = SemHash.from_records(list(dataset), columns=["img"], model=model)

# Deduplicate the images
deduplicated_images = semhash.self_deduplicate().selected

# Filter outliers
filtered_images = semhash.self_filter_outliers().selected

# Find representative images
representative_images = semhash.self_find_representative().selected
```

### Cross-Dataset Deduplication, Filtering & Representative Sampling

Deduplicate across two datasets, filter outliers, and find representative samples (e.g., eliminating train/test leakage):

```python
from datasets import load_dataset
Expand All @@ -93,13 +123,12 @@ filtered_test_texts = semhash.filter_outliers(records=test_texts, outlier_percen

# Find representative texts in the test data against the training data,
# optionally with a specific selection size
representative_test_texts = semhash.find_representative(
records=test_texts, selection_size=10).selected


representative_test_texts = semhash.find_representative(records=test_texts, selection_size=10).selected
```

Or, deduplicate multi-column dataset, filter outliers, and find representative samples with the following code (e.g., deduplicating a QA dataset):
### Multi-Column Deduplication

Deduplicate multi-column datasets (e.g., deduplicating a QA dataset):

```python
from datasets import load_dataset
Expand All @@ -116,15 +145,9 @@ semhash = SemHash.from_records(records=records, columns=["question", "context"])

# Deduplicate the records
deduplicated_records = semhash.self_deduplicate().selected

# Filter outliers from the records
filtered_texts = semhash.self_filter_outliers().selected

# Find representative texts in the records
representative_texts = semhash.self_find_representative().selected
```

The `deduplicate` and `self_deduplicate` functions return a [DeduplicationResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#L58). This object stores the deduplicated corpus, a set of duplicate object (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result.
The `deduplicate` and `self_deduplicate` functions return a [DeduplicationResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#L58). This object stores the deduplicated corpus, a set of duplicate objects (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result.

The `filter_outliers`, `self_filter_outliers`, `find_representative`, and `self_find_representative` functions return a [FilterResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#179). This object stores the found outliers/representative samples.

Expand Down Expand Up @@ -212,14 +235,11 @@ The following code snippet shows how to deduplicate across two datasets, filter
from datasets import load_dataset
from semhash import SemHash

# Initialize a SemHash instance
semhash = SemHash()

# Load two datasets to deduplicate
train_texts = load_dataset("ag_news", split="train")["text"]
test_texts = load_dataset("ag_news", split="test")["text"]

# Initialize a SemHash instance
# Initialize a SemHash instance with the training data
semhash = SemHash.from_records(records=train_texts)

# Deduplicate the test data against the training data
Expand Down Expand Up @@ -265,6 +285,56 @@ representative_records = semhash.self_find_representative().selected

</details>

<details>
<summary> Deduplicate, filter outliers, and find representative samples on image datasets </summary>
<br>

You can bring your own encoder for any modality by implementing the Encoder protocol. Here's an example using a vision model from timm for image deduplication:

```python
from datasets import load_dataset
import timm
import torch
from semhash import SemHash

# Requires: pip install timm torch datasets

# Create a custom image encoder
class VisionEncoder:
"""Custom encoder using timm models. Implements the Encoder protocol."""

def __init__(self, model_name: str = "mobilenetv3_small_100"):
self.model = timm.create_model(model_name, pretrained=True, num_classes=0).eval()
self.transform = timm.data.create_transform(**timm.data.resolve_model_data_config(self.model))

def encode(self, inputs):
"""Encode a batch of PIL images into embeddings."""
with torch.no_grad():
return self.model(torch.stack([self.transform(img) for img in inputs])).numpy()

# Load image dataset
dataset = load_dataset("uoft-cs/cifar10", split="test")
train_data = [{"img": img, "id": i} for i, img in enumerate(dataset["img"][:100])]
test_data = [{"img": img, "id": i} for i, img in enumerate(dataset["img"][100:150])]

# Initialize SemHash with the custom vision encoder
semhash = SemHash.from_records(train_data, columns=["img"], model=VisionEncoder())

# Single-dataset operations
deduplicated = semhash.self_deduplicate().selected
outliers = semhash.self_filter_outliers().selected
representatives = semhash.self_find_representative().selected

# Cross-dataset operations
test_deduplicated = semhash.deduplicate(test_data).selected
test_outliers = semhash.filter_outliers(test_data).selected
test_representatives = semhash.find_representative(test_data, selection_size=10).selected
```

The Encoder protocol requires only an `encode(inputs, **kwargs)` method that returns a numpy array. This makes it easy to integrate any embedding model for any modality.

</details>

<details>
<summary> Using custom encoders </summary>
<br>
Expand Down Expand Up @@ -400,6 +470,44 @@ representative_texts = semhash.self_find_representative().selected
```
</details>

<details>
<summary> Initializing from a HuggingFace Dataset </summary>
<br>
You can easily use SemHash with HuggingFace Datasets by converting them to a list:

```python
from datasets import load_dataset
from semhash import SemHash

# Load a HuggingFace dataset
dataset = load_dataset("ag_news", split="train")

# Convert to list and initialize SemHash
semhash = SemHash.from_records(records=list(dataset), columns=["text"])

# Deduplicate, filter outliers, and find representative samples
deduplicated_texts = semhash.self_deduplicate().selected
filtered_texts = semhash.self_filter_outliers().selected
representative_texts = semhash.self_find_representative().selected
```

This also works with multi-column datasets:

```python
from datasets import load_dataset
from semhash import SemHash

# Load a multi-column dataset
dataset = load_dataset("squad_v2", split="train")

# Convert to list and initialize with multiple columns
semhash = SemHash.from_records(records=list(dataset), columns=["question", "context"])

# Deduplicate the records
deduplicated_records = semhash.self_deduplicate().selected
```
</details>




Expand All @@ -419,7 +527,7 @@ If you use SemHash in your research, please cite the following:
```bibtex
@software{minishlab2025semhash,
author = {{van Dongen}, Thomas and Stephan Tulkens},
title = {SemHash: Fast Semantic Text Deduplication \& Filtering},
title = {SemHash: Fast Multimodal Semantic Deduplication \& Filtering},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.17265942},
Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "semhash"
description = "Fast Semantic Text Deduplication & Filtering"
description = "Fast Multimodal Semantic Deduplication & Filtering"
authors = [{name = "Thomas van Dongen", email = "[email protected]"}, { name = "Stéphan Tulkens", email = "[email protected]"}]
readme = { file = "README.md", content-type = "text/markdown" }
dynamic = ["version"]
Expand Down Expand Up @@ -43,6 +43,7 @@ dev = [
"ruff",
]


[project.urls]
"Homepage" = "https://github.com/MinishLab"
"Bug Reports" = "https://github.com/MinishLab/semhash/issues"
Expand Down
Loading