You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SemHash is extremely fast and scales to large datasets with millions of records. We've benchmarked both single-dataset deduplication and train/test deduplication across a variety of datasets. For example, deduplicating 1.8M records takes only ~83 seconds on CPU.
530
+
SemHash is extremely fast and scales to large datasets with millions of records. We've benchmarked both text and image deduplication across a variety of datasets. For example, deduplicating text 1.8M records takes only ~83 seconds on CPU.
531
+
532
+
For detailed benchmark results and analysis, see the [benchmarks directory](benchmarks/README.md).
533
+
534
+
### Running Benchmarks
517
535
518
-
For detailed benchmark results including performance metrics across 17 datasets, as well as code to reproduce the benchmarks, see the [benchmarks directory](benchmarks/README.md).
Copy file name to clipboardExpand all lines: benchmarks/README.md
+77-9Lines changed: 77 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,19 @@
1
1
# SemHash Benchmarks
2
2
3
-
This directory contains the benchmarking code and results for SemHash. The benchmarks measure deduplication performance and speed across a variety of datasets.
3
+
This directory contains the benchmarking code and results for SemHash. The benchmarks measure deduplication performance and speed across a variety of text and image datasets.
4
4
5
-
## Setup
5
+
## Text Benchmarks
6
6
7
-
All benchmarks were run with the following configuration:
7
+
### Setup
8
+
9
+
All text benchmarks were run with the following configuration:
8
10
-**CPU-only**: All benchmarks run on CPU (no GPU acceleration)
-**Fashion-MNIST high deduplication**: Fashion-MNIST shows very high duplication rates (72% train, 79% test) due to the simple nature of the dataset (10 clothing categories with similar items)
124
+
-**CIFAR-10 moderate deduplication**: CIFAR-10 shows lower duplication (3.45% train, 6.03% test) as it contains more diverse natural images
125
+
-**Speed**: Image deduplication is fast even for large datasets (60k images in ~87 seconds on MPS)
126
+
127
+
### Running Image Benchmarks
128
+
129
+
To run the image benchmarks yourself:
76
130
77
131
```bash
78
-
python -m benchmarks.run_benchmarks
132
+
# Install dependencies
133
+
pip install timm torch datasets
134
+
135
+
# Run benchmarks
136
+
python -m benchmarks.run_image_benchmarks
137
+
# Or using make
138
+
make benchmark-image
79
139
```
80
140
81
-
The datasets can be customized by editing `benchmarks/data.py`.
141
+
The image datasets can be customized by editing `benchmarks/data.py` (see `IMAGE_DATASET_DICT`).
0 commit comments