11
2- # FastDup Manual
2+ # FastDup
33
4- FastDup is a tool for fast detection of duplicate and near duplicate images.
4+ FastDup is a tool for fast detection of duplicate and near duplicate images. FastDup scales to millions of images running on CPU only.
55
66![ alt text] ( https://github.com/visualdatabase/fastdup/blob/main/gallery/git_main-min.png )
77
8- # FastDup is FAST
8+ ## Quick Installation
9+ For Python 3.7 and 3.8
10+ ``` python
11+ pip install fastdup
12+ ```
13+
14+ [ Install from stable release] ( INSTALL.md )
915
10- Experiments on a 32 core Google cloud machine, with 128GB RAM (no GPU required).
16+
17+ ## Running the code
18+
19+ ### Python
20+ ``` python
21+ python3
22+ import fastdup
23+ fastdup.run(input_dir = " /path/to/your/folder" , work_dir = " /path/to/your/folder" ) # main running function
24+ ```
25+
26+ ### C++
27+ ``` bash
28+ /usr/bin/fastdup /path/to/your/folder --work_dir=" /tmp/fastdup_files"
29+ ```
30+
31+ [ Detailed running instructions] ( RUN.md )
32+
33+
34+
35+ ### Support for s3 cloud/ google storage
36+ [ Detailed instructions] ( CLOUD.md )
37+
38+
39+ ## Results on Key Datasets
40+ We have thourougly tested fastdup across various famous computer-vision dataset. Ranging from Academic datasets to Kaggle competitions. A key finding we have made using FastDup is that there are ~ 1.2M (!) duplicate images on the ImageNet21K dataset, a new unknown result! Full results are below.
41+
42+ ### FastDup is FAST
1143
1244| Dataset | Total Images | Owner | Image Res | cost [ $] | spot cost [ $] | processing [ sec] | throughput [ 1/sec] |
1345| -----------------------| ---------------| -----------------------| --------------| --------| -------| -------| -----|
@@ -21,9 +53,11 @@ Experiments on a 32 core Google cloud machine, with 128GB RAM (no GPU required).
2153| [ visualgenome] ( https://visualgenome.org/ ) | 108,079 | stanford | 334x500 | 0.05 | 0.01 | 124 | 872|
2254| [ sku110k] ( https://github.com/eg4000/SKU110K_CVPR19 ) | 11,743 | trax | 4160x2340 | 0.03 | 0.01 | 77 | 153|
2355
24- We run on the full ImageNet dataset (11.5M images) to compare all pairs of images in less than 3 hours WITHOUT a GPU (with Google cloud cost of 5$ ).
56+ * Experiments on a 32 core Google cloud machine, with 128GB RAM (no GPU required ).
2557
26- # FastDup is ACCURATE
58+ * We run on the full ImageNet dataset (11.5M images) to compare all pairs of images in less than 3 hours WITHOUT a GPU (with Google cloud cost of 5$).
59+
60+ ### FastDup is ACCURATE
2761
2862
2963Dataset| Identical Pairs| Near-Identical Pairs
@@ -42,39 +76,3 @@ Dataset| Identical Pairs| Near-Identical Pairs
4276[ snakeclef2022-fgvc9] ( https://www.kaggle.com/competitions/snakeclef2022/data ) |6,953 |33,128
4377[ fungiclef2022-fgvc9] ( https://www.kaggle.com/competitions/fungiclef2022/data ) |2,205 |75
4478[ hotel-id-to-combat-human-trafficking-2022-fgvc9] ( https://www.kaggle.com/competitions/hotel-id-to-combat-human-trafficking-2022-fgvc9/data ) | 3,544 |2,704
45-
46-
47- FastDup identifies 1,200,000 duplicate images on the ImageNet dataset, a new unknown resut!
48-
49-
50- # Installing the code
51- For Python 3.7 and 3.8
52- ``` python
53- pip install fastdup
54- ```
55-
56- [ Install from stable release] ( INSTALL.md )
57-
58-
59- # Running the code
60-
61- ## Python
62- ``` python
63- python3
64- import fastdup
65- fastdup.run(input_dir = " /path/to/your/folder" , work_dir = " /path/to/your/folder" ) # main running function
66- ```
67-
68- ## C++
69- ``` bash
70- /usr/bin/fastdup /path/to/your/folder --work_dir=" /tmp/fastdup_files"
71- ```
72-
73- [ Detailed running instructions] ( RUN.md )
74-
75-
76-
77- # Support for s3 cloud/ google storage
78- [ Detailed instructions] ( CLOUD.md )
79-
80-
0 commit comments