|
8 | 8 | "tags": [] |
9 | 9 | }, |
10 | 10 | "source": [ |
11 | | - "# Clustering Images with DINOv2\n", |
12 | | - "This notebook shows how to extract feature vectors efficiently using [DINOv2](https://github.com/facebookresearch/dinov2) by Meta AI Research - No GPU needed!\n", |
| 11 | + "# Extract Dataset Feature Vectors with DINOv2\n", |
| 12 | + "This notebook shows how to extract feature vectors efficiently using [DINOv2](https://github.com/facebookresearch/dinov2) by Meta AI Research \n", |
13 | 13 | "\n", |
14 | | - "This notebook includes artifacts from the DINOv2 repository, which is licensed under the\n", |
| 14 | + "Yes! Using only CPU, no GPU needed!\n", |
| 15 | + "\n", |
| 16 | + "> **NOTE**: This notebook includes artifacts from the DINOv2 repository, which is licensed under the\n", |
15 | 17 | "Creative Commons Attribution-NonCommercial 4.0 International License.\n", |
16 | 18 | "You are free to use this code as long as you provide attribution to the original author and use it in accordance with the terms of the license.\n", |
17 | 19 | "\n", |
|
23 | 25 | "id": "5d461982-b6b8-49fe-bf5e-ccb932f31ef7", |
24 | 26 | "metadata": {}, |
25 | 27 | "source": [ |
26 | | - "## Installation" |
| 28 | + "## Installation\n", |
| 29 | + "\n", |
| 30 | + "We will be using a free tool [fastdup](https://github.com/visual-layer/fastdup) to efficiently extract the feature vectors on CPU." |
27 | 31 | ] |
28 | 32 | }, |
29 | 33 | { |
|
47 | 51 | "source": [ |
48 | 52 | "## Download Oxford Pets Dataset\n", |
49 | 53 | "\n", |
50 | | - "For demonstration, we will use a widely available and well curated dataset. For that reason we might not find a lot of issues here. \n", |
| 54 | + "For demonstration, we will use a widely available and well curated dataset.\n", |
51 | 55 | "\n", |
52 | | - "!wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz -O images.tar.gz\n", |
53 | | - "!tar xf images.tar.gzFeel free to swap this dataset with your own." |
| 56 | + "Feel free to swap this dataset with your own." |
54 | 57 | ] |
55 | 58 | }, |
56 | 59 | { |
|
72 | 75 | "tags": [] |
73 | 76 | }, |
74 | 77 | "source": [ |
75 | | - "## Import and Run fastdup to create embedding\n", |
76 | | - "> **Note** - Runtime on a 2-core Google Colab node may be slow, it is recommended to run on 32-core machine to get x16 speedup. " |
| 78 | + "## Import and Run\n", |
| 79 | + "\n", |
| 80 | + "Now we need to import fastdup and run extract the embedding of the dataset." |
77 | 81 | ] |
78 | 82 | }, |
79 | 83 | { |
|
85 | 89 | { |
86 | 90 | "data": { |
87 | 91 | "text/plain": [ |
88 | | - "'0.921'" |
| 92 | + "'0.922'" |
89 | 93 | ] |
90 | 94 | }, |
91 | 95 | "execution_count": 1, |
|
98 | 102 | "fastdup.__version__" |
99 | 103 | ] |
100 | 104 | }, |
| 105 | + { |
| 106 | + "cell_type": "markdown", |
| 107 | + "id": "d60024b5-8089-466e-910b-e3bcebf31636", |
| 108 | + "metadata": {}, |
| 109 | + "source": [ |
| 110 | + "Now specify the `input_dir` and `work_dir` for the run.\n", |
| 111 | + "\n", |
| 112 | + "- `input_dir` - A folder that stores your image dataset.\n", |
| 113 | + "\n", |
| 114 | + "- `work_dir` - A folder to save all artifacts from the run." |
| 115 | + ] |
| 116 | + }, |
101 | 117 | { |
102 | 118 | "cell_type": "code", |
103 | 119 | "execution_count": 2, |
|
107 | 123 | }, |
108 | 124 | "outputs": [], |
109 | 125 | "source": [ |
110 | | - "# The input for creating the embedding is a folder with images\n", |
111 | | - "# The output is saved in fastdup_work_dir.\n", |
112 | 126 | "fd = fastdup.create(input_dir=\"images/\", work_dir=\"fastdup_work_dir/\")" |
113 | 127 | ] |
114 | 128 | }, |
| 129 | + { |
| 130 | + "cell_type": "markdown", |
| 131 | + "id": "b9189af7-545e-48d0-b220-3c9daf9ea5cf", |
| 132 | + "metadata": {}, |
| 133 | + "source": [ |
| 134 | + "We are ready to run now.\n", |
| 135 | + "\n", |
| 136 | + "> **Note** - Runtime on a 2-core Google Colab node may be slow, we recommend running on 32-core machine to get x16 speedup. " |
| 137 | + ] |
| 138 | + }, |
| 139 | + { |
| 140 | + "cell_type": "markdown", |
| 141 | + "id": "233d2522-d62e-45b5-a7ac-7c30d89d161a", |
| 142 | + "metadata": {}, |
| 143 | + "source": [ |
| 144 | + "Arguments - \n", |
| 145 | + "\n", |
| 146 | + "- `model_path` - The model to use for the run. Choose `dinov2s` or `dinov2b`.\n", |
| 147 | + "- `cc_threshold` - Connected component threshold. Read more [here](https://visual-layer.readme.io/docs/dataset-cleanup) to set an approriate value for your dataset." |
| 148 | + ] |
| 149 | + }, |
| 150 | + { |
| 151 | + "cell_type": "markdown", |
| 152 | + "id": "e8bef62f-208b-432b-9701-237151381d3f", |
| 153 | + "metadata": {}, |
| 154 | + "source": [ |
| 155 | + "You can optionally use your own ONNX model for the extraction. \n", |
| 156 | + "\n", |
| 157 | + "```python\n", |
| 158 | + "fd.run(model_path='your-model.onnx', cc_threshold=0.8, d=384)\n", |
| 159 | + "```\n", |
| 160 | + "\n", |
| 161 | + "If you use your own ONNX model you need to specify the value of `d`, the output dimension of the model.\n", |
| 162 | + "\n", |
| 163 | + "> **NOTE**: For `dinov2s`, `d` is `384` and for `dinov2b`, `d` is `786`" |
| 164 | + ] |
| 165 | + }, |
115 | 166 | { |
116 | 167 | "cell_type": "code", |
117 | 168 | "execution_count": 3, |
|
135 | 186 | " and use it in accordance with the terms of the license.\n", |
136 | 187 | " For more information, please see: https://github.com/facebookresearch/dinov2/blob/main/LICENSE\n", |
137 | 188 | "FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.\n", |
138 | | - "2023-04-24 13:47:19 [INFO] Found resent/efficientnet/dinov2 model, setting up normalization\n", |
139 | | - "2023-04-24 13:47:19 [INFO] Going to loop over dir images\n", |
140 | | - "2023-04-24 13:47:19 [INFO] Found total 7390 images to run on, 7390 train, 0 test, name list 7390, counter 7390 \n", |
141 | | - "2023-04-24 13:47:26 [ERROR] Failed to read image images/Abyssinian_34.jpg\n", |
142 | | - "2023-04-24 13:49:13 [ERROR] Failed to read image images/Egyptian_Mau_139.jpg\n", |
143 | | - "2023-04-24 13:49:13 [ERROR] Failed to read image images/Egyptian_Mau_145.jpg\n", |
144 | | - "2023-04-24 13:49:15 [ERROR] Failed to read image images/Egyptian_Mau_167.jpg\n", |
145 | | - "2023-04-24 13:49:15 [ERROR] Failed to read image images/Egyptian_Mau_177.jpg\n", |
146 | | - "2023-04-24 13:49:16 [ERROR] Failed to read image images/Egyptian_Mau_191.jpg\n", |
147 | | - "2023-04-24 13:54:04 [INFO] Found total 7390 images to run on\n", |
148 | | - "Finished histogram 1.135\n", |
149 | | - "Finished bucket sort 1.155\n", |
150 | | - "2023-04-24 13:54:04 [INFO] 79) Finished write_index() NN model\n", |
151 | | - "2023-04-24 13:54:04 [INFO] Stored nn model index file fastdup_work_dir/nnf.index\n", |
152 | | - "2023-04-24 13:54:04 [INFO] Total time took 405302 ms\n", |
153 | | - "2023-04-24 13:54:04 [INFO] Found a total of 118 fully identical images (d>0.990), which are 0.53 %\n", |
154 | | - "2023-04-24 13:54:04 [INFO] Found a total of 14 nearly identical images(d>0.980), which are 0.06 %\n", |
155 | | - "2023-04-24 13:54:04 [INFO] Found a total of 511 above threshold images (d>0.900), which are 2.30 %\n", |
156 | | - "2023-04-24 13:54:04 [INFO] Found a total of 739 outlier images (d<0.050), which are 3.33 %\n", |
157 | | - "2023-04-24 13:54:04 [INFO] Min distance found 0.203 max distance 1.000\n", |
158 | | - "2023-04-24 13:54:04 [INFO] Running connected components for ccthreshold 0.800000 \n", |
| 189 | + "2023-05-02 12:09:03 [INFO] Found resent/efficientnet/dinov2 model, setting up normalization\n", |
| 190 | + "2023-05-02 12:09:03 [INFO] Going to loop over dir images\n", |
| 191 | + "2023-05-02 12:09:03 [INFO] Found total 7390 images to run on, 7390 train, 0 test, name list 7390, counter 7390 \n", |
| 192 | + "2023-05-02 12:09:10 [ERROR] Failed to read image images/Abyssinian_34.jpg\n", |
| 193 | + "2023-05-02 12:10:57 [ERROR] Failed to read image images/Egyptian_Mau_139.jpg\n", |
| 194 | + "2023-05-02 12:10:57 [ERROR] Failed to read image images/Egyptian_Mau_145.jpg\n", |
| 195 | + "2023-05-02 12:10:58 [ERROR] Failed to read image images/Egyptian_Mau_167.jpg\n", |
| 196 | + "2023-05-02 12:10:59 [ERROR] Failed to read image images/Egyptian_Mau_177.jpg\n", |
| 197 | + "2023-05-02 12:11:00 [ERROR] Failed to read image images/Egyptian_Mau_191.jpg\n", |
| 198 | + "2023-05-02 12:15:19 [INFO] Found total 7390 images to run on\n", |
| 199 | + "Finished histogram 1.164\n", |
| 200 | + "Finished bucket sort 1.183\n", |
| 201 | + "2023-05-02 12:15:20 [INFO] 90) Finished write_index() NN model\n", |
| 202 | + "2023-05-02 12:15:20 [INFO] Stored nn model index file fastdup_work_dir/nnf.index\n", |
| 203 | + "2023-05-02 12:15:20 [INFO] Total time took 376277 ms\n", |
| 204 | + "2023-05-02 12:15:20 [INFO] Found a total of 118 fully identical images (d>0.990), which are 0.53 %\n", |
| 205 | + "2023-05-02 12:15:20 [INFO] Found a total of 14 nearly identical images(d>0.980), which are 0.06 %\n", |
| 206 | + "2023-05-02 12:15:20 [INFO] Found a total of 511 above threshold images (d>0.900), which are 2.30 %\n", |
| 207 | + "2023-05-02 12:15:20 [INFO] Found a total of 739 outlier images (d<0.050), which are 3.33 %\n", |
| 208 | + "2023-05-02 12:15:20 [INFO] Min distance found 0.229 max distance 1.000\n", |
| 209 | + "2023-05-02 12:15:20 [INFO] Running connected components for ccthreshold 0.800000 \n", |
159 | 210 | ".0\n", |
160 | 211 | " ########################################################################################\n", |
161 | 212 | "\n", |
|
171 | 222 | " For a detailed analysis, use `.connected_components()`\n", |
172 | 223 | "(similarity threshold used is 0.9, connected component threshold used is 0.8).\n", |
173 | 224 | "\n", |
174 | | - " Outliers: 6.31% (466) of images are possible outliers, and fall in the bottom 5.00% of similarity values.\n", |
| 225 | + " Outliers: 6.29% (465) of images are possible outliers, and fall in the bottom 5.00% of similarity values.\n", |
175 | 226 | " For a detailed list of outliers, use `.outliers()`.\n" |
176 | 227 | ] |
177 | 228 | } |
178 | 229 | ], |
179 | 230 | "source": [ |
180 | | - "# Use dinov2s for the smaller dino model (d=384), or dinov2b for the bigger model (d=786)\n", |
181 | 231 | "fd.run(model_path='dinov2s', cc_threshold=0.8)" |
182 | 232 | ] |
183 | 233 | }, |
|
191 | 241 | "source": [ |
192 | 242 | "## Image Clusters\n", |
193 | 243 | "\n", |
194 | | - "Let's debug the embedding quality by clustering group of similar images and visualizing them." |
| 244 | + "Let's debug the embedding quality by clustering group of similar images and visualizing them.\n", |
| 245 | + "\n", |
| 246 | + "In the visualization below, `component` refers to the cluster number. For example -\n", |
| 247 | + "\n", |
| 248 | + "- `component` `933` refers to cluster `933` found in the dataset.\n", |
| 249 | + "\n", |
| 250 | + "- `num_images` refers to the number of images in the cluster (`component`).\n", |
| 251 | + "\n", |
| 252 | + "- `mean_distance` refers to the mean distance of all the images in the cluster (`component`)." |
195 | 253 | ] |
196 | 254 | }, |
197 | 255 | { |
|
212 | 270 | "name": "stderr", |
213 | 271 | "output_type": "stream", |
214 | 272 | "text": [ |
215 | | - "100%|███████████████████████████████████| 20/20 [00:03<00:00, 5.50it/s]\n" |
| 273 | + "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 5.42it/s]\n" |
216 | 274 | ] |
217 | 275 | }, |
218 | 276 | { |
|
221 | 279 | "text": [ |
222 | 280 | "Finished OK. Components are stored as image files fastdup_work_dir/galleries/components_[index].jpg\n", |
223 | 281 | "Stored components visual view in fastdup_work_dir/galleries/components.html\n", |
224 | | - "Execution time in seconds 5.9\n" |
| 282 | + "Execution time in seconds 6.0\n" |
225 | 283 | ] |
226 | 284 | }, |
227 | 285 | { |
|
1316 | 1374 | "fd.vis.component_gallery()" |
1317 | 1375 | ] |
1318 | 1376 | }, |
| 1377 | + { |
| 1378 | + "cell_type": "markdown", |
| 1379 | + "id": "2e7af64e-5eac-4faf-9aba-79d6b0be4b6c", |
| 1380 | + "metadata": {}, |
| 1381 | + "source": [ |
| 1382 | + "## Dimensions" |
| 1383 | + ] |
| 1384 | + }, |
1319 | 1385 | { |
1320 | 1386 | "cell_type": "code", |
1321 | 1387 | "execution_count": 5, |
|
1364 | 1430 | "print(\"Feature vector matrix dimensions\", feature_vec.shape)" |
1365 | 1431 | ] |
1366 | 1432 | }, |
| 1433 | + { |
| 1434 | + "cell_type": "markdown", |
| 1435 | + "id": "f3f75164-a790-436f-9ff5-ae95927b7f20", |
| 1436 | + "metadata": {}, |
| 1437 | + "source": [ |
| 1438 | + "## Wrap up\n", |
| 1439 | + "\n", |
| 1440 | + "With the embeddings extracted, we'll leave it to you on how you'd use them. Find anomalies, duplicates, or visualize them, whatever comes to mind.\n", |
| 1441 | + "\n", |
| 1442 | + "Feel free to check out our notebooks on how to use fastdup to find issues in your visual datasets. In combination with DINOv2 models, this could significantly improve the quality of your image dataset!\n", |
| 1443 | + "\n", |
| 1444 | + "We recommend checking out -\n", |
| 1445 | + "\n", |
| 1446 | + "- [**Quick Dataset Analysis**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb) - Learn how to quickly analyze a dataset for potential issues. Identify duplicates, outliers, dark/bright/blurry images, and cluster similar images with only a few lines of code.\n", |
| 1447 | + "\n", |
| 1448 | + "- [**Cleaning Image Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb) - Learn how to clean a dataset from broken images, duplicates, outliers, and identify dark/bright/blurry images.\n", |
| 1449 | + "\n", |
| 1450 | + "As usual, feedback is welcome! Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) if you have questions!\n", |
| 1451 | + "Happy learning 😀" |
| 1452 | + ] |
| 1453 | + }, |
1367 | 1454 | { |
1368 | 1455 | "cell_type": "code", |
1369 | 1456 | "execution_count": null, |
1370 | | - "id": "2JbfqfSPTuSC", |
1371 | | - "metadata": { |
1372 | | - "id": "2JbfqfSPTuSC" |
1373 | | - }, |
| 1457 | + "id": "ab6f39a3-603e-4914-9a14-9c8ee148fd9c", |
| 1458 | + "metadata": {}, |
1374 | 1459 | "outputs": [], |
1375 | 1460 | "source": [] |
1376 | 1461 | } |
|
1394 | 1479 | "name": "python", |
1395 | 1480 | "nbconvert_exporter": "python", |
1396 | 1481 | "pygments_lexer": "ipython3", |
1397 | | - "version": "3.10.10" |
| 1482 | + "version": "3.10.11" |
1398 | 1483 | } |
1399 | 1484 | }, |
1400 | 1485 | "nbformat": 4, |
|
0 commit comments