You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the input_dir command line argument put the full path your bucket for example: `s3://mybucket/myfolder/myother_folder/`
19
+
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
20
+
Alternatively (and recommended) create a file with the list of all your images in the following format:
Assuming the filename is files.txt you can run with input_dir=’/path/to/files.txt’
27
+
28
+
Notes:
29
+
Currently we support a single cloud provider and a single bucket.
30
+
It is OK to have images with the same name assuming they are nested in different subfolders.
31
+
In terms of performance, it is better to copy the full bucket to the local node first in case the local disk is hard enough. Then give the input_dir as the local folder location of the copied data. The explanation above is for the case the dataset is larger than the local disk (and potentially multiple nodes run in parallel).
Configure the client to point to the cloud provider
44
+
45
+
```
46
+
mc alias set myminio/ http://MINIO-SERVER MYUSER MYPASSWORD
47
+
```
48
+
For example for google cloud:
49
+
```
50
+
/usr/bin/mc alias set google https://storage.googleapis.com/ <access_key> <secret_key>
51
+
```
52
+
Make sure the bucket is accessible using the command:
53
+
```
54
+
/usr/bin/mc ls google/mybucket/myfolder/myotherfolder/
55
+
```
56
+
57
+
How to run
58
+
There are two options to run.
59
+
In the input_dir command line argument put the full path your cloud storage provider as defined by the minio alias, for example: `minio://google/mybucket/myfolder/myother_folder/`
60
+
(Note that google is the alias set for google cloud, and the path has to start with `minio://` prefix).
61
+
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
62
+
Alternatively (and recommended) create a file with the list of all your images in the following format:
-`Python 3.8` (via pip) or `Python 3.7` (via pip or conda) or a `debian package` (Python is not required)
62
-
63
-
Hardware support
64
-
- CPU (GPU not needed!)
65
-
66
-
67
25
# Running the code
68
26
```
69
27
> python3
70
28
> import fastdup
71
29
> fastdup.__version__ # prints the version number
72
30
> fastdup.run(“/path/to/your/folder”) #main running function
73
31
```
74
-
75
-
Detailed Python API documentation
76
-
77
-
```
78
-
Run fastdup tool for find duplicate and near duplicate images in a corpus of images.
79
-
The only mandatory argument is image_dir. Given an image directory it will compare all pairs of images and store the most similar ones in the output file output_similarity.
80
-
81
-
Parameters:
82
-
input_dir (str): Location of the images directory (or videos).
83
-
Alternatively, it is also possible to give a location of a file listing images full path, one image per row.
84
-
85
-
work_dir (str): Working directory for saving intermediate results and outputs.
86
-
87
-
compute (str): Compute type [cpu|gpu] default is cpu.
88
-
89
-
verbose (boolean): Verbosity. Default is False.
90
-
91
-
num_threads (int): Number of threads. Default is -1 to be auto configured by the number of cores.
92
-
93
-
num_images (int): Number of images to run on. Default is -1 which means run on all the images in the image_dir folder.
94
-
95
-
nnmodel (str): Nearest Neighbor model for clustering the features together, when using turi (has no effect when using faiss). Supported options are brute_force (exact), ball_tree and lsh (both approximate). Default is brute_force.
96
-
97
-
distance (str): Distance metric for the Nearest Neighbors algorithm. Default is cosine. Other distances are euclidean, squared_euclidean, manhattan.
98
-
99
-
threshold (float): Similarity measure in the range 0->1, where 1 is totally identical, 0.98 and above is almost identical, and 0.85 and above is very similar. Default is 0.85 which means that only image pairs with similarity larger than 0.85 are stored.
100
-
101
-
lower_threshold (float): Similarity measure to outline images that are far away (outliers) vs. the total distribution. Default value is 0.3.
102
-
103
-
model_path(str): Optional location of ONNX model file, should not be used.
104
-
105
-
version(bool): Print out the version number. This function takes no argument.
106
-
107
-
nearest_neighbors_k (int): For each image, how many similar images to look for. Default is 2.
108
-
109
-
run_mode (int): This software can run for either feature vector extraction and similarity measurement (0), or just feature vector extraction (1), or just similarity measure computation (2).
110
-
111
-
nn_provider (string): Provider of the nearest neighbor algorithm, allowed values are turi|faiss.
112
-
113
-
min_offset (int): Optional min offset to start iterating on the full file list. Default is -1.
114
-
115
-
max_offset (int): Optional max offset to start iterating on the full file list. Default is -1.
116
-
117
-
faiss_mode (str): When nn_provider='faiss' selects the faiss mode. Supported options are HNSW32 and any other faiss string.
118
-
119
-
faiss_param (str): When nn_provider='faiss' assigns optional faiss parameters. For example efSearch=175. Multiple params are supported - for example 'efSearch=175,nprobes=200'
120
-
121
-
122
-
123
-
124
-
Returns:
125
-
Status code 0 = success, 1 = error.
126
-
```
127
-
128
-
## Input / output formats
129
-
The input to fastdup tool is given in the command line argument: data_dir. There are a few options:
130
-
Location of a local folder. In that case all images in this folder are searched recursively.
131
-
Location of an s3 path. Again all images in the path will be used recursively.
132
-
A file containing image locations (either local or full s3 paths). Each image in its own row.
133
-
134
-
The intermediate outputs and final outputs are stored in the folder work_dir.
135
-
Feature extraction related files:
136
-
Binary numpy array containing n rows of 576 columns with the feature vectors. (Default filename is features.dat)
137
-
An additional csv file containing the full paths to the image names corresponding to the feature vectors (default filename is features.dat.csv). This is needed from two reasons:
138
-
The order of extraction may change depends on the file system listing
139
-
In case of corrupted images, its feature vector is skipped and not generated. In that case an additional output file is provided ( features.bad.csv)
140
-
141
-
Similarity pair list
142
-
The output of the fastdup tool is a similarity file (filename is similarity.csv) which is a csv file with 3 columns: from, to, distance. The file is sorted from the closest matching images to less similar images.
143
-
144
-
Note: for exploiting the binary features we provide the following function in Python:
145
-
146
-
```
147
-
def load_binary_feature(filename):
148
-
149
-
Example Python function for loading the stored binary features and their matching filenames.
150
-
151
-
Parameters:
152
-
filename(str):The binary feature file location
153
-
154
-
Returns:
155
-
A list of with all image file names of length X.
156
-
An np matrix of shape X rows x 576 cols. Each row conform to feature vector os a single image.
When using faiss an additional intermediate results file is created: faiss.index.
166
-
Support for cloud storage
167
-
FastDup supports two types of cloud storage:
168
-
Amazon s3 aws cli
169
-
Min.io cloud storage api
170
-
171
-
## Amazon s3 aws cli support
172
-
### Preliminaries:
173
-
- Install aws cli using the command
174
-
`sudo apt install awscli`
175
-
- Configure your aws using the command
176
-
`aws configure`
177
-
- Make sure you can access your bucket using
178
-
`aws s3 ls s3://<your bucket name>`
179
-
180
-
## How to run
181
-
There are two options to run.
182
-
In the input_dir command line argument put the full path your bucket for example: `s3://mybucket/myfolder/myother_folder/`
183
-
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
184
-
Alternatively (and recommended) create a file with the list of all your images in the following format:
Assuming the filename is files.txt you can run with input_dir=’/path/to/files.txt’
191
-
192
-
Notes:
193
-
Currently we support a single cloud provider and a single bucket.
194
-
It is OK to have images with the same name assuming they are nested in different subfolders.
195
-
In terms of performance, it is better to copy the full bucket to the local node first in case the local disk is hard enough. Then give the input_dir as the local folder location of the copied data. The explanation above is for the case the dataset is larger than the local disk (and potentially multiple nodes run in parallel).
Configure the client to point to the cloud provider
208
-
209
-
```
210
-
mc alias set myminio/ http://MINIO-SERVER MYUSER MYPASSWORD
211
-
```
212
-
For example for google cloud:
213
-
```
214
-
/usr/bin/mc alias set google https://storage.googleapis.com/ <access_key> <secret_key>
215
-
```
216
-
Make sure the bucket is accessible using the command:
217
-
```
218
-
/usr/bin/mc ls google/mybucket/myfolder/myotherfolder/
219
-
```
220
-
221
-
How to run
222
-
There are two options to run.
223
-
In the input_dir command line argument put the full path your cloud storage provider as defined by the minio alias, for example: `minio://google/mybucket/myfolder/myother_folder/`
224
-
(Note that google is the alias set for google cloud, and the path has to start with `minio://` prefix).
225
-
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
226
-
Alternatively (and recommended) create a file with the list of all your images in the following format:
Assuming the filename is `files.txt` you can run with `input_dir=’/path/to/files.txt’`
233
32
33
+
[Detailed running instructions](RUN.md)
234
34
235
35
## Error handling
236
36
When bad images are encountered, namely corrupted images that can not be read, an additional csv output file is generated called features.dat.bad. The bad images filenames are stored there. In addition there is a printout that states the number of good and bad images encountered. The good images filenames are stored in the file features.dat.csv file. Namely the bad images are excluded from the total images listing. The function fastdup.load_binary_features() reads the features corresponding to the good images and returns a list of all the good images, and a numpy array of all their corresponding features.
0 commit comments