Skip to content

Commit 1063c18

Browse files
author
dbickson
committed
cleaning documentation
1 parent 9cab0bc commit 1063c18

File tree

4 files changed

+207
-201
lines changed

4 files changed

+207
-201
lines changed

CLOUD.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
2+
# Support for cloud storage
3+
FastDup supports two types of cloud storage:
4+
- Amazon s3 aws cli
5+
- Min.io cloud storage api
6+
7+
## Amazon s3 aws cli support
8+
### Preliminaries:
9+
- Install aws cli using the command
10+
`sudo apt install awscli`
11+
- Configure your aws using the command
12+
`aws configure`
13+
- Make sure you can access your bucket using
14+
`aws s3 ls s3://<your bucket name>`
15+
16+
## How to run
17+
There are two options to run.
18+
In the input_dir command line argument put the full path your bucket for example: `s3://mybucket/myfolder/myother_folder/`
19+
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
20+
Alternatively (and recommended) create a file with the list of all your images in the following format:
21+
```
22+
s3://mybucket/myfolder/myother_folder/image1.jpg
23+
s3://mybucket/myfolder2/myother_folder4/image2.jpg
24+
s3://mybucket/myfolder3/myother_folder5/image3.jpg
25+
```
26+
Assuming the filename is files.txt you can run with input_dir=’/path/to/files.txt’
27+
28+
Notes:
29+
Currently we support a single cloud provider and a single bucket.
30+
It is OK to have images with the same name assuming they are nested in different subfolders.
31+
In terms of performance, it is better to copy the full bucket to the local node first in case the local disk is hard enough. Then give the input_dir as the local folder location of the copied data. The explanation above is for the case the dataset is larger than the local disk (and potentially multiple nodes run in parallel).
32+
33+
34+
35+
## Min.io support
36+
Preliminaries
37+
Install the min.io client using the command
38+
```
39+
wget https://dl.min.io/client/mc/release/linux-amd64/mc
40+
sudo mv mc /usr/bin/
41+
chmod +x /usr/bin/mc
42+
```
43+
Configure the client to point to the cloud provider
44+
45+
```
46+
mc alias set myminio/ http://MINIO-SERVER MYUSER MYPASSWORD
47+
```
48+
For example for google cloud:
49+
```
50+
/usr/bin/mc alias set google https://storage.googleapis.com/ <access_key> <secret_key>
51+
```
52+
Make sure the bucket is accessible using the command:
53+
```
54+
/usr/bin/mc ls google/mybucket/myfolder/myotherfolder/
55+
```
56+
57+
How to run
58+
There are two options to run.
59+
In the input_dir command line argument put the full path your cloud storage provider as defined by the minio alias, for example: `minio://google/mybucket/myfolder/myother_folder/`
60+
(Note that google is the alias set for google cloud, and the path has to start with `minio://` prefix).
61+
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
62+
Alternatively (and recommended) create a file with the list of all your images in the following format:
63+
```
64+
minio://google/mybucket/myfolder/myother_folder/image1.jpg
65+
minio://google/mybucket/myfolder/myother_folder/image2.jpg
66+
minio://google/mybucket/myfolder/myother_folder/image3.jpg
67+
```
68+
Assuming the filename is `files.txt` you can run with `input_dir=’/path/to/files.txt’`
69+
70+

INSTALL.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Installation
2+
## Ubuntu 20.04 LTS Machine Setup
3+
Required setup
4+
- sudo apt update
5+
- sudo apt -y install software-properties-common
6+
- sudo add-apt-repository -y ppa:deadsnakes/ppa
7+
- sudo apt update
8+
- sudo apt -y install python3.8
9+
- sudo apt -y install python3-pip
10+
- pip install --upgrade pip
11+
12+
13+
14+
# Pip Package setup
15+
Download the FastDup latest wheel from the following shared folder: `s3://visualdb`
16+
17+
Latest version: 0.25
18+
19+
## For pip (python 3.8) install using
20+
```
21+
pip install fastdup-<VERSION>-cp38-cp38-linux_x86_64.whl
22+
```
23+
24+
## For conda (python 3.7.11) install using
25+
```
26+
conda install -y pandas tqdm opencv numpy
27+
conda install fastdup-<VERSION>-py37_0.tar.bz
28+
```
29+
30+
31+
# Currently supported software/hardware
32+
33+
Operating system
34+
- `Ubuntu 20.04 LTS`
35+
36+
Software versions
37+
- `Python 3.8` (via pip) or `Python 3.7` (via pip or conda) or a `debian package` (Python is not required)
38+
39+
Hardware support
40+
- CPU (GPU not needed!)
41+
42+
43+

README.md

Lines changed: 1 addition & 201 deletions
Original file line numberDiff line numberDiff line change
@@ -22,215 +22,15 @@ FastDup identifies 1,200,000 duplicate images on the ImageNet dataset.
2222

2323

2424

25-
# Installation
26-
## Ubuntu 20.04 LTS Machine Setup
27-
Required setup
28-
- sudo apt update
29-
- sudo apt -y install software-properties-common
30-
- sudo add-apt-repository -y ppa:deadsnakes/ppa
31-
- sudo apt update
32-
- sudo apt -y install python3.8
33-
- sudo apt -y install python3-pip
34-
- pip install --upgrade pip
35-
36-
37-
38-
# Pip Package setup
39-
Download the FastDup latest wheel from the following shared folder: `s3://visualdb`
40-
41-
Latest version: 0.25
42-
43-
## For pip (python 3.8) install using
44-
```
45-
pip install fastdup-<VERSION>-cp38-cp38-linux_x86_64.whl
46-
```
47-
48-
## For conda (python 3.7.11) install using
49-
```
50-
conda install -y pandas tqdm opencv numpy
51-
conda install fastdup-<VERSION>-py37_0.tar.bz
52-
```
53-
54-
55-
# Currently supported software/hardware
56-
57-
Operating system
58-
- `Ubuntu 20.04 LTS`
59-
60-
Software versions
61-
- `Python 3.8` (via pip) or `Python 3.7` (via pip or conda) or a `debian package` (Python is not required)
62-
63-
Hardware support
64-
- CPU (GPU not needed!)
65-
66-
6725
# Running the code
6826
```
6927
> python3
7028
> import fastdup
7129
> fastdup.__version__ # prints the version number
7230
> fastdup.run(“/path/to/your/folder”) #main running function
7331
```
74-
75-
Detailed Python API documentation
76-
77-
```
78-
Run fastdup tool for find duplicate and near duplicate images in a corpus of images.
79-
The only mandatory argument is image_dir. Given an image directory it will compare all pairs of images and store the most similar ones in the output file output_similarity.
80-
81-
Parameters:
82-
input_dir (str): Location of the images directory (or videos).
83-
Alternatively, it is also possible to give a location of a file listing images full path, one image per row.
84-
85-
work_dir (str): Working directory for saving intermediate results and outputs.
86-
87-
compute (str): Compute type [cpu|gpu] default is cpu.
88-
89-
verbose (boolean): Verbosity. Default is False.
90-
91-
num_threads (int): Number of threads. Default is -1 to be auto configured by the number of cores.
92-
93-
num_images (int): Number of images to run on. Default is -1 which means run on all the images in the image_dir folder.
94-
95-
nnmodel (str): Nearest Neighbor model for clustering the features together, when using turi (has no effect when using faiss). Supported options are brute_force (exact), ball_tree and lsh (both approximate). Default is brute_force.
96-
97-
distance (str): Distance metric for the Nearest Neighbors algorithm. Default is cosine. Other distances are euclidean, squared_euclidean, manhattan.
98-
99-
threshold (float): Similarity measure in the range 0->1, where 1 is totally identical, 0.98 and above is almost identical, and 0.85 and above is very similar. Default is 0.85 which means that only image pairs with similarity larger than 0.85 are stored.
100-
101-
lower_threshold (float): Similarity measure to outline images that are far away (outliers) vs. the total distribution. Default value is 0.3.
102-
103-
model_path(str): Optional location of ONNX model file, should not be used.
104-
105-
version(bool): Print out the version number. This function takes no argument.
106-
107-
nearest_neighbors_k (int): For each image, how many similar images to look for. Default is 2.
108-
109-
run_mode (int): This software can run for either feature vector extraction and similarity measurement (0), or just feature vector extraction (1), or just similarity measure computation (2).
110-
111-
nn_provider (string): Provider of the nearest neighbor algorithm, allowed values are turi|faiss.
112-
113-
min_offset (int): Optional min offset to start iterating on the full file list. Default is -1.
114-
115-
max_offset (int): Optional max offset to start iterating on the full file list. Default is -1.
116-
117-
faiss_mode (str): When nn_provider='faiss' selects the faiss mode. Supported options are HNSW32 and any other faiss string.
118-
119-
faiss_param (str): When nn_provider='faiss' assigns optional faiss parameters. For example efSearch=175. Multiple params are supported - for example 'efSearch=175,nprobes=200'
120-
121-
122-
123-
124-
Returns:
125-
Status code 0 = success, 1 = error.
126-
```
127-
128-
## Input / output formats
129-
The input to fastdup tool is given in the command line argument: data_dir. There are a few options:
130-
Location of a local folder. In that case all images in this folder are searched recursively.
131-
Location of an s3 path. Again all images in the path will be used recursively.
132-
A file containing image locations (either local or full s3 paths). Each image in its own row.
133-
134-
The intermediate outputs and final outputs are stored in the folder work_dir.
135-
Feature extraction related files:
136-
Binary numpy array containing n rows of 576 columns with the feature vectors. (Default filename is features.dat)
137-
An additional csv file containing the full paths to the image names corresponding to the feature vectors (default filename is features.dat.csv). This is needed from two reasons:
138-
The order of extraction may change depends on the file system listing
139-
In case of corrupted images, its feature vector is skipped and not generated. In that case an additional output file is provided ( features.bad.csv)
140-
141-
Similarity pair list
142-
The output of the fastdup tool is a similarity file (filename is similarity.csv) which is a csv file with 3 columns: from, to, distance. The file is sorted from the closest matching images to less similar images.
143-
144-
Note: for exploiting the binary features we provide the following function in Python:
145-
146-
```
147-
def load_binary_feature(filename):
148-
149-
Example Python function for loading the stored binary features and their matching filenames.
150-
151-
Parameters:
152-
filename(str):The binary feature file location
153-
154-
Returns:
155-
A list of with all image file names of length X.
156-
An np matrix of shape X rows x 576 cols. Each row conform to feature vector os a single image.
157-
158-
Example:
159-
import fastdup
160-
file_list, mat_features = fastdup.load_binary('features.dat')
161-
162-
```
163-
164-
Faiss index files
165-
When using faiss an additional intermediate results file is created: faiss.index.
166-
Support for cloud storage
167-
FastDup supports two types of cloud storage:
168-
Amazon s3 aws cli
169-
Min.io cloud storage api
170-
171-
## Amazon s3 aws cli support
172-
### Preliminaries:
173-
- Install aws cli using the command
174-
`sudo apt install awscli`
175-
- Configure your aws using the command
176-
`aws configure`
177-
- Make sure you can access your bucket using
178-
`aws s3 ls s3://<your bucket name>`
179-
180-
## How to run
181-
There are two options to run.
182-
In the input_dir command line argument put the full path your bucket for example: `s3://mybucket/myfolder/myother_folder/`
183-
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
184-
Alternatively (and recommended) create a file with the list of all your images in the following format:
185-
```
186-
s3://mybucket/myfolder/myother_folder/image1.jpg
187-
s3://mybucket/myfolder2/myother_folder4/image2.jpg
188-
s3://mybucket/myfolder3/myother_folder5/image3.jpg
189-
```
190-
Assuming the filename is files.txt you can run with input_dir=’/path/to/files.txt’
191-
192-
Notes:
193-
Currently we support a single cloud provider and a single bucket.
194-
It is OK to have images with the same name assuming they are nested in different subfolders.
195-
In terms of performance, it is better to copy the full bucket to the local node first in case the local disk is hard enough. Then give the input_dir as the local folder location of the copied data. The explanation above is for the case the dataset is larger than the local disk (and potentially multiple nodes run in parallel).
196-
197-
198-
199-
## Min.io support
200-
Preliminaries
201-
Install the min.io client using the command
202-
```
203-
wget https://dl.min.io/client/mc/release/linux-amd64/mc
204-
sudo mv mc /usr/bin/
205-
chmod +x /usr/bin/mc
206-
```
207-
Configure the client to point to the cloud provider
208-
209-
```
210-
mc alias set myminio/ http://MINIO-SERVER MYUSER MYPASSWORD
211-
```
212-
For example for google cloud:
213-
```
214-
/usr/bin/mc alias set google https://storage.googleapis.com/ <access_key> <secret_key>
215-
```
216-
Make sure the bucket is accessible using the command:
217-
```
218-
/usr/bin/mc ls google/mybucket/myfolder/myotherfolder/
219-
```
220-
221-
How to run
222-
There are two options to run.
223-
In the input_dir command line argument put the full path your cloud storage provider as defined by the minio alias, for example: `minio://google/mybucket/myfolder/myother_folder/`
224-
(Note that google is the alias set for google cloud, and the path has to start with `minio://` prefix).
225-
This option is useful for testing but it is not recommended for large corpouses of images as listing files in s3 is a slow operation. In this mode, all the images in the recursive subfolders of the given folders will be used.
226-
Alternatively (and recommended) create a file with the list of all your images in the following format:
227-
```
228-
minio://google/mybucket/myfolder/myother_folder/image1.jpg
229-
minio://google/mybucket/myfolder/myother_folder/image2.jpg
230-
minio://google/mybucket/myfolder/myother_folder/image3.jpg
231-
```
232-
Assuming the filename is `files.txt` you can run with `input_dir=’/path/to/files.txt’`
23332

33+
[Detailed running instructions](RUN.md)
23434

23535
## Error handling
23636
When bad images are encountered, namely corrupted images that can not be read, an additional csv output file is generated called features.dat.bad. The bad images filenames are stored there. In addition there is a printout that states the number of good and bad images encountered. The good images filenames are stored in the file features.dat.csv file. Namely the bad images are excluded from the total images listing. The function fastdup.load_binary_features() reads the features corresponding to the good images and returns a list of all the good images, and a numpy array of all their corresponding features.

0 commit comments

Comments
 (0)