Skip to content

Latest commit

 

History

History
120 lines (95 loc) · 3.7 KB

File metadata and controls

120 lines (95 loc) · 3.7 KB

Extracting GIST features on the full dataset

In this chapter, we run on the full dataset to be able to do a real submission.

Extracting query descriptors

We re-use the PCA matrix computed in the previous chapter. The GIST descriptors for the 50k query images can be extracted with:

python baselines/gist_baseline.py \
    --file_list list_files/dev_queries \
    --image_dir images/queries \
    --o data/dev_queries_gist_pca.hdf5 \
    --pca_file data/pca_gist.vt \
    --nproc 20

However, we have only the ground truth for the 25k first query examples so we also extract the GIST features for that subset so that we can evaluate the results locally.

python baselines/gist_baseline.py \
    --file_list list_files/dev_queries_25k \
    --image_dir images/queries \
    --o data/dev_queries_25k_gist_pca.hdf5 \
    --pca_file data/pca_gist.vt \
    --nproc 20

Yes, we could also extract them as the first 25k rows of the full query matrix. But GIST is very fast to extract, so it's not worth the trouble.

Extracting the reference descriptors by batches

The 1 million reference descriptors could be extracted just as before, which takes 10 to 20 minutes. However, for large datasets it is often useful to extract features by batches so that the computation can be distributed or restarted after a failure.

To support this, the feature extraction script takes arguments --i0 and --i1 to specify a subset of images to extract. We use this to run the feature extraction in 20 batches of 50k images in a loop in shell:

for i in {0..19}; do
     python baselines/gist_baseline.py \
          --file_list list_files/references \
          --i0 $((i * 50000)) --i1 $(((i + 1) * 50000)) \
          --image_dir images/references \
          --o data/references_${i}_gist_pca.hdf5 \
          --pca_file data/pca_gist.vt \
          --nproc 20
done

which produces files data/references_0_gist_pca.hdf5 to data/references_19_gist_pca.hdf5. We are going to use this set of files in the next sections. We will use the handy shell-script shortcut data/references_{0..19}_gist_pca.hdf5 that expands to that list of files.

Evaluation on the public queries subset

We can now evaluate locally on the 25k public queries. The evaluation script takes the set of reference descriptors at once.

python scripts/compute_metrics.py \
    --query_descs data/dev_queries_25k_gist_pca.hdf5 \
    --db_descs data/references_{0..19}_gist_pca.hdf5 \
    --gt_filepath list_files/public_ground_truth.csv \
    --track2

This takes a bit more time (especially if the machine's GPU is not supported by Faiss) and outputs

Average Precision: 0.15209
Recall at P90    : 0.10439
Threshold at P90 : -0.0442495
Recall at rank 1:  0.24063
Recall at rank 10: 0.24925

Preparing a submission file for track 2

The submission file can be constructed from the (full) query descriptors and the reference descriptors:

python scripts/convert_track2_format.py \
    --query_descs data/dev_queries_gist_pca.hdf5 \
    --db_descs data/references_{0..19}_gist_pca.hdf5 \
    --o data/gist_pca_descriptors.hdf5

The output data/gist_pca_descriptors.hdf5 can be used as an official submission file.