Skip to content

Commit fe031cb

Browse files
EmersonFrasCopilotegrace479
authored
Feat: Pre-check for existing images using in_img_dir column (Closes #34) (#43)
* Add check_existing_images() to compare existing image files with CSV list * Integrate existing image check into main download flow * Only print pre-download directory status if missing images * Add tests for check_existing_images() including partial and complete directory cases, update test_download_images for new logic flow * Update CLI Examples in README for new check existing image examples * Use fullpath when checking existing files * Handle directory existing but empty case * Update description to match enhanced functionality * Use os.path.normpath to normalize pathing for comparison * Implement check_existing_images with starting_idx * Add subfolders handling check for existing images * Remove test_main_directory_exists --------- Co-authored-by: Copilot <[email protected]> Co-authored-by: egrace479 <[email protected]> Co-authored-by: Elizabeth Campolongo <[email protected]>
1 parent ac97289 commit fe031cb

File tree

5 files changed

+215
-50
lines changed

5 files changed

+215
-50
lines changed

README.md

Lines changed: 39 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
<img align="right" src="cautious-robot_logo.png" alt="cautious-robot logo, an image of a robot generated with Canva Magic Media" width="384"/>
44

5-
I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV and will warn you if the parent image folder already exists before overwriting it. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, [`sum-buddy`](https://github.com/Imageomics/sum-buddy) helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.
5+
I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV, and will check for images already downloaded in your target folder. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, [`sum-buddy`](https://github.com/Imageomics/sum-buddy) helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.
66

77

88
<p align="right">
@@ -20,7 +20,7 @@ pip install cautious-robot
2020

2121
## How it Works
2222

23-
Cautious-robot will check the provided CSV for `IMG_NAME`, `URL`, and `SUBFOLDERS` (if provided), then download all images that have a value in the `IMG_NAME` column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the provided `OUTPUT` folder already exists, asking the user before proceeding. Images that have a filename but no `URL` are recorded in the error log; the user is prompted whether to ignore or address the missing URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.
23+
Cautious-robot will check the provided CSV for `IMG_NAME`, `URL`, and `SUBFOLDERS` (if provided), then download all images that have a value in the `IMG_NAME` column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the images already exist in the provided `OUTPUT` folder to avoid overwriting existing files. Images that have a filename but no `URL` are recorded in the error log; the user is prompted whether to ignore or address missing filenames for URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.
2424

2525
If desired, a secondary output directory (`OUTPUT_downsized`) will be created with square copies of the images downsized to the specified size (e.g., 256 x 256). The folder structure of this secondary output directory will match that of the un-processed images. Parameters such as time to wait between retries on a failed download, the maximum number of times to retry downloading an image, and which index of the CSV to start with can all also be passed. Cautious-robot will retry image downloads when receiving one of the following [HTTP response status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes): `429, 500, 502, 503, 504`.
2626

@@ -78,7 +78,7 @@ cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test
7878
> Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl.
7979
> Calculating md5 checksums on examples/test_images: 100%|███████████████████████████████████████████| 16/16 [00:00<00:00, 3133.00it/s]
8080
> md5 checksums for examples/test_images written to examples/HCGSD_testNA_checksums.csv
81-
> 8 images were downloaded to examples/test_images of the 8 expected.
81+
> There are 8 files in examples/test_images. Based on examples/HCGSD_testNA.csv, there should be 8 images.
8282
> ```
8383
```
8484
head -n 9 examples/HCGSD_testNA_checksums.csv
@@ -107,7 +107,7 @@ cautious-robot -i examples/HCGSD_testNA.csv -o examples/test_images_subdirs --su
107107
> Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl.
108108
> Calculating md5 checksums on examples/test_images_subdirs: 100%|█████████████████████████████████████████████| 8/8 [00:00<00:00, 3106.60it/s]
109109
> md5 checksums for examples/test_images_subdirs written to examples/HCGSD_testNA_checksums.csv
110-
> 8 images were downloaded to examples/test_images_subdirs of the 8 expected.
110+
> There are 8 files in examples/test_images_subdirs. Based on examples/HCGSD_testNA.csv, there should be 8 images.
111111
> ```
112112
```
113113
ls examples/test_images_subdirs
@@ -144,19 +144,50 @@ cautious-robot -i examples/HCGSD_test_MD5_mismatch.csv -o examples/test_images_m
144144
> Download logs are in examples/HCGSD_test_MD5_mismatch_log.jsonl and examples/HCGSD_test_MD5_mismatch_error_log.jsonl.
145145
> Calculating md5 checksums on examples/test_images_md5_mismatch: 100%|████████████████████████████████| 8/8 [00:00<00:00, 4159.98it/s]
146146
> md5 checksums for examples/test_images_md5_mismatch written to examples/HCGSD_test_MD5_mismatch_checksums.csv
147-
> 8 images were downloaded to examples/test_images_md5_mismatch of the 8 expected.
147+
> There are 8 files in examples/test_images_md5_mismatch. Based on examples/HCGSD_test_MD5_mismatch.csv, there should be 8 images.
148148
> Image mismatch: 1 image(s) not aligned, see examples/HCGSD_test_MD5_mismatch_missing.csv for missing image info and check logs.
149149
> ```
150-
```
150+
```bash
151151
# Check on that mis-aligned image
152152
head -n 2 examples/HCGSD_test_MD5_mismatch_missing.csv
153153
```
154154
> Output:
155155
> ```console
156-
> nhm_specimen,species,subspecies,sex,file_url,filename,md5
157-
> 10428972,erato,petiverana,male,https://github.com/Imageomics/dashboard-prototype/raw/main/test_data/images/ventral_images/10428972_V_lowres.png,10428972_V_lowres.png,mismatch
156+
> nhm_specimen,species,subspecies,sex,file_url,filename,md5,in_img_dir
157+
> 10428972,erato,petiverana,male,https://github.com/Imageomics/dashboard-prototype/raw/main/test_data/images/ventral_images/10428972_V_lowres.png,10428972_V_lowres.png,mismatch,False
158158
> ```
159159
160+
- **Download Partially Existing Images:** some (or all) images may already exist in the output directory
161+
```bash
162+
# 1. Download the images
163+
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
164+
# 2. Remove some of the images
165+
rm ./examples/test_images/104281*
166+
# 3. Download the same set of images to get only those removed at 2
167+
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
168+
```
169+
170+
> Output:
171+
> ```console
172+
> There are 6 of the desired files already in examples/test_images. Based on examples/HCGSD_testNA.csv, 2 images should be downloaded.
173+
> 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:32<00:00, 76.36s/it]
174+
> Images downloaded from examples/HCGSD_testNA.csv to examples/test_images.
175+
> Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl.
176+
> Calculating md5 checksums on examples/test_images: 100%|████████████████████████████████| 8/8 [00:00<00:00, 4159.98it/s]
177+
> md5 checksums for examples/test_images written to examples/HCGSD_testNA_checksums.csv
178+
> There are 8 files in examples/test_images. Based on examples/HCGSD_testNA.csv, there should be 8 images.
179+
> ```
180+
```bash
181+
# Attempt to download the same set of images
182+
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
183+
```
184+
185+
> Output:
186+
> ```console
187+
> 'examples/test_images' already contains all images. Exited without executing.
188+
> ```
189+
190+
160191
## Development
161192
To develop the package further:
162193

src/cautiousrobot/__main__.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,10 @@
1111
import os
1212
import sys
1313
from sumbuddy import get_checksums
14-
from cautiousrobot.utils import process_csv
14+
from cautiousrobot.utils import process_csv, check_existing_images
1515
from cautiousrobot.buddy_check import BuddyCheck
1616
from cautiousrobot.download import download_images
1717

18-
1918
def parse_args():
2019
available_algorithms = ', '.join(hashlib.algorithms_available)
2120

@@ -104,7 +103,7 @@ def process_checksums(img_dir, metadata_path, args, source_df):
104103
# Verify numbers
105104
checksum_df = pd.read_csv(checksum_path, low_memory=False)
106105
expected_num_imgs = source_df.shape[0]
107-
print(f"{checksum_df.shape[0]} images were downloaded to {img_dir} of the {expected_num_imgs} expected.")
106+
print(f"There are {checksum_df.shape[0]} files in {img_dir}. Based on {args.input_file}, there should be {expected_num_imgs} images.")
108107

109108
return checksum_df, expected_num_imgs
110109
except Exception as e:
@@ -160,9 +159,9 @@ def main():
160159
# Set source DataFrame for only non-null filename values
161160
source_df = data_df.loc[data_df[filename_col].notna()].copy()
162161

163-
# Validate output directory
162+
# Validate and handle existing output directory
164163
img_dir = args.output_dir
165-
validate_output_directory(img_dir)
164+
source_df, filtered_df = check_existing_images(csv_path, img_dir, source_df, filename_col, subfolders, args.starting_idx)
166165

167166
# Set up log paths
168167
log_filepath, error_log_filepath, metadata_path = setup_log_paths(csv_path)
@@ -171,7 +170,7 @@ def main():
171170
if isinstance(args.side_length, int):
172171
downsample_dest_path = img_dir + "_downsized"
173172
# Download images from urls & save downsample copy
174-
download_images(source_df,
173+
download_images(filtered_df,
175174
img_dir=img_dir,
176175
log_filepath=log_filepath,
177176
error_log_filepath=error_log_filepath,
@@ -186,7 +185,7 @@ def main():
186185
print(f"Images downloaded from {csv_path} to {img_dir}, with downsampled images in {downsample_dest_path}.")
187186
else:
188187
# Download images from urls without downsample copy
189-
download_images(source_df,
188+
download_images(filtered_df,
190189
img_dir=img_dir,
191190
log_filepath=log_filepath,
192191
error_log_filepath=error_log_filepath,

src/cautiousrobot/utils.py

Lines changed: 81 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
# Helper functions for download
22

33
import json
4+
import sys
45
import pandas as pd
56
import os
67
from PIL import Image
8+
from sumbuddy import gather_file_paths
9+
from sumbuddy.exceptions import EmptyInputDirectoryError
710

811

912
def log_response(log_data, index, image, file_path, response_code):
@@ -80,4 +83,81 @@ def downsample_and_save_image(image_dir_path, image_name, downsample_dir_path, d
8083
response_code=str(e)
8184
)
8285
update_log(log=log_errors, index=image_index, filepath=error_log_filepath)
83-
86+
87+
def check_existing_images(csv_path, img_dir, source_df, filename_col, subfolders = None, starting_idx = 0):
88+
"""
89+
Checks which files from the CSV already exist in the image directory.
90+
91+
Adds a new boolean column `in_img_dir` to source_df indicating which images
92+
are already in the directory.
93+
94+
If all images already exist in the directory, the function will exit early
95+
by calling `sys.exit()`, and no further processing will occur.
96+
97+
Parameters:
98+
csv_path (str): Path to the CSV file containing image information.
99+
img_dir (str): Path to the directory where images are to be stored.
100+
source_df (pd.DataFrame): DataFrame loaded from the CSV, containing image metadata.
101+
filename_col (str): Name of the column in source_df that contains image filenames.
102+
subfolders (str): Name of the column in source_df that contains subfolder names. (optional)
103+
starting_idx (int): Index to start checking from. (optional)
104+
105+
Returns:
106+
updated_df (pd.DataFrame): DataFrame with new column 'in_img_dir' indicating presence in img_dir.
107+
filtered_df (pd.DataFrame): DataFrame filtered to only files not present in img_dir.
108+
"""
109+
# Create a copy to avoid modifying the original DataFrame
110+
df = source_df.copy()
111+
112+
if not os.path.exists(img_dir):
113+
# Directory doesn't exist, so nothing to check
114+
df["in_img_dir"] = False
115+
116+
# If we have a starting index, we still need to mark the skipped ones as True
117+
if starting_idx > 0:
118+
df.iloc[:starting_idx, df.columns.get_loc("in_img_dir")] = True
119+
# Return the updated df and the filtered dataframe of items that still need downloading
120+
filtered_df = df[~df["in_img_dir"]].copy()
121+
return df, filtered_df
122+
123+
try:
124+
existing_files = gather_file_paths(img_dir)
125+
except EmptyInputDirectoryError:
126+
# If the directory exists but is empty, sumbuddy raises an error.
127+
# We catch it and treat it as an empty file list.
128+
existing_files = []
129+
130+
existing_full_paths = {os.path.normpath(os.path.relpath(f, img_dir)) for f in existing_files}
131+
132+
if subfolders:
133+
# We use a generic join here, but the apply(os.path.normpath) below fixes it for the specific OS
134+
raw_paths = df[subfolders].astype(str) + os.sep + df[filename_col].astype(str)
135+
136+
# This converts '/' to '\' on Windows, or vice versa, ensuring a match
137+
df["expected_path"] = raw_paths.apply(os.path.normpath)
138+
else:
139+
# Normalize even simple filenames just in case they contain pathing characters
140+
df["expected_path"] = df[filename_col].astype(str).apply(os.path.normpath)
141+
142+
# Determine which expected paths physically exist
143+
expected_present = df["expected_path"].isin(existing_full_paths)
144+
df["in_img_dir"] = expected_present.copy()
145+
146+
if starting_idx > 0:
147+
df.iloc[:starting_idx, df.columns.get_loc("in_img_dir")] = True
148+
149+
# Clean up the temporary column before returning.
150+
df = df.drop(columns=["expected_path"])
151+
152+
# Create filtered DataFrame
153+
filtered_df = df[~df["in_img_dir"]].copy()
154+
155+
# Exit if all images are already there
156+
if filtered_df.empty:
157+
sys.exit(f"'{img_dir}' already contains all images. Exited without executing.")
158+
else:
159+
# Print directory status message - pre-download
160+
num_existing = len(existing_files)
161+
print(f"There are {num_existing} of the desired files already in {img_dir}. Based on {csv_path}, {filtered_df.shape[0]} images should be downloaded.")
162+
163+
return df, filtered_df

tests/test_download_images.py

Lines changed: 0 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -423,39 +423,5 @@ def test_main_missing_filenames(self, mock_input, mock_process_csv, mock_parse_a
423423

424424
self.assertEqual(cm.exception.code, "Exited without executing.")
425425

426-
@patch('cautiousrobot.__main__.parse_args')
427-
@patch('cautiousrobot.__main__.process_csv')
428-
@patch('builtins.input', return_value='n')
429-
@patch('os.path.exists', return_value=True)
430-
def test_main_directory_exists(self, mock_exists, mock_input, mock_process_csv, mock_parse_args):
431-
mock_args = MagicMock()
432-
mock_args.input_file = 'test.csv'
433-
mock_args.img_name_col = 'filename_col'
434-
mock_args.url_col = 'url_col'
435-
mock_args.subdir_col = None
436-
mock_args.output_dir = 'output_dir'
437-
mock_args.side_length = None
438-
mock_args.wait_time = 0
439-
mock_args.max_retries = 3
440-
mock_args.starting_idx = 0
441-
mock_args.checksum_algorithm = 'md5'
442-
mock_args.verifier_col = None
443-
444-
mock_parse_args.return_value = mock_args
445-
446-
mock_data = pd.DataFrame({
447-
'filename_col': ['file1', 'file2', 'file3', 'file4'],
448-
'url_col': ['url1', 'url2', 'url3', 'url4']
449-
})
450-
451-
mock_process_csv.return_value = mock_data
452-
453-
with self.assertRaises(SystemExit) as cm:
454-
main()
455-
456-
# self.assertEqual(cm.exception.code, "mock_args.output_dir Exited without executing.")
457-
self.assertEqual(cm.exception.code, f"'{mock_args.output_dir}' already exists. Exited without executing.")
458-
459-
460426
if __name__ == '__main__':
461427
unittest.main()

0 commit comments

Comments
 (0)