Skip to content
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
d35e2b2
Add check_existing_images() to compare existing image files with CSV …
EmersonFras Oct 30, 2025
32ae64d
Integrate existing image check into main download flow
EmersonFras Oct 30, 2025
489a896
Only print pre-download directory status if missing images
EmersonFras Oct 30, 2025
aa434ef
Add tests for check_existing_images() including partial and complete …
EmersonFras Oct 30, 2025
37f4099
Update CLI Examples in README for new check existing image examples
EmersonFras Oct 30, 2025
bf39663
Fix CLI Examples wrong output dir
EmersonFras Nov 25, 2025
f73f16e
Remove validate_output_directory call
EmersonFras Nov 25, 2025
a0f839c
Combine imports in main
EmersonFras Nov 25, 2025
1997b6d
Complete docstring for check_existing_images
EmersonFras Nov 25, 2025
2cc1c16
Remove useless test assert
EmersonFras Nov 25, 2025
6ffa15a
Remove unused variable
EmersonFras Nov 25, 2025
90653cf
Use `.code` over `str()` for consistency
EmersonFras Nov 25, 2025
17f49a2
Use fullpath when checking existing files
EmersonFras Nov 25, 2025
c9a8d39
Merge branch 'feature/issue-34/check-existing-images' of github.com:I…
EmersonFras Nov 25, 2025
6fdcc47
Handle directory existing but empty case
EmersonFras Nov 25, 2025
8d19c70
Update description to match enhanced functionality
egrace479 Nov 26, 2025
3810adb
Use bash section in MD for proper rendering
EmersonFras Dec 1, 2025
fa8e01b
Move comments in README code section to new lines
EmersonFras Dec 1, 2025
1c46e94
Move comments to new lines
EmersonFras Dec 1, 2025
2001973
Renamed 'subfolders_col' -> 'subfolders' for consistency
EmersonFras Dec 1, 2025
ae5c211
Spacing after imports in main
EmersonFras Dec 1, 2025
29dffc4
Use os.path.normpath to normalize pathing for comparison
EmersonFras Dec 1, 2025
a2ad790
Use copy of source_df to avoid future side-effects
EmersonFras Dec 1, 2025
641e4f4
Update docstring for test_directory_does_note_exist
EmersonFras Dec 1, 2025
0612a26
Implement check_existing_images with starting_idx
EmersonFras Dec 1, 2025
78fbfde
Add subfolders handling check for existing images
EmersonFras Dec 1, 2025
212011f
Remove test_main_directory_exists
EmersonFras Dec 1, 2025
ced40e9
Rename missing_df to filtered_df
EmersonFras Dec 3, 2025
ce1aab2
Pass actual DF in tests
EmersonFras Dec 3, 2025
b8e6da0
Update docstring wording in tests/test_existing_images.py
EmersonFras Dec 4, 2025
923e243
Update docstring wording in src/cautiousrobot/utils.py
EmersonFras Dec 4, 2025
385b12f
Return filtered_df when directory does not exist
EmersonFras Dec 4, 2025
1092d74
Update print message in check_existing_images
EmersonFras Dec 4, 2025
b9fcf09
Update test assert messages
EmersonFras Dec 8, 2025
b62a09f
Update phrasing for how many desired images downloaded
EmersonFras Dec 8, 2025
ab3f398
Update README for extra column
EmersonFras Dec 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 37 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

<img align="right" src="cautious-robot_logo.png" alt="cautious-robot logo, an image of a robot generated with Canva Magic Media" width="384"/>

I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV and will warn you if the parent image folder already exists before overwriting it. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, [`sum-buddy`](https://github.com/Imageomics/sum-buddy) helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.
I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV, and will check for images already downloaded in your target folder. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, [`sum-buddy`](https://github.com/Imageomics/sum-buddy) helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.


<p align="right">
Expand All @@ -20,7 +20,7 @@ pip install cautious-robot

## How it Works

Cautious-robot will check the provided CSV for `IMG_NAME`, `URL`, and `SUBFOLDERS` (if provided), then download all images that have a value in the `IMG_NAME` column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the provided `OUTPUT` folder already exists, asking the user before proceeding. Images that have a filename but no `URL` are recorded in the error log; the user is prompted whether to ignore or address the missing URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.
Cautious-robot will check the provided CSV for `IMG_NAME`, `URL`, and `SUBFOLDERS` (if provided), then download all images that have a value in the `IMG_NAME` column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the images already exist in the provided `OUTPUT` folder to avoid overwriting existing files. Images that have a filename but no `URL` are recorded in the error log; the user is prompted whether to ignore or address missing filenames for URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.

If desired, a secondary output directory (`OUTPUT_downsized`) will be created with square copies of the images downsized to the specified size (e.g., 256 x 256). The folder structure of this secondary output directory will match that of the un-processed images. Parameters such as time to wait between retries on a failed download, the maximum number of times to retry downloading an image, and which index of the CSV to start with can all also be passed. Cautious-robot will retry image downloads when receiving one of the following [HTTP response status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes): `429, 500, 502, 503, 504`.

Expand Down Expand Up @@ -78,7 +78,7 @@ cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test
> Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl.
> Calculating md5 checksums on examples/test_images: 100%|███████████████████████████████████████████| 16/16 [00:00<00:00, 3133.00it/s]
> md5 checksums for examples/test_images written to examples/HCGSD_testNA_checksums.csv
> 8 images were downloaded to examples/test_images of the 8 expected.
> There are 8 files in examples/test_images. Based on examples/HCGSD_testNA.csv, there should be 8 images.
> ```
```
head -n 9 examples/HCGSD_testNA_checksums.csv
Expand Down Expand Up @@ -107,7 +107,7 @@ cautious-robot -i examples/HCGSD_testNA.csv -o examples/test_images_subdirs --su
> Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl.
> Calculating md5 checksums on examples/test_images_subdirs: 100%|█████████████████████████████████████████████| 8/8 [00:00<00:00, 3106.60it/s]
> md5 checksums for examples/test_images_subdirs written to examples/HCGSD_testNA_checksums.csv
> 8 images were downloaded to examples/test_images_subdirs of the 8 expected.
> There are 8 files in examples/test_images_subdirs. Based on examples/HCGSD_testNA.csv, there should be 8 images.
> ```
```
ls examples/test_images_subdirs
Expand Down Expand Up @@ -144,10 +144,10 @@ cautious-robot -i examples/HCGSD_test_MD5_mismatch.csv -o examples/test_images_m
> Download logs are in examples/HCGSD_test_MD5_mismatch_log.jsonl and examples/HCGSD_test_MD5_mismatch_error_log.jsonl.
> Calculating md5 checksums on examples/test_images_md5_mismatch: 100%|████████████████████████████████| 8/8 [00:00<00:00, 4159.98it/s]
> md5 checksums for examples/test_images_md5_mismatch written to examples/HCGSD_test_MD5_mismatch_checksums.csv
> 8 images were downloaded to examples/test_images_md5_mismatch of the 8 expected.
> There are 8 files in examples/test_images_md5_mismatch. Based on examples/HCGSD_test_MD5_mismatch.csv, there should be 8 images.
> Image mismatch: 1 image(s) not aligned, see examples/HCGSD_test_MD5_mismatch_missing.csv for missing image info and check logs.
> ```
```
```bash
# Check on that mis-aligned image
head -n 2 examples/HCGSD_test_MD5_mismatch_missing.csv
```
Expand All @@ -157,6 +157,37 @@ head -n 2 examples/HCGSD_test_MD5_mismatch_missing.csv
> 10428972,erato,petiverana,male,https://github.com/Imageomics/dashboard-prototype/raw/main/test_data/images/ventral_images/10428972_V_lowres.png,10428972_V_lowres.png,mismatch
> ```

- **Download Partially Existing Images:** some (or all) images may already exist in the output directory
```bash
# 1. Download the images
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
# 2. Remove some of the images
rm ./examples/test_images/104281*
# 3. Download the same set of images to get only those removed at 2
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
```

> Output:
> ```console
> There are 6 files in examples/test_images. Based on examples/HCGSD_testNA.csv, there should be 8 images.
> 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:32<00:00, 76.36s/it]
> Images downloaded from examples/HCGSD_testNA.csv to examples/test_images.
> Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl.
> Calculating md5 checksums on examples/test_images: 100%|████████████████████████████████| 8/8 [00:00<00:00, 4159.98it/s]
> md5 checksums for examples/test_images written to examples/HCGSD_testNA_checksums.csv
> There are 8 files in examples/test_images. Based on examples/HCGSD_testNA.csv, there should be 8 images.
> ```
```bash
# Attempt to download the same set of images
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
```

> Output:
> ```console
> 'examples/test_images' already contains all images. Exited without executing.
> ```


## Development
To develop the package further:

Expand Down
13 changes: 6 additions & 7 deletions src/cautiousrobot/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,10 @@
import os
import sys
from sumbuddy import get_checksums
from cautiousrobot.utils import process_csv
from cautiousrobot.utils import process_csv, check_existing_images
from cautiousrobot.buddy_check import BuddyCheck
from cautiousrobot.download import download_images


def parse_args():
available_algorithms = ', '.join(hashlib.algorithms_available)

Expand Down Expand Up @@ -104,7 +103,7 @@ def process_checksums(img_dir, metadata_path, args, source_df):
# Verify numbers
checksum_df = pd.read_csv(checksum_path, low_memory=False)
expected_num_imgs = source_df.shape[0]
print(f"{checksum_df.shape[0]} images were downloaded to {img_dir} of the {expected_num_imgs} expected.")
print(f"There are {checksum_df.shape[0]} files in {img_dir}. Based on {args.input_file}, there should be {expected_num_imgs} images.")

return checksum_df, expected_num_imgs
except Exception as e:
Expand Down Expand Up @@ -160,9 +159,9 @@ def main():
# Set source DataFrame for only non-null filename values
source_df = data_df.loc[data_df[filename_col].notna()].copy()

# Validate output directory
# Validate and handle existing output directory
img_dir = args.output_dir
validate_output_directory(img_dir)
source_df, filtered_df = check_existing_images(csv_path, img_dir, source_df, filename_col, subfolders, args.starting_idx)

# Set up log paths
log_filepath, error_log_filepath, metadata_path = setup_log_paths(csv_path)
Expand All @@ -171,7 +170,7 @@ def main():
if isinstance(args.side_length, int):
downsample_dest_path = img_dir + "_downsized"
# Download images from urls & save downsample copy
download_images(source_df,
download_images(filtered_df,
img_dir=img_dir,
log_filepath=log_filepath,
error_log_filepath=error_log_filepath,
Expand All @@ -186,7 +185,7 @@ def main():
print(f"Images downloaded from {csv_path} to {img_dir}, with downsampled images in {downsample_dest_path}.")
else:
# Download images from urls without downsample copy
download_images(source_df,
download_images(filtered_df,
img_dir=img_dir,
log_filepath=log_filepath,
error_log_filepath=error_log_filepath,
Expand Down
82 changes: 81 additions & 1 deletion src/cautiousrobot/utils.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
# Helper functions for download

import json
import sys
import pandas as pd
import os
from PIL import Image
from sumbuddy import gather_file_paths
from sumbuddy.exceptions import EmptyInputDirectoryError


def log_response(log_data, index, image, file_path, response_code):
Expand Down Expand Up @@ -80,4 +83,81 @@ def downsample_and_save_image(image_dir_path, image_name, downsample_dir_path, d
response_code=str(e)
)
update_log(log=log_errors, index=image_index, filepath=error_log_filepath)


def check_existing_images(csv_path, img_dir, source_df, filename_col, subfolders = None, starting_idx = 0):
"""
Checks which files from the CSV already exist in the image directory.

Adds a new boolean column `in_img_dir` to source_df indicating which images
are already in the directory.

If all images already exist in the directory, the function will exit early
by calling `sys.exit()`, and no further processing will occur.

Parameters:
csv_path (str): Path to the CSV file containing image information.
img_dir (str): Path to the directory where images are to be stored.
source_df (pd.DataFrame): DataFrame loaded from the CSV, containing image metadata.
filename_col (str): Name of the column in source_df that contains image filenames.
subfolders (str): Name of the column in source_df that contains subfolder names. (optional)
starting_idx (int): Index to start checking from. (optional)

Returns:
updated_df (pd.DataFrame): DataFrame with new column 'in_img_dir' indicating presence in img_dir.
filtered_df (pd.DataFrame): DataFrame filtered to only files not present in img_dir.
"""
# Create a copy to avoid modifying the original DataFrame
df = source_df.copy()

if not os.path.exists(img_dir):
# Directory doesn't exist, so nothing to check
df["in_img_dir"] = False

# If we have a starting index, we still need to mark the skipped ones as True
if starting_idx > 0:
df.iloc[:starting_idx, df.columns.get_loc("in_img_dir")] = True
# Return the updated df and the filtered dataframe of items that still need downloading
filtered_df = df[~df["in_img_dir"]].copy()
return df, filtered_df

try:
existing_files = gather_file_paths(img_dir)
except EmptyInputDirectoryError:
# If the directory exists but is empty, sumbuddy raises an error.
# We catch it and treat it as an empty file list.
existing_files = []

existing_full_paths = {os.path.normpath(os.path.relpath(f, img_dir)) for f in existing_files}

if subfolders:
# We use a generic join here, but the apply(os.path.normpath) below fixes it for the specific OS
raw_paths = df[subfolders].astype(str) + os.sep + df[filename_col].astype(str)

# This converts '/' to '\' on Windows, or vice versa, ensuring a match
df["expected_path"] = raw_paths.apply(os.path.normpath)
else:
# Normalize even simple filenames just in case they contain pathing characters
df["expected_path"] = df[filename_col].astype(str).apply(os.path.normpath)

# Determine which expected paths physically exist
expected_present = df["expected_path"].isin(existing_full_paths)
df["in_img_dir"] = expected_present.copy()

if starting_idx > 0:
df.iloc[:starting_idx, df.columns.get_loc("in_img_dir")] = True

# Clean up the temporary column before returning.
df = df.drop(columns=["expected_path"])

# Create filtered DataFrame
filtered_df = df[~df["in_img_dir"]].copy()

# Exit if all images are already there
if filtered_df.empty:
sys.exit(f"'{img_dir}' already contains all images. Exited without executing.")
else:
# Print directory status message - pre-download
num_existing = len(existing_files)
print(f"There are {num_existing} of the desired files already in {img_dir}. Based on {csv_path}, {filtered_df.shape[0]} images should be downloaded.")

return df, filtered_df
34 changes: 0 additions & 34 deletions tests/test_download_images.py
Original file line number Diff line number Diff line change
Expand Up @@ -423,39 +423,5 @@ def test_main_missing_filenames(self, mock_input, mock_process_csv, mock_parse_a

self.assertEqual(cm.exception.code, "Exited without executing.")

@patch('cautiousrobot.__main__.parse_args')
@patch('cautiousrobot.__main__.process_csv')
@patch('builtins.input', return_value='n')
@patch('os.path.exists', return_value=True)
def test_main_directory_exists(self, mock_exists, mock_input, mock_process_csv, mock_parse_args):
mock_args = MagicMock()
mock_args.input_file = 'test.csv'
mock_args.img_name_col = 'filename_col'
mock_args.url_col = 'url_col'
mock_args.subdir_col = None
mock_args.output_dir = 'output_dir'
mock_args.side_length = None
mock_args.wait_time = 0
mock_args.max_retries = 3
mock_args.starting_idx = 0
mock_args.checksum_algorithm = 'md5'
mock_args.verifier_col = None

mock_parse_args.return_value = mock_args

mock_data = pd.DataFrame({
'filename_col': ['file1', 'file2', 'file3', 'file4'],
'url_col': ['url1', 'url2', 'url3', 'url4']
})

mock_process_csv.return_value = mock_data

with self.assertRaises(SystemExit) as cm:
main()

# self.assertEqual(cm.exception.code, "mock_args.output_dir Exited without executing.")
self.assertEqual(cm.exception.code, f"'{mock_args.output_dir}' already exists. Exited without executing.")


if __name__ == '__main__':
unittest.main()
Loading