Feat: Pre-check for existing images using in_img_dir column (Closes #34) (#43)

EmersonFras · Copilot · egrace479 · web-flow · commit fe031cb151e4 · 2025-12-08T15:42:35.000-05:00
* Add check_existing_images() to compare existing image files with CSV list

* Integrate existing image check into main download flow

* Only print pre-download directory status if missing images

* Add tests for check_existing_images() including partial and complete directory cases, update test_download_images for new logic flow

* Update CLI Examples in README for new check existing image examples

* Use fullpath when checking existing files

* Handle directory existing but empty case

* Update description to match enhanced functionality

* Use os.path.normpath to normalize pathing for comparison

* Implement check_existing_images with starting_idx

* Add subfolders handling check for existing images

* Remove test_main_directory_exists

---------

Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;
Co-authored-by: egrace479 &lt;e.campolongo479@gmail.com&gt;
Co-authored-by: Elizabeth Campolongo &lt;38985481+egrace479@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 <img align="right" src="cautious-robot_logo.png" alt="cautious-robot logo, an image of a robot generated with Canva Magic Media" width="384"/>
 
-I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV and will warn you if the parent image folder already exists before overwriting it. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, [`sum-buddy`](https://github.com/Imageomics/sum-buddy) helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.
+I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV, and will check for images already downloaded in your target folder. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, [`sum-buddy`](https://github.com/Imageomics/sum-buddy) helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.
 
   
 <p align="right">
@@ -20,7 +20,7 @@ pip install cautious-robot
 
 ## How it Works
 
-Cautious-robot will check the provided CSV for `IMG_NAME`, `URL`, and `SUBFOLDERS` (if provided), then download all images that have a value in the `IMG_NAME` column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the provided `OUTPUT` folder already exists, asking the user before proceeding. Images that have a filename but no `URL` are recorded in the error log; the user is prompted whether to ignore or address the missing URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.
+Cautious-robot will check the provided CSV for `IMG_NAME`, `URL`, and `SUBFOLDERS` (if provided), then download all images that have a value in the `IMG_NAME` column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the images already exist in the provided `OUTPUT` folder to avoid overwriting existing files. Images that have a filename but no `URL` are recorded in the error log; the user is prompted whether to ignore or address missing filenames for URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.
 
 If desired, a secondary output directory (`OUTPUT_downsized`) will be created with square copies of the images downsized to the specified size (e.g., 256 x 256). The folder structure of this secondary output directory will match that of the un-processed images. Parameters such as time to wait between retries on a failed download, the maximum number of times to retry downloading an image, and which index of the CSV to start with can all also be passed. Cautious-robot will retry image downloads when receiving one of the following [HTTP response status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes): `429, 500, 502, 503, 504`.
 
@@ -78,7 +78,7 @@ cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test
  > Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl.
  > Calculating md5 checksums on examples/test_images: 100%|███████████████████████████████████████████| 16/16 [00:00<00:00, 3133.00it/s]
  > md5 checksums for examples/test_images written to examples/HCGSD_testNA_checksums.csv
- > 8 images were downloaded to examples/test_images of the 8 expected.
+ > There are 8 files in examples/test_images. Based on examples/HCGSD_testNA.csv, there should be 8 images.
  > ```
 ```
 head -n 9 examples/HCGSD_testNA_checksums.csv
@@ -107,7 +107,7 @@ cautious-robot -i examples/HCGSD_testNA.csv -o examples/test_images_subdirs --su
  > Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl.
  > Calculating md5 checksums on examples/test_images_subdirs: 100%|█████████████████████████████████████████████| 8/8 [00:00<00:00, 3106.60it/s]
  > md5 checksums for examples/test_images_subdirs written to examples/HCGSD_testNA_checksums.csv
- > 8 images were downloaded to examples/test_images_subdirs of the 8 expected.
+ > There are 8 files in examples/test_images_subdirs. Based on examples/HCGSD_testNA.csv, there should be 8 images.
  > ```
 ```
 ls examples/test_images_subdirs
@@ -144,19 +144,50 @@ cautious-robot -i examples/HCGSD_test_MD5_mismatch.csv -o examples/test_images_m
  > Download logs are in examples/HCGSD_test_MD5_mismatch_log.jsonl and examples/HCGSD_test_MD5_mismatch_error_log.jsonl.
  > Calculating md5 checksums on examples/test_images_md5_mismatch: 100%|████████████████████████████████| 8/8 [00:00<00:00, 4159.98it/s]
  > md5 checksums for examples/test_images_md5_mismatch written to examples/HCGSD_test_MD5_mismatch_checksums.csv
- > 8 images were downloaded to examples/test_images_md5_mismatch of the 8 expected.
+ > There are 8 files in examples/test_images_md5_mismatch. Based on examples/HCGSD_test_MD5_mismatch.csv, there should be 8 images.
  > Image mismatch: 1 image(s) not aligned, see examples/HCGSD_test_MD5_mismatch_missing.csv for missing image info and check logs.
  > ```
-```
+```bash
 # Check on that mis-aligned image
 head -n 2 examples/HCGSD_test_MD5_mismatch_missing.csv
 ```
  > Output:
  > ```console
- > nhm_specimen,species,subspecies,sex,file_url,filename,md5
- > 10428972,erato,petiverana,male,https://github.com/Imageomics/dashboard-prototype/raw/main/test_data/images/ventral_images/10428972_V_lowres.png,10428972_V_lowres.png,mismatch
+ > nhm_specimen,species,subspecies,sex,file_url,filename,md5,in_img_dir
+ > 10428972,erato,petiverana,male,https://github.com/Imageomics/dashboard-prototype/raw/main/test_data/images/ventral_images/10428972_V_lowres.png,10428972_V_lowres.png,mismatch,False
  > ```
 
+- **Download Partially Existing Images:** some (or all) images may already exist in the output directory
+```bash
+# 1. Download the images
+cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
+# 2. Remove some of the images
+rm ./examples/test_images/104281*
+# 3. Download the same set of images to get only those removed at 2
+cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
+```
+
+ > Output:
+ > ```console
+ > There are 6 of the desired files already in examples/test_images. Based on examples/HCGSD_testNA.csv, 2 images should be downloaded.
+ > 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:32<00:00, 76.36s/it]
+ > Images downloaded from examples/HCGSD_testNA.csv to examples/test_images.
+ > Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl.
+ > Calculating md5 checksums on examples/test_images: 100%|████████████████████████████████| 8/8 [00:00<00:00, 4159.98it/s]
+ > md5 checksums for examples/test_images written to examples/HCGSD_testNA_checksums.csv
+ > There are 8 files in examples/test_images. Based on examples/HCGSD_testNA.csv, there should be 8 images.
+ > ```
+```bash 
+# Attempt to download the same set of images
+cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
+```
+
+ > Output:
+ > ```console
+ > 'examples/test_images' already contains all images. Exited without executing.
+ > ```
+
+
 ## Development
 To develop the package further:
 
diff --git a/src/cautiousrobot/__main__.py b/src/cautiousrobot/__main__.py
@@ -11,11 +11,10 @@
 import os
 import sys
 from sumbuddy import get_checksums
-from cautiousrobot.utils import process_csv
+from cautiousrobot.utils import process_csv, check_existing_images
 from cautiousrobot.buddy_check import BuddyCheck
 from cautiousrobot.download import download_images
 
-
 def parse_args():
     available_algorithms = ', '.join(hashlib.algorithms_available)
 
@@ -104,7 +103,7 @@ def process_checksums(img_dir, metadata_path, args, source_df):
         # Verify numbers
         checksum_df = pd.read_csv(checksum_path, low_memory=False)
         expected_num_imgs = source_df.shape[0]
-        print(f"{checksum_df.shape[0]} images were downloaded to {img_dir} of the {expected_num_imgs} expected.")
+        print(f"There are {checksum_df.shape[0]} files in {img_dir}. Based on {args.input_file}, there should be {expected_num_imgs} images.")
         
         return checksum_df, expected_num_imgs
     except Exception as e:
@@ -160,9 +159,9 @@ def main():
     # Set source DataFrame for only non-null filename values
     source_df = data_df.loc[data_df[filename_col].notna()].copy()
 
-    # Validate output directory
+    # Validate and handle existing output directory
     img_dir = args.output_dir
-    validate_output_directory(img_dir)
+    source_df, filtered_df = check_existing_images(csv_path, img_dir, source_df, filename_col, subfolders, args.starting_idx)
 
     # Set up log paths
     log_filepath, error_log_filepath, metadata_path = setup_log_paths(csv_path)
@@ -171,7 +170,7 @@ def main():
     if isinstance(args.side_length, int):
         downsample_dest_path = img_dir + "_downsized"
         # Download images from urls & save downsample copy
-        download_images(source_df,
+        download_images(filtered_df,
                        img_dir=img_dir,
                        log_filepath=log_filepath,
                        error_log_filepath=error_log_filepath,
@@ -186,7 +185,7 @@ def main():
         print(f"Images downloaded from {csv_path} to {img_dir}, with downsampled images in {downsample_dest_path}.")
     else:
         # Download images from urls without downsample copy
-        download_images(source_df,
+        download_images(filtered_df,
                        img_dir=img_dir,
                        log_filepath=log_filepath,
                        error_log_filepath=error_log_filepath,
diff --git a/src/cautiousrobot/utils.py b/src/cautiousrobot/utils.py
@@ -1,9 +1,12 @@
 # Helper functions for download
 
 import json
+import sys
 import pandas as pd
 import os
 from PIL import Image
+from sumbuddy import gather_file_paths
+from sumbuddy.exceptions import EmptyInputDirectoryError
 
 
 def log_response(log_data, index, image, file_path, response_code):
@@ -80,4 +83,81 @@ def downsample_and_save_image(image_dir_path, image_name, downsample_dir_path, d
             response_code=str(e)
         )
         update_log(log=log_errors, index=image_index, filepath=error_log_filepath)
-        
+        
+def check_existing_images(csv_path, img_dir, source_df, filename_col, subfolders = None, starting_idx = 0):
+    """
+    Checks which files from the CSV already exist in the image directory.
+
+    Adds a new boolean column `in_img_dir` to source_df indicating which images
+    are already in the directory.
+
+    If all images already exist in the directory, the function will exit early
+    by calling `sys.exit()`, and no further processing will occur.
+
+    Parameters:
+        csv_path (str): Path to the CSV file containing image information.
+        img_dir (str): Path to the directory where images are to be stored.
+        source_df (pd.DataFrame): DataFrame loaded from the CSV, containing image metadata.
+        filename_col (str): Name of the column in source_df that contains image filenames.
+        subfolders (str): Name of the column in source_df that contains subfolder names. (optional)
+        starting_idx (int): Index to start checking from. (optional)
+
+    Returns:
+        updated_df (pd.DataFrame): DataFrame with new column 'in_img_dir' indicating presence in img_dir.
+        filtered_df (pd.DataFrame): DataFrame filtered to only files not present in img_dir.
+    """
+    # Create a copy to avoid modifying the original DataFrame
+    df = source_df.copy()
+    
+    if not os.path.exists(img_dir):
+        # Directory doesn't exist, so nothing to check
+        df["in_img_dir"] = False
+        
+        # If we have a starting index, we still need to mark the skipped ones as True
+        if starting_idx > 0:
+             df.iloc[:starting_idx, df.columns.get_loc("in_img_dir")] = True
+        # Return the updated df and the filtered dataframe of items that still need downloading
+        filtered_df = df[~df["in_img_dir"]].copy()
+        return df, filtered_df
+
+    try:
+        existing_files = gather_file_paths(img_dir)
+    except EmptyInputDirectoryError:
+        # If the directory exists but is empty, sumbuddy raises an error.
+        # We catch it and treat it as an empty file list.
+        existing_files = []
+    
+    existing_full_paths = {os.path.normpath(os.path.relpath(f, img_dir)) for f in existing_files}
+
+    if subfolders:
+        # We use a generic join here, but the apply(os.path.normpath) below fixes it for the specific OS
+        raw_paths = df[subfolders].astype(str) + os.sep + df[filename_col].astype(str)
+
+        # This converts '/' to '\' on Windows, or vice versa, ensuring a match
+        df["expected_path"] = raw_paths.apply(os.path.normpath)
+    else:
+        # Normalize even simple filenames just in case they contain pathing characters
+        df["expected_path"] = df[filename_col].astype(str).apply(os.path.normpath)
+        
+    # Determine which expected paths physically exist
+    expected_present = df["expected_path"].isin(existing_full_paths)
+    df["in_img_dir"] = expected_present.copy()
+    
+    if starting_idx > 0:
+        df.iloc[:starting_idx, df.columns.get_loc("in_img_dir")] = True
+    
+    # Clean up the temporary column before returning.
+    df = df.drop(columns=["expected_path"])
+    
+    # Create filtered DataFrame
+    filtered_df = df[~df["in_img_dir"]].copy()
+    
+    # Exit if all images are already there
+    if filtered_df.empty:
+        sys.exit(f"'{img_dir}' already contains all images. Exited without executing.")
+    else:
+        # Print directory status message - pre-download
+        num_existing = len(existing_files)
+        print(f"There are {num_existing} of the desired files already in {img_dir}. Based on {csv_path}, {filtered_df.shape[0]} images should be downloaded.")
+        
+    return df, filtered_df
diff --git a/tests/test_download_images.py b/tests/test_download_images.py
@@ -423,39 +423,5 @@ def test_main_missing_filenames(self, mock_input, mock_process_csv, mock_parse_a
         
         self.assertEqual(cm.exception.code, "Exited without executing.")
 
-    @patch('cautiousrobot.__main__.parse_args')
-    @patch('cautiousrobot.__main__.process_csv')
-    @patch('builtins.input', return_value='n')
-    @patch('os.path.exists', return_value=True)
-    def test_main_directory_exists(self, mock_exists, mock_input, mock_process_csv, mock_parse_args):
-        mock_args = MagicMock()
-        mock_args.input_file = 'test.csv'
-        mock_args.img_name_col = 'filename_col'
-        mock_args.url_col = 'url_col'
-        mock_args.subdir_col = None
-        mock_args.output_dir = 'output_dir'
-        mock_args.side_length = None
-        mock_args.wait_time = 0
-        mock_args.max_retries = 3
-        mock_args.starting_idx = 0
-        mock_args.checksum_algorithm = 'md5'
-        mock_args.verifier_col = None
-
-        mock_parse_args.return_value = mock_args
-
-        mock_data = pd.DataFrame({
-            'filename_col': ['file1', 'file2', 'file3', 'file4'],
-            'url_col': ['url1', 'url2', 'url3', 'url4']
-        })
-        
-        mock_process_csv.return_value = mock_data
-
-        with self.assertRaises(SystemExit) as cm:
-            main()
-        
-        # self.assertEqual(cm.exception.code, "mock_args.output_dir Exited without executing.")
-        self.assertEqual(cm.exception.code, f"'{mock_args.output_dir}' already exists. Exited without executing.")
-
-
 if __name__ == '__main__':
     unittest.main()
diff --git a/tests/test_existing_images.py b/tests/test_existing_images.py