Skip to content

Conversation

@EmersonFras
Copy link
Contributor

Summary

Implements a feature to check for and skip downloading images that already exist in the target image directory (img_dir), addressing Issue #34.

Changes Implemented

  • A new method, check_exisiting_images, utilizes sum-buddy.gather_file_paths to collect file names from the existing img_dir.
  • A boolean column, in_img_dir, is added to the source_df to track which images are already present.
  • The download_images function is now passed a filtered dataframe containing only images that need to be downloaded (in_img_dir == False).
  • Added an early exit condition: if the filtered dataframe is empty (all images exist), the process exits with the message: '{img_dir}' already contains all images. Exited without executing.
  • Updated the output message regarding file counts to: There are {checksum_df.shape[0]} files in {img_dir}. Based on {csv_path}, there should be {expected_num_imgs} images.
  • Updated the README's CLI Examples section regarding new outputs and cases where relevant to these changes

Closes #34

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

EmersonFras and others added 10 commits November 25, 2025 10:06
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…mageomics/cautious-robot into feature/issue-34/check-existing-images
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Copy link
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed one update to the description (beginning of the README) to reflect the updated functionality. A few other suggestions for clarity are noted below.

A bigger issue I noticed is that the check_existing_images function will have a conflict if the --starting-idx parameter is passed and I don't believe that's addressed.

Suggested method: run the existing image check for all filenames starting at a passed starting index and then have the download start from only the images that aren't already in the output folder following the starting index. This should maintain the current, expected behavior.

EmersonFras and others added 4 commits December 1, 2025 11:38
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Copy link
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more items. I am also concerned that process_checksums needs an edit here.

EmersonFras and others added 4 commits December 4, 2025 12:42
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Copy link
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing I noticed while running through the README examples. I made the suggestion on the line I could and just pointed to the line above. It's the new in_img_dir column that now shows up in the missing CSV.

Copy link
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@EmersonFras EmersonFras merged commit fe031cb into main Dec 8, 2025
7 checks passed
@egrace479 egrace479 deleted the feature/issue-34/check-existing-images branch December 8, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prune image download CSV based on existing directory contents

3 participants