You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feat: Pre-check for existing images using in_img_dir column (Closes#34) (#43)
* Add check_existing_images() to compare existing image files with CSV list
* Integrate existing image check into main download flow
* Only print pre-download directory status if missing images
* Add tests for check_existing_images() including partial and complete directory cases, update test_download_images for new logic flow
* Update CLI Examples in README for new check existing image examples
* Use fullpath when checking existing files
* Handle directory existing but empty case
* Update description to match enhanced functionality
* Use os.path.normpath to normalize pathing for comparison
* Implement check_existing_images with starting_idx
* Add subfolders handling check for existing images
* Remove test_main_directory_exists
---------
Co-authored-by: Copilot <[email protected]>
Co-authored-by: egrace479 <[email protected]>
Co-authored-by: Elizabeth Campolongo <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+39-8Lines changed: 39 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
<imgalign="right"src="cautious-robot_logo.png"alt="cautious-robot logo, an image of a robot generated with Canva Magic Media"width="384"/>
4
4
5
-
I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV and will warn you if the parent image folder already exists before overwriting it. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, [`sum-buddy`](https://github.com/Imageomics/sum-buddy) helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.
5
+
I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV, and will check for images already downloaded in your target folder. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, [`sum-buddy`](https://github.com/Imageomics/sum-buddy) helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.
6
6
7
7
8
8
<palign="right">
@@ -20,7 +20,7 @@ pip install cautious-robot
20
20
21
21
## How it Works
22
22
23
-
Cautious-robot will check the provided CSV for `IMG_NAME`, `URL`, and `SUBFOLDERS` (if provided), then download all images that have a value in the `IMG_NAME` column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the provided `OUTPUT` folder already exists, asking the user before proceeding. Images that have a filename but no `URL` are recorded in the error log; the user is prompted whether to ignore or address the missing URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.
23
+
Cautious-robot will check the provided CSV for `IMG_NAME`, `URL`, and `SUBFOLDERS` (if provided), then download all images that have a value in the `IMG_NAME` column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the images already exist in the provided `OUTPUT` folder to avoid overwriting existing files. Images that have a filename but no `URL` are recorded in the error log; the user is prompted whether to ignore or address missing filenames for URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.
24
24
25
25
If desired, a secondary output directory (`OUTPUT_downsized`) will be created with square copies of the images downsized to the specified size (e.g., 256 x 256). The folder structure of this secondary output directory will match that of the un-processed images. Parameters such as time to wait between retries on a failed download, the maximum number of times to retry downloading an image, and which index of the CSV to start with can all also be passed. Cautious-robot will retry image downloads when receiving one of the following [HTTP response status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes): `429, 500, 502, 503, 504`.
0 commit comments