|
| 1 | +# Copilot Instructions for cautious-robot |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +**cautious-robot** is a Python CLI tool designed to download images from URLs specified in CSV files. It provides robust downloading capabilities with verification, error handling, and optional image downsampling. |
| 6 | + |
| 7 | +### Key Features |
| 8 | +- Download images from URLs in CSV files |
| 9 | +- Generate and verify checksums (MD5, SHA256, etc.) using the `sum-buddy` library |
| 10 | +- Optional image downsampling and resizing using Pillow |
| 11 | +- Comprehensive logging of downloads and errors |
| 12 | +- CSV validation and data integrity checks |
| 13 | +- Support for organizing downloads into subdirectories |
| 14 | + |
| 15 | +## Architecture and Code Organization |
| 16 | + |
| 17 | +### Core Modules |
| 18 | +- `__main__.py` - Main entry point with CLI argument parsing and workflow orchestration |
| 19 | +- `download.py` - Core image downloading functionality with retry logic |
| 20 | +- `buddy_check.py` - Checksum validation and download verification using BuddyCheck class |
| 21 | +- `utils.py` - Helper functions for CSV processing, logging, and image downsampling |
| 22 | +- `exceptions.py` - Custom exception classes |
| 23 | + |
| 24 | +### Key Classes and Functions |
| 25 | +- `BuddyCheck` - Main class for validating downloads against expected checksums |
| 26 | +- `download_images()` - Core function for downloading images with progress tracking |
| 27 | +- `process_csv()` - Validates and processes input CSV files |
| 28 | +- `downsample_and_save_image()` - Handles image resizing operations |
| 29 | + |
| 30 | +## Development Guidelines |
| 31 | + |
| 32 | +### Code Style and Standards |
| 33 | +- Follow PEP 8 style guidelines |
| 34 | +- Use descriptive function and variable names |
| 35 | +- Include comprehensive docstrings for all functions and classes |
| 36 | +- Use type hints where appropriate |
| 37 | +- Handle exceptions gracefully with meaningful error messages |
| 38 | + |
| 39 | +### Testing Patterns |
| 40 | +- Use `unittest` framework for test cases |
| 41 | +- Create temporary files for testing file operations |
| 42 | +- Test both success and failure scenarios |
| 43 | +- Mock external dependencies (HTTP requests) when appropriate |
| 44 | +- Include tests for edge cases (empty files, network errors, etc.) |
| 45 | + |
| 46 | +### Dependencies Management |
| 47 | +- Core dependencies: `requests`, `pandas`, `pillow`, `sum-buddy`, `argparse` |
| 48 | +- Development dependencies: `pytest`, `ruff`, `pre-commit` |
| 49 | +- Use `pyproject.toml` for project configuration |
| 50 | +- Pin dependency versions for reproducible builds |
| 51 | + |
| 52 | +## CLI Interface Patterns |
| 53 | + |
| 54 | +### Required Arguments |
| 55 | +- `-i, --input-file`: Path to CSV file with URLs |
| 56 | +- `-o, --output-dir`: Directory for downloaded images |
| 57 | + |
| 58 | +### Optional Arguments |
| 59 | +- `-s, --subdir-col`: Column name for subdirectory organization |
| 60 | +- `-n, --img-name-col`: Column for image filenames (default: "filename") |
| 61 | +- `-u, --url-col`: Column with URLs (default: "file_url") |
| 62 | +- `-w, --wait-time`: Retry delay in seconds (default: 3) |
| 63 | +- `-r, --max-retries`: Maximum retry attempts for a single image (default: 5) |
| 64 | +- `-x, --starting-idx`: Index of DataFrame from CSV at which to start download (default: 0) |
| 65 | +- `-l, --side-length`: Pixels per side for square resized images |
| 66 | +- `-a, --checksum-algorithm`: Hash algorithm for checksums (default: "md5") |
| 67 | +- `-v, --verifier-col`: Column with expected checksums for validation |
| 68 | + |
| 69 | +### CSV Format Expectations |
| 70 | +- Required columns: filename and URL columns (customizable names) |
| 71 | +- Optional columns: subdirectory organization, checksum verification |
| 72 | +- Case-insensitive column matching (all converted to lowercase) |
| 73 | +- Handle missing values gracefully |
| 74 | + |
| 75 | +## Error Handling and Validation |
| 76 | + |
| 77 | +### Input Validation |
| 78 | +- Verify CSV file extension |
| 79 | +- Check for required columns in CSV |
| 80 | +- Validate filename uniqueness |
| 81 | +- Prevent overwriting existing output directories |
| 82 | +- Handle missing filenames with user prompts |
| 83 | + |
| 84 | +### Download Reliability |
| 85 | +- Implement retry logic with exponential backoff |
| 86 | +- Log all HTTP response codes as strings (not integers) |
| 87 | +- Create separate log files for successful downloads and errors |
| 88 | +- Generate JSONL format logs for structured data |
| 89 | + |
| 90 | +### Checksum Verification |
| 91 | +- Support multiple hash algorithms via `hashlib` |
| 92 | +- Compare downloaded file checksums with expected values |
| 93 | +- Report mismatches and missing files in separate CSV |
| 94 | +- Use `sum-buddy` library for checksum generation |
| 95 | + |
| 96 | +## File Organization |
| 97 | + |
| 98 | +### Output Structure |
| 99 | +- Main directory: specified by `--output-dir` |
| 100 | +- Optional subdirectories: based on `--subdir-col` values |
| 101 | +- Downsampled images: `<output-dir>_downsized` directory |
| 102 | +- Log files: same directory as input CSV with descriptive suffixes |
| 103 | + |
| 104 | +### Logging Files |
| 105 | +- `<csv_name>_log.jsonl` - Successful downloads |
| 106 | +- `<csv_name>_error_log.jsonl` - Failed downloads |
| 107 | +- `<csv_name>_checksums.csv` - Generated checksums |
| 108 | +- `<csv_name>_missing.csv` - Verification mismatches |
| 109 | + |
| 110 | +## Performance Considerations |
| 111 | + |
| 112 | +### Image Processing |
| 113 | +- Use Pillow for image operations |
| 114 | +- Implement memory-efficient downsampling |
| 115 | +- Handle various image formats gracefully |
| 116 | +- Process images sequentially to avoid memory issues |
| 117 | + |
| 118 | +### Progress Tracking |
| 119 | +- Use `tqdm` for progress bars |
| 120 | +- Report download statistics |
| 121 | +- Provide clear status messages for user feedback |
| 122 | + |
| 123 | +## Testing and CI/CD |
| 124 | + |
| 125 | +### Test Structure |
| 126 | +- Unit tests for individual functions |
| 127 | +- Integration tests for full workflows |
| 128 | +- Use temporary files for file I/O testing |
| 129 | +- Test error conditions and edge cases |
| 130 | + |
| 131 | +### GitHub Actions |
| 132 | +- Run tests on Python 3.10, 3.11, 3.12, 3.13 |
| 133 | +- Use Ruff for linting |
| 134 | +- Test on Ubuntu latest |
| 135 | +- Separate workflows for PRs and pushes |
| 136 | + |
| 137 | +### Pre-commit Hooks |
| 138 | +- Ruff linting automatically applied |
| 139 | +- Use `.pre-commit-config.yaml` for configuration |
| 140 | +- Keep hooks updated with `pre-commit autoupdate` |
| 141 | + |
| 142 | +## Common Patterns and Conventions |
| 143 | + |
| 144 | +### DataFrame Operations |
| 145 | +- Convert column names to lowercase for case-insensitive matching |
| 146 | +- Use `pd.read_csv()` with `low_memory=False` for large files |
| 147 | +- Filter DataFrames using `.loc[]` for better performance |
| 148 | +- Handle NaN values explicitly |
| 149 | + |
| 150 | +### Logging and Error Reporting |
| 151 | +- Use structured logging with JSON (specifically JSONL) format |
| 152 | +- Include context information (image name, URL, index) |
| 153 | +- Provide user-friendly error messages |
| 154 | +- Log response codes as strings to avoid serialization issues |
| 155 | + |
| 156 | +### Configuration Management |
| 157 | +- Use argument groups for required vs optional parameters |
| 158 | +- Provide sensible defaults for optional parameters |
| 159 | +- Validate user inputs before processing |
| 160 | +- Support flexible column naming |
| 161 | + |
| 162 | +## Examples and Usage |
| 163 | + |
| 164 | +### Basic Usage |
| 165 | +```bash |
| 166 | +cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images |
| 167 | +``` |
| 168 | + |
| 169 | +### With Subdirectories and Checksums |
| 170 | +```bash |
| 171 | +cautious-robot -i data.csv -o images -s species -a sha256 -v expected_sha256 |
| 172 | +``` |
| 173 | + |
| 174 | +### With Downsampling |
| 175 | +```bash |
| 176 | +cautious-robot -i data.csv -o images -l 512 |
| 177 | +``` |
| 178 | + |
| 179 | +When implementing features or fixing bugs, consider the tool's primary use case in scientific image processing workflows, where data integrity and reliable downloads are critical. |
0 commit comments