Skip to content

Commit 5487a1a

Browse files
Copilothlappegrace479
authored
✨ Set up Copilot instructions for cautious-robot repository (#39)
--------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: Hilmar Lapp <[email protected]> Co-authored-by: Elizabeth Campolongo <[email protected]>
1 parent a7b42cd commit 5487a1a

File tree

1 file changed

+179
-0
lines changed

1 file changed

+179
-0
lines changed

.github/copilot-instructions.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Copilot Instructions for cautious-robot
2+
3+
## Project Overview
4+
5+
**cautious-robot** is a Python CLI tool designed to download images from URLs specified in CSV files. It provides robust downloading capabilities with verification, error handling, and optional image downsampling.
6+
7+
### Key Features
8+
- Download images from URLs in CSV files
9+
- Generate and verify checksums (MD5, SHA256, etc.) using the `sum-buddy` library
10+
- Optional image downsampling and resizing using Pillow
11+
- Comprehensive logging of downloads and errors
12+
- CSV validation and data integrity checks
13+
- Support for organizing downloads into subdirectories
14+
15+
## Architecture and Code Organization
16+
17+
### Core Modules
18+
- `__main__.py` - Main entry point with CLI argument parsing and workflow orchestration
19+
- `download.py` - Core image downloading functionality with retry logic
20+
- `buddy_check.py` - Checksum validation and download verification using BuddyCheck class
21+
- `utils.py` - Helper functions for CSV processing, logging, and image downsampling
22+
- `exceptions.py` - Custom exception classes
23+
24+
### Key Classes and Functions
25+
- `BuddyCheck` - Main class for validating downloads against expected checksums
26+
- `download_images()` - Core function for downloading images with progress tracking
27+
- `process_csv()` - Validates and processes input CSV files
28+
- `downsample_and_save_image()` - Handles image resizing operations
29+
30+
## Development Guidelines
31+
32+
### Code Style and Standards
33+
- Follow PEP 8 style guidelines
34+
- Use descriptive function and variable names
35+
- Include comprehensive docstrings for all functions and classes
36+
- Use type hints where appropriate
37+
- Handle exceptions gracefully with meaningful error messages
38+
39+
### Testing Patterns
40+
- Use `unittest` framework for test cases
41+
- Create temporary files for testing file operations
42+
- Test both success and failure scenarios
43+
- Mock external dependencies (HTTP requests) when appropriate
44+
- Include tests for edge cases (empty files, network errors, etc.)
45+
46+
### Dependencies Management
47+
- Core dependencies: `requests`, `pandas`, `pillow`, `sum-buddy`, `argparse`
48+
- Development dependencies: `pytest`, `ruff`, `pre-commit`
49+
- Use `pyproject.toml` for project configuration
50+
- Pin dependency versions for reproducible builds
51+
52+
## CLI Interface Patterns
53+
54+
### Required Arguments
55+
- `-i, --input-file`: Path to CSV file with URLs
56+
- `-o, --output-dir`: Directory for downloaded images
57+
58+
### Optional Arguments
59+
- `-s, --subdir-col`: Column name for subdirectory organization
60+
- `-n, --img-name-col`: Column for image filenames (default: "filename")
61+
- `-u, --url-col`: Column with URLs (default: "file_url")
62+
- `-w, --wait-time`: Retry delay in seconds (default: 3)
63+
- `-r, --max-retries`: Maximum retry attempts for a single image (default: 5)
64+
- `-x, --starting-idx`: Index of DataFrame from CSV at which to start download (default: 0)
65+
- `-l, --side-length`: Pixels per side for square resized images
66+
- `-a, --checksum-algorithm`: Hash algorithm for checksums (default: "md5")
67+
- `-v, --verifier-col`: Column with expected checksums for validation
68+
69+
### CSV Format Expectations
70+
- Required columns: filename and URL columns (customizable names)
71+
- Optional columns: subdirectory organization, checksum verification
72+
- Case-insensitive column matching (all converted to lowercase)
73+
- Handle missing values gracefully
74+
75+
## Error Handling and Validation
76+
77+
### Input Validation
78+
- Verify CSV file extension
79+
- Check for required columns in CSV
80+
- Validate filename uniqueness
81+
- Prevent overwriting existing output directories
82+
- Handle missing filenames with user prompts
83+
84+
### Download Reliability
85+
- Implement retry logic with exponential backoff
86+
- Log all HTTP response codes as strings (not integers)
87+
- Create separate log files for successful downloads and errors
88+
- Generate JSONL format logs for structured data
89+
90+
### Checksum Verification
91+
- Support multiple hash algorithms via `hashlib`
92+
- Compare downloaded file checksums with expected values
93+
- Report mismatches and missing files in separate CSV
94+
- Use `sum-buddy` library for checksum generation
95+
96+
## File Organization
97+
98+
### Output Structure
99+
- Main directory: specified by `--output-dir`
100+
- Optional subdirectories: based on `--subdir-col` values
101+
- Downsampled images: `<output-dir>_downsized` directory
102+
- Log files: same directory as input CSV with descriptive suffixes
103+
104+
### Logging Files
105+
- `<csv_name>_log.jsonl` - Successful downloads
106+
- `<csv_name>_error_log.jsonl` - Failed downloads
107+
- `<csv_name>_checksums.csv` - Generated checksums
108+
- `<csv_name>_missing.csv` - Verification mismatches
109+
110+
## Performance Considerations
111+
112+
### Image Processing
113+
- Use Pillow for image operations
114+
- Implement memory-efficient downsampling
115+
- Handle various image formats gracefully
116+
- Process images sequentially to avoid memory issues
117+
118+
### Progress Tracking
119+
- Use `tqdm` for progress bars
120+
- Report download statistics
121+
- Provide clear status messages for user feedback
122+
123+
## Testing and CI/CD
124+
125+
### Test Structure
126+
- Unit tests for individual functions
127+
- Integration tests for full workflows
128+
- Use temporary files for file I/O testing
129+
- Test error conditions and edge cases
130+
131+
### GitHub Actions
132+
- Run tests on Python 3.10, 3.11, 3.12, 3.13
133+
- Use Ruff for linting
134+
- Test on Ubuntu latest
135+
- Separate workflows for PRs and pushes
136+
137+
### Pre-commit Hooks
138+
- Ruff linting automatically applied
139+
- Use `.pre-commit-config.yaml` for configuration
140+
- Keep hooks updated with `pre-commit autoupdate`
141+
142+
## Common Patterns and Conventions
143+
144+
### DataFrame Operations
145+
- Convert column names to lowercase for case-insensitive matching
146+
- Use `pd.read_csv()` with `low_memory=False` for large files
147+
- Filter DataFrames using `.loc[]` for better performance
148+
- Handle NaN values explicitly
149+
150+
### Logging and Error Reporting
151+
- Use structured logging with JSON (specifically JSONL) format
152+
- Include context information (image name, URL, index)
153+
- Provide user-friendly error messages
154+
- Log response codes as strings to avoid serialization issues
155+
156+
### Configuration Management
157+
- Use argument groups for required vs optional parameters
158+
- Provide sensible defaults for optional parameters
159+
- Validate user inputs before processing
160+
- Support flexible column naming
161+
162+
## Examples and Usage
163+
164+
### Basic Usage
165+
```bash
166+
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
167+
```
168+
169+
### With Subdirectories and Checksums
170+
```bash
171+
cautious-robot -i data.csv -o images -s species -a sha256 -v expected_sha256
172+
```
173+
174+
### With Downsampling
175+
```bash
176+
cautious-robot -i data.csv -o images -l 512
177+
```
178+
179+
When implementing features or fixing bugs, consider the tool's primary use case in scientific image processing workflows, where data integrity and reliable downloads are critical.

0 commit comments

Comments
 (0)