-
Notifications
You must be signed in to change notification settings - Fork 1
Generalize Amazon product refresh and image matching #418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Generalize Amazon product refresh and image matching #418
Conversation
Reviewer's GuideThis PR refactors and generalizes the Amazon product refresh flow by renaming and extracting the existing issues-fetching factory into a new FetchRecentlySyncedProductFactory which now supports ASIN synchronization, issue persistence, and optional image matching via pHash, updates cron tasks and GraphQL mutations to use this new flow, adjusts import logic to opt image matching off during imports, and adds an image similarity utility with its dependency. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Consider adding explicit logging in the _match_images method to surface phash comparison failures instead of silently ignoring exceptions for easier debugging.
- The zip-based matching between MediaProductThrough records and remote URLs relies on consistent ordering—please verify or document this assumption to avoid misaligned image associations when counts differ.
- For high-volume issue imports, consider batching or using bulk_create for AmazonProductIssue objects to reduce database round trips and improve performance.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider adding explicit logging in the _match_images method to surface phash comparison failures instead of silently ignoring exceptions for easier debugging.
- The zip-based matching between MediaProductThrough records and remote URLs relies on consistent ordering—please verify or document this assumption to avoid misaligned image associations when counts differ.
- For high-volume issue imports, consider batching or using bulk_create for AmazonProductIssue objects to reduce database round trips and improve performance.
## Individual Comments
### Comment 1
<location> `OneSila/sales_channels/integrations/amazon/image_similarity.py:12` </location>
<code_context>
+MAX_BYTES = 25 * 1024 * 1024 # 25 MB safety cap
+
+
+def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
+ """Download URL into memory with a size cap."""
+ with requests.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
+ r.raise_for_status()
+ total = 0
+ chunks = []
+ for chunk in r.iter_content(1024 * 32):
+ if not chunk:
+ break
+ total += len(chunk)
+ if total > MAX_BYTES:
+ raise ValueError(f"Image too large (> {MAX_BYTES} bytes): {url}")
+ chunks.append(chunk)
+ return b"".join(chunks)
+
+
</code_context>
<issue_to_address>
No retry logic for transient network errors when fetching images.
Adding retry logic will help prevent missed matches due to temporary network issues. You can implement a retry loop or use a library with built-in retry support to improve reliability.
Suggested implementation:
```python
import math
from typing import Tuple
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from PIL import Image
import imagehash
```
```python
def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
"""Download URL into memory with a size cap and retry logic for transient errors."""
session = requests.Session()
retries = Retry(
total=3,
backoff_factor=0.5,
status_forcelist=[500, 502, 503, 504],
allowed_methods=["GET"],
raise_on_status=False,
)
adapter = HTTPAdapter(max_retries=retries)
session.mount("http://", adapter)
session.mount("https://", adapter)
with session.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
r.raise_for_status()
total = 0
chunks = []
for chunk in r.iter_content(1024 * 32):
if not chunk:
break
total += len(chunk)
if total > MAX_BYTES:
raise ValueError(f"Image too large (> {MAX_BYTES} bytes): {url}")
chunks.append(chunk)
return b"".join(chunks)
```
</issue_to_address>
### Comment 2
<location> `OneSila/sales_channels/integrations/amazon/image_similarity.py:28` </location>
<code_context>
+ return b"".join(chunks)
+
+
+def _pil_from_source(src: str) -> Image.Image:
+ if src.startswith("http://") or src.startswith("https://"):
+ data = _fetch_bytes(src)
</code_context>
<issue_to_address>
No validation of image format or content after loading.
Consider adding checks to ensure the image is valid and in a supported format to prevent exceptions from corrupt or unsupported files.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def _pil_from_source(src: str) -> Image.Image:
if src.startswith("http://") or src.startswith("https://"):
data = _fetch_bytes(src)
img = Image.open(io.BytesIO(data))
else:
img = Image.open(src)
img.load()
return img
=======
def _pil_from_source(src: str) -> Image.Image:
SUPPORTED_FORMATS = {"JPEG", "PNG", "BMP", "GIF", "WEBP"}
try:
if src.startswith("http://") or src.startswith("https://"):
data = _fetch_bytes(src)
img = Image.open(io.BytesIO(data))
else:
img = Image.open(src)
img.load()
except Exception as e:
raise ValueError(f"Failed to load image from source '{src}': {e}")
if not hasattr(img, "format") or img.format is None:
raise ValueError(f"Image format could not be determined for source '{src}'.")
if img.format.upper() not in SUPPORTED_FORMATS:
raise ValueError(f"Unsupported image format '{img.format}' for source '{src}'. Supported formats: {', '.join(SUPPORTED_FORMATS)}.")
return img
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes: | ||
| """Download URL into memory with a size cap.""" | ||
| with requests.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r: | ||
| r.raise_for_status() | ||
| total = 0 | ||
| chunks = [] | ||
| for chunk in r.iter_content(1024 * 32): | ||
| if not chunk: | ||
| break | ||
| total += len(chunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: No retry logic for transient network errors when fetching images.
Adding retry logic will help prevent missed matches due to temporary network issues. You can implement a retry loop or use a library with built-in retry support to improve reliability.
Suggested implementation:
import math
from typing import Tuple
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from PIL import Image
import imagehashdef _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
"""Download URL into memory with a size cap and retry logic for transient errors."""
session = requests.Session()
retries = Retry(
total=3,
backoff_factor=0.5,
status_forcelist=[500, 502, 503, 504],
allowed_methods=["GET"],
raise_on_status=False,
)
adapter = HTTPAdapter(max_retries=retries)
session.mount("http://", adapter)
session.mount("https://", adapter)
with session.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
r.raise_for_status()
total = 0
chunks = []
for chunk in r.iter_content(1024 * 32):
if not chunk:
break
total += len(chunk)
if total > MAX_BYTES:
raise ValueError(f"Image too large (> {MAX_BYTES} bytes): {url}")
chunks.append(chunk)
return b"".join(chunks)| def _pil_from_source(src: str) -> Image.Image: | ||
| if src.startswith("http://") or src.startswith("https://"): | ||
| data = _fetch_bytes(src) | ||
| img = Image.open(io.BytesIO(data)) | ||
| else: | ||
| img = Image.open(src) | ||
| img.load() | ||
| return img |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): No validation of image format or content after loading.
Consider adding checks to ensure the image is valid and in a supported format to prevent exceptions from corrupt or unsupported files.
| def _pil_from_source(src: str) -> Image.Image: | |
| if src.startswith("http://") or src.startswith("https://"): | |
| data = _fetch_bytes(src) | |
| img = Image.open(io.BytesIO(data)) | |
| else: | |
| img = Image.open(src) | |
| img.load() | |
| return img | |
| def _pil_from_source(src: str) -> Image.Image: | |
| SUPPORTED_FORMATS = {"JPEG", "PNG", "BMP", "GIF", "WEBP"} | |
| try: | |
| if src.startswith("http://") or src.startswith("https://"): | |
| data = _fetch_bytes(src) | |
| img = Image.open(io.BytesIO(data)) | |
| else: | |
| img = Image.open(src) | |
| img.load() | |
| except Exception as e: | |
| raise ValueError(f"Failed to load image from source '{src}': {e}") | |
| if not hasattr(img, "format") or img.format is None: | |
| raise ValueError(f"Image format could not be determined for source '{src}'.") | |
| if img.format.upper() not in SUPPORTED_FORMATS: | |
| raise ValueError(f"Unsupported image format '{img.format}' for source '{src}'. Supported formats: {', '.join(SUPPORTED_FORMATS)}.") | |
| return img |
| h1 = imagehash.phash(img1, hash_size=hash_size) | ||
| h2 = imagehash.phash(img2, hash_size=hash_size) | ||
| dist = int(h1 - h2) | ||
| max_bits = hash_size * hash_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace x * x with x ** 2 (square-identity)
| max_bits = hash_size * hash_size | |
| max_bits = hash_size**2 |
Summary
FetchRecentlySyncedProductFactoryto refresh issues, ASINs, and match images via pHashmatch_imagesflag so flows can opt into image matching while imports skip itFetchRemoteValidationIssueFactoryin dedicatedissuesmodule and update mixin accordinglyTesting
pre-commit run --files OneSila/sales_channels/integrations/amazon/factories/sales_channels/recently_synced_products.pypytest OneSila/sales_channels/integrations/amazon/testshttps://chatgpt.com/codex/tasks/task_e_68b1919a384c832ea3e4b2974d620e89
Summary by Sourcery
Generalize the Amazon product refresh process by consolidating issue fetching, ASIN synchronization, and optional image matching into a new factory, replace legacy issue fetch logic across imports, GraphQL, and cron tasks, and introduce a 15-minute refresh flow using perceptual hashing for image validation.
New Features:
Enhancements:
Build:
Tests: