Generalize Amazon product refresh and image matching #418

BranDavidSebastian · 2025-08-29T13:28:24Z

Summary

Rename factory to FetchRecentlySyncedProductFactory to refresh issues, ASINs, and match images via pHash
Restore direct image import handling and wire new factory into sync flow and cron task
Add match_images flag so flows can opt into image matching while imports skip it
Keep FetchRemoteValidationIssueFactory in dedicated issues module and update mixin accordingly
Refactor recently-synced product factory to delegate work to smaller helper methods

Testing

pre-commit run --files OneSila/sales_channels/integrations/amazon/factories/sales_channels/recently_synced_products.py
pytest OneSila/sales_channels/integrations/amazon/tests

https://chatgpt.com/codex/tasks/task_e_68b1919a384c832ea3e4b2974d620e89

Summary by Sourcery

Generalize the Amazon product refresh process by consolidating issue fetching, ASIN synchronization, and optional image matching into a new factory, replace legacy issue fetch logic across imports, GraphQL, and cron tasks, and introduce a 15-minute refresh flow using perceptual hashing for image validation.

New Features:

Introduce FetchRecentlySyncedProductFactory to fetch listing issues, sync ASINs, and optionally match images via perceptual hashing
Add an image similarity module with phash-based matching and HTTP fetching safeguards
Add a new cron task and flow to refresh recently synced Amazon products every 15 minutes

Enhancements:

Refactor the product-refresh factory into smaller helper methods and add a match_images flag
Replace legacy FetchRemoteIssuesFactory references in GraphQL mutations, import processors, and cron tasks with the new factory
Retain FetchRemoteValidationIssueFactory in the dedicated issues module and clean up unused mixins

Build:

Add the imagehash library for pHash image matching

Tests:

Update existing tests to reference and validate behaviors of FetchRecentlySyncedProductFactory

sourcery-ai · 2025-08-29T13:28:30Z

Reviewer's Guide

This PR refactors and generalizes the Amazon product refresh flow by renaming and extracting the existing issues-fetching factory into a new FetchRecentlySyncedProductFactory which now supports ASIN synchronization, issue persistence, and optional image matching via pHash, updates cron tasks and GraphQL mutations to use this new flow, adjusts import logic to opt image matching off during imports, and adds an image similarity utility with its dependency.

File-Level Changes

Change	Details	Files
Introduce FetchRecentlySyncedProductFactory with optional image matching	Create a new factory class with helper methods for validation, issue clearing, response extraction, ASIN sync, issue persistence, and image matching Add match_images flag to control pHash-based image matching Implement image_similarity module to download images, compute pHash, and compare with threshold Add imagehash dependency in requirements	`OneSila/sales_channels/integrations/amazon/factories/sales_channels/recently_synced_products.py` `OneSila/sales_channels/integrations/amazon/image_similarity.py` `requirements.txt`
Replace and remove old FetchRemoteIssuesFactory references	Delete outdated FetchRemoteIssuesFactory implementation from issues module Update imports in schema mutations, import processors, and mixins to use new factory Adjust tests to reference and patch FetchRecentlySyncedProductFactory instead of the old factory	`OneSila/sales_channels/integrations/amazon/factories/sales_channels/issues.py` `OneSila/sales_channels/integrations/amazon/schema/mutations.py` `OneSila/sales_channels/integrations/amazon/factories/imports/products_imports.py` `OneSila/sales_channels/integrations/amazon/factories/mixins.py` `OneSila/sales_channels/integrations/amazon/tests/tests_factories/tests_issues_factory.py` `OneSila/sales_channels/integrations/amazon/tests/tests_factories/tests_products_imports.py`
Revamp cron job and extract flow for recent product refresh	Remove old bi-daily issues cron task and replace with 15-minute cron calling new flow Create flows/recently_synced_products.py encapsulating cutoff logic and invoking the new factory Wire the new flow into tasks.py	`OneSila/sales_channels/integrations/amazon/tasks.py` `OneSila/sales_channels/integrations/amazon/flows/recently_synced_products.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Consider adding explicit logging in the _match_images method to surface phash comparison failures instead of silently ignoring exceptions for easier debugging.
The zip-based matching between MediaProductThrough records and remote URLs relies on consistent ordering—please verify or document this assumption to avoid misaligned image associations when counts differ.
For high-volume issue imports, consider batching or using bulk_create for AmazonProductIssue objects to reduce database round trips and improve performance.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Consider adding explicit logging in the _match_images method to surface phash comparison failures instead of silently ignoring exceptions for easier debugging.
- The zip-based matching between MediaProductThrough records and remote URLs relies on consistent ordering—please verify or document this assumption to avoid misaligned image associations when counts differ.
- For high-volume issue imports, consider batching or using bulk_create for AmazonProductIssue objects to reduce database round trips and improve performance.

## Individual Comments

### Comment 1
<location> `OneSila/sales_channels/integrations/amazon/image_similarity.py:12` </location>
<code_context>
+MAX_BYTES = 25 * 1024 * 1024  # 25 MB safety cap
+
+
+def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
+    """Download URL into memory with a size cap."""
+    with requests.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
+        r.raise_for_status()
+        total = 0
+        chunks = []
+        for chunk in r.iter_content(1024 * 32):
+            if not chunk:
+                break
+            total += len(chunk)
+            if total > MAX_BYTES:
+                raise ValueError(f"Image too large (> {MAX_BYTES} bytes): {url}")
+            chunks.append(chunk)
+        return b"".join(chunks)
+
+
</code_context>

<issue_to_address>
No retry logic for transient network errors when fetching images.

Adding retry logic will help prevent missed matches due to temporary network issues. You can implement a retry loop or use a library with built-in retry support to improve reliability.

Suggested implementation:

```python
import math
from typing import Tuple
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from PIL import Image
import imagehash

```

```python
def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
    """Download URL into memory with a size cap and retry logic for transient errors."""
    session = requests.Session()
    retries = Retry(
        total=3,
        backoff_factor=0.5,
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["GET"],
        raise_on_status=False,
    )
    adapter = HTTPAdapter(max_retries=retries)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    with session.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
        r.raise_for_status()
        total = 0
        chunks = []
        for chunk in r.iter_content(1024 * 32):
            if not chunk:
                break
            total += len(chunk)
            if total > MAX_BYTES:
                raise ValueError(f"Image too large (> {MAX_BYTES} bytes): {url}")
            chunks.append(chunk)
        return b"".join(chunks)

```
</issue_to_address>

### Comment 2
<location> `OneSila/sales_channels/integrations/amazon/image_similarity.py:28` </location>
<code_context>
+        return b"".join(chunks)
+
+
+def _pil_from_source(src: str) -> Image.Image:
+    if src.startswith("http://") or src.startswith("https://"):
+        data = _fetch_bytes(src)
</code_context>

<issue_to_address>
No validation of image format or content after loading.

Consider adding checks to ensure the image is valid and in a supported format to prevent exceptions from corrupt or unsupported files.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def _pil_from_source(src: str) -> Image.Image:
    if src.startswith("http://") or src.startswith("https://"):
        data = _fetch_bytes(src)
        img = Image.open(io.BytesIO(data))
    else:
        img = Image.open(src)
    img.load()
    return img
=======
def _pil_from_source(src: str) -> Image.Image:
    SUPPORTED_FORMATS = {"JPEG", "PNG", "BMP", "GIF", "WEBP"}
    try:
        if src.startswith("http://") or src.startswith("https://"):
            data = _fetch_bytes(src)
            img = Image.open(io.BytesIO(data))
        else:
            img = Image.open(src)
        img.load()
    except Exception as e:
        raise ValueError(f"Failed to load image from source '{src}': {e}")

    if not hasattr(img, "format") or img.format is None:
        raise ValueError(f"Image format could not be determined for source '{src}'.")

    if img.format.upper() not in SUPPORTED_FORMATS:
        raise ValueError(f"Unsupported image format '{img.format}' for source '{src}'. Supported formats: {', '.join(SUPPORTED_FORMATS)}.")

    return img
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-29T13:30:24Z

OneSila/sales_channels/integrations/amazon/image_similarity.py

+def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
+    """Download URL into memory with a size cap."""
+    with requests.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
+        r.raise_for_status()
+        total = 0
+        chunks = []
+        for chunk in r.iter_content(1024 * 32):
+            if not chunk:
+                break
+            total += len(chunk)


suggestion: No retry logic for transient network errors when fetching images.

Adding retry logic will help prevent missed matches due to temporary network issues. You can implement a retry loop or use a library with built-in retry support to improve reliability.

Suggested implementation:

import math from typing import Tuple import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry from PIL import Image import imagehash

def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes: """Download URL into memory with a size cap and retry logic for transient errors.""" session = requests.Session() retries = Retry( total=3, backoff_factor=0.5, status_forcelist=[500, 502, 503, 504], allowed_methods=["GET"], raise_on_status=False, ) adapter = HTTPAdapter(max_retries=retries) session.mount("http://", adapter) session.mount("https://", adapter) with session.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r: r.raise_for_status() total = 0 chunks = [] for chunk in r.iter_content(1024 * 32): if not chunk: break total += len(chunk) if total > MAX_BYTES: raise ValueError(f"Image too large (> {MAX_BYTES} bytes): {url}") chunks.append(chunk) return b"".join(chunks)

sourcery-ai · 2025-08-29T13:30:24Z

OneSila/sales_channels/integrations/amazon/image_similarity.py

+def _pil_from_source(src: str) -> Image.Image:
+    if src.startswith("http://") or src.startswith("https://"):
+        data = _fetch_bytes(src)
+        img = Image.open(io.BytesIO(data))
+    else:
+        img = Image.open(src)
+    img.load()
+    return img


suggestion (bug_risk): No validation of image format or content after loading.

Consider adding checks to ensure the image is valid and in a supported format to prevent exceptions from corrupt or unsupported files.

Suggested change

def _pil_from_source(src: str) -> Image.Image:

if src.startswith("http://") or src.startswith("https://"):

data = _fetch_bytes(src)

img = Image.open(io.BytesIO(data))

else:

img = Image.open(src)

img.load()

return img

def _pil_from_source(src: str) -> Image.Image:

SUPPORTED_FORMATS = {"JPEG", "PNG", "BMP", "GIF", "WEBP"}

try:

if src.startswith("http://") or src.startswith("https://"):

data = _fetch_bytes(src)

img = Image.open(io.BytesIO(data))

else:

img = Image.open(src)

img.load()

except Exception as e:

raise ValueError(f"Failed to load image from source '{src}': {e}")

if not hasattr(img, "format") or img.format is None:

raise ValueError(f"Image format could not be determined for source '{src}'.")

if img.format.upper() not in SUPPORTED_FORMATS:

raise ValueError(f"Unsupported image format '{img.format}' for source '{src}'. Supported formats: {', '.join(SUPPORTED_FORMATS)}.")

return img

sourcery-ai · 2025-08-29T13:30:24Z

OneSila/sales_channels/integrations/amazon/image_similarity.py

+    h1 = imagehash.phash(img1, hash_size=hash_size)
+    h2 = imagehash.phash(img2, hash_size=hash_size)
+    dist = int(h1 - h2)
+    max_bits = hash_size * hash_size


suggestion (code-quality): Replace x * x with x ** 2 (square-identity)

Suggested change

max_bits = hash_size * hash_size

max_bits = hash_size**2

refactor: simplify recently synced product factory

70485b5

BranDavidSebastian added the codex label Aug 29, 2025 — with ChatGPT Codex Connector

sourcery-ai bot reviewed Aug 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize Amazon product refresh and image matching #418

Generalize Amazon product refresh and image matching #418

Uh oh!

BranDavidSebastian commented Aug 29, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Aug 29, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Aug 29, 2025

Uh oh!

sourcery-ai bot Aug 29, 2025

Uh oh!

sourcery-ai bot Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Generalize Amazon product refresh and image matching #418

Are you sure you want to change the base?

Generalize Amazon product refresh and image matching #418

Uh oh!

Conversation

BranDavidSebastian commented Aug 29, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BranDavidSebastian commented Aug 29, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 29, 2025 •

edited

Loading