Skip to content

Conversation

@BranDavidSebastian
Copy link
Collaborator

@BranDavidSebastian BranDavidSebastian commented Aug 29, 2025

Summary

  • Rename factory to FetchRecentlySyncedProductFactory to refresh issues, ASINs, and match images via pHash
  • Restore direct image import handling and wire new factory into sync flow and cron task
  • Add match_images flag so flows can opt into image matching while imports skip it
  • Keep FetchRemoteValidationIssueFactory in dedicated issues module and update mixin accordingly
  • Refactor recently-synced product factory to delegate work to smaller helper methods

Testing

  • pre-commit run --files OneSila/sales_channels/integrations/amazon/factories/sales_channels/recently_synced_products.py
  • pytest OneSila/sales_channels/integrations/amazon/tests

https://chatgpt.com/codex/tasks/task_e_68b1919a384c832ea3e4b2974d620e89

Summary by Sourcery

Generalize the Amazon product refresh process by consolidating issue fetching, ASIN synchronization, and optional image matching into a new factory, replace legacy issue fetch logic across imports, GraphQL, and cron tasks, and introduce a 15-minute refresh flow using perceptual hashing for image validation.

New Features:

  • Introduce FetchRecentlySyncedProductFactory to fetch listing issues, sync ASINs, and optionally match images via perceptual hashing
  • Add an image similarity module with phash-based matching and HTTP fetching safeguards
  • Add a new cron task and flow to refresh recently synced Amazon products every 15 minutes

Enhancements:

  • Refactor the product-refresh factory into smaller helper methods and add a match_images flag
  • Replace legacy FetchRemoteIssuesFactory references in GraphQL mutations, import processors, and cron tasks with the new factory
  • Retain FetchRemoteValidationIssueFactory in the dedicated issues module and clean up unused mixins

Build:

  • Add the imagehash library for pHash image matching

Tests:

  • Update existing tests to reference and validate behaviors of FetchRecentlySyncedProductFactory

@sourcery-ai
Copy link

sourcery-ai bot commented Aug 29, 2025

Reviewer's Guide

This PR refactors and generalizes the Amazon product refresh flow by renaming and extracting the existing issues-fetching factory into a new FetchRecentlySyncedProductFactory which now supports ASIN synchronization, issue persistence, and optional image matching via pHash, updates cron tasks and GraphQL mutations to use this new flow, adjusts import logic to opt image matching off during imports, and adds an image similarity utility with its dependency.

File-Level Changes

Change Details Files
Introduce FetchRecentlySyncedProductFactory with optional image matching
  • Create a new factory class with helper methods for validation, issue clearing, response extraction, ASIN sync, issue persistence, and image matching
  • Add match_images flag to control pHash-based image matching
  • Implement image_similarity module to download images, compute pHash, and compare with threshold
  • Add imagehash dependency in requirements
OneSila/sales_channels/integrations/amazon/factories/sales_channels/recently_synced_products.py
OneSila/sales_channels/integrations/amazon/image_similarity.py
requirements.txt
Replace and remove old FetchRemoteIssuesFactory references
  • Delete outdated FetchRemoteIssuesFactory implementation from issues module
  • Update imports in schema mutations, import processors, and mixins to use new factory
  • Adjust tests to reference and patch FetchRecentlySyncedProductFactory instead of the old factory
OneSila/sales_channels/integrations/amazon/factories/sales_channels/issues.py
OneSila/sales_channels/integrations/amazon/schema/mutations.py
OneSila/sales_channels/integrations/amazon/factories/imports/products_imports.py
OneSila/sales_channels/integrations/amazon/factories/mixins.py
OneSila/sales_channels/integrations/amazon/tests/tests_factories/tests_issues_factory.py
OneSila/sales_channels/integrations/amazon/tests/tests_factories/tests_products_imports.py
Revamp cron job and extract flow for recent product refresh
  • Remove old bi-daily issues cron task and replace with 15-minute cron calling new flow
  • Create flows/recently_synced_products.py encapsulating cutoff logic and invoking the new factory
  • Wire the new flow into tasks.py
OneSila/sales_channels/integrations/amazon/tasks.py
OneSila/sales_channels/integrations/amazon/flows/recently_synced_products.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider adding explicit logging in the _match_images method to surface phash comparison failures instead of silently ignoring exceptions for easier debugging.
  • The zip-based matching between MediaProductThrough records and remote URLs relies on consistent ordering—please verify or document this assumption to avoid misaligned image associations when counts differ.
  • For high-volume issue imports, consider batching or using bulk_create for AmazonProductIssue objects to reduce database round trips and improve performance.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider adding explicit logging in the _match_images method to surface phash comparison failures instead of silently ignoring exceptions for easier debugging.
- The zip-based matching between MediaProductThrough records and remote URLs relies on consistent ordering—please verify or document this assumption to avoid misaligned image associations when counts differ.
- For high-volume issue imports, consider batching or using bulk_create for AmazonProductIssue objects to reduce database round trips and improve performance.

## Individual Comments

### Comment 1
<location> `OneSila/sales_channels/integrations/amazon/image_similarity.py:12` </location>
<code_context>
+MAX_BYTES = 25 * 1024 * 1024  # 25 MB safety cap
+
+
+def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
+    """Download URL into memory with a size cap."""
+    with requests.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
+        r.raise_for_status()
+        total = 0
+        chunks = []
+        for chunk in r.iter_content(1024 * 32):
+            if not chunk:
+                break
+            total += len(chunk)
+            if total > MAX_BYTES:
+                raise ValueError(f"Image too large (> {MAX_BYTES} bytes): {url}")
+            chunks.append(chunk)
+        return b"".join(chunks)
+
+
</code_context>

<issue_to_address>
No retry logic for transient network errors when fetching images.

Adding retry logic will help prevent missed matches due to temporary network issues. You can implement a retry loop or use a library with built-in retry support to improve reliability.

Suggested implementation:

```python
import math
from typing import Tuple
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from PIL import Image
import imagehash

```

```python
def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
    """Download URL into memory with a size cap and retry logic for transient errors."""
    session = requests.Session()
    retries = Retry(
        total=3,
        backoff_factor=0.5,
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["GET"],
        raise_on_status=False,
    )
    adapter = HTTPAdapter(max_retries=retries)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    with session.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
        r.raise_for_status()
        total = 0
        chunks = []
        for chunk in r.iter_content(1024 * 32):
            if not chunk:
                break
            total += len(chunk)
            if total > MAX_BYTES:
                raise ValueError(f"Image too large (> {MAX_BYTES} bytes): {url}")
            chunks.append(chunk)
        return b"".join(chunks)

```
</issue_to_address>

### Comment 2
<location> `OneSila/sales_channels/integrations/amazon/image_similarity.py:28` </location>
<code_context>
+        return b"".join(chunks)
+
+
+def _pil_from_source(src: str) -> Image.Image:
+    if src.startswith("http://") or src.startswith("https://"):
+        data = _fetch_bytes(src)
</code_context>

<issue_to_address>
No validation of image format or content after loading.

Consider adding checks to ensure the image is valid and in a supported format to prevent exceptions from corrupt or unsupported files.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def _pil_from_source(src: str) -> Image.Image:
    if src.startswith("http://") or src.startswith("https://"):
        data = _fetch_bytes(src)
        img = Image.open(io.BytesIO(data))
    else:
        img = Image.open(src)
    img.load()
    return img
=======
def _pil_from_source(src: str) -> Image.Image:
    SUPPORTED_FORMATS = {"JPEG", "PNG", "BMP", "GIF", "WEBP"}
    try:
        if src.startswith("http://") or src.startswith("https://"):
            data = _fetch_bytes(src)
            img = Image.open(io.BytesIO(data))
        else:
            img = Image.open(src)
        img.load()
    except Exception as e:
        raise ValueError(f"Failed to load image from source '{src}': {e}")

    if not hasattr(img, "format") or img.format is None:
        raise ValueError(f"Image format could not be determined for source '{src}'.")

    if img.format.upper() not in SUPPORTED_FORMATS:
        raise ValueError(f"Unsupported image format '{img.format}' for source '{src}'. Supported formats: {', '.join(SUPPORTED_FORMATS)}.")

    return img
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +12 to +21
def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
"""Download URL into memory with a size cap."""
with requests.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
r.raise_for_status()
total = 0
chunks = []
for chunk in r.iter_content(1024 * 32):
if not chunk:
break
total += len(chunk)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: No retry logic for transient network errors when fetching images.

Adding retry logic will help prevent missed matches due to temporary network issues. You can implement a retry loop or use a library with built-in retry support to improve reliability.

Suggested implementation:

import math
from typing import Tuple
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from PIL import Image
import imagehash
def _fetch_bytes(url: str, timeout: Tuple[float, float] = (5, 20)) -> bytes:
    """Download URL into memory with a size cap and retry logic for transient errors."""
    session = requests.Session()
    retries = Retry(
        total=3,
        backoff_factor=0.5,
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["GET"],
        raise_on_status=False,
    )
    adapter = HTTPAdapter(max_retries=retries)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    with session.get(url, headers=DEFAULT_HEADERS, stream=True, timeout=timeout) as r:
        r.raise_for_status()
        total = 0
        chunks = []
        for chunk in r.iter_content(1024 * 32):
            if not chunk:
                break
            total += len(chunk)
            if total > MAX_BYTES:
                raise ValueError(f"Image too large (> {MAX_BYTES} bytes): {url}")
            chunks.append(chunk)
        return b"".join(chunks)

Comment on lines +28 to +35
def _pil_from_source(src: str) -> Image.Image:
if src.startswith("http://") or src.startswith("https://"):
data = _fetch_bytes(src)
img = Image.open(io.BytesIO(data))
else:
img = Image.open(src)
img.load()
return img
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): No validation of image format or content after loading.

Consider adding checks to ensure the image is valid and in a supported format to prevent exceptions from corrupt or unsupported files.

Suggested change
def _pil_from_source(src: str) -> Image.Image:
if src.startswith("http://") or src.startswith("https://"):
data = _fetch_bytes(src)
img = Image.open(io.BytesIO(data))
else:
img = Image.open(src)
img.load()
return img
def _pil_from_source(src: str) -> Image.Image:
SUPPORTED_FORMATS = {"JPEG", "PNG", "BMP", "GIF", "WEBP"}
try:
if src.startswith("http://") or src.startswith("https://"):
data = _fetch_bytes(src)
img = Image.open(io.BytesIO(data))
else:
img = Image.open(src)
img.load()
except Exception as e:
raise ValueError(f"Failed to load image from source '{src}': {e}")
if not hasattr(img, "format") or img.format is None:
raise ValueError(f"Image format could not be determined for source '{src}'.")
if img.format.upper() not in SUPPORTED_FORMATS:
raise ValueError(f"Unsupported image format '{img.format}' for source '{src}'. Supported formats: {', '.join(SUPPORTED_FORMATS)}.")
return img

h1 = imagehash.phash(img1, hash_size=hash_size)
h2 = imagehash.phash(img2, hash_size=hash_size)
dist = int(h1 - h2)
max_bits = hash_size * hash_size
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace x * x with x ** 2 (square-identity)

Suggested change
max_bits = hash_size * hash_size
max_bits = hash_size**2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants