Skip to content

Conversation

@m4dm4rtig4n
Copy link
Contributor

Summary

Replaces Engie PDF scraper with HelloWatt comparison site scraping to get more accurate, up-to-date pricing. Now supports 34 Engie offers (Référence 3 ans and Tranquillité in both BASE and HC/HP options). Also fixes a React crash in AdminOffers.tsx that occurred after logout due to corrupted cache data.

Changes

  • Engie scraper: HTML parsing with BeautifulSoup instead of PDF extraction
  • Added defensive array check in AdminOffers.tsx to prevent crashes
  • Updated fallback pricing to December 2025
  • Updated documentation

Testing

Visit /admin/offers and click "Prévisualiser" on the Engie provider to verify all 34 offers are detected correctly.

🤖 Generated with Claude Code

- Replace Engie PDF scraper with HelloWatt comparison site scraping
- Add HTML parsing with BeautifulSoup for pricing tables
- Support 34 Engie offers: Référence 3 ans + Tranquillité (BASE + HC/HP)
- Update fallback pricing data to December 2025
- Fix AdminOffers.tsx: defensive array check to prevent crashes after logout

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
Copilot AI review requested due to automatic review settings December 5, 2025 07:53
Clément VALENTIN and others added 2 commits December 5, 2025 09:00
🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
Change label from "Elec Référence 1 an (PDF officiel)" to "Tarifs Engie (HelloWatt)"

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Engie scraper from PDF parsing to HTML web scraping via HelloWatt, increasing coverage from 17 to 34 offers. It also adds a defensive fix in AdminOffers.tsx to prevent crashes from corrupted cache data, and updates documentation to reflect the new scraping approach and pricing data as of December 2025.

Key Changes

  • Engie scraper refactored: Replaced PDF extraction with BeautifulSoup HTML parsing from HelloWatt comparison site
  • Offer coverage doubled: Now scrapes 34 offers (Référence 3 ans and Tranquillité, both in BASE and HC/HP variants) vs. previous 17 offers
  • Frontend crash fix: Added defensive Array.isArray() check in AdminOffers.tsx to handle corrupted cache scenarios

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

File Description
apps/api/src/services/price_scrapers/engie_scraper.py Complete rewrite: removed PDF parsing, added HTML scraping with BeautifulSoup, new table/header parsing methods, updated fallback prices to December 2025
apps/web/src/pages/AdminOffers.tsx Added defensive array check to prevent crashes when offersData is corrupted or undefined
docs/features-spec/energy-providers-scrapers.md Updated documentation: source changed to HelloWatt, offer count increased from 17 to 34, updated pricing mechanism details
docs/pages/admin-offers.md Updated total offer count from ~236 to ~253 and data source description

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +151 to +154
**Prix TTC** (décembre 2025):
- Référence 3 ans BASE: 0.2124€/kWh (3-6 kVA), 0.2109€/kWh (9+ kVA)
- Tranquillité BASE: 0.2612€/kWh (3-6 kVA), 0.2597€/kWh (9+ kVA)

Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The documentation states "Prix TTC (décembre 2025)" which implies these are current prices as of December 2025. However, since the current date is December 5, 2025, this should be clarified whether the prices are from the beginning of December or expected to be updated later in December. Consider being more specific with the date (e.g., "début décembre 2025" or the actual date).

Copilot uses AI. Check for mistakes.

return offer_name, offer_type

def _find_next_table(self, header) -> "BeautifulSoup | None":
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation uses quoted "BeautifulSoup | None" which suggests this is a forward reference, but BeautifulSoup is already imported at the top of the file (line 6). The quotes are unnecessary and should be removed for consistency with Python typing conventions when the type is already imported.

Suggested change
def _find_next_table(self, header) -> "BeautifulSoup | None":
def _find_next_table(self, header) -> BeautifulSoup | None:

Copilot uses AI. Check for mistakes.
try:
value = float(match.group(1))
# Basic sanity check - prices should be reasonable
if 0 < value < 1000:
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The price validation range 0 < value < 1000 (line 405) is too permissive for energy pricing. Based on the fallback prices in the code, subscription prices range from ~9-55€/month and kWh prices range from ~0.17-0.28€. A value like 500 or 800 would pass this validation but is clearly incorrect for energy pricing. Consider tightening the validation range to something like 0.001 < value < 100 to catch more potential parsing errors.

Suggested change
if 0 < value < 1000:
if 0.001 < value < 100:

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +217
elif "elec' car" in text or 'elec car' in text:
offer_name = "Elec' Car"
# Elec' Car is always HC/HP type
offer_type = "HC_HP"
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The offer name detection for "Elec' Car" (lines 214-217) includes a check for both "elec' car" and "elec car" (without apostrophe), but the apostrophe handling seems inconsistent. If the website uses a specific format, consider being more precise about the expected text format. Additionally, the apostrophe character might be a regular apostrophe (') or a typographic apostrophe ('), which could cause matching issues.

Copilot uses AI. Check for mistakes.
base_price = self._extract_price(cells[base_idx].get_text())

# If no specific base column, try the last column
if base_price is None and len(cells) > 2:
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the BASE price extraction fallback (lines 382-383), when no specific base price column is found, the code tries to extract from the last column cells[-1]. However, this same fallback is attempted even when base_idx is not None but the extraction from cells[base_idx] returned None. This means the code will always try cells[-1] as a fallback, which could pick up unrelated data from the last column if the actual base price column exists but couldn't be parsed. Consider only using the cells[-1] fallback when base_idx is None.

Suggested change
if base_price is None and len(cells) > 2:
if base_price is None and base_idx is None and len(cells) > 2:

Copilot uses AI. Check for mistakes.
Comment on lines +273 to +274
# Default to current month
return datetime.now(UTC).replace(day=1, hour=0, minute=0, second=0, microsecond=0)
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _extract_update_date method returns datetime.now(UTC) as a fallback when the date cannot be parsed (line 274). This means if the scraping succeeds but the date extraction fails, the offers will be marked with the current date rather than indicating uncertainty or using the fallback date (December 2025). This could be misleading - consider either logging a warning and using the fallback date (December 2025) consistent with the fallback offers, or returning None to indicate the date is unknown.

Copilot uses AI. Check for mistakes.
"""Extract the update date from the page"""
# Look for "Mise à jour le X MONTH YEAR" pattern
text = soup.get_text()
date_match = re.search(r'Mise\s+à\s+jour\s+le\s+(\d+)\s+(\w+)\s+(\d{4})', text, re.IGNORECASE)
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The date extraction regex pattern r'Mise\s+à\s+jour\s+le\s+(\d+)\s+(\w+)\s+(\d{4})' (line 258) expects the format "Mise à jour le D MONTH YEAR", but it's unclear if the website consistently uses this exact format. Consider making the pattern more flexible to handle variations like:

  • Different whitespace (multiple spaces, tabs, non-breaking spaces)
  • Optional "le" article
  • Date formats like "Mise à jour: DD/MM/YYYY"

Add logging when the pattern doesn't match to aid debugging.

Copilot uses AI. Check for mistakes.
Comment on lines +222 to +251
"""Find the next table element after a header"""
# Try to find table in next siblings
for sibling in header.next_siblings:
if hasattr(sibling, 'name'):
if sibling.name == 'table':
return sibling
# If we hit another header, stop looking
if sibling.name in ['h2', 'h3', 'h4']:
break
# Look for table inside divs or other containers
if sibling.name in ['div', 'section', 'article']:
table = sibling.find('table')
if table:
return table

# Try parent's next siblings
parent = header.parent
if parent:
for sibling in parent.next_siblings:
if hasattr(sibling, 'name'):
if sibling.name == 'table':
return sibling
if sibling.name in ['div', 'section', 'article']:
table = sibling.find('table')
if table:
return table
# Stop if we hit another header-like element
if sibling.name in ['h2', 'h3', 'h4']:
break

Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The _find_next_table method has potential performance issues when searching for tables. It first searches through immediate siblings, then through parent's siblings, but doesn't limit the depth of nested container searches. If the page has deeply nested structures or many divs, this could iterate through a large portion of the page. Consider:

  1. Adding a maximum depth limit for nested searches
  2. Adding a maximum sibling search limit
  3. Using BeautifulSoup's find_next method instead of manual sibling traversal for better performance
Suggested change
"""Find the next table element after a header"""
# Try to find table in next siblings
for sibling in header.next_siblings:
if hasattr(sibling, 'name'):
if sibling.name == 'table':
return sibling
# If we hit another header, stop looking
if sibling.name in ['h2', 'h3', 'h4']:
break
# Look for table inside divs or other containers
if sibling.name in ['div', 'section', 'article']:
table = sibling.find('table')
if table:
return table
# Try parent's next siblings
parent = header.parent
if parent:
for sibling in parent.next_siblings:
if hasattr(sibling, 'name'):
if sibling.name == 'table':
return sibling
if sibling.name in ['div', 'section', 'article']:
table = sibling.find('table')
if table:
return table
# Stop if we hit another header-like element
if sibling.name in ['h2', 'h3', 'h4']:
break
"""Find the next table element after a header, with search limits for performance"""
# Use BeautifulSoup's find_next to efficiently locate the next table
# Limit search to 20 elements after the header to avoid performance issues
max_search = 20
count = 0
current = header
while count < max_search:
current = current.find_next()
if current is None:
break
if hasattr(current, 'name'):
if current.name == 'table':
return current
if current.name in ['h2', 'h3', 'h4']:
break
count += 1

Copilot uses AI. Check for mistakes.
Comment on lines +308 to +318
elif 'tarif base' in header_lower or ('base' in header_lower and 'tarif' in header_lower):
base_idx = idx
elif 'tarif' in header_lower and 'base' not in header_lower and hp_idx is None:
# Generic tarif column - use as base for BASE type
if offer_type == "BASE":
base_idx = idx
elif 'hp' in header_lower or 'heures pleines' in header_lower or 'tarif hp' in header_lower:
hp_idx = idx
elif 'hc' in header_lower or 'heures creuses' in header_lower or 'tarif hc' in header_lower:
hc_idx = idx

Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column header detection logic on lines 302-317 has overlapping conditions that could lead to incorrect column assignment. Specifically:

  • Line 310: Checks for generic "tarif" when offer_type is "BASE"
  • Line 314: Checks for "hp" or "heures pleines" for HP column
  • Line 316: Checks for "hc" or "heures creuses" for HC column

If a header contains "tarif hp" or "tarif hc", the generic "tarif" check on line 310 will match first (when iterating through headers), potentially assigning it as base_idx before the HP/HC checks. The order of these checks matters. Consider checking for more specific patterns (HP/HC) before generic patterns, or restructure the logic to check all patterns for each header before deciding which column type it is.

Suggested change
elif 'tarif base' in header_lower or ('base' in header_lower and 'tarif' in header_lower):
base_idx = idx
elif 'tarif' in header_lower and 'base' not in header_lower and hp_idx is None:
# Generic tarif column - use as base for BASE type
if offer_type == "BASE":
base_idx = idx
elif 'hp' in header_lower or 'heures pleines' in header_lower or 'tarif hp' in header_lower:
hp_idx = idx
elif 'hc' in header_lower or 'heures creuses' in header_lower or 'tarif hc' in header_lower:
hc_idx = idx
# Check for HP/HC columns first to avoid misassignment
elif 'hp' in header_lower or 'heures pleines' in header_lower or 'tarif hp' in header_lower:
hp_idx = idx
elif 'hc' in header_lower or 'heures creuses' in header_lower or 'tarif hc' in header_lower:
hc_idx = idx
elif 'tarif base' in header_lower or ('base' in header_lower and 'tarif' in header_lower):
base_idx = idx
elif 'tarif' in header_lower and 'base' not in header_lower and hp_idx is None and hc_idx is None:
# Generic tarif column - use as base for BASE type
if offer_type == "BASE":
base_idx = idx

Copilot uses AI. Check for mistakes.
# Basic sanity check - prices should be reasonable
if 0 < value < 1000:
return value
except ValueError:
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except ValueError:
except ValueError:
# If conversion fails, the price string is invalid; return None.

Copilot uses AI. Check for mistakes.
Clément VALENTIN and others added 2 commits December 5, 2025 09:13
Provides step-by-step guidance for adding new energy provider scrapers,
including all required files and configuration changes.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
Resolved conflict in docs/pages/admin-offers.md - kept main's 9 providers count

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
@m4dm4rtig4n m4dm4rtig4n merged commit d5a6d71 into main Dec 5, 2025
5 checks passed
@m4dm4rtig4n m4dm4rtig4n deleted the engie-hellowatt-scraper branch December 18, 2025 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants