-
Notifications
You must be signed in to change notification settings - Fork 0
feat(engie-scraper): switch from PDF to HelloWatt web scraping #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Replace Engie PDF scraper with HelloWatt comparison site scraping - Add HTML parsing with BeautifulSoup for pricing tables - Support 34 Engie offers: Référence 3 ans + Tranquillité (BASE + HC/HP) - Update fallback pricing data to December 2025 - Fix AdminOffers.tsx: defensive array check to prevent crashes after logout 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
Change label from "Elec Référence 1 an (PDF officiel)" to "Tarifs Engie (HelloWatt)" 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the Engie scraper from PDF parsing to HTML web scraping via HelloWatt, increasing coverage from 17 to 34 offers. It also adds a defensive fix in AdminOffers.tsx to prevent crashes from corrupted cache data, and updates documentation to reflect the new scraping approach and pricing data as of December 2025.
Key Changes
- Engie scraper refactored: Replaced PDF extraction with BeautifulSoup HTML parsing from HelloWatt comparison site
- Offer coverage doubled: Now scrapes 34 offers (Référence 3 ans and Tranquillité, both in BASE and HC/HP variants) vs. previous 17 offers
- Frontend crash fix: Added defensive
Array.isArray()check in AdminOffers.tsx to handle corrupted cache scenarios
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.
| File | Description |
|---|---|
apps/api/src/services/price_scrapers/engie_scraper.py |
Complete rewrite: removed PDF parsing, added HTML scraping with BeautifulSoup, new table/header parsing methods, updated fallback prices to December 2025 |
apps/web/src/pages/AdminOffers.tsx |
Added defensive array check to prevent crashes when offersData is corrupted or undefined |
docs/features-spec/energy-providers-scrapers.md |
Updated documentation: source changed to HelloWatt, offer count increased from 17 to 34, updated pricing mechanism details |
docs/pages/admin-offers.md |
Updated total offer count from ~236 to ~253 and data source description |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| **Prix TTC** (décembre 2025): | ||
| - Référence 3 ans BASE: 0.2124€/kWh (3-6 kVA), 0.2109€/kWh (9+ kVA) | ||
| - Tranquillité BASE: 0.2612€/kWh (3-6 kVA), 0.2597€/kWh (9+ kVA) | ||
|
|
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The documentation states "Prix TTC (décembre 2025)" which implies these are current prices as of December 2025. However, since the current date is December 5, 2025, this should be clarified whether the prices are from the beginning of December or expected to be updated later in December. Consider being more specific with the date (e.g., "début décembre 2025" or the actual date).
|
|
||
| return offer_name, offer_type | ||
|
|
||
| def _find_next_table(self, header) -> "BeautifulSoup | None": |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type annotation uses quoted "BeautifulSoup | None" which suggests this is a forward reference, but BeautifulSoup is already imported at the top of the file (line 6). The quotes are unnecessary and should be removed for consistency with Python typing conventions when the type is already imported.
| def _find_next_table(self, header) -> "BeautifulSoup | None": | |
| def _find_next_table(self, header) -> BeautifulSoup | None: |
| try: | ||
| value = float(match.group(1)) | ||
| # Basic sanity check - prices should be reasonable | ||
| if 0 < value < 1000: |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The price validation range 0 < value < 1000 (line 405) is too permissive for energy pricing. Based on the fallback prices in the code, subscription prices range from ~9-55€/month and kWh prices range from ~0.17-0.28€. A value like 500 or 800 would pass this validation but is clearly incorrect for energy pricing. Consider tightening the validation range to something like 0.001 < value < 100 to catch more potential parsing errors.
| if 0 < value < 1000: | |
| if 0.001 < value < 100: |
| elif "elec' car" in text or 'elec car' in text: | ||
| offer_name = "Elec' Car" | ||
| # Elec' Car is always HC/HP type | ||
| offer_type = "HC_HP" |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The offer name detection for "Elec' Car" (lines 214-217) includes a check for both "elec' car" and "elec car" (without apostrophe), but the apostrophe handling seems inconsistent. If the website uses a specific format, consider being more precise about the expected text format. Additionally, the apostrophe character might be a regular apostrophe (') or a typographic apostrophe ('), which could cause matching issues.
| base_price = self._extract_price(cells[base_idx].get_text()) | ||
|
|
||
| # If no specific base column, try the last column | ||
| if base_price is None and len(cells) > 2: |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the BASE price extraction fallback (lines 382-383), when no specific base price column is found, the code tries to extract from the last column cells[-1]. However, this same fallback is attempted even when base_idx is not None but the extraction from cells[base_idx] returned None. This means the code will always try cells[-1] as a fallback, which could pick up unrelated data from the last column if the actual base price column exists but couldn't be parsed. Consider only using the cells[-1] fallback when base_idx is None.
| if base_price is None and len(cells) > 2: | |
| if base_price is None and base_idx is None and len(cells) > 2: |
| # Default to current month | ||
| return datetime.now(UTC).replace(day=1, hour=0, minute=0, second=0, microsecond=0) |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _extract_update_date method returns datetime.now(UTC) as a fallback when the date cannot be parsed (line 274). This means if the scraping succeeds but the date extraction fails, the offers will be marked with the current date rather than indicating uncertainty or using the fallback date (December 2025). This could be misleading - consider either logging a warning and using the fallback date (December 2025) consistent with the fallback offers, or returning None to indicate the date is unknown.
| """Extract the update date from the page""" | ||
| # Look for "Mise à jour le X MONTH YEAR" pattern | ||
| text = soup.get_text() | ||
| date_match = re.search(r'Mise\s+à\s+jour\s+le\s+(\d+)\s+(\w+)\s+(\d{4})', text, re.IGNORECASE) |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The date extraction regex pattern r'Mise\s+à\s+jour\s+le\s+(\d+)\s+(\w+)\s+(\d{4})' (line 258) expects the format "Mise à jour le D MONTH YEAR", but it's unclear if the website consistently uses this exact format. Consider making the pattern more flexible to handle variations like:
- Different whitespace (multiple spaces, tabs, non-breaking spaces)
- Optional "le" article
- Date formats like "Mise à jour: DD/MM/YYYY"
Add logging when the pattern doesn't match to aid debugging.
| """Find the next table element after a header""" | ||
| # Try to find table in next siblings | ||
| for sibling in header.next_siblings: | ||
| if hasattr(sibling, 'name'): | ||
| if sibling.name == 'table': | ||
| return sibling | ||
| # If we hit another header, stop looking | ||
| if sibling.name in ['h2', 'h3', 'h4']: | ||
| break | ||
| # Look for table inside divs or other containers | ||
| if sibling.name in ['div', 'section', 'article']: | ||
| table = sibling.find('table') | ||
| if table: | ||
| return table | ||
|
|
||
| # Try parent's next siblings | ||
| parent = header.parent | ||
| if parent: | ||
| for sibling in parent.next_siblings: | ||
| if hasattr(sibling, 'name'): | ||
| if sibling.name == 'table': | ||
| return sibling | ||
| if sibling.name in ['div', 'section', 'article']: | ||
| table = sibling.find('table') | ||
| if table: | ||
| return table | ||
| # Stop if we hit another header-like element | ||
| if sibling.name in ['h2', 'h3', 'h4']: | ||
| break | ||
|
|
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The _find_next_table method has potential performance issues when searching for tables. It first searches through immediate siblings, then through parent's siblings, but doesn't limit the depth of nested container searches. If the page has deeply nested structures or many divs, this could iterate through a large portion of the page. Consider:
- Adding a maximum depth limit for nested searches
- Adding a maximum sibling search limit
- Using BeautifulSoup's
find_nextmethod instead of manual sibling traversal for better performance
| """Find the next table element after a header""" | |
| # Try to find table in next siblings | |
| for sibling in header.next_siblings: | |
| if hasattr(sibling, 'name'): | |
| if sibling.name == 'table': | |
| return sibling | |
| # If we hit another header, stop looking | |
| if sibling.name in ['h2', 'h3', 'h4']: | |
| break | |
| # Look for table inside divs or other containers | |
| if sibling.name in ['div', 'section', 'article']: | |
| table = sibling.find('table') | |
| if table: | |
| return table | |
| # Try parent's next siblings | |
| parent = header.parent | |
| if parent: | |
| for sibling in parent.next_siblings: | |
| if hasattr(sibling, 'name'): | |
| if sibling.name == 'table': | |
| return sibling | |
| if sibling.name in ['div', 'section', 'article']: | |
| table = sibling.find('table') | |
| if table: | |
| return table | |
| # Stop if we hit another header-like element | |
| if sibling.name in ['h2', 'h3', 'h4']: | |
| break | |
| """Find the next table element after a header, with search limits for performance""" | |
| # Use BeautifulSoup's find_next to efficiently locate the next table | |
| # Limit search to 20 elements after the header to avoid performance issues | |
| max_search = 20 | |
| count = 0 | |
| current = header | |
| while count < max_search: | |
| current = current.find_next() | |
| if current is None: | |
| break | |
| if hasattr(current, 'name'): | |
| if current.name == 'table': | |
| return current | |
| if current.name in ['h2', 'h3', 'h4']: | |
| break | |
| count += 1 |
| elif 'tarif base' in header_lower or ('base' in header_lower and 'tarif' in header_lower): | ||
| base_idx = idx | ||
| elif 'tarif' in header_lower and 'base' not in header_lower and hp_idx is None: | ||
| # Generic tarif column - use as base for BASE type | ||
| if offer_type == "BASE": | ||
| base_idx = idx | ||
| elif 'hp' in header_lower or 'heures pleines' in header_lower or 'tarif hp' in header_lower: | ||
| hp_idx = idx | ||
| elif 'hc' in header_lower or 'heures creuses' in header_lower or 'tarif hc' in header_lower: | ||
| hc_idx = idx | ||
|
|
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column header detection logic on lines 302-317 has overlapping conditions that could lead to incorrect column assignment. Specifically:
- Line 310: Checks for generic "tarif" when offer_type is "BASE"
- Line 314: Checks for "hp" or "heures pleines" for HP column
- Line 316: Checks for "hc" or "heures creuses" for HC column
If a header contains "tarif hp" or "tarif hc", the generic "tarif" check on line 310 will match first (when iterating through headers), potentially assigning it as base_idx before the HP/HC checks. The order of these checks matters. Consider checking for more specific patterns (HP/HC) before generic patterns, or restructure the logic to check all patterns for each header before deciding which column type it is.
| elif 'tarif base' in header_lower or ('base' in header_lower and 'tarif' in header_lower): | |
| base_idx = idx | |
| elif 'tarif' in header_lower and 'base' not in header_lower and hp_idx is None: | |
| # Generic tarif column - use as base for BASE type | |
| if offer_type == "BASE": | |
| base_idx = idx | |
| elif 'hp' in header_lower or 'heures pleines' in header_lower or 'tarif hp' in header_lower: | |
| hp_idx = idx | |
| elif 'hc' in header_lower or 'heures creuses' in header_lower or 'tarif hc' in header_lower: | |
| hc_idx = idx | |
| # Check for HP/HC columns first to avoid misassignment | |
| elif 'hp' in header_lower or 'heures pleines' in header_lower or 'tarif hp' in header_lower: | |
| hp_idx = idx | |
| elif 'hc' in header_lower or 'heures creuses' in header_lower or 'tarif hc' in header_lower: | |
| hc_idx = idx | |
| elif 'tarif base' in header_lower or ('base' in header_lower and 'tarif' in header_lower): | |
| base_idx = idx | |
| elif 'tarif' in header_lower and 'base' not in header_lower and hp_idx is None and hc_idx is None: | |
| # Generic tarif column - use as base for BASE type | |
| if offer_type == "BASE": | |
| base_idx = idx |
| # Basic sanity check - prices should be reasonable | ||
| if 0 < value < 1000: | ||
| return value | ||
| except ValueError: |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except ValueError: | |
| except ValueError: | |
| # If conversion fails, the price string is invalid; return None. |
Provides step-by-step guidance for adding new energy provider scrapers, including all required files and configuration changes. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
Resolved conflict in docs/pages/admin-offers.md - kept main's 9 providers count 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
Summary
Replaces Engie PDF scraper with HelloWatt comparison site scraping to get more accurate, up-to-date pricing. Now supports 34 Engie offers (Référence 3 ans and Tranquillité in both BASE and HC/HP options). Also fixes a React crash in AdminOffers.tsx that occurred after logout due to corrupted cache data.
Changes
Testing
Visit
/admin/offersand click "Prévisualiser" on the Engie provider to verify all 34 offers are detected correctly.🤖 Generated with Claude Code