-
Notifications
You must be signed in to change notification settings - Fork 0
fix(scraper): implement Priméo Énergie PDF parsing #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The Priméo scraper was always falling back to hardcoded values because _parse_pdf() returned an empty list. This commit implements actual PDF parsing that extracts prices directly from the Priméo tariff sheet: - Parse concatenated PDF text by splitting on 'kVA' markers - Extract BASE subscriptions (3-36 kVA) and kWh price (0.1327€ HT) - Extract HC/HP subscriptions (3-36 kVA) with HP (0.1434€) and HC (0.1147€) - Update fallback values to match current PDF prices (2025-12-05) - Add 3 kVA option to HC/HP (Priméo offers it unlike standard TRV) The scraper now returns 18 offers (9 BASE + 9 HC/HP) from live PDF data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements actual PDF parsing for the Priméo Énergie scraper, replacing the previous stub implementation that always fell back to hardcoded values. The new implementation extracts electricity tariff prices directly from the Priméo Énergie PDF document using regex-based text parsing.
Key Changes:
- Implemented
_parse_pdf()method with actual parsing logic instead of returning empty list - Added
_extract_base_prices()and_extract_hc_hp_prices()helper methods to parse BASE and HC/HP tariff tables from PDF text - Updated fallback prices to reflect current rates (valid from 04/08/2025) with 3 kVA option added to HC/HP offers
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Extract the kWh BASE price (HT) - look for 0,1327 pattern | ||
| kwh_price = 0.1327 # Default | ||
| kwh_matches = re.findall(r"0[,\.]1[23]\d{2}", text) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern r"0[,\.]1[23]\d{2}" will match values like "0.13xx" or "0.12xx" but won't match "0.14xx" or "0.11xx". This pattern seems overly restrictive. Based on the context (looking for BASE kWh price around 0.1327), the pattern should be r"0[,\.]1[0-9]\d{2}" to match any value in the 0.10-0.19 range. The current pattern could fail to find the price if the PDF format changes slightly or if the price shifts outside the 0.12-0.13 range.
| kwh_matches = re.findall(r"0[,\.]1[23]\d{2}", text) | |
| kwh_matches = re.findall(r"0[,\.]1[0-9]\d{2}", text) |
| # Build the prices dict (exclude 3 kVA if not valid for HC/HP) | ||
| for power, subscription in subscription_mapping.items(): | ||
| # Standard HC/HP is 6+ kVA, but include 3 if Primeo offers it | ||
| if power >= 3: | ||
| prices[power] = { | ||
| "subscription": subscription, | ||
| "hp": hp_price, | ||
| "hc": hc_price, | ||
| } |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if power >= 3 is always true since all values in hchp_powers are >= 3 (the list is [3, 6, 9, 12, 15, 18, 24, 30, 36]). This makes the condition redundant. Either remove the condition, or if there's a specific reason to exclude 3 kVA for HC/HP (as the comment on line 267 suggests might be standard), implement the proper logic to handle this case. The comment on line 311 says "Standard HC/HP is 6+ kVA, but include 3 if Primeo offers it" - this suggests the condition should be checking something else or the logic needs clarification.
| # Fallback to hardcoded values | ||
| fallback = { | ||
| 3: 11.74, | ||
| 6: 15.47, | ||
| 9: 19.39, | ||
| 12: 23.32, | ||
| 15: 27.06, | ||
| 18: 30.76, | ||
| 24: 38.80, | ||
| 30: 46.44, | ||
| 36: 54.29, | ||
| } | ||
| for power in fallback: | ||
| if power not in subscription_mapping: | ||
| subscription_mapping[power] = fallback[power] |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded fallback subscription prices are duplicated - they appear both in the FALLBACK_PRICES constant at the top of the class (lines 41-49) and here within the _extract_hc_hp_prices method (lines 294-304). This violates the DRY principle and creates a maintenance burden. If prices need to be updated, they must be changed in two places, increasing the risk of inconsistency. Consider reusing self.FALLBACK_PRICES["FIXE_HC_HP"] instead of hardcoding these values again.
| # Fallback to hardcoded values | |
| fallback = { | |
| 3: 11.74, | |
| 6: 15.47, | |
| 9: 19.39, | |
| 12: 23.32, | |
| 15: 27.06, | |
| 18: 30.76, | |
| 24: 38.80, | |
| 30: 46.44, | |
| 36: 54.29, | |
| } | |
| for power in fallback: | |
| if power not in subscription_mapping: | |
| subscription_mapping[power] = fallback[power] | |
| # Fallback to class-level hardcoded values | |
| fallback = self.FALLBACK_PRICES["FIXE_HC_HP"] | |
| for power in fallback: | |
| if power not in subscription_mapping: | |
| subscription_mapping[power] = fallback[power]["subscription"] |
| The PDF text when split by 'kVA' gives parts like: | ||
| - Part 1: "8,516 " = price 8.51 for 3 kVA, "6" is start of next power | ||
| - Part 2: "11,0711,309 " = price 11.07 for 6 kVA (+ TRV), "9" is next power |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment describes the format as "8,516 " where "6" is the start of the next power value. This is confusing and suggests the parsing logic might be unreliable. If the trailing digit is actually part of the next value rather than the price itself, the regex r"(\d+[,\.]\d{2})" might not correctly extract the intended price in all cases. The comment should clarify the exact expected format more precisely, and the parsing logic should be validated to ensure it handles this concatenated format correctly.
| kwh_matches = re.findall(r"0[,\.]1[23]\d{2}", text) | ||
| for m in kwh_matches: | ||
| val = float(m.replace(",", ".")) | ||
| if 0.12 < val < 0.15: |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks if the price is between 0.12 and 0.15, but the expected HP price is 0.1434 and HC price is 0.1147. The HC price (0.1147) would fall outside this range and wouldn't be captured. Additionally, this validation range seems too narrow and could cause the parser to miss valid prices if they change slightly in future PDFs. Consider widening the range to something like 0.10 < val < 0.20 to be more robust.
| if 0.12 < val < 0.15: | |
| if 0.10 < val < 0.20: |
| price_match = re.match(r"(\d+[,\.]\d{2})", part) | ||
| if price_match: | ||
| price = float(price_match.group(1).replace(",", ".")) | ||
| if 5 < price < 45: # Valid subscription range for BASE |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The hardcoded range 5 < price < 45 is a magic number without clear documentation. This range validation could be made more maintainable by defining named constants like MIN_BASE_SUBSCRIPTION and MAX_BASE_SUBSCRIPTION with comments explaining why these bounds were chosen. This would make the code more self-documenting and easier to update if the valid ranges change.
| # Find the starting index for BASE section | ||
| # BASE section starts after headers, look for part containing "3 " | ||
| start_idx = None | ||
| for i, part in enumerate(parts): | ||
| if part.strip().endswith("3 ") or part.strip().endswith("3") or "3 " in part[-5:]: | ||
| start_idx = i + 1 | ||
| break |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for finding the BASE section start index is fragile. The condition checks if a part endswith("3 ") or endswith("3") or has "3 " in the last 5 characters (part[-5:]). However, this could match many unintended locations in the PDF text where "3 " appears. For example, any text mentioning "3" followed by a space could trigger this. A more robust approach would be to look for a more specific marker that uniquely identifies the start of the BASE pricing table, such as a combination of keywords or a more distinctive pattern.
| # Find the starting index for BASE section | |
| # BASE section starts after headers, look for part containing "3 " | |
| start_idx = None | |
| for i, part in enumerate(parts): | |
| if part.strip().endswith("3 ") or part.strip().endswith("3") or "3 " in part[-5:]: | |
| start_idx = i + 1 | |
| break | |
| # Find the starting index for BASE section using a more robust marker | |
| # Look for a part containing a distinctive marker, e.g., "Abonnement annuel HT" followed by "3 kVA" | |
| start_idx = None | |
| marker_regex = re.compile(r"(Abonnement annuel HT.*?3\s*kVA)", re.IGNORECASE | re.DOTALL) | |
| marker_match = marker_regex.search(text) | |
| if marker_match: | |
| # Find which part contains the marker | |
| marker_text = marker_match.group(1) | |
| for i, part in enumerate(parts): | |
| if marker_text in part: | |
| start_idx = i | |
| break | |
| else: | |
| # Fallback to previous logic if marker not found | |
| for i, part in enumerate(parts): | |
| if part.strip().endswith("3 ") or part.strip().endswith("3") or "3 " in part[-5:]: | |
| start_idx = i + 1 | |
| break |
| # Fallback to hardcoded values if extraction failed | ||
| fallback = { | ||
| 3: 8.51, | ||
| 6: 11.07, | ||
| 9: 13.79, | ||
| 12: 16.51, | ||
| 15: 19.07, | ||
| 18: 21.60, | ||
| 24: 27.18, | ||
| 30: 32.45, | ||
| 36: 37.88, | ||
| } | ||
| for power in fallback: | ||
| if power not in subscription_mapping: | ||
| subscription_mapping[power] = fallback[power] |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded fallback subscription prices are duplicated - they appear both in the FALLBACK_PRICES constant at the top of the class (lines 30-38) and here within the _extract_base_prices method (lines 214-224). This violates the DRY principle and creates a maintenance burden. If prices need to be updated, they must be changed in two places, increasing the risk of inconsistency. Consider reusing self.FALLBACK_PRICES["FIXE_BASE"] instead of hardcoding these values again.
| # Fallback to hardcoded values if extraction failed | |
| fallback = { | |
| 3: 8.51, | |
| 6: 11.07, | |
| 9: 13.79, | |
| 12: 16.51, | |
| 15: 19.07, | |
| 18: 21.60, | |
| 24: 27.18, | |
| 30: 32.45, | |
| 36: 37.88, | |
| } | |
| for power in fallback: | |
| if power not in subscription_mapping: | |
| subscription_mapping[power] = fallback[power] | |
| # Fallback to class constant values if extraction failed | |
| fallback = self.FALLBACK_PRICES["FIXE_BASE"] | |
| for power in fallback: | |
| if power not in subscription_mapping: | |
| subscription_mapping[power] = fallback[power]["subscription"] |
| price_match = re.match(r"(\d+[,\.]\d{2})", part) | ||
| if price_match: | ||
| price = float(price_match.group(1).replace(",", ".")) | ||
| if 10 < price < 60: # Valid subscription range for HC/HP |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The hardcoded range 10 < price < 60 is a magic number without clear documentation. This range validation could be made more maintainable by defining named constants like MIN_HCHP_SUBSCRIPTION and MAX_HCHP_SUBSCRIPTION with comments explaining why these bounds were chosen. This would make the code more self-documenting and easier to update if the valid ranges change.
| if part_idx < len(parts): | ||
| part = parts[part_idx] | ||
| # Extract the first price from this part (Primeo price) | ||
| # Format: "8,516 " -> price is 8,51 (exactly 2 decimals) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment says "price is 8,51 (exactly 2 decimals)" but this is misleading. The regex r"(\d+[,\.]\d{2})" matches numbers with exactly 2 decimal places, but 8,51 actually represents 8.51 (with a comma as decimal separator), not "exactly 2 decimals" in a mathematical sense. The comment should be clarified to say something like "Format: '8,51' -> extracts price with 2 decimal digits" to avoid confusion.
| # Format: "8,516 " -> price is 8,51 (exactly 2 decimals) | |
| # Format: "8,516 " -> extracts price string "8,51" (with 2 decimal digits, comma as decimal separator) |
Summary
Implement actual PDF parsing for Priméo Énergie scraper instead of always falling back to hardcoded values. The _parse_pdf() method now extracts prices directly from the tariff PDF.
Changes
Result
Scraper now returns 18 offers (9 BASE + 9 HC/HP) from live PDF data without fallback.