-
Notifications
You must be signed in to change notification settings - Fork 18
[HB] Replace Bremen HTML scraper with INSPIRE shapefile #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Switch data source from HTML scraping to INSPIRE Download Service shapefiles, adding geolocation support for all 253 schools (206 Bremen + 47 Bremerhaven). Key changes: - Use official Schulnummer (SNR) from shapefile as stable ID source - SNR is zero-padded 3-digit format (002, 117) matching official Bremen convention - Transform coordinates from EPSG:25832 to WGS84 using pyproj - Implement caching: download ZIP once, extract locally - Add pyshp dependency for lightweight shapefile reading (no numpy/GDAL) - Validate all SNRs are unique and in correct format Data improvements: - 100% geolocation coverage (was 0%) - Stable IDs using official Schulnummer field - 253 schools total (previous HTML scraper missed some) - Consistent with Bremen's official materials and email aliases
Enhancements: - Read shapefiles directly from ZIP using BytesIO (no disk extraction) - Auto-detect encoding from .cpg file for proper character handling - Use iterShapeRecords() for memory efficiency - Validate required fields exist before processing - Track duplicate SNRs per shapefile with better error messages - Case-insensitive field mapping for robustness - Cross-platform path handling with os.path.join()
Raise ValueError instead of logging warnings for: - Missing required shapefile fields (snr_txt, nam, strasse, plz, ort) - Invalid SNR format (must be 3 digits) - Duplicate SNR - Missing core data (name, address, zip, city) Prevents silent failures by failing hard with clear error messages.
- Add REQUIRED_MAP to centralize field name mappings - Aggregate missing field validation to report all issues at once - Add fail_on_missing_core attribute for configurable failure behavior - Default to logging errors and continuing (lenient mode) - Can be set to fail hard by setting fail_on_missing_core=True
jedeschule/spiders/bremen.py
Outdated
| CACHE_DIR = "cache" | ||
| CACHE_FILE = os.path.join(CACHE_DIR, "Schulstandorte_HB_BHV.zip") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are only ever writing and never reading the file, I don't think we need this caching logic.
jedeschule/spiders/bremen.py
Outdated
| # Build robust field-name map | ||
| # sf.fields: [("DeletionFlag","C",1,0), ("NAM","C",80,0), ...] | ||
| field_names = [f[0].lower() for f in sf.fields[1:]] # skip DeletionFlag | ||
| required = set(self.REQUIRED_MAP.values()) | ||
| missing = required.difference(field_names) | ||
| if missing: | ||
| raise ValueError( | ||
| f"Missing required fields in {stem}: {missing}. Found: {field_names}" | ||
| ) | ||
|
|
||
| # Iterate records | ||
| seen_ids = set() | ||
| for sr in sf.iterShapeRecords(): | ||
| rec = dict(zip(field_names, sr.record)) | ||
|
|
||
| snr_txt = (rec.get("snr_txt") or "").strip() | ||
| if not re.fullmatch(r"\d{3}", snr_txt): | ||
| raise ValueError( | ||
| f"[{city_name}] Invalid SNR format '{snr_txt}' for {rec.get('nam')} (expected 3 digits)" | ||
| ) | ||
| if snr_txt in seen_ids: | ||
| raise ValueError(f"[{city_name}] Duplicate SNR '{snr_txt}'") | ||
| seen_ids.add(snr_txt) | ||
|
|
||
| # Validate core fields (non-silent). Aggregate and act once. | ||
| core = { | ||
| k: (rec.get(v) or "").strip() | ||
| for k, v in self.REQUIRED_MAP.items() | ||
| } | ||
| missing_core = [k for k, val in core.items() if not val] | ||
| if missing_core: | ||
| msg = f"[{city_name}] Missing required fields {missing_core} for SNR '{snr_txt}'" | ||
| if getattr(self, "fail_on_missing_core", False): | ||
| raise CloseSpider(reason=msg) | ||
| self.logger.error(msg) | ||
| continue | ||
|
|
||
| # geometry | ||
| shp = sr.shape | ||
| lat = lon = None | ||
| if shp and shp.points: | ||
| # Expect Point; take first coordinate defensively | ||
| x, y = shp.points[0] | ||
| lon, lat = transformer.transform(x, y) | ||
|
|
||
| yield { | ||
| "snr": snr_txt, | ||
| "name": core["name"], | ||
| "address": core["address"], | ||
| "zip": core["zip"], | ||
| "city": core["city"], | ||
| "district": rec.get("ortsteilna"), | ||
| "school_type": rec.get("schulart_2"), | ||
| "provider": rec.get("traegernam"), | ||
| "latitude": lat, | ||
| "longitude": lon, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer a much simpler approach. Something like this:
field_names = [f.name for f in sf.fields]
for sr in sf.iterShapeRecords():
rec = dict(zip(field_names, sr.record))
# geometry
shp = sr.shape
if shp and shp.points:
# Expect Point; take first coordinate defensively
x, y = shp.points[0]
rec['lon'], rec['lat'] = transformer.transform(x, y)
yield rec
Our goal generally is to have the parse function return the data (which is going to be stored as raw in the database later) as close to the original as possible. It is then the goal of the normalize method to turn the fields names into the common names that we picked for our model of the data.
My preferred way would hence be to return the data as untouched as possible and to then do the mapping form their short names to our schema names (as you already do in REQUIRED_MAP) in the normalize function.
This was we also don't need to add expectations about the shape of the data into the parse function yet. They will continue to work even if the schema changes and we will only have to adapt the normalize function later to match the field names again.
Resolve uv.lock conflict by regenerating lock file
- Remove unnecessary file caching and SHA256 computation - Simplify parse() to return raw shapefile data - Move field mapping and validation to normalize() - Remove unused imports (os, hashlib, CloseSpider) This follows the architecture pattern where parse() keeps data as close to the source as possible, and normalize() handles the transformation to our standard schema.
k-nut
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if we should merge this or not since we are losing some information (e.g. phone and websites) but since our new strategy is to prioritize geodata over other information, I think that this is good.
We could consider re-using the old scraper as well (since both sources use the same ids) and make a new parse_details that fetches the details from https://www.bildung.bremen.de/-8954?Sid=<school_id>. Let me know what you prefer.
In any case, please re-base on main and fix the conflicts in the lockfile :)
|
|
||
| # Bremen shapefile cache | ||
| cache/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed again now
Replaces the Bremen HTML scraper with a more robust INSPIRE shapefile-based implementation that includes geolocation support.
Summary
Data improvements
Data quality validation
fail_on_missing_coreattributeDependencies
New dependency:
pyshp>=3.0.2.post1- Python shapefile library for reading INSPIRE shapefilesTest plan