[HB] Replace Bremen HTML scraper with INSPIRE shapefile #221

tifa365 · 2025-10-25T12:56:48Z

Replaces the Bremen HTML scraper with a more robust INSPIRE shapefile-based implementation that includes geolocation support.

Summary

Switched from HTML scraping to INSPIRE shapefile (official geodata source)
Adds geolocation support (100% coverage for all schools)
Uses official SNR (Schulnummer) identifiers from shapefile data
Reads ZIP file directly without extraction
Automatic encoding detection from .cpg files
SHA256 checksum verification for data integrity

Data improvements

Geolocation: Now includes latitude/longitude for all schools (converted from EPSG:25832)
Stable IDs: Uses 3-digit SNR from shapefile (e.g., HB-002, HB-117)
Additional fields: Provider (Träger) and district (Ortsteil) now available

Data quality validation

Strict validation for shapefile schema (required fields must be present)
SNR format validation (must be exactly 3 digits)
Duplicate SNR detection
Core field validation with aggregated error reporting
Configurable failure behavior via fail_on_missing_core attribute

Dependencies

New dependency:

pyshp>=3.0.2.post1 - Python shapefile library for reading INSPIRE shapefiles

Test plan

Tested with small sample (3 schools)
Verified geolocation data (coordinates converted correctly)
Confirmed all 253 schools from Bremen + Bremerhaven processed successfully
Validated SNR format and uniqueness
Tested strict validation behavior

Switch data source from HTML scraping to INSPIRE Download Service shapefiles, adding geolocation support for all 253 schools (206 Bremen + 47 Bremerhaven). Key changes: - Use official Schulnummer (SNR) from shapefile as stable ID source - SNR is zero-padded 3-digit format (002, 117) matching official Bremen convention - Transform coordinates from EPSG:25832 to WGS84 using pyproj - Implement caching: download ZIP once, extract locally - Add pyshp dependency for lightweight shapefile reading (no numpy/GDAL) - Validate all SNRs are unique and in correct format Data improvements: - 100% geolocation coverage (was 0%) - Stable IDs using official Schulnummer field - 253 schools total (previous HTML scraper missed some) - Consistent with Bremen's official materials and email aliases

Enhancements: - Read shapefiles directly from ZIP using BytesIO (no disk extraction) - Auto-detect encoding from .cpg file for proper character handling - Use iterShapeRecords() for memory efficiency - Validate required fields exist before processing - Track duplicate SNRs per shapefile with better error messages - Case-insensitive field mapping for robustness - Cross-platform path handling with os.path.join()

Raise ValueError instead of logging warnings for: - Missing required shapefile fields (snr_txt, nam, strasse, plz, ort) - Invalid SNR format (must be 3 digits) - Duplicate SNR - Missing core data (name, address, zip, city) Prevents silent failures by failing hard with clear error messages.

- Add REQUIRED_MAP to centralize field name mappings - Aggregate missing field validation to report all issues at once - Add fail_on_missing_core attribute for configurable failure behavior - Default to logging errors and continuing (lenient mode) - Can be set to fail hard by setting fail_on_missing_core=True

k-nut · 2025-10-28T07:00:44Z

jedeschule/spiders/bremen.py

+    CACHE_DIR = "cache"
+    CACHE_FILE = os.path.join(CACHE_DIR, "Schulstandorte_HB_BHV.zip")
+


Since we are only ever writing and never reading the file, I don't think we need this caching logic.

k-nut · 2025-10-28T07:22:28Z

jedeschule/spiders/bremen.py

+                # Build robust field-name map
+                # sf.fields: [("DeletionFlag","C",1,0), ("NAM","C",80,0), ...]
+                field_names = [f[0].lower() for f in sf.fields[1:]]  # skip DeletionFlag
+                required = set(self.REQUIRED_MAP.values())
+                missing = required.difference(field_names)
+                if missing:
+                    raise ValueError(
+                        f"Missing required fields in {stem}: {missing}. Found: {field_names}"
+                    )
+
+                # Iterate records
+                seen_ids = set()
+                for sr in sf.iterShapeRecords():
+                    rec = dict(zip(field_names, sr.record))
+
+                    snr_txt = (rec.get("snr_txt") or "").strip()
+                    if not re.fullmatch(r"\d{3}", snr_txt):
+                        raise ValueError(
+                            f"[{city_name}] Invalid SNR format '{snr_txt}' for {rec.get('nam')} (expected 3 digits)"
+                        )
+                    if snr_txt in seen_ids:
+                        raise ValueError(f"[{city_name}] Duplicate SNR '{snr_txt}'")
+                    seen_ids.add(snr_txt)
+
+                    # Validate core fields (non-silent). Aggregate and act once.
+                    core = {
+                        k: (rec.get(v) or "").strip()
+                        for k, v in self.REQUIRED_MAP.items()
+                    }
+                    missing_core = [k for k, val in core.items() if not val]
+                    if missing_core:
+                        msg = f"[{city_name}] Missing required fields {missing_core} for SNR '{snr_txt}'"
+                        if getattr(self, "fail_on_missing_core", False):
+                            raise CloseSpider(reason=msg)
+                        self.logger.error(msg)
+                        continue
+
+                    # geometry
+                    shp = sr.shape
+                    lat = lon = None
+                    if shp and shp.points:
+                        # Expect Point; take first coordinate defensively
+                        x, y = shp.points[0]
+                        lon, lat = transformer.transform(x, y)
+
+                    yield {
+                        "snr": snr_txt,
+                        "name": core["name"],
+                        "address": core["address"],
+                        "zip": core["zip"],
+                        "city": core["city"],
+                        "district": rec.get("ortsteilna"),
+                        "school_type": rec.get("schulart_2"),
+                        "provider": rec.get("traegernam"),
+                        "latitude": lat,
+                        "longitude": lon,
+                    }


I think I would prefer a much simpler approach. Something like this:

field_names = [f.name for f in sf.fields] for sr in sf.iterShapeRecords(): rec = dict(zip(field_names, sr.record)) # geometry shp = sr.shape if shp and shp.points: # Expect Point; take first coordinate defensively x, y = shp.points[0] rec['lon'], rec['lat'] = transformer.transform(x, y) yield rec

Our goal generally is to have the parse function return the data (which is going to be stored as raw in the database later) as close to the original as possible. It is then the goal of the normalize method to turn the fields names into the common names that we picked for our model of the data.

My preferred way would hence be to return the data as untouched as possible and to then do the mapping form their short names to our schema names (as you already do in REQUIRED_MAP) in the normalize function.

This was we also don't need to add expectations about the shape of the data into the parse function yet. They will continue to work even if the schema changes and we will only have to adapt the normalize function later to match the field names again.

Resolve uv.lock conflict by regenerating lock file

- Remove unnecessary file caching and SHA256 computation - Simplify parse() to return raw shapefile data - Move field mapping and validation to normalize() - Remove unused imports (os, hashlib, CloseSpider) This follows the architecture pattern where parse() keeps data as close to the source as possible, and normalize() handles the transformation to our standard schema.

k-nut

I wasn't sure if we should merge this or not since we are losing some information (e.g. phone and websites) but since our new strategy is to prioritize geodata over other information, I think that this is good.
We could consider re-using the old scraper as well (since both sources use the same ids) and make a new parse_details that fetches the details from https://www.bildung.bremen.de/-8954?Sid=<school_id>. Let me know what you prefer.

In any case, please re-base on main and fix the conflicts in the lockfile :)

k-nut · 2025-11-26T20:42:14Z

.gitignore

+
+# Bremen shapefile cache
+cache/


This can be removed again now

tim added 4 commits October 24, 2025 08:30

k-nut reviewed Oct 28, 2025

View reviewed changes

tim added 2 commits November 4, 2025 12:19

Merge branch 'main' into feature/bremen-geodata

a19c8a7

Resolve uv.lock conflict by regenerating lock file

k-nut approved these changes Nov 26, 2025

View reviewed changes

.gitignore

Comment on lines +26 to +28

# Bremen shapefile cache

cache/

Copy link

Member

k-nut Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed again now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HB] Replace Bremen HTML scraper with INSPIRE shapefile #221

[HB] Replace Bremen HTML scraper with INSPIRE shapefile #221

Uh oh!

tifa365 commented Oct 25, 2025 •

edited

Loading

Uh oh!

k-nut Oct 28, 2025

Uh oh!

k-nut Oct 28, 2025

Uh oh!

k-nut left a comment

Uh oh!

k-nut Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		CACHE_DIR = "cache"
		CACHE_FILE = os.path.join(CACHE_DIR, "Schulstandorte_HB_BHV.zip")

[HB] Replace Bremen HTML scraper with INSPIRE shapefile #221

Are you sure you want to change the base?

[HB] Replace Bremen HTML scraper with INSPIRE shapefile #221

Uh oh!

Conversation

tifa365 commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Data improvements

Data quality validation

Dependencies

Test plan

Uh oh!

k-nut Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

k-nut Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

k-nut left a comment

Choose a reason for hiding this comment

Uh oh!

k-nut Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tifa365 commented Oct 25, 2025 •

edited

Loading