Skip to content

Conversation

@tifa365
Copy link
Contributor

@tifa365 tifa365 commented Oct 25, 2025

Replaces the Bremen HTML scraper with a more robust INSPIRE shapefile-based implementation that includes geolocation support.

Summary

  • Switched from HTML scraping to INSPIRE shapefile (official geodata source)
  • Adds geolocation support (100% coverage for all schools)
  • Uses official SNR (Schulnummer) identifiers from shapefile data
  • Reads ZIP file directly without extraction
  • Automatic encoding detection from .cpg files
  • SHA256 checksum verification for data integrity

Data improvements

  • Geolocation: Now includes latitude/longitude for all schools (converted from EPSG:25832)
  • Stable IDs: Uses 3-digit SNR from shapefile (e.g., HB-002, HB-117)
  • Additional fields: Provider (Träger) and district (Ortsteil) now available

Data quality validation

  • Strict validation for shapefile schema (required fields must be present)
  • SNR format validation (must be exactly 3 digits)
  • Duplicate SNR detection
  • Core field validation with aggregated error reporting
  • Configurable failure behavior via fail_on_missing_core attribute

Dependencies

New dependency:

  • pyshp>=3.0.2.post1 - Python shapefile library for reading INSPIRE shapefiles

Test plan

  • Tested with small sample (3 schools)
  • Verified geolocation data (coordinates converted correctly)
  • Confirmed all 253 schools from Bremen + Bremerhaven processed successfully
  • Validated SNR format and uniqueness
  • Tested strict validation behavior

tim added 4 commits October 24, 2025 08:30
Switch data source from HTML scraping to INSPIRE Download Service shapefiles,
adding geolocation support for all 253 schools (206 Bremen + 47 Bremerhaven).

Key changes:
- Use official Schulnummer (SNR) from shapefile as stable ID source
- SNR is zero-padded 3-digit format (002, 117) matching official Bremen convention
- Transform coordinates from EPSG:25832 to WGS84 using pyproj
- Implement caching: download ZIP once, extract locally
- Add pyshp dependency for lightweight shapefile reading (no numpy/GDAL)
- Validate all SNRs are unique and in correct format

Data improvements:
- 100% geolocation coverage (was 0%)
- Stable IDs using official Schulnummer field
- 253 schools total (previous HTML scraper missed some)
- Consistent with Bremen's official materials and email aliases
Enhancements:
- Read shapefiles directly from ZIP using BytesIO (no disk extraction)
- Auto-detect encoding from .cpg file for proper character handling
- Use iterShapeRecords() for memory efficiency
- Validate required fields exist before processing
- Track duplicate SNRs per shapefile with better error messages
- Case-insensitive field mapping for robustness
- Cross-platform path handling with os.path.join()
Raise ValueError instead of logging warnings for:
- Missing required shapefile fields (snr_txt, nam, strasse, plz, ort)
- Invalid SNR format (must be 3 digits)
- Duplicate SNR
- Missing core data (name, address, zip, city)

Prevents silent failures by failing hard with clear error messages.
- Add REQUIRED_MAP to centralize field name mappings
- Aggregate missing field validation to report all issues at once
- Add fail_on_missing_core attribute for configurable failure behavior
- Default to logging errors and continuing (lenient mode)
- Can be set to fail hard by setting fail_on_missing_core=True
Comment on lines 20 to 22
CACHE_DIR = "cache"
CACHE_FILE = os.path.join(CACHE_DIR, "Schulstandorte_HB_BHV.zip")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are only ever writing and never reading the file, I don't think we need this caching logic.

Comment on lines 67 to 123
# Build robust field-name map
# sf.fields: [("DeletionFlag","C",1,0), ("NAM","C",80,0), ...]
field_names = [f[0].lower() for f in sf.fields[1:]] # skip DeletionFlag
required = set(self.REQUIRED_MAP.values())
missing = required.difference(field_names)
if missing:
raise ValueError(
f"Missing required fields in {stem}: {missing}. Found: {field_names}"
)

# Iterate records
seen_ids = set()
for sr in sf.iterShapeRecords():
rec = dict(zip(field_names, sr.record))

snr_txt = (rec.get("snr_txt") or "").strip()
if not re.fullmatch(r"\d{3}", snr_txt):
raise ValueError(
f"[{city_name}] Invalid SNR format '{snr_txt}' for {rec.get('nam')} (expected 3 digits)"
)
if snr_txt in seen_ids:
raise ValueError(f"[{city_name}] Duplicate SNR '{snr_txt}'")
seen_ids.add(snr_txt)

# Validate core fields (non-silent). Aggregate and act once.
core = {
k: (rec.get(v) or "").strip()
for k, v in self.REQUIRED_MAP.items()
}
missing_core = [k for k, val in core.items() if not val]
if missing_core:
msg = f"[{city_name}] Missing required fields {missing_core} for SNR '{snr_txt}'"
if getattr(self, "fail_on_missing_core", False):
raise CloseSpider(reason=msg)
self.logger.error(msg)
continue

# geometry
shp = sr.shape
lat = lon = None
if shp and shp.points:
# Expect Point; take first coordinate defensively
x, y = shp.points[0]
lon, lat = transformer.transform(x, y)

yield {
"snr": snr_txt,
"name": core["name"],
"address": core["address"],
"zip": core["zip"],
"city": core["city"],
"district": rec.get("ortsteilna"),
"school_type": rec.get("schulart_2"),
"provider": rec.get("traegernam"),
"latitude": lat,
"longitude": lon,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer a much simpler approach. Something like this:

                field_names = [f.name for f in sf.fields]
                for sr in sf.iterShapeRecords():
                    rec = dict(zip(field_names, sr.record))

                    # geometry
                    shp = sr.shape
                    if shp and shp.points:
                        # Expect Point; take first coordinate defensively
                        x, y = shp.points[0]
                        rec['lon'], rec['lat'] = transformer.transform(x, y)
                    yield rec

Our goal generally is to have the parse function return the data (which is going to be stored as raw in the database later) as close to the original as possible. It is then the goal of the normalize method to turn the fields names into the common names that we picked for our model of the data.

My preferred way would hence be to return the data as untouched as possible and to then do the mapping form their short names to our schema names (as you already do in REQUIRED_MAP) in the normalize function.

This was we also don't need to add expectations about the shape of the data into the parse function yet. They will continue to work even if the schema changes and we will only have to adapt the normalize function later to match the field names again.

tim added 2 commits November 4, 2025 12:19
Resolve uv.lock conflict by regenerating lock file
- Remove unnecessary file caching and SHA256 computation
- Simplify parse() to return raw shapefile data
- Move field mapping and validation to normalize()
- Remove unused imports (os, hashlib, CloseSpider)

This follows the architecture pattern where parse() keeps data
as close to the source as possible, and normalize() handles
the transformation to our standard schema.
Copy link
Member

@k-nut k-nut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if we should merge this or not since we are losing some information (e.g. phone and websites) but since our new strategy is to prioritize geodata over other information, I think that this is good.
We could consider re-using the old scraper as well (since both sources use the same ids) and make a new parse_details that fetches the details from https://www.bildung.bremen.de/-8954?Sid=<school_id>. Let me know what you prefer.

In any case, please re-base on main and fix the conflicts in the lockfile :)

Comment on lines +26 to +28

# Bremen shapefile cache
cache/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed again now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants