Commit fe9257f
tim
Rewrite Niedersachsen spider with shapefile-first approach
Replaces API-only scraping with LSN geodata integration providing
100% geolocation coverage (4,250 schools). Implements robust matching,
normalization, and async downloads.
Key improvements:
- Add shapefile integration for 4 LSN categories (ABS, Förder, BBS, SdG)
- Implement Unicode-aware (NFKD) German character normalization
- Add collision-safe API indexing with form-aware disambiguation
- Use async Scrapy Requests for shapefile downloads (non-blocking)
- Generate stable synthetic IDs (SHA-1) for shapefile-only schools
- Add path traversal protection in ZIP extraction
- Ensure API-only schools are fetched after shapefile processing
- Add enhanced school form normalization with long-form variants
- Add allowed_domains and stats tracking
Coverage: 4,250 schools with geodata (~75% API-enriched, ~25% shapefile-only)
Adds pyproj dependency for CRS transformation support.
Note: API endpoint (schulen.nibis.de) is public but may experience
intermittent connectivity. Shapefile processing is fully functional
and provides complete geodata coverage.1 parent e30ae74 commit fe9257f
File tree
4 files changed
+825
-354
lines changed- jedeschule/spiders
4 files changed
+825
-354
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
| 52 | + | |
53 | 53 | | |
54 | | - | |
| 54 | + | |
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
| |||
0 commit comments