Migrate schools scraper from HTML scraping to xacen-backend REST API by jsanz · Pull Request #5 · decasaalcole/decasaalcole-data

jsanz · 2026-03-06T13:05:51Z

Summary

The education portal (ceice.gva.es) replaced its old ASP pages with a modern Angular SPA backed by a JSON REST API at xacen-backend.gva.es. The old HTML scraping endpoints all return 404.

This PR rewrites the schools scraper to use the new API:

Authentication: JWT flow using public credentials (embedded in the frontend JS bundle)
School list: Paginated GET /guiadecentros/listaCentrosAulariosLibre (100 per page, ~3700 schools across 39 pages)
School details: 6 GET endpoints per school (datosGenerales, nivelesAutorizados, jornadas, informacionAdicional, servicios, programaLinguistico)
Output: Always produces both schools.csv and schools.json in a single run (no more OUTPUT_FORMAT toggle)
Legacy compatibility: API responses are transformed to match the existing 18-field output schema (codigo, denGenEs, deno, reg, dir, cp, muni, tel, email, com, titular, lat, long, cif, niveles, horario, info, inst)

Changes

Rewrite scraper.py: API client with JWT auth, pagination, retry with backoff, and ThreadPoolExecutor for concurrent detail fetching
Simplify main.py: remove local HTML mode (--local flag)
Drop dependencies: beautifulsoup4, lxml, requests-cache, debugpy
Add python-dotenv loading for non-Docker runs
Update README.md to document the new API-based approach

Verified against previous scrape

Same 3710 schools (33 removed + 33 new = natural churn)
Identical 18-field schema in both CSV and JSON outputs
Data quality improvements: cleaner text (no \r artifacts), no navigation icon titles in inst, no header labels in horario
All coordinate data preserved (1 school gained coordinates)

Test plan

uv run src/main.py --school-codes 03000047 — single school detail fetch
uv run src/main.py --subset 3 — paginated list with subset
Full scrape of all ~3700 schools with --threads 15
Field-by-field comparison against data.old/ output
Both CSV and JSON outputs generated in single run

The education portal (ceice.gva.es) replaced the old ASP pages with a modern Angular SPA backed by a JSON REST API. The old HTML endpoints return 404. - Rewrite scraper to use xacen-backend API with JWT authentication - Fetch paginated school list and enrich with detail endpoints (datosGenerales, nivelesAutorizados, jornadas, informacionAdicional, servicios, programaLinguistico) - Transform API response to match legacy output schema (same 18 fields) - Always output both CSV and JSON in a single run - Use ThreadPoolExecutor instead of multiprocessing Pool - Remove local HTML mode (no longer applicable) - Drop beautifulsoup4, lxml, requests-cache, debugpy dependencies - Load .env via python-dotenv for non-Docker runs Made-with: Cursor

jsanz requested a review from vehrka March 6, 2026 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate schools scraper from HTML scraping to xacen-backend REST API#5

Migrate schools scraper from HTML scraping to xacen-backend REST API#5
jsanz wants to merge 1 commit intomainfrom
migrate-scraper-to-api

jsanz commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jsanz commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verified against previous scrape

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jsanz commented Mar 6, 2026 •

edited

Loading