Skip to content

Migrate schools scraper from HTML scraping to xacen-backend REST API#5

Open
jsanz wants to merge 1 commit intomainfrom
migrate-scraper-to-api
Open

Migrate schools scraper from HTML scraping to xacen-backend REST API#5
jsanz wants to merge 1 commit intomainfrom
migrate-scraper-to-api

Conversation

@jsanz
Copy link
Member

@jsanz jsanz commented Mar 6, 2026

Summary

The education portal (ceice.gva.es) replaced its old ASP pages with a modern Angular SPA backed by a JSON REST API at xacen-backend.gva.es. The old HTML scraping endpoints all return 404.

This PR rewrites the schools scraper to use the new API:

  • Authentication: JWT flow using public credentials (embedded in the frontend JS bundle)
  • School list: Paginated GET /guiadecentros/listaCentrosAulariosLibre (100 per page, ~3700 schools across 39 pages)
  • School details: 6 GET endpoints per school (datosGenerales, nivelesAutorizados, jornadas, informacionAdicional, servicios, programaLinguistico)
  • Output: Always produces both schools.csv and schools.json in a single run (no more OUTPUT_FORMAT toggle)
  • Legacy compatibility: API responses are transformed to match the existing 18-field output schema (codigo, denGenEs, deno, reg, dir, cp, muni, tel, email, com, titular, lat, long, cif, niveles, horario, info, inst)

Changes

  • Rewrite scraper.py: API client with JWT auth, pagination, retry with backoff, and ThreadPoolExecutor for concurrent detail fetching
  • Simplify main.py: remove local HTML mode (--local flag)
  • Drop dependencies: beautifulsoup4, lxml, requests-cache, debugpy
  • Add python-dotenv loading for non-Docker runs
  • Update README.md to document the new API-based approach

Verified against previous scrape

  • Same 3710 schools (33 removed + 33 new = natural churn)
  • Identical 18-field schema in both CSV and JSON outputs
  • Data quality improvements: cleaner text (no \r artifacts), no navigation icon titles in inst, no header labels in horario
  • All coordinate data preserved (1 school gained coordinates)

Test plan

  • uv run src/main.py --school-codes 03000047 — single school detail fetch
  • uv run src/main.py --subset 3 — paginated list with subset
  • Full scrape of all ~3700 schools with --threads 15
  • Field-by-field comparison against data.old/ output
  • Both CSV and JSON outputs generated in single run

The education portal (ceice.gva.es) replaced the old ASP pages with
a modern Angular SPA backed by a JSON REST API. The old HTML endpoints
return 404.

- Rewrite scraper to use xacen-backend API with JWT authentication
- Fetch paginated school list and enrich with detail endpoints
  (datosGenerales, nivelesAutorizados, jornadas, informacionAdicional,
  servicios, programaLinguistico)
- Transform API response to match legacy output schema (same 18 fields)
- Always output both CSV and JSON in a single run
- Use ThreadPoolExecutor instead of multiprocessing Pool
- Remove local HTML mode (no longer applicable)
- Drop beautifulsoup4, lxml, requests-cache, debugpy dependencies
- Load .env via python-dotenv for non-Docker runs

Made-with: Cursor
@jsanz jsanz requested a review from vehrka March 6, 2026 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant