Skip to content

Prod DB Integration: Sitemap Fetching and Parsing Module #67

@Enniwhere

Description

@Enniwhere

Description:
Create a module to fetch and parse lex.dk sitemaps, extracting article URLs, modification timestamps, and deriving encyclopedia IDs and permalinks from URLs.

Acceptance criteria:

  • Function fetch_sitemap(url: str) -> str fetches XML content from a sitemap URL with error handling
  • Function parse_sitemap(xml_content: str) -> list[SitemapEntry] parses sitemap XML and extracts and elements
  • Function fetch_all_sitemaps() -> list[SitemapEntry] fetches and parses all 6 sub-sitemaps (sitemap1.xml through sitemap6.xml)
  • Function derive_encyclopedia_id(url: str) -> int extracts subdomain and maps to encyclopedia_id using reverse lookup of get_url_base() mapping
  • Function derive_permalink(url: str) -> str extracts path component after domain (e.g., "ekstern" from "https://lex.dk/ekstern")
  • SitemapEntry dataclass includes: url, lastmod (datetime), encyclopedia_id, permalink, article_id
  • Unit tests with mocked HTTP requests and sample sitemap XML fixtures
  • Handles URL-encoded permalinks (e.g., "eksteri%C3%B8rbed%C3%B8mmelse" → "eksteriørbedømmelse")

Technical details:

  • Use httpx for HTTP requests with timeout and retry configuration
  • Use xml.etree.ElementTree for XML parsing (stdlib, no extra dependencies)
  • Handle timezone-aware datetime parsing for lastmod field
  • Sitemap available at https://lex.dk/.sitemap/sitemap.xml (overview) and https://lex.dk/.sitemap/sitemap1.xml - https://lex.dk/.sitemap/sitemap6.xml (actual sitemaps)
  • Every article has a json representation that can be mined by appending ".json" to the article url (e.g. lex.dk/ekstern -> lex.dk/ekstern.json)
  • Add httpx to pyproject.toml dependencies
  • Create module at src/lex_db/sitemap.py

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions