-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Description:
Create a module to fetch and parse lex.dk sitemaps, extracting article URLs, modification timestamps, and deriving encyclopedia IDs and permalinks from URLs.
Acceptance criteria:
- Function fetch_sitemap(url: str) -> str fetches XML content from a sitemap URL with error handling
- Function parse_sitemap(xml_content: str) -> list[SitemapEntry] parses sitemap XML and extracts and elements
- Function fetch_all_sitemaps() -> list[SitemapEntry] fetches and parses all 6 sub-sitemaps (sitemap1.xml through sitemap6.xml)
- Function derive_encyclopedia_id(url: str) -> int extracts subdomain and maps to encyclopedia_id using reverse lookup of get_url_base() mapping
- Function derive_permalink(url: str) -> str extracts path component after domain (e.g., "ekstern" from "https://lex.dk/ekstern")
- SitemapEntry dataclass includes: url, lastmod (datetime), encyclopedia_id, permalink, article_id
- Unit tests with mocked HTTP requests and sample sitemap XML fixtures
- Handles URL-encoded permalinks (e.g., "eksteri%C3%B8rbed%C3%B8mmelse" → "eksteriørbedømmelse")
Technical details:
- Use httpx for HTTP requests with timeout and retry configuration
- Use xml.etree.ElementTree for XML parsing (stdlib, no extra dependencies)
- Handle timezone-aware datetime parsing for lastmod field
- Sitemap available at https://lex.dk/.sitemap/sitemap.xml (overview) and https://lex.dk/.sitemap/sitemap1.xml - https://lex.dk/.sitemap/sitemap6.xml (actual sitemaps)
- Every article has a json representation that can be mined by appending ".json" to the article url (e.g. lex.dk/ekstern -> lex.dk/ekstern.json)
- Add httpx to pyproject.toml dependencies
- Create module at src/lex_db/sitemap.py
Metadata
Metadata
Assignees
Labels
No labels