Major Refactoring & Company Details Feature #35

maximiliancw · 2026-01-01T04:16:15Z

Summary

This PR represents a major refactoring and feature enhancement of thehandelsregister package, with ~60 commits since the fork.

Key Changes

Architecture & Refactoring

Restructured from single-file to modular package structure
Migrated to Pydantic models for type safety and validation
Improved code organization and separation of concerns
Added comprehensive settings management with pydantic-settings

New Features

Company Details Feature: Full support for fetching detailed company information (SI, AD, UT detail types)
Batch Operations: search_batch() function for processing multiple searches
Enhanced CLI: Improved command-line interface with better options
Better Error Handling: Structured exception hierarchy with retry logic

Documentation

Multilingual documentation (English & German) with MkDocs Material
Comprehensive API reference with mkdocstrings
Integration examples (FastAPI, Django, Jupyter)
Improved guides and tutorials

Improvements

Enhanced caching with TTL support
Better rate limiting and retry mechanisms
Comprehensive test suite (unit + integration tests)
Backward compatibility maintained via compatibility shim

Breaking Changes

⚠️ None - Backward compatibility is maintained through the handelsregister.py compatibility shim. However, users are encouraged to migrate to the new package structure.

Testing

All unit tests pass
Integration tests pass
Manual CLI testing completed
Documentation examples verified

Migration Guide

For users of the old single-file structure:

# Old (still works, but deprecated)
from handelsregister import search

# New (recommended)
from handelsregister import search, get_details, Company, CompanyDetails

Checklist

- Add type hints throughout the codebase (functions, methods, variables) - Create Company and HistoryEntry dataclasses for structured data - Add docstrings to key functions (parse_result, pr_company_info, get_companies_in_searchresults) - Extract SUFFIX_MAP as a module-level constant - Use list comprehension for cell parsing - Add null check for grid in get_companies_in_searchresults - Use f-strings in pr_company_info

- Add custom exception hierarchy: HandelsregisterError, NetworkError, ParseError, FormError, CacheError - Wrap network operations in try/except with NetworkError - Handle form selection failures with FormError - Add parse validation with ParseError for malformed HTML - Handle cache read/write failures gracefully - Create main() function with proper exit codes for each error type - Add docstrings with Raises documentation

- Use SHA-256 hashing for cache filenames to prevent path traversal - Add CacheEntry dataclass to store metadata with cached content - Implement TTL-based cache expiration (default: 1 hour) - Store cache as JSON with query, options, timestamp, and HTML - Auto-delete expired or corrupted cache files - Add _get_cache_key, _get_cache_path, _load_from_cache, _save_to_cache methods - Include search options in cache key for proper cache invalidation

- Bump minimum Python version to 3.9 (3.6-3.8 are EOL) - Remove unused mechanicalsoup dependency - Add beautifulsoup4 as explicit dependency (was used but not declared) - Add pytest to dev dependencies - Add version constraints to dependencies for reproducibility - Update tox envlist to py39, py310, py311, py312 - Add project metadata: description, license, repository, keywords - Bump version to 0.2.0

- Replace print() debug statements with `logging` module - Add module-level logger configuration - Replace `if x == True:` with `if x:` (PEP 8) - Organize imports: `stdlib` first, third-party second - Configure logging format based on debug flag - Enable mechanize logger in debug mode - Use logger.debug/info/warning for appropriate log levels

- Add pytest markers: @integration and @slow for live API tests - Skip integration tests by default (run with -m integration) - Add conftest.py with marker configuration - Create test fixtures: sample_search_html, mock_args, temp_cache_dir - Add unit tests for parsing (TestParseSearchResults) - Add unit tests for dataclasses (TestDataClasses) - Add unit tests for cache key generation (TestCache) - Add unit tests for suffix mapping (TestSuffixMap) - Move live API tests to TestLiveAPI class with proper markers - Improve test documentation and organization

- Extract SearchCache class for cache operations with configurable TTL - Extract ResultParser class with static methods for HTML parsing - Refactor HandelsRegister to use dependency injection for cache - Add _create_browser() factory method for browser configuration - Split search_company() into smaller focused methods - Add backward-compatible aliases for deprecated functions - Add configuration constants (BASE_URL, REQUEST_TIMEOUT) - Improve code organization with section headers - Add module docstring describing architecture - Update CLI help text with examples - Update tests to use new SearchCache class directly

- Add SearchOptions dataclass to encapsulate all search parameters - Add STATE_CODES mapping for all 16 German states (bundesland filtering) - Add REGISTER_TYPES list (HRA, HRB, GnR, PR, VR) - Add RESULTS_PER_PAGE_OPTIONS (10, 25, 50, 100) - Implement state filtering via --states CLI option - Implement register type filtering via --register-type option - Implement register number search via --register-number option - Add --include-deleted flag for historical entries - Add --similar-sounding flag for phonetic search - Add --results-per-page option to control pagination - Update _submit_search to set all form fields with proper error handling - Add _build_search_options method for args to SearchOptions conversion - Improve CLI help text with grouped arguments and examples - Add unit tests for SearchOptions and configuration constants

- Document all new CLI arguments (--states, --register-type, etc.) - Add state codes reference table - Add usage examples for common scenarios - Add testing instructions (unit vs integration tests) - Keep original API documentation intact

- Vollständig auf Deutsch - Bessere Struktur mit klaren Abschnitten - Rechtliche Hinweise hervorgehoben - API-Parameter in übersichtlichen Tabellen - Rechtsformen-Tabelle hinzugefügt - Bundesland-Filter dokumentiert

- Neue search() Funktion für programmatische Nutzung - Klare Python-API ohne argparse.Namespace - Vollständige Dokumentation mit Docstring und Beispielen - Alle Suchoptionen als benannte Parameter verfügbar - Ermöglicht einfache Integration in andere Anwendungen

- Konstruktor akzeptiert jetzt optionales args (Rückwärtskompatibilität) - Neuer debug Parameter für programmatische Nutzung - Neue from_options() Klassenmethode für SearchOptions - Neue search_with_options() Methode als saubere API - search_company() delegiert jetzt an search_with_options() - Deutsche Docstrings für bessere Konsistenz

- Titel zu 'Handelsregister' geändert (nicht nur CLI) - Neuer Abschnitt 'Verwendung als Library' mit Beispielen - Einfache API (search-Funktion) dokumentiert - Erweiterte API (HandelsRegister-Klasse) dokumentiert - Rückgabeformat mit Beispiel-Dictionary erklärt - CLI-Dokumentation in eigenen Abschnitt verschoben

- Import der neuen search() Funktion - TestPublicAPI: Tests für search() Funktion und SearchOptions - TestHandelsRegisterClass: Tests für neue Initialisierung - test_init_without_args - test_init_with_debug - test_init_with_custom_cache - test_from_options_classmethod - test_search_company_requires_args - Integration-Tests für search() und search_with_options() - Alle 25 Unit-Tests bestanden

- Von Poetry zu Standard PEP 621 [project] Format - hatchling als Build-Backend - [project.scripts] für CLI-Einstiegspunkt - [tool.uv] für dev-dependencies - pytest Marker-Konfiguration hinzugefügt - black Konfiguration hinzugefügt

- uv.lock Datei mit allen Dependencies erstellt - dependency-groups.dev statt tool.uv.dev-dependencies (deprecated) - Alle 25 Unit-Tests bestehen mit uv run pytest

- Installation mit uv sync statt poetry install - pip Alternative hinzugefügt - CLI-Beispiele: uv run handelsregister statt poetry run python - Tests: uv run pytest statt poetry run pytest

- Pipfile entfernt (Pipenv nicht mehr verwendet) - poetry.lock entfernt (ersetzt durch uv.lock) - conftest.py: Marker-Definition entfernt (jetzt in pyproject.toml)

- Address: Business address with street, postal code, city, country - Representative: Company representatives (Geschäftsführer, Vorstand, etc.) - Owner: Company owners/shareholders (Gesellschafter) - CompanyDetails: Extended company information combining all detail views These models will be used to store structured data from the SI, AD, and UT detail views of the Handelsregister.

Add parser for extracting company details from SI (Strukturierter Registerinhalt) HTML views: - Parse company name, legal form, capital, currency - Extract business address with street, postal code, city - Parse company purpose (Unternehmensgegenstand) - Extract representatives (Geschäftsführer, Vorstand, Prokurist) - Smart legal form detection with priority ordering The parser handles various HTML table structures and text patterns commonly found in the Handelsregister detail views.

…ister: HandelsRegister class: - get_company_details(): Fetch details for a single company (SI/AD/UT) - search_with_details(): Search and fetch details in one call - _fetch_detail_page(): Handle JSF form submission for details - _parse_details(): Route to appropriate parser DetailsParser class: - parse_ad(): Parse 'Aktueller Abdruck' (current printout) - parse_ut(): Parse 'Unternehmensträger' (company owners) - _extract_representatives_from_text(): Extract from free-form text - _extract_owners(): Extract owner/shareholder information Public API: - get_details(): Simple function to fetch details for a company The detail fetching uses the existing mechanize session to submit the JSF form with the appropriate control parameters.

- Add DETAILS_CACHE_TTL_SECONDS (24h) for longer caching of details - SearchCache now accepts details_ttl_seconds parameter - Cache.get() automatically uses longer TTL for 'details:' prefixed keys - Add clear() method to remove cache files (optionally details only) - Add get_stats() method for cache statistics This allows company details to be cached for longer periods since register data changes infrequently, while search results still use the shorter 1-hour TTL.

Extend the command-line interface with detail fetching options: New CLI arguments: - --details: Enable fetching of detailed company information - --detail-type: Choose detail type (SI/AD/UT, default: SI) New output function: - pr_company_details(): Pretty-print CompanyDetails with all fields The main() function now supports two modes: 1. Standard search (existing behavior) 2. Search with details (--details flag) Example usage: handelsregister.py -s 'GASAG AG' --details --detail-type SI handelsregister.py -s 'Bank' --states BE --details --json

Library usage: - Document get_details() function - Show available detail types (SI, AD, UT) - Add CompanyDetails response format example CLI usage: - Document --details and --detail-type options - Add examples for detail fetching - Show JSON output for details The documentation explains how to fetch extended company information including legal form, capital, address, representatives, and owners.

…oject.toml and uv.lock

…mentation structure - Changed site name and description to English. - Updated theme language to English and modified toggle names for dark/light mode. - Added i18n plugin configuration for multilingual support, including English and German translations. - Translated navigation and documentation sections to English. - Created a new German index file for localized documentation. - Updated existing index file to reflect English content and structure.

- Simplify CompanyDetails.to_dict() to use model_dump() with mode='python' - Pydantic automatically handles nested model serialization - Optimize Company.to_dict() to use model_dump() with by_alias=True - Reduces code duplication and improves maintainability - No functional changes

- Remove _get_cache_key, _get_cache_path, _load_from_cache, _save_to_cache - These were deprecated private methods that just delegated to SearchCache - Use cache.get() and cache.set() directly instead - Breaking change: private API removed (methods were already deprecated)

- Replace incomplete fallback with proper FormError exception - Add original_error support to FormError for better error context - Improves robustness by failing fast with clear error messages - No silent failures from empty string returns

- Change parameter type from dict to Company for type safety - Use attribute access instead of dict.get() for better performance - Update history iteration to use HistoryEntry objects - Breaking change: function signature changed (public API)

- Add URL context to NetworkError messages - Include current page URL in FormError messages for debugging - Add original_error to FormError in _navigate_to_search and _submit_search - Improves debugging experience when errors occur

- Remove deprecated cache methods from HandelsRegister (private API) - Change pr_company_info() signature to accept Company instead of dict (public API) - All other changes are non-breaking optimizations and improvements

- Replace dict access patterns with attribute access (company.name instead of company['name']) - Update return value descriptions from 'list of dicts' to 'list of Company objects' - Fix pandas DataFrame examples to properly convert Company objects - Update both English and German documentation files - All examples now use Pydantic model attribute access

…ests - Introduced a new fixture `shared_hr_client` to optimize API calls during integration tests by reusing a single instance of `HandelsRegister` - Updated `pytest_collection_modifyitems` to skip integration tests by default unless specified - Added critical rate limit warnings in `test_handelsregister.py` to inform users about potential API request limits and recommendations for running tests efficiently - Improved code formatting and consistency across test file

- Update test matrix to test Python 3.9, 3.10, 3.11, and 3.12, matching the requires-python >=3.9 constraint in pyproject.toml - Remove Python 3.7 and 3.8 which are no longer supported - Also update GitHub Actions to latest versions (checkout@v4, setup-python@v5)

- Replace `URL | None` with `Optional[URL]` in `build_url` function to support Python 3.9 - The `|` union syntax for type hints was introduced in Python 3.10 and is therefore not compatible with Python 3.9

- Introduced a new workflow to deploy documentation to GitHub Pages upon successful completion of the Lint and Python tests workflows - The workflow checks if documentation files have changed and only proceeds with deployment if both linting and testing workflows have passed - Utilizes actions for checking workflow status, setting up Python, installing dependencies, building static site with MkDocs, and deploying to GitHub Pages

- Added a step to install `uv` for improved dependency management - Updated the installation command to utilize `uv` for syncing and installing dependencies - Modified the MkDocs build command to run through `uv`, ensuring a more efficient build process

LilithWittmann · 2026-01-03T14:02:11Z

Hey, thank you for your contribution. There are a bunch of nice things in there. Buut also a lot of AI-Slop. I would propose we start with merging this small feature by feature in separate PRs. Like I really like the additional support of attributes or in general the improved cli - buut there are also things that just don't make a lot of sense (e.g. --results-per-page).

Especially the more complex parsing logic is kinda hard to test and I would rather suggest offering an option to retrieve and parse the actual structured information (SI XML) part of HR instead of trying to do more and more string matching via the UT view/…. Cause there are the weirdest things in the HR and we need to build this so robust that not slight page structure changes break it over and over again (the risk for this increases with every string bs4 matching/parsing we do).

Especially the part of the code which gives me the impression of "this fetches structured information" does not use the structured information XML but string matching gives me the feeling that this is not properly self reviewed but vibe coded. I am not up for reviewing thousands of locs of vibecoded slop with some good parts. If you want to contribute this I would really encourage you do it in small chunks. The proper SI parser could be an amazing first little project. But it will need a looot of tests (so maybe a few hundred different downloaded SI XMLs)

- Changed site author to "BundesAPI Contributors" and updated copyright year to 2025 - Modified theme colors from deep purple and amber to indigo and blue - Updated navigation labels for better understanding, including translations for "User Guide" and "Fetching Details" - Enhanced clarity in other sections of the documentation structure

Changed the reference from `handelsregister.main` to `handelsregister.cli.main` in both German and English documentation files

- Introduced the alias `hrg` for the `handelsregister` CLI command - Updated both German and English docs to reflect this change - Updated README.md accordingly

…nd keyword matching: - Introduced `State`, `KeywordMatch`, and `RegisterType` enums for better type safety and clarity in search options - Updated the `search` function to accept these enums alongside string values for states and keyword options - Modified the `SearchOptions` model to validate and handle both enum and string inputs for states and register types - Updated and improved documentation to reflect these changes and provide usage examples for the new enums

maximiliancw · 2026-01-04T21:52:59Z

Hi @LilithWittmann,

thanks for your swift feedback! I'm not hiding the fact the I used LLMs to support my work – however you're absolutely right. I'll put in some more work and split this up into multiple PRs.

maximiliancw · 2026-01-05T00:35:03Z

Split Strategy

Based on @LilithWittmann's feedback, this PR is being split into smaller, focused PRs:

New PRs

Remove questionable CLI options #36: Removes --results-per-page and --details options that add complexity without clear benefit
Add SI XML parser for structured register content #37: Proper XML-based parsing using xml.etree.ElementTree instead of HTML string matching

What's happening

The "structured details" feature in this PR uses HTML/label string matching via BeautifulSoup, which is fragile and breaks when the portal HTML changes. The new approach parses the actual SI XML endpoint which provides stable, machine-readable data.

Status

CLI cleanup (branch: pr-cli-improvements)
SI XML parser foundation (branch: pr-si-xml-parser)
Integration of SI XML fetching into client
Fixture corpus expansion (100+ real SI XMLs)

This PR will remain as draft until the new PRs are merged.

maximiliancw · 2026-01-05T00:57:20Z

The SI XML parser skeleton is in place with 24 tests. I'm investigating the actual SI format returned by the portal adapt the parser and build out the fixture corpus.

maximiliancw added 30 commits January 1, 2026 03:28

Update poetry.lock for new dependencies

bae6720

Überarbeitung der README.md

3da6755

- Vollständig auf Deutsch - Bessere Struktur mit klaren Abschnitten - Rechtliche Hinweise hervorgehoben - API-Parameter in übersichtlichen Tabellen - Rechtsformen-Tabelle hinzugefügt - Bundesland-Filter dokumentiert

pyproject.toml auf uv/PEP 621 Format konvertieren

cd8c58d

- Von Poetry zu Standard PEP 621 [project] Format - hatchling als Build-Backend - [project.scripts] für CLI-Einstiegspunkt - [tool.uv] für dev-dependencies - pytest Marker-Konfiguration hinzugefügt - black Konfiguration hinzugefügt

uv.lock erstellen und pyproject.toml korrigieren:

825a647

- uv.lock Datei mit allen Dependencies erstellt - dependency-groups.dev statt tool.uv.dev-dependencies (deprecated) - Alle 25 Unit-Tests bestehen mit uv run pytest

README aktualisieren

180317b

- Installation mit uv sync statt poetry install - pip Alternative hinzugefügt - CLI-Beispiele: uv run handelsregister statt poetry run python - Tests: uv run pytest statt poetry run pytest

Poetry/Pipenv Dateien entfernen:

a668e4f

- Pipfile entfernt (Pipenv nicht mehr verwendet) - poetry.lock entfernt (ersetzt durch uv.lock) - conftest.py: Marker-Definition entfernt (jetzt in pyproject.toml)

Add mkdocs.yml and required dependencies to pyproject.toml

0a6f4ff

Create docs/index.md

1413c2e

Add mkdocs-static-i18n dependency and update package versions in pypr…

4de434a

…oject.toml and uv.lock

Add docs: quickstart.md

5be6fbe

maximiliancw added 8 commits January 3, 2026 01:13

Improve error messages with better context:

c1d652f

- Add URL context to NetworkError messages - Include current page URL in FormError messages for debugging - Add original_error to FormError in _navigate_to_search and _submit_search - Improves debugging experience when errors occur

Bump version to 0.3.0 for breaking changes:

d8d4f60

- Remove deprecated cache methods from HandelsRegister (private API) - Change pr_company_info() signature to accept Company instead of dict (public API) - All other changes are non-breaking optimizations and improvements

Add new author to pyproject.toml

8f80a75

maximiliancw marked this pull request as draft January 3, 2026 00:33

maximiliancw marked this pull request as ready for review January 3, 2026 00:39

maximiliancw added 10 commits January 3, 2026 13:22

Add ruff for linting and formatting

5391a2a

Add pre-commit hook and GH action for linting using ruff

f04fcb2

Lint and format entire codebase using ruff

c9b4a70

Extend .gitignore

394f37f

Use Python 3.12 for lint workflow

15a622d

Use Optional for Python 3.9 compatibility:

8e14ded

- Replace `URL | None` with `Optional[URL]` in `build_url` function to support Python 3.9 - The `|` union syntax for type hints was introduced in Python 3.10 and is therefore not compatible with Python 3.9

maximiliancw added 4 commits January 3, 2026 15:52

Update docs to reflect changes in main function reference:

a3ee4cb

Changed the reference from `handelsregister.main` to `handelsregister.cli.main` in both German and English documentation files

Add shorter alias for CLI and update docs accordingly:

a893070

- Introduced the alias `hrg` for the `handelsregister` CLI command - Updated both German and English docs to reflect this change - Updated README.md accordingly

maximiliancw marked this pull request as draft January 4, 2026 23:32

maximiliancw closed this Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major Refactoring & Company Details Feature #35

Major Refactoring & Company Details Feature #35

Uh oh!

maximiliancw commented Jan 1, 2026 •

edited

Loading

Uh oh!

LilithWittmann commented Jan 3, 2026

Uh oh!

maximiliancw commented Jan 4, 2026

Uh oh!

maximiliancw commented Jan 5, 2026 •

edited

Loading

Uh oh!

maximiliancw commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Major Refactoring & Company Details Feature #35

Major Refactoring & Company Details Feature #35

Uh oh!

Conversation

maximiliancw commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Architecture & Refactoring

New Features

Documentation

Improvements

Breaking Changes

Testing

Migration Guide

Checklist

Uh oh!

LilithWittmann commented Jan 3, 2026

Uh oh!

maximiliancw commented Jan 4, 2026

Uh oh!

maximiliancw commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Split Strategy

New PRs

What's happening

Status

Uh oh!

maximiliancw commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maximiliancw commented Jan 1, 2026 •

edited

Loading

maximiliancw commented Jan 5, 2026 •

edited

Loading