Skip to content

Conversation

@maximiliancw
Copy link

@maximiliancw maximiliancw commented Jan 1, 2026

Summary

This PR represents a major refactoring and feature enhancement of thehandelsregister package, with ~60 commits since the fork.

Key Changes

Architecture & Refactoring

  • Restructured from single-file to modular package structure
  • Migrated to Pydantic models for type safety and validation
  • Improved code organization and separation of concerns
  • Added comprehensive settings management with pydantic-settings

New Features

  • Company Details Feature: Full support for fetching detailed company information (SI, AD, UT detail types)
  • Batch Operations: search_batch() function for processing multiple searches
  • Enhanced CLI: Improved command-line interface with better options
  • Better Error Handling: Structured exception hierarchy with retry logic

Documentation

  • Multilingual documentation (English & German) with MkDocs Material
  • Comprehensive API reference with mkdocstrings
  • Integration examples (FastAPI, Django, Jupyter)
  • Improved guides and tutorials

Improvements

  • Enhanced caching with TTL support
  • Better rate limiting and retry mechanisms
  • Comprehensive test suite (unit + integration tests)
  • Backward compatibility maintained via compatibility shim

Breaking Changes

⚠️ None - Backward compatibility is maintained through the handelsregister.py compatibility shim. However, users are encouraged to migrate to the new package structure.

Testing

  • All unit tests pass
  • Integration tests pass
  • Manual CLI testing completed
  • Documentation examples verified

Migration Guide

For users of the old single-file structure:

# Old (still works, but deprecated)
from handelsregister import search
# New (recommended)
from handelsregister import search, get_details, Company, CompanyDetails

Checklist

  • Code follows project style guidelines
  • Tests added/updated
  • Documentation updated
  • Changelog updated
  • Backward compatibility maintained

- Add type hints throughout the codebase (functions, methods, variables)
- Create Company and HistoryEntry dataclasses for structured data
- Add docstrings to key functions (parse_result, pr_company_info, get_companies_in_searchresults)
- Extract SUFFIX_MAP as a module-level constant
- Use list comprehension for cell parsing
- Add null check for grid in get_companies_in_searchresults
- Use f-strings in pr_company_info
- Add custom exception hierarchy: HandelsregisterError, NetworkError,
  ParseError, FormError, CacheError
- Wrap network operations in try/except with NetworkError
- Handle form selection failures with FormError
- Add parse validation with ParseError for malformed HTML
- Handle cache read/write failures gracefully
- Create main() function with proper exit codes for each error type
- Add docstrings with Raises documentation
- Use SHA-256 hashing for cache filenames to prevent path traversal
- Add CacheEntry dataclass to store metadata with cached content
- Implement TTL-based cache expiration (default: 1 hour)
- Store cache as JSON with query, options, timestamp, and HTML
- Auto-delete expired or corrupted cache files
- Add _get_cache_key, _get_cache_path, _load_from_cache, _save_to_cache methods
- Include search options in cache key for proper cache invalidation
- Bump minimum Python version to 3.9 (3.6-3.8 are EOL)
- Remove unused mechanicalsoup dependency
- Add beautifulsoup4 as explicit dependency (was used but not declared)
- Add pytest to dev dependencies
- Add version constraints to dependencies for reproducibility
- Update tox envlist to py39, py310, py311, py312
- Add project metadata: description, license, repository, keywords
- Bump version to 0.2.0
- Replace print() debug statements with `logging` module
- Add module-level logger configuration
- Replace `if x == True:` with `if x:` (PEP 8)
- Organize imports: `stdlib` first, third-party second
- Configure logging format based on debug flag
- Enable mechanize logger in debug mode
- Use logger.debug/info/warning for appropriate log levels
- Add pytest markers: @integration and @slow for live API tests
- Skip integration tests by default (run with -m integration)
- Add conftest.py with marker configuration
- Create test fixtures: sample_search_html, mock_args, temp_cache_dir
- Add unit tests for parsing (TestParseSearchResults)
- Add unit tests for dataclasses (TestDataClasses)
- Add unit tests for cache key generation (TestCache)
- Add unit tests for suffix mapping (TestSuffixMap)
- Move live API tests to TestLiveAPI class with proper markers
- Improve test documentation and organization
- Extract SearchCache class for cache operations with configurable TTL
- Extract ResultParser class with static methods for HTML parsing
- Refactor HandelsRegister to use dependency injection for cache
- Add _create_browser() factory method for browser configuration
- Split search_company() into smaller focused methods
- Add backward-compatible aliases for deprecated functions
- Add configuration constants (BASE_URL, REQUEST_TIMEOUT)
- Improve code organization with section headers
- Add module docstring describing architecture
- Update CLI help text with examples
- Update tests to use new SearchCache class directly
- Add SearchOptions dataclass to encapsulate all search parameters
- Add STATE_CODES mapping for all 16 German states (bundesland filtering)
- Add REGISTER_TYPES list (HRA, HRB, GnR, PR, VR)
- Add RESULTS_PER_PAGE_OPTIONS (10, 25, 50, 100)
- Implement state filtering via --states CLI option
- Implement register type filtering via --register-type option
- Implement register number search via --register-number option
- Add --include-deleted flag for historical entries
- Add --similar-sounding flag for phonetic search
- Add --results-per-page option to control pagination
- Update _submit_search to set all form fields with proper error handling
- Add _build_search_options method for args to SearchOptions conversion
- Improve CLI help text with grouped arguments and examples
- Add unit tests for SearchOptions and configuration constants
- Document all new CLI arguments (--states, --register-type, etc.)
- Add state codes reference table
- Add usage examples for common scenarios
- Add testing instructions (unit vs integration tests)
- Keep original API documentation intact
- Vollständig auf Deutsch
- Bessere Struktur mit klaren Abschnitten
- Rechtliche Hinweise hervorgehoben
- API-Parameter in übersichtlichen Tabellen
- Rechtsformen-Tabelle hinzugefügt
- Bundesland-Filter dokumentiert
- Neue search() Funktion für programmatische Nutzung
- Klare Python-API ohne argparse.Namespace
- Vollständige Dokumentation mit Docstring und Beispielen
- Alle Suchoptionen als benannte Parameter verfügbar
- Ermöglicht einfache Integration in andere Anwendungen
- Konstruktor akzeptiert jetzt optionales args (Rückwärtskompatibilität)
- Neuer debug Parameter für programmatische Nutzung
- Neue from_options() Klassenmethode für SearchOptions
- Neue search_with_options() Methode als saubere API
- search_company() delegiert jetzt an search_with_options()
- Deutsche Docstrings für bessere Konsistenz
- Titel zu 'Handelsregister' geändert (nicht nur CLI)
- Neuer Abschnitt 'Verwendung als Library' mit Beispielen
- Einfache API (search-Funktion) dokumentiert
- Erweiterte API (HandelsRegister-Klasse) dokumentiert
- Rückgabeformat mit Beispiel-Dictionary erklärt
- CLI-Dokumentation in eigenen Abschnitt verschoben
- Import der neuen search() Funktion
- TestPublicAPI: Tests für search() Funktion und SearchOptions
- TestHandelsRegisterClass: Tests für neue Initialisierung
  - test_init_without_args
  - test_init_with_debug
  - test_init_with_custom_cache
  - test_from_options_classmethod
  - test_search_company_requires_args
- Integration-Tests für search() und search_with_options()
- Alle 25 Unit-Tests bestanden
- Von Poetry zu Standard PEP 621 [project] Format
- hatchling als Build-Backend
- [project.scripts] für CLI-Einstiegspunkt
- [tool.uv] für dev-dependencies
- pytest Marker-Konfiguration hinzugefügt
- black Konfiguration hinzugefügt
- uv.lock Datei mit allen Dependencies erstellt
- dependency-groups.dev statt tool.uv.dev-dependencies (deprecated)
- Alle 25 Unit-Tests bestehen mit uv run pytest
- Installation mit uv sync statt poetry install
- pip Alternative hinzugefügt
- CLI-Beispiele: uv run handelsregister statt poetry run python
- Tests: uv run pytest statt poetry run pytest
- Pipfile entfernt (Pipenv nicht mehr verwendet)
- poetry.lock entfernt (ersetzt durch uv.lock)
- conftest.py: Marker-Definition entfernt (jetzt in pyproject.toml)
- Address: Business address with street, postal code, city, country
- Representative: Company representatives (Geschäftsführer, Vorstand, etc.)
- Owner: Company owners/shareholders (Gesellschafter)
- CompanyDetails: Extended company information combining all detail views

These models will be used to store structured data from the SI, AD,
and UT detail views of the Handelsregister.
Add parser for extracting company details from SI (Strukturierter
Registerinhalt) HTML views:
- Parse company name, legal form, capital, currency
- Extract business address with street, postal code, city
- Parse company purpose (Unternehmensgegenstand)
- Extract representatives (Geschäftsführer, Vorstand, Prokurist)
- Smart legal form detection with priority ordering

The parser handles various HTML table structures and text patterns
commonly found in the Handelsregister detail views.
…ister:

HandelsRegister class:
- get_company_details(): Fetch details for a single company (SI/AD/UT)
- search_with_details(): Search and fetch details in one call
- _fetch_detail_page(): Handle JSF form submission for details
- _parse_details(): Route to appropriate parser

DetailsParser class:
- parse_ad(): Parse 'Aktueller Abdruck' (current printout)
- parse_ut(): Parse 'Unternehmensträger' (company owners)
- _extract_representatives_from_text(): Extract from free-form text
- _extract_owners(): Extract owner/shareholder information

Public API:
- get_details(): Simple function to fetch details for a company

The detail fetching uses the existing mechanize session to submit
the JSF form with the appropriate control parameters.
- Add DETAILS_CACHE_TTL_SECONDS (24h) for longer caching of details
- SearchCache now accepts details_ttl_seconds parameter
- Cache.get() automatically uses longer TTL for 'details:' prefixed keys
- Add clear() method to remove cache files (optionally details only)
- Add get_stats() method for cache statistics

This allows company details to be cached for longer periods since
register data changes infrequently, while search results still use
the shorter 1-hour TTL.
Extend the command-line interface with detail fetching options:

New CLI arguments:
- --details: Enable fetching of detailed company information
- --detail-type: Choose detail type (SI/AD/UT, default: SI)

New output function:
- pr_company_details(): Pretty-print CompanyDetails with all fields

The main() function now supports two modes:
1. Standard search (existing behavior)
2. Search with details (--details flag)

Example usage:
  handelsregister.py -s 'GASAG AG' --details --detail-type SI
  handelsregister.py -s 'Bank' --states BE --details --json
Library usage:

- Document get_details() function
- Show available detail types (SI, AD, UT)
- Add CompanyDetails response format example

CLI usage:

- Document --details and --detail-type options
- Add examples for detail fetching
- Show JSON output for details

The documentation explains how to fetch extended company information
including legal form, capital, address, representatives, and owners.
…mentation structure

- Changed site name and description to English.
- Updated theme language to English and modified toggle names for dark/light mode.
- Added i18n plugin configuration for multilingual support, including English and German translations.
- Translated navigation and documentation sections to English.
- Created a new German index file for localized documentation.
- Updated existing index file to reflect English content and structure.
- Simplify CompanyDetails.to_dict() to use model_dump() with mode='python'
- Pydantic automatically handles nested model serialization
- Optimize Company.to_dict() to use model_dump() with by_alias=True
- Reduces code duplication and improves maintainability
- No functional changes
- Remove _get_cache_key, _get_cache_path, _load_from_cache, _save_to_cache
- These were deprecated private methods that just delegated to SearchCache
- Use cache.get() and cache.set() directly instead
- Breaking change: private API removed (methods were already deprecated)
- Replace incomplete fallback with proper FormError exception
- Add original_error support to FormError for better error context
- Improves robustness by failing fast with clear error messages
- No silent failures from empty string returns
- Change parameter type from dict to Company for type safety
- Use attribute access instead of dict.get() for better performance
- Update history iteration to use HistoryEntry objects
- Breaking change: function signature changed (public API)
- Add URL context to NetworkError messages
- Include current page URL in FormError messages for debugging
- Add original_error to FormError in _navigate_to_search and _submit_search
- Improves debugging experience when errors occur
- Remove deprecated cache methods from HandelsRegister (private API)
- Change pr_company_info() signature to accept Company instead of dict (public API)
- All other changes are non-breaking optimizations and improvements
- Replace dict access patterns with attribute access (company.name instead of company['name'])
- Update return value descriptions from 'list of dicts' to 'list of Company objects'
- Fix pandas DataFrame examples to properly convert Company objects
- Update both English and German documentation files
- All examples now use Pydantic model attribute access
@maximiliancw maximiliancw marked this pull request as draft January 3, 2026 00:33
@maximiliancw maximiliancw marked this pull request as ready for review January 3, 2026 00:39
…ests

- Introduced a new fixture `shared_hr_client` to optimize API calls during integration tests by reusing a single instance of `HandelsRegister`
- Updated `pytest_collection_modifyitems` to skip integration tests by default unless specified
- Added critical rate limit warnings in `test_handelsregister.py` to inform users about potential API request limits and recommendations for running tests efficiently
- Improved code formatting and consistency across test file
- Update test matrix to test Python 3.9, 3.10, 3.11, and 3.12, matching
the requires-python >=3.9 constraint in pyproject.toml
- Remove Python
3.7 and 3.8 which are no longer supported
- Also update GitHub Actions
to latest versions (checkout@v4, setup-python@v5)
- Replace `URL | None` with `Optional[URL]` in `build_url` function to support
Python 3.9
- The `|` union syntax for type hints was introduced in Python 3.10
and is therefore not compatible with Python 3.9
- Introduced a new workflow to deploy documentation to GitHub Pages upon successful completion of the Lint and Python tests workflows
- The workflow checks if documentation files have changed and only proceeds with deployment if both linting and testing workflows have passed
- Utilizes actions for checking workflow status, setting up Python, installing dependencies, building static site with MkDocs, and deploying to GitHub Pages
- Added a step to install `uv` for improved dependency management
- Updated the installation command to utilize `uv` for syncing and installing dependencies
- Modified the MkDocs build command to run through `uv`, ensuring a more efficient build process
@LilithWittmann
Copy link
Member

Hey, thank you for your contribution. There are a bunch of nice things in there. Buut also a lot of AI-Slop. I would propose we start with merging this small feature by feature in separate PRs. Like I really like the additional support of attributes or in general the improved cli - buut there are also things that just don't make a lot of sense (e.g. --results-per-page).

Especially the more complex parsing logic is kinda hard to test and I would rather suggest offering an option to retrieve and parse the actual structured information (SI XML) part of HR instead of trying to do more and more string matching via the UT view/…. Cause there are the weirdest things in the HR and we need to build this so robust that not slight page structure changes break it over and over again (the risk for this increases with every string bs4 matching/parsing we do).

Especially the part of the code which gives me the impression of "this fetches structured information" does not use the structured information XML but string matching gives me the feeling that this is not properly self reviewed but vibe coded. I am not up for reviewing thousands of locs of vibecoded slop with some good parts. If you want to contribute this I would really encourage you do it in small chunks. The proper SI parser could be an amazing first little project. But it will need a looot of tests (so maybe a few hundred different downloaded SI XMLs)

- Changed site author to "BundesAPI Contributors" and updated copyright year to 2025
- Modified theme colors from deep purple and amber to indigo and blue
- Updated navigation labels for better understanding, including translations for "User Guide" and "Fetching Details"
- Enhanced clarity in other sections of the documentation structure
Changed the reference from `handelsregister.main` to `handelsregister.cli.main` in both German and English documentation files
- Introduced the alias `hrg` for the `handelsregister` CLI command
- Updated both German and English docs to reflect this change
- Updated README.md accordingly
…nd keyword matching:

- Introduced `State`, `KeywordMatch`, and `RegisterType` enums for better type safety and clarity in search options
- Updated the `search` function to accept these enums alongside string values for states and keyword options
- Modified the `SearchOptions` model to validate and handle both enum and string inputs for states and register types
- Updated and improved documentation to reflect these changes and provide usage examples for the new enums
@maximiliancw
Copy link
Author

Hi @LilithWittmann,

thanks for your swift feedback! I'm not hiding the fact the I used LLMs to support my work – however you're absolutely right. I'll put in some more work and split this up into multiple PRs.

@maximiliancw maximiliancw marked this pull request as draft January 4, 2026 23:32
@maximiliancw
Copy link
Author

maximiliancw commented Jan 5, 2026

Split Strategy

Based on @LilithWittmann's feedback, this PR is being split into smaller, focused PRs:

New PRs

What's happening

The "structured details" feature in this PR uses HTML/label string matching via BeautifulSoup, which is fragile and breaks when the portal HTML changes. The new approach parses the actual SI XML endpoint which provides stable, machine-readable data.

Status

  • CLI cleanup (branch: pr-cli-improvements)
  • SI XML parser foundation (branch: pr-si-xml-parser)
  • Integration of SI XML fetching into client
  • Fixture corpus expansion (100+ real SI XMLs)

This PR will remain as draft until the new PRs are merged.

@maximiliancw
Copy link
Author

The SI XML parser skeleton is in place with 24 tests. I'm investigating the actual SI format returned by the portal adapt the parser and build out the fixture corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants