-
Notifications
You must be signed in to change notification settings - Fork 0
Benchmark alternatives #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Moved from eager to lazy initialization of TextSearch matchers - Import time: 30.75ms → 1.00ms (30.7x faster) - Runtime performance unchanged (still ~600k ops/sec) - Better for CLI tools and one-time usage - Uses global state with lazy getters pattern
Following code style guidelines
- Use itertools.chain for combining dict keys (cleaner than unpacking) - Convert nested loop to dict comprehension for _unsafe_dict - Remove redundant list(dictionary.keys()) - can iterate dict directly - More Pythonic and efficient code
Before: k, v, comb, ts, s After: contraction, expansion, combination, text_search, text All variable names are now self-documenting: - key/value -> contraction/expansion - comb -> combination - ts -> text_search - s -> text - k_lower -> (inline contraction.lower()) - f -> json_file - data -> contractions_data - results -> matched_contractions Added type hints to all API functions for clarity
Month abbreviations (jan., feb., etc.) now live in contractions_dict.json where they belong, not hardcoded in the Python logic. Cleaner separation of data and code.
Safety keys (contractions that shouldn't have apostrophe-less variants) now live in safety_keys.json where they belong. These prevent ambiguous matches like 'its' vs 'it's' or 'hes' vs 'he's'. All data now lives in data files, not hardcoded in Python.
Makes code more readable with early returns and clear logic flow. Also slightly faster (28.53ms → 0.51ms init time). Before: Dense nested comprehension After: Clear step-by-step logic
Created _build_apostrophe_variants() for better separation of concerns: - Single responsibility: generate variants with/without apostrophes - Type-hinted parameters and return - Testable in isolation - Makes _load_dicts() cleaner and easier to understand
Created focused helper methods: - _load_json_data(): Load and parse JSON files - _normalize_apostrophes(): Replace smart quotes with standard apostrophes _load_dicts() now reads like poetry: 1. Load data files 2. Normalize apostrophes 3. Build variants Each method has a single, clear responsibility.
Created loaders.py for all data loading and transformation: - load_json_data() - normalize_apostrophes() - build_apostrophe_variants() - load_all_contractions() core.py now focuses solely on TextSearch matcher management: - Cleaner separation of concerns - Each file has single responsibility - More testable and maintainable - core.py shrunk from 148 to ~80 lines
Clarified the math: - num_items = len(items) - num_separators = num_items - 1 - total_slots = num_items + num_separators For [a,b,c] with separator X: - 3 items + 2 separators = 5 total slots - Result: [a, X, b, X, c] No lru_cache needed: lazy loading already ensures single execution
Added filter in normalize_apostrophes() to skip strings that don't contain apostrophes, reducing unnecessary string operations. Benchmark results: • 1.24x speedup on normalize_apostrophes() • Total load_all_contractions(): 1.16ms (down from 1.84ms) • 37% faster overall initialization The filter 'if "'" in contraction' avoids calling str.replace() on ~32 contractions that have no apostrophes (months, etc.)
api.py improvements: • Cache _MATCHER_GETTERS tuple - no rebuild on each call • Early return in add_dict() for empty dicts • Add comprehensive type hints with return types • Robust error handling in load_json(): - FileNotFoundError if file doesn't exist - ValueError if JSON is not a dict • Use .keys() view instead of creating list copy Performance: • Empty dict: 0.000ms (early return) • 100 entries: ~2ms (unchanged, already optimal) • Better error messages for users
Added 4 new tests: • test_load_json_non_dict() - validates ValueError for non-dict JSON • test_fix_leftovers_only() - tests leftovers=True, slang=False • test_fix_slang_only() - tests leftovers=False, slang=True • test_fix_basic_only() - tests leftovers=False, slang=False Coverage: 96% → 100% ✅ Total tests: 16 → 20
Replaced module-level global variables with _State class: • No more 'global' keywords • No more # noqa: PLW0603 suppressions • Cleaner, more Pythonic singleton pattern • Easier to reset state for testing • Groups related state together Before: global _ts_basic # noqa: PLW0603 After: _State.ts_basic (no warnings!) All tests pass ✅ Coverage still 100% ✅
Created state.py for clean separation of concerns: • state.py - State container class • core.py - Core logic (matchers & fix) • api.py - Public API • loaders.py - Data loading Each file has single responsibility ✅ No Pydantic - simple class is faster and sufficient
Split loaders.py into three focused files: • data_io.py (6 lines) - load_json_data() → JSON file I/O only • transforms.py (46 lines) - normalize_apostrophes() → String transformations - build_apostrophe_variants() → Variant generation - Helper functions for combinations • loaders.py (19 lines) - load_all_contractions() → Orchestration only Now EVERY file has true single responsibility: ✅ data_io.py - File I/O ✅ transforms.py - Data transformations ✅ loaders.py - Orchestration ✅ state.py - State management ✅ core.py - Matcher management ✅ api.py - Public API 100% coverage maintained ✅
Split load_json_data into two properly typed functions: • load_dict_data() → dict[str, str] • load_list_data() → list[str] Each function has runtime assertions for type safety. Before: Union return type requiring type: ignore suppressions After: Specific return types with zero suppressions ✅ Zero # type: ignore comments ✅ Zero # noqa suppressions ✅ 100% properly typed ✅ All tests pass ✅ Mypy clean
Renamed all 'ts' (TextSearch) variables to descriptive names: State attributes: • ts_basic → basic_matcher • ts_leftovers → leftovers_matcher • ts_slang → slang_matcher • ts_leftovers_slang → leftovers_slang_matcher • ts_view_window → preview_matcher Functions: • _get_ts_basic() → _get_basic_matcher() • _get_ts_leftovers() → _get_leftovers_matcher() • _get_ts_slang() → _get_slang_matcher() • _get_ts_leftovers_slang() → _get_leftovers_slang_matcher() • _get_ts_view_window() → _get_preview_matcher() Variables: • _MATCHER_GETTERS → _ALL_MATCHERS • matcher_getter → get_matcher Now every name is self-documenting - no cryptic abbreviations!
Changed preview() parameter name: • flank → context_chars 'flank' is military/biological terminology that's not intuitive for text processing. 'context_chars' clearly describes what it does: number of characters to show before/after each match. Updated: • api.py: preview(text, context_chars) • tests: All test cases updated • Error messages: More descriptive Breaking change for preview() API but worth it for clarity.
Extracted repeated logic:
• Created _create_matcher() factory function
• Defined constants for magic strings:
- _TEXTSEARCH_MODE_NORM = 'norm'
- _TEXTSEARCH_MODE_OBJECT = 'object'
- _TEXTSEARCH_CASE_INSENSITIVE = 'insensitive'
Before: TextSearch('insensitive', 'norm') repeated 4 times
After: _create_matcher(MODE_NORM, *dicts) - DRY!
Before: Manual add() calls for each dict
After: Factory loops through dicts
Reduced duplication significantly while keeping clarity.
Created validation.py for centralized input validation:
• validate_string_param()
• validate_non_empty_string()
• validate_dict_param()
• validate_int_param()
• validate_data_type()
Fixed ALL production safety issues:
✅ Replaced assert with proper exceptions (python -O safe)
✅ Added validation to all public API functions
✅ Better error messages with actual types
✅ Cleaned up constant naming (_MODE_NORM, _CASE_INSENSITIVE)
✅ No more cryptic errors from deep in library
Changes:
• data_io.py: No more assert, proper TypeErrors
• api.py: Validates all inputs (text, contractions, dicts)
• core.py: Validates text parameter
• validation.py: Single source of truth for all validation
Tests:
• 40 tests (was 20)
• 100% coverage maintained
• All validation paths covered
• Mock tests for edge cases
Impact:
• contractions.fix(None) → Clear TypeError, not AttributeError
• contractions.add('', 'test') → Clear ValueError
• All errors happen at API boundary, not deep in library
• Production-safe (works with python -O)
Extracted complex slice logic into _extract_viewing_window(): • window_start = max(0, match_start - context_chars) • window_end = min(text_length, match_end + context_chars) • return text[window_start:window_end] Before: Cryptic one-liner with nested max/min calls After: Clear method with descriptive variable names Benefits: • Self-documenting function name • Step-by-step variable names explain the logic • Easier to test in isolation • More readable preview() implementation
Renamed for specificity: • core.py → matchers.py (manages TextSearch matchers) • transforms.py → dict_builders.py (builds dict variants) Why: • 'core' is vague - every file thinks it's core • 'transforms' sounds like a single transformer • 'matchers' clearly describes what it manages • 'dict_builders' clearly describes building dict variants Updated all imports: • __init__.py: from .matchers import fix • api.py: from .matchers import _get_*_matcher • loaders.py: from .dict_builders import ... All tests pass ✅ 100% coverage maintained ✅
- Extract expand() and preview() into dedicated processor.py - Separates text processing from API layer - Implements clean separation of concerns
- Handles add(), add_dict(), and load_file() logic - Centralizes all custom contraction management - Clean separation from public API
- All business logic moved to processor and extensions modules - API now only delegates to appropriate modules - Add e (expand) and p (preview) shortcuts - Zero business logic in facade layer
- Rename fix() → expand() (clearer, industry standard) - Add deprecated fix() wrapper with warning - Import shortcuts (e, p) from api module - Update __all__ exports
- More specific name (loads files, not just JSON) - Better describes the module's purpose - Maintains all existing functionality
- More accurate name (transforms dicts, not builds) - Covers both normalize and variant generation - Clearer purpose and responsibility
- Industry-standard term for data initialization - More specific than 'loaders' or 'initialization' - Avoids conflict with data/ directory
- Import from renamed bootstrap module - Update to use new module names - Maintain all existing functionality
- Centralize all validation logic in validation.py - Move validation from extensions module - Consistent error messages and patterns
- Rename test_fix* → test_expand* - Update all fix() calls to expand() - Add tests for deprecated fix() with warning - Add tests for e() and p() shortcuts - Update performance benchmarks
- Update imports for renamed file_loader module - Update monkeypatch paths - Maintain 100% test coverage
- Replace all fix() examples with expand() - Add e() and p() shortcuts documentation - Update flank → context_chars parameter - Add Codecov badge - Update load_json → load_file - Document deprecated fix() function
- Remove README.rst (replaced by README.md) - Remove deploy.py (replaced by GitHub Actions) - Clean up legacy files from original fork
- Delete initialization.py, resources.py, and variants.py as they are no longer needed - Remove associated tests from test_resources.py - Clean up the codebase by eliminating obsolete files
- Change trigger from 'v*' to 'v*.*.*' - Prevents accidental publishing on non-version tags - Only publishes on proper semver tags (v1.0.0, v0.2.0, etc.)
- Remove dict_builders.py (renamed to transformers.py) - Old files were being included in coverage calculation - Coverage now correctly shows 100%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the codebase to benchmark alternative implementations, renaming the core fix() function to expand() while maintaining backward compatibility. The changes include significant restructuring to improve modularity, adding comprehensive test coverage for validation and file loading utilities, and updating the API with shorthand aliases.
Key Changes:
- Renamed
fix()toexpand()with deprecation warning for the old name - Reorganized code into focused modules (validation, file_loader, transformers, processor, matchers, etc.)
- Added new test files for validation and file loading functionality
- Updated performance benchmarks to use the new
expand()function name
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_validation.py | New comprehensive test suite for input validation functions |
| tests/test_performance.py | Updated benchmark tests to use expand() instead of fix() |
| tests/test_file_loader.py | New test suite for JSON file loading utilities |
| tests/test_contractions.py | Updated all test cases to use expand() and added tests for new functionality |
| deploy.py | Removed deprecated manual deployment script |
| contractions/validation.py | New module containing input validation helper functions |
| contractions/transformers.py | New module for apostrophe normalization and variant generation |
| contractions/state.py | New module to manage global state for matchers and dictionaries |
| contractions/processor.py | New module containing core expansion and preview logic |
| contractions/matchers.py | New module managing TextSearch matcher instances |
| contractions/file_loader.py | New module for loading JSON data files |
| contractions/extensions.py | New module for custom contraction management |
| contractions/data/safety_keys.json | New data file containing protected contraction keys |
| contractions/data/contractions_dict.json | Reformatted and added month abbreviations |
| contractions/core.py | Removed (logic moved to new modules) |
| contractions/bootstrap.py | New module for initialization logic |
| contractions/api.py | Refactored to use new modular structure |
| contractions/init.py | Updated to export new API and deprecate fix() |
| README.rst | Removed (consolidated to README.md) |
| README.md | Updated documentation to reflect API changes |
| .github/workflows/publish.yml | Updated tag pattern to require semantic versioning |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Breaking changes: - Rename fix() → expand() with deprecation - Rename load_json() → load_file() - Rename parameter flank → context_chars - Complete architecture refactor with pure facade pattern - Professional file naming (file_loader, bootstrap, transformers) - 100% test coverage enforced
No description provided.