Skip to content

Conversation

@devjerry0
Copy link
Owner

No description provided.

- Moved from eager to lazy initialization of TextSearch matchers
- Import time: 30.75ms → 1.00ms (30.7x faster)
- Runtime performance unchanged (still ~600k ops/sec)
- Better for CLI tools and one-time usage
- Uses global state with lazy getters pattern
- Use itertools.chain for combining dict keys (cleaner than unpacking)
- Convert nested loop to dict comprehension for _unsafe_dict
- Remove redundant list(dictionary.keys()) - can iterate dict directly
- More Pythonic and efficient code
Before: k, v, comb, ts, s
After: contraction, expansion, combination, text_search, text

All variable names are now self-documenting:
- key/value -> contraction/expansion
- comb -> combination
- ts -> text_search
- s -> text
- k_lower -> (inline contraction.lower())
- f -> json_file
- data -> contractions_data
- results -> matched_contractions

Added type hints to all API functions for clarity
Month abbreviations (jan., feb., etc.) now live in contractions_dict.json
where they belong, not hardcoded in the Python logic.

Cleaner separation of data and code.
Safety keys (contractions that shouldn't have apostrophe-less variants)
now live in safety_keys.json where they belong.

These prevent ambiguous matches like 'its' vs 'it's' or 'hes' vs 'he's'.

All data now lives in data files, not hardcoded in Python.
Makes code more readable with early returns and clear logic flow.
Also slightly faster (28.53ms → 0.51ms init time).

Before: Dense nested comprehension
After: Clear step-by-step logic
Created _build_apostrophe_variants() for better separation of concerns:
- Single responsibility: generate variants with/without apostrophes
- Type-hinted parameters and return
- Testable in isolation
- Makes _load_dicts() cleaner and easier to understand
Created focused helper methods:
- _load_json_data(): Load and parse JSON files
- _normalize_apostrophes(): Replace smart quotes with standard apostrophes

_load_dicts() now reads like poetry:
1. Load data files
2. Normalize apostrophes
3. Build variants

Each method has a single, clear responsibility.
Created loaders.py for all data loading and transformation:
- load_json_data()
- normalize_apostrophes()
- build_apostrophe_variants()
- load_all_contractions()

core.py now focuses solely on TextSearch matcher management:
- Cleaner separation of concerns
- Each file has single responsibility
- More testable and maintainable
- core.py shrunk from 148 to ~80 lines
Clarified the math:
- num_items = len(items)
- num_separators = num_items - 1
- total_slots = num_items + num_separators

For [a,b,c] with separator X:
- 3 items + 2 separators = 5 total slots
- Result: [a, X, b, X, c]

No lru_cache needed: lazy loading already ensures single execution
Added filter in normalize_apostrophes() to skip strings that don't
contain apostrophes, reducing unnecessary string operations.

Benchmark results:
• 1.24x speedup on normalize_apostrophes()
• Total load_all_contractions(): 1.16ms (down from 1.84ms)
• 37% faster overall initialization

The filter 'if "'" in contraction' avoids calling str.replace()
on ~32 contractions that have no apostrophes (months, etc.)
api.py improvements:
• Cache _MATCHER_GETTERS tuple - no rebuild on each call
• Early return in add_dict() for empty dicts
• Add comprehensive type hints with return types
• Robust error handling in load_json():
  - FileNotFoundError if file doesn't exist
  - ValueError if JSON is not a dict
• Use .keys() view instead of creating list copy

Performance:
• Empty dict: 0.000ms (early return)
• 100 entries: ~2ms (unchanged, already optimal)
• Better error messages for users
Added 4 new tests:
• test_load_json_non_dict() - validates ValueError for non-dict JSON
• test_fix_leftovers_only() - tests leftovers=True, slang=False
• test_fix_slang_only() - tests leftovers=False, slang=True
• test_fix_basic_only() - tests leftovers=False, slang=False

Coverage: 96% → 100% ✅
Total tests: 16 → 20
Replaced module-level global variables with _State class:
• No more 'global' keywords
• No more # noqa: PLW0603 suppressions
• Cleaner, more Pythonic singleton pattern
• Easier to reset state for testing
• Groups related state together

Before: global _ts_basic  # noqa: PLW0603
After:  _State.ts_basic (no warnings!)

All tests pass ✅
Coverage still 100% ✅
Created state.py for clean separation of concerns:
• state.py - State container class
• core.py - Core logic (matchers & fix)
• api.py - Public API
• loaders.py - Data loading

Each file has single responsibility ✅
No Pydantic - simple class is faster and sufficient
Split loaders.py into three focused files:

• data_io.py (6 lines)
  - load_json_data() → JSON file I/O only

• transforms.py (46 lines)
  - normalize_apostrophes() → String transformations
  - build_apostrophe_variants() → Variant generation
  - Helper functions for combinations

• loaders.py (19 lines)
  - load_all_contractions() → Orchestration only

Now EVERY file has true single responsibility:
✅ data_io.py - File I/O
✅ transforms.py - Data transformations
✅ loaders.py - Orchestration
✅ state.py - State management
✅ core.py - Matcher management
✅ api.py - Public API

100% coverage maintained ✅
Split load_json_data into two properly typed functions:
• load_dict_data() → dict[str, str]
• load_list_data() → list[str]

Each function has runtime assertions for type safety.

Before: Union return type requiring type: ignore suppressions
After: Specific return types with zero suppressions

✅ Zero # type: ignore comments
✅ Zero # noqa suppressions
✅ 100% properly typed
✅ All tests pass
✅ Mypy clean
Renamed all 'ts' (TextSearch) variables to descriptive names:

State attributes:
• ts_basic → basic_matcher
• ts_leftovers → leftovers_matcher
• ts_slang → slang_matcher
• ts_leftovers_slang → leftovers_slang_matcher
• ts_view_window → preview_matcher

Functions:
• _get_ts_basic() → _get_basic_matcher()
• _get_ts_leftovers() → _get_leftovers_matcher()
• _get_ts_slang() → _get_slang_matcher()
• _get_ts_leftovers_slang() → _get_leftovers_slang_matcher()
• _get_ts_view_window() → _get_preview_matcher()

Variables:
• _MATCHER_GETTERS → _ALL_MATCHERS
• matcher_getter → get_matcher

Now every name is self-documenting - no cryptic abbreviations!
Changed preview() parameter name:
• flank → context_chars

'flank' is military/biological terminology that's not intuitive
for text processing. 'context_chars' clearly describes what it
does: number of characters to show before/after each match.

Updated:
• api.py: preview(text, context_chars)
• tests: All test cases updated
• Error messages: More descriptive

Breaking change for preview() API but worth it for clarity.
Extracted repeated logic:
• Created _create_matcher() factory function
• Defined constants for magic strings:
  - _TEXTSEARCH_MODE_NORM = 'norm'
  - _TEXTSEARCH_MODE_OBJECT = 'object'
  - _TEXTSEARCH_CASE_INSENSITIVE = 'insensitive'

Before: TextSearch('insensitive', 'norm') repeated 4 times
After: _create_matcher(MODE_NORM, *dicts) - DRY!

Before: Manual add() calls for each dict
After: Factory loops through dicts

Reduced duplication significantly while keeping clarity.
Created validation.py for centralized input validation:
• validate_string_param()
• validate_non_empty_string()
• validate_dict_param()
• validate_int_param()
• validate_data_type()

Fixed ALL production safety issues:
✅ Replaced assert with proper exceptions (python -O safe)
✅ Added validation to all public API functions
✅ Better error messages with actual types
✅ Cleaned up constant naming (_MODE_NORM, _CASE_INSENSITIVE)
✅ No more cryptic errors from deep in library

Changes:
• data_io.py: No more assert, proper TypeErrors
• api.py: Validates all inputs (text, contractions, dicts)
• core.py: Validates text parameter
• validation.py: Single source of truth for all validation

Tests:
• 40 tests (was 20)
• 100% coverage maintained
• All validation paths covered
• Mock tests for edge cases

Impact:
• contractions.fix(None) → Clear TypeError, not AttributeError
• contractions.add('', 'test') → Clear ValueError
• All errors happen at API boundary, not deep in library
• Production-safe (works with python -O)
Extracted complex slice logic into _extract_viewing_window():
• window_start = max(0, match_start - context_chars)
• window_end = min(text_length, match_end + context_chars)
• return text[window_start:window_end]

Before: Cryptic one-liner with nested max/min calls
After: Clear method with descriptive variable names

Benefits:
• Self-documenting function name
• Step-by-step variable names explain the logic
• Easier to test in isolation
• More readable preview() implementation
Renamed for specificity:
• core.py → matchers.py (manages TextSearch matchers)
• transforms.py → dict_builders.py (builds dict variants)

Why:
• 'core' is vague - every file thinks it's core
• 'transforms' sounds like a single transformer
• 'matchers' clearly describes what it manages
• 'dict_builders' clearly describes building dict variants

Updated all imports:
• __init__.py: from .matchers import fix
• api.py: from .matchers import _get_*_matcher
• loaders.py: from .dict_builders import ...

All tests pass ✅
100% coverage maintained ✅
- Extract expand() and preview() into dedicated processor.py
- Separates text processing from API layer
- Implements clean separation of concerns
- Handles add(), add_dict(), and load_file() logic
- Centralizes all custom contraction management
- Clean separation from public API
- All business logic moved to processor and extensions modules
- API now only delegates to appropriate modules
- Add e (expand) and p (preview) shortcuts
- Zero business logic in facade layer
- Rename fix() → expand() (clearer, industry standard)
- Add deprecated fix() wrapper with warning
- Import shortcuts (e, p) from api module
- Update __all__ exports
- More specific name (loads files, not just JSON)
- Better describes the module's purpose
- Maintains all existing functionality
- More accurate name (transforms dicts, not builds)
- Covers both normalize and variant generation
- Clearer purpose and responsibility
- Industry-standard term for data initialization
- More specific than 'loaders' or 'initialization'
- Avoids conflict with data/ directory
- Import from renamed bootstrap module
- Update to use new module names
- Maintain all existing functionality
- Centralize all validation logic in validation.py
- Move validation from extensions module
- Consistent error messages and patterns
- Rename test_fix* → test_expand*
- Update all fix() calls to expand()
- Add tests for deprecated fix() with warning
- Add tests for e() and p() shortcuts
- Update performance benchmarks
- Update imports for renamed file_loader module
- Update monkeypatch paths
- Maintain 100% test coverage
- Replace all fix() examples with expand()
- Add e() and p() shortcuts documentation
- Update flank → context_chars parameter
- Add Codecov badge
- Update load_json → load_file
- Document deprecated fix() function
- Remove README.rst (replaced by README.md)
- Remove deploy.py (replaced by GitHub Actions)
- Clean up legacy files from original fork
- Delete initialization.py, resources.py, and variants.py as they are no longer needed
- Remove associated tests from test_resources.py
- Clean up the codebase by eliminating obsolete files
Copilot AI review requested due to automatic review settings December 12, 2025 20:21
- Change trigger from 'v*' to 'v*.*.*'
- Prevents accidental publishing on non-version tags
- Only publishes on proper semver tags (v1.0.0, v0.2.0, etc.)
- Remove dict_builders.py (renamed to transformers.py)
- Old files were being included in coverage calculation
- Coverage now correctly shows 100%
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the codebase to benchmark alternative implementations, renaming the core fix() function to expand() while maintaining backward compatibility. The changes include significant restructuring to improve modularity, adding comprehensive test coverage for validation and file loading utilities, and updating the API with shorthand aliases.

Key Changes:

  • Renamed fix() to expand() with deprecation warning for the old name
  • Reorganized code into focused modules (validation, file_loader, transformers, processor, matchers, etc.)
  • Added new test files for validation and file loading functionality
  • Updated performance benchmarks to use the new expand() function name

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/test_validation.py New comprehensive test suite for input validation functions
tests/test_performance.py Updated benchmark tests to use expand() instead of fix()
tests/test_file_loader.py New test suite for JSON file loading utilities
tests/test_contractions.py Updated all test cases to use expand() and added tests for new functionality
deploy.py Removed deprecated manual deployment script
contractions/validation.py New module containing input validation helper functions
contractions/transformers.py New module for apostrophe normalization and variant generation
contractions/state.py New module to manage global state for matchers and dictionaries
contractions/processor.py New module containing core expansion and preview logic
contractions/matchers.py New module managing TextSearch matcher instances
contractions/file_loader.py New module for loading JSON data files
contractions/extensions.py New module for custom contraction management
contractions/data/safety_keys.json New data file containing protected contraction keys
contractions/data/contractions_dict.json Reformatted and added month abbreviations
contractions/core.py Removed (logic moved to new modules)
contractions/bootstrap.py New module for initialization logic
contractions/api.py Refactored to use new modular structure
contractions/init.py Updated to export new API and deprecate fix()
README.rst Removed (consolidated to README.md)
README.md Updated documentation to reflect API changes
.github/workflows/publish.yml Updated tag pattern to require semantic versioning

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Repository owner deleted a comment from codecov-commenter Dec 12, 2025
@codecov
Copy link

codecov bot commented Dec 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Breaking changes:
- Rename fix() → expand() with deprecation
- Rename load_json() → load_file()
- Rename parameter flank → context_chars
- Complete architecture refactor with pure facade pattern
- Professional file naming (file_loader, bootstrap, transformers)
- 100% test coverage enforced
@devjerry0 devjerry0 merged commit ec2b506 into main Dec 12, 2025
12 checks passed
@devjerry0 devjerry0 deleted the benchmark-alternatives branch December 12, 2025 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants