Benchmark alternatives #2

devjerry0 · 2025-12-12T20:21:47Z

No description provided.

- Moved from eager to lazy initialization of TextSearch matchers - Import time: 30.75ms → 1.00ms (30.7x faster) - Runtime performance unchanged (still ~600k ops/sec) - Better for CLI tools and one-time usage - Uses global state with lazy getters pattern

Following code style guidelines

- Use itertools.chain for combining dict keys (cleaner than unpacking) - Convert nested loop to dict comprehension for _unsafe_dict - Remove redundant list(dictionary.keys()) - can iterate dict directly - More Pythonic and efficient code

Before: k, v, comb, ts, s After: contraction, expansion, combination, text_search, text All variable names are now self-documenting: - key/value -> contraction/expansion - comb -> combination - ts -> text_search - s -> text - k_lower -> (inline contraction.lower()) - f -> json_file - data -> contractions_data - results -> matched_contractions Added type hints to all API functions for clarity

Month abbreviations (jan., feb., etc.) now live in contractions_dict.json where they belong, not hardcoded in the Python logic. Cleaner separation of data and code.

Safety keys (contractions that shouldn't have apostrophe-less variants) now live in safety_keys.json where they belong. These prevent ambiguous matches like 'its' vs 'it's' or 'hes' vs 'he's'. All data now lives in data files, not hardcoded in Python.

Makes code more readable with early returns and clear logic flow. Also slightly faster (28.53ms → 0.51ms init time). Before: Dense nested comprehension After: Clear step-by-step logic

Created _build_apostrophe_variants() for better separation of concerns: - Single responsibility: generate variants with/without apostrophes - Type-hinted parameters and return - Testable in isolation - Makes _load_dicts() cleaner and easier to understand

Created focused helper methods: - _load_json_data(): Load and parse JSON files - _normalize_apostrophes(): Replace smart quotes with standard apostrophes _load_dicts() now reads like poetry: 1. Load data files 2. Normalize apostrophes 3. Build variants Each method has a single, clear responsibility.

Created loaders.py for all data loading and transformation: - load_json_data() - normalize_apostrophes() - build_apostrophe_variants() - load_all_contractions() core.py now focuses solely on TextSearch matcher management: - Cleaner separation of concerns - Each file has single responsibility - More testable and maintainable - core.py shrunk from 148 to ~80 lines

Clarified the math: - num_items = len(items) - num_separators = num_items - 1 - total_slots = num_items + num_separators For [a,b,c] with separator X: - 3 items + 2 separators = 5 total slots - Result: [a, X, b, X, c] No lru_cache needed: lazy loading already ensures single execution

Added filter in normalize_apostrophes() to skip strings that don't contain apostrophes, reducing unnecessary string operations. Benchmark results: • 1.24x speedup on normalize_apostrophes() • Total load_all_contractions(): 1.16ms (down from 1.84ms) • 37% faster overall initialization The filter 'if "'" in contraction' avoids calling str.replace() on ~32 contractions that have no apostrophes (months, etc.)

api.py improvements: • Cache _MATCHER_GETTERS tuple - no rebuild on each call • Early return in add_dict() for empty dicts • Add comprehensive type hints with return types • Robust error handling in load_json(): - FileNotFoundError if file doesn't exist - ValueError if JSON is not a dict • Use .keys() view instead of creating list copy Performance: • Empty dict: 0.000ms (early return) • 100 entries: ~2ms (unchanged, already optimal) • Better error messages for users

Added 4 new tests: • test_load_json_non_dict() - validates ValueError for non-dict JSON • test_fix_leftovers_only() - tests leftovers=True, slang=False • test_fix_slang_only() - tests leftovers=False, slang=True • test_fix_basic_only() - tests leftovers=False, slang=False Coverage: 96% → 100% ✅ Total tests: 16 → 20

Replaced module-level global variables with _State class: • No more 'global' keywords • No more # noqa: PLW0603 suppressions • Cleaner, more Pythonic singleton pattern • Easier to reset state for testing • Groups related state together Before: global _ts_basic # noqa: PLW0603 After: _State.ts_basic (no warnings!) All tests pass ✅ Coverage still 100% ✅

Created state.py for clean separation of concerns: • state.py - State container class • core.py - Core logic (matchers & fix) • api.py - Public API • loaders.py - Data loading Each file has single responsibility ✅ No Pydantic - simple class is faster and sufficient

Split loaders.py into three focused files: • data_io.py (6 lines) - load_json_data() → JSON file I/O only • transforms.py (46 lines) - normalize_apostrophes() → String transformations - build_apostrophe_variants() → Variant generation - Helper functions for combinations • loaders.py (19 lines) - load_all_contractions() → Orchestration only Now EVERY file has true single responsibility: ✅ data_io.py - File I/O ✅ transforms.py - Data transformations ✅ loaders.py - Orchestration ✅ state.py - State management ✅ core.py - Matcher management ✅ api.py - Public API 100% coverage maintained ✅

Split load_json_data into two properly typed functions: • load_dict_data() → dict[str, str] • load_list_data() → list[str] Each function has runtime assertions for type safety. Before: Union return type requiring type: ignore suppressions After: Specific return types with zero suppressions ✅ Zero # type: ignore comments ✅ Zero # noqa suppressions ✅ 100% properly typed ✅ All tests pass ✅ Mypy clean

Renamed all 'ts' (TextSearch) variables to descriptive names: State attributes: • ts_basic → basic_matcher • ts_leftovers → leftovers_matcher • ts_slang → slang_matcher • ts_leftovers_slang → leftovers_slang_matcher • ts_view_window → preview_matcher Functions: • _get_ts_basic() → _get_basic_matcher() • _get_ts_leftovers() → _get_leftovers_matcher() • _get_ts_slang() → _get_slang_matcher() • _get_ts_leftovers_slang() → _get_leftovers_slang_matcher() • _get_ts_view_window() → _get_preview_matcher() Variables: • _MATCHER_GETTERS → _ALL_MATCHERS • matcher_getter → get_matcher Now every name is self-documenting - no cryptic abbreviations!

Changed preview() parameter name: • flank → context_chars 'flank' is military/biological terminology that's not intuitive for text processing. 'context_chars' clearly describes what it does: number of characters to show before/after each match. Updated: • api.py: preview(text, context_chars) • tests: All test cases updated • Error messages: More descriptive Breaking change for preview() API but worth it for clarity.

Extracted repeated logic: • Created _create_matcher() factory function • Defined constants for magic strings: - _TEXTSEARCH_MODE_NORM = 'norm' - _TEXTSEARCH_MODE_OBJECT = 'object' - _TEXTSEARCH_CASE_INSENSITIVE = 'insensitive' Before: TextSearch('insensitive', 'norm') repeated 4 times After: _create_matcher(MODE_NORM, *dicts) - DRY! Before: Manual add() calls for each dict After: Factory loops through dicts Reduced duplication significantly while keeping clarity.

Created validation.py for centralized input validation: • validate_string_param() • validate_non_empty_string() • validate_dict_param() • validate_int_param() • validate_data_type() Fixed ALL production safety issues: ✅ Replaced assert with proper exceptions (python -O safe) ✅ Added validation to all public API functions ✅ Better error messages with actual types ✅ Cleaned up constant naming (_MODE_NORM, _CASE_INSENSITIVE) ✅ No more cryptic errors from deep in library Changes: • data_io.py: No more assert, proper TypeErrors • api.py: Validates all inputs (text, contractions, dicts) • core.py: Validates text parameter • validation.py: Single source of truth for all validation Tests: • 40 tests (was 20) • 100% coverage maintained • All validation paths covered • Mock tests for edge cases Impact: • contractions.fix(None) → Clear TypeError, not AttributeError • contractions.add('', 'test') → Clear ValueError • All errors happen at API boundary, not deep in library • Production-safe (works with python -O)

Extracted complex slice logic into _extract_viewing_window(): • window_start = max(0, match_start - context_chars) • window_end = min(text_length, match_end + context_chars) • return text[window_start:window_end] Before: Cryptic one-liner with nested max/min calls After: Clear method with descriptive variable names Benefits: • Self-documenting function name • Step-by-step variable names explain the logic • Easier to test in isolation • More readable preview() implementation

Renamed for specificity: • core.py → matchers.py (manages TextSearch matchers) • transforms.py → dict_builders.py (builds dict variants) Why: • 'core' is vague - every file thinks it's core • 'transforms' sounds like a single transformer • 'matchers' clearly describes what it manages • 'dict_builders' clearly describes building dict variants Updated all imports: • __init__.py: from .matchers import fix • api.py: from .matchers import _get_*_matcher • loaders.py: from .dict_builders import ... All tests pass ✅ 100% coverage maintained ✅

- Extract expand() and preview() into dedicated processor.py - Separates text processing from API layer - Implements clean separation of concerns

- Handles add(), add_dict(), and load_file() logic - Centralizes all custom contraction management - Clean separation from public API

- All business logic moved to processor and extensions modules - API now only delegates to appropriate modules - Add e (expand) and p (preview) shortcuts - Zero business logic in facade layer

- Rename fix() → expand() (clearer, industry standard) - Add deprecated fix() wrapper with warning - Import shortcuts (e, p) from api module - Update __all__ exports

- More specific name (loads files, not just JSON) - Better describes the module's purpose - Maintains all existing functionality

- More accurate name (transforms dicts, not builds) - Covers both normalize and variant generation - Clearer purpose and responsibility

- Industry-standard term for data initialization - More specific than 'loaders' or 'initialization' - Avoids conflict with data/ directory

- Import from renamed bootstrap module - Update to use new module names - Maintain all existing functionality

- Centralize all validation logic in validation.py - Move validation from extensions module - Consistent error messages and patterns

- Rename test_fix* → test_expand* - Update all fix() calls to expand() - Add tests for deprecated fix() with warning - Add tests for e() and p() shortcuts - Update performance benchmarks

- Update imports for renamed file_loader module - Update monkeypatch paths - Maintain 100% test coverage

- Replace all fix() examples with expand() - Add e() and p() shortcuts documentation - Update flank → context_chars parameter - Add Codecov badge - Update load_json → load_file - Document deprecated fix() function

- Remove README.rst (replaced by README.md) - Remove deploy.py (replaced by GitHub Actions) - Clean up legacy files from original fork

- Delete initialization.py, resources.py, and variants.py as they are no longer needed - Remove associated tests from test_resources.py - Clean up the codebase by eliminating obsolete files

- Change trigger from 'v*' to 'v*.*.*' - Prevents accidental publishing on non-version tags - Only publishes on proper semver tags (v1.0.0, v0.2.0, etc.)

- Remove dict_builders.py (renamed to transformers.py) - Old files were being included in coverage calculation - Coverage now correctly shows 100%

Copilot

Pull request overview

This PR refactors the codebase to benchmark alternative implementations, renaming the core fix() function to expand() while maintaining backward compatibility. The changes include significant restructuring to improve modularity, adding comprehensive test coverage for validation and file loading utilities, and updating the API with shorthand aliases.

Key Changes:

Renamed fix() to expand() with deprecation warning for the old name
Reorganized code into focused modules (validation, file_loader, transformers, processor, matchers, etc.)
Added new test files for validation and file loading functionality
Updated performance benchmarks to use the new expand() function name

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/test_validation.py	New comprehensive test suite for input validation functions
tests/test_performance.py	Updated benchmark tests to use `expand()` instead of `fix()`
tests/test_file_loader.py	New test suite for JSON file loading utilities
tests/test_contractions.py	Updated all test cases to use `expand()` and added tests for new functionality
deploy.py	Removed deprecated manual deployment script
contractions/validation.py	New module containing input validation helper functions
contractions/transformers.py	New module for apostrophe normalization and variant generation
contractions/state.py	New module to manage global state for matchers and dictionaries
contractions/processor.py	New module containing core expansion and preview logic
contractions/matchers.py	New module managing TextSearch matcher instances
contractions/file_loader.py	New module for loading JSON data files
contractions/extensions.py	New module for custom contraction management
contractions/data/safety_keys.json	New data file containing protected contraction keys
contractions/data/contractions_dict.json	Reformatted and added month abbreviations
contractions/core.py	Removed (logic moved to new modules)
contractions/bootstrap.py	New module for initialization logic
contractions/api.py	Refactored to use new modular structure
contractions/init.py	Updated to export new API and deprecate `fix()`
README.rst	Removed (consolidated to README.md)
README.md	Updated documentation to reflect API changes
.github/workflows/publish.yml	Updated tag pattern to require semantic versioning

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2025-12-12T20:28:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Breaking changes: - Rename fix() → expand() with deprecation - Rename load_json() → load_file() - Rename parameter flank → context_chars - Complete architecture refactor with pure facade pattern - Professional file naming (file_loader, bootstrap, transformers) - 100% test coverage enforced

devjerry0 added 30 commits December 12, 2025 13:32

refactor: use early returns instead of if-elif-else

e2700ed

Following code style guidelines

refactor: move month abbreviations from code to data file

d6fb246

Month abbreviations (jan., feb., etc.) now live in contractions_dict.json where they belong, not hardcoded in the Python logic. Cleaner separation of data and code.

refactor: replace nested comprehension with explicit loop

7bfc868

Makes code more readable with early returns and clear logic flow. Also slightly faster (28.53ms → 0.51ms init time). Before: Dense nested comprehension After: Clear step-by-step logic

feat: create processor module for text processing

c48bae7

- Extract expand() and preview() into dedicated processor.py - Separates text processing from API layer - Implements clean separation of concerns

feat: create extensions module for user customizations

019ea08

- Handles add(), add_dict(), and load_file() logic - Centralizes all custom contraction management - Clean separation from public API

refactor: convert api.py to pure facade pattern

a9cc1d7

- All business logic moved to processor and extensions modules - API now only delegates to appropriate modules - Add e (expand) and p (preview) shortcuts - Zero business logic in facade layer

feat: rename fix() to expand() with deprecation

eafd7f5

- Rename fix() → expand() (clearer, industry standard) - Add deprecated fix() wrapper with warning - Import shortcuts (e, p) from api module - Update __all__ exports

refactor: rename json_loader.py → file_loader.py

deb849b

- More specific name (loads files, not just JSON) - Better describes the module's purpose - Maintains all existing functionality

refactor: rename dict_builders.py → transformers.py

649ddbe

- More accurate name (transforms dicts, not builds) - Covers both normalize and variant generation - Clearer purpose and responsibility

devjerry0 added 8 commits December 12, 2025 15:18

refactor: rename loaders.py → bootstrap.py

8d3f86c

- Industry-standard term for data initialization - More specific than 'loaders' or 'initialization' - Avoids conflict with data/ directory

refactor: update imports in matchers.py

ba3b22a

- Import from renamed bootstrap module - Update to use new module names - Maintain all existing functionality

feat: add validate_file_contains_dict() to validation

aa8e801

- Centralize all validation logic in validation.py - Move validation from extensions module - Consistent error messages and patterns

test: update tests for expand() and new API

a87f73b

- Rename test_fix* → test_expand* - Update all fix() calls to expand() - Add tests for deprecated fix() with warning - Add tests for e() and p() shortcuts - Update performance benchmarks

test: rename test_data_io.py → test_file_loader.py

020508b

- Update imports for renamed file_loader module - Update monkeypatch paths - Maintain 100% test coverage

docs: update README for expand() and new API

1a81d47

- Replace all fix() examples with expand() - Add e() and p() shortcuts documentation - Update flank → context_chars parameter - Add Codecov badge - Update load_json → load_file - Document deprecated fix() function

chore: remove obsolete files

f995b8c

- Remove README.rst (replaced by README.md) - Remove deploy.py (replaced by GitHub Actions) - Clean up legacy files from original fork

chore: remove unused modules and tests

740b00b

- Delete initialization.py, resources.py, and variants.py as they are no longer needed - Remove associated tests from test_resources.py - Clean up the codebase by eliminating obsolete files

Copilot AI review requested due to automatic review settings December 12, 2025 20:21

Copilot started reviewing on behalf of devjerry0 December 12, 2025 20:22 View session

devjerry0 added 2 commits December 12, 2025 15:23

fix: restrict publish workflow to version tags only

d3312cb

- Change trigger from 'v*' to 'v*.*.*' - Prevents accidental publishing on non-version tags - Only publishes on proper semver tags (v1.0.0, v0.2.0, etc.)

chore: remove old renamed files to restore 100% coverage

97ca413

- Remove dict_builders.py (renamed to transformers.py) - Old files were being included in coverage calculation - Coverage now correctly shows 100%

Copilot AI reviewed Dec 12, 2025

View reviewed changes

Repository owner deleted a comment from codecov-commenter Dec 12, 2025

devjerry0 merged commit ec2b506 into main Dec 12, 2025
12 checks passed

devjerry0 deleted the benchmark-alternatives branch December 12, 2025 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark alternatives #2

Benchmark alternatives #2

Uh oh!

devjerry0 commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmark alternatives #2

Benchmark alternatives #2

Uh oh!

Conversation

devjerry0 commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov bot commented Dec 12, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants