Fix type hint inconsistencies and improve type precision#1232
Conversation
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
- Fix variable shadowing by renaming ref_sample and sample to ref_sample_arr and sample_arr - Update return type of _find_words_correctly_tokenised to tuple[int, ...] - Rename tokenization_indicators to tokenization_indicators_str to avoid type confusion
- Change tokenize parameter type from object to Callable[[str], list[str]] - Change modify_sent parameter sent type from str to list[str] - Update augment return type to list[tuple[str, ...]] for variable-length tuples
- Change augment return type from list[tuple[str]] to list[str] - Return type matches actual implementation which returns list of strings
- Add type annotation for sent2 as list[str] in both generate and augment methods - Use local variable sent2 instead of instance variable self.sent2 for clarity
- Add type annotation for _tagger as Optional[Any] - Add type: ignore[assignment] for Parse class import conflicts - Fix model parameter None handling with default values - Add type: ignore[misc] for _tagger callable check
- Add type annotations for weights, classes, _totals, _tstamps attributes - Add type annotations for scores, features, counts local variables - Fix data type annotation to dict[str, Union[dict, list]] - Convert sentences to list before shuffling to fix Iterable shuffling issue - Import Union from typing module
- Change engine attribute type to Any to handle different engine types - Add type: ignore[assignment] for ThaiNameTagger import conflict - Add type: ignore[call-arg] for wangchanberta ThaiNameTagger constructor - Add type: ignore[union-attr] for engine.get_ner call - Import Any from typing module
- Add type annotation for _convert_token_to_id parameter and return value - Add type check and cast for encoder to ensure it's a dict - Fix save_vocabulary return type to tuple[str, str] - Convert Path to str in save_json call - Add null check in _convert_id_to_token - Import cast from typing module
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
…transliterate/royin.py
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Verified all changes maintain and enhance file safety measures. My changes to |
There was a problem hiding this comment.
Pull request overview
This PR addresses type hint inconsistencies, generator function fixes, import organization, and file safety enhancements across multiple modules in PyThaiNLP.
Changes:
- Fixed generator function signatures in
tcc.pyandtcc_p.pyto properly returnIterator[str]instead ofstr, removing unreachableyieldstatements - Updated type annotations across numerous modules for better precision (with some exceptions noted below)
- Added conditional import for
importlib_resourcesfor Python < 3.11 compatibility with proper fallback handling - Enhanced file safety measures in
corpus/core.pywith None checks before file operations - Fixed import sorting issues and added appropriate
# noqa: I001markers for conditional imports - Updated tests to match new behavior where
find_keyword([])returns{}instead of raising an error
Reviewed changes
Copilot reviewed 40 out of 40 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pythainlp/tokenize/tcc.py, tcc_p.py | Fixed generator function signatures from -> str to -> Iterator[str] |
| pythainlp/corpus/common.py | Modified type annotations for _THAI_WSD_DICT and _THAI_SYNONYMS |
| pythainlp/tools/path.py | Added conditional import for importlib_resources with proper type handling |
| pythainlp/corpus/core.py | Enhanced file safety with explicit None checks |
| pythainlp/util/*.py | Various type hint improvements (keywords, date, time, numtoword, etc.) |
| pythainlp/transliterate/*.py | Type hint improvements and conditional imports |
| pythainlp/tag/*.py | Type hint improvements and None checks |
| pythainlp/augment/*.py | Type hint improvements |
| pythainlp/parse/core.py | Added type ignores for conditional imports |
| tests/core/test_util.py | Updated test to match new graceful handling of empty list |
| pyproject.toml | Added importlib_resources dependency for Python < 3.11 |
|
@copilot apply changes based on the comments in this thread |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
|


What does this changes
Fixes type hint inconsistencies across multiple modules including improving type annotation precision, fixing generator function patterns, clarifying type ignore comments, resolving ruff import sorting errors, and enhancing file safety measures.
What was wrong
Several issues were identified during code review:
corpus/common.py: Type annotations for
_THAI_WSD_DICTand_THAI_SYNONYMSneeded more precise typing to reflect heterogeneous value types (some keys holdlist[str], others holdlist[list[str]])tokenize/tcc.py and tcc_p.py: Generator functions had unreachable
yieldstatements afterreturn, creating dead code. The original code incorrectly had signature-> strwith generator implementation usingyield, and returned""for invalid input which is incompatible with generator functions.augment/word2vec/core.py: The
modify_sentmethod had incorrect type annotation (sent: str) that didn't match its actual implementation which expectslist[str]translate/tokenization_small100.py: Incomplete cast using
cast(dict, ...)without type parameterstools/path.py: Needed clarification on why
type: ignore[import-not-found,no-redef]requires both error codes for theimportlib_resourcesfallback importMultiple files: Ruff import sorting warnings (I001) for conditional/dynamic imports that cannot be reordered
File safety: Need to maintain file access safety measures from recent PRs
How this fixes it
Type Precision Improvements:
_THAI_WSD_DICTand_THAI_SYNONYMStodict[str, Union[list[str], list[list[str]]]]to accurately reflect heterogeneous value types where some keys holdlist[str](like "word", "pos") and others holdlist[list[str]](like "meaning", "synonym")tokenization_small100.pyfromcast(dict, encoder_data)tocast(dict[str, int], encoder_data)to provide complete type informationmodify_sentsignature fromsent: strtosent: list[str]to match actual implementation (this was a bug fix - the original annotation was incorrect)Generator Pattern Fix:
yieldstatements intcc()andtcc_p()functions-> strto-> Iterator[str]to correctly reflect generator behaviortcc("")returns empty iterator[], not['']Documentation:
tools/path.pyexplaining mypy'sno-redefcheck operates on variable names, not module sourcesimportlib.resourcesandimportlib_resourcesare different modules, but both assign tofiles, triggering the redefinition checkRuff Compliance:
tools/path.py# noqa: I001comments for conditional/dynamic imports inparse/core.py,tag/named_entity.py, andtransliterate/core.pythat cannot be moved to top-levelFile Safety Enhancements:
corpus/core.pybefore callingget_full_data_path(),os.remove(), andshutil.rmtree()get_path_folder_corpus()to raise descriptive ValueError instead of letting None propagatecast(PathLike[str], ...)intools/path.pyfor type safetyYour checklist for this pull request
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.