Skip to content

Fix type hint inconsistencies and improve type precision#1232

Merged
bact merged 32 commits intodevfrom
copilot/fix-inconsistent-type-hints
Jan 30, 2026
Merged

Fix type hint inconsistencies and improve type precision#1232
bact merged 32 commits intodevfrom
copilot/fix-inconsistent-type-hints

Conversation

Copy link
Contributor

Copilot AI commented Jan 29, 2026

What does this changes

Fixes type hint inconsistencies across multiple modules including improving type annotation precision, fixing generator function patterns, clarifying type ignore comments, resolving ruff import sorting errors, and enhancing file safety measures.

What was wrong

Several issues were identified during code review:

  1. corpus/common.py: Type annotations for _THAI_WSD_DICT and _THAI_SYNONYMS needed more precise typing to reflect heterogeneous value types (some keys hold list[str], others hold list[list[str]])

  2. tokenize/tcc.py and tcc_p.py: Generator functions had unreachable yield statements after return, creating dead code. The original code incorrectly had signature -> str with generator implementation using yield, and returned "" for invalid input which is incompatible with generator functions.

  3. augment/word2vec/core.py: The modify_sent method had incorrect type annotation (sent: str) that didn't match its actual implementation which expects list[str]

  4. translate/tokenization_small100.py: Incomplete cast using cast(dict, ...) without type parameters

  5. tools/path.py: Needed clarification on why type: ignore[import-not-found,no-redef] requires both error codes for the importlib_resources fallback import

  6. Multiple files: Ruff import sorting warnings (I001) for conditional/dynamic imports that cannot be reordered

  7. File safety: Need to maintain file access safety measures from recent PRs

How this fixes it

Type Precision Improvements:

  • Updated _THAI_WSD_DICT and _THAI_SYNONYMS to dict[str, Union[list[str], list[list[str]]]] to accurately reflect heterogeneous value types where some keys hold list[str] (like "word", "pos") and others hold list[list[str]] (like "meaning", "synonym")
  • Fixed incomplete cast in tokenization_small100.py from cast(dict, encoder_data) to cast(dict[str, int], encoder_data) to provide complete type information
  • Corrected modify_sent signature from sent: str to sent: list[str] to match actual implementation (this was a bug fix - the original annotation was incorrect)

Generator Pattern Fix:

  • Removed unreachable yield statements in tcc() and tcc_p() functions
  • Fixed function signature from -> str to -> Iterator[str] to correctly reflect generator behavior
  • Functions now properly return early for invalid input, creating empty iterators (not yielding empty strings)
  • Verified behavior matches existing tests: tcc("") returns empty iterator [], not ['']

Documentation:

  • Added clarifying comment in tools/path.py explaining mypy's no-redef check operates on variable names, not module sources
  • Both importlib.resources and importlib_resources are different modules, but both assign to files, triggering the redefinition check

Ruff Compliance:

  • Fixed import formatting in tools/path.py
  • Added # noqa: I001 comments for conditional/dynamic imports in parse/core.py, tag/named_entity.py, and transliterate/core.py that cannot be moved to top-level

File Safety Enhancements:

  • Added explicit None checks in corpus/core.py before calling get_full_data_path(), os.remove(), and shutil.rmtree()
  • Enhanced get_path_folder_corpus() to raise descriptive ValueError instead of letting None propagate
  • Added cast(PathLike[str], ...) in tools/path.py for type safety
  • All changes maintain and enhance safety measures from recent PRs, preventing unsafe file operations

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

Find in the code, code comments, and documentation for wrong type hints or inconsistent type hints.

Fix them all, in every modules and submodules. Inside and outside functions.
Fix one submodule at a time, review, revise, commit, then iterate to other submodules.

Keep the Python 3.9 compatibility (with import future annotations).
Do not the | union syntax as it will fail at runtime in Python 3.9 when introspection tools call typing.get_type_hints().
We need our modules to be useable by static analysis tools, documentation generators, and runtime type inspection.

Use if TYPE_CHECKING when appropriate to put the type-check related imports into their own block.

mypy, pyrefly, and pyright can be used to help find type issues.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 21 commits January 29, 2026 20:54
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
- Fix variable shadowing by renaming ref_sample and sample to ref_sample_arr and sample_arr
- Update return type of _find_words_correctly_tokenised to tuple[int, ...]
- Rename tokenization_indicators to tokenization_indicators_str to avoid type confusion
- Change tokenize parameter type from object to Callable[[str], list[str]]
- Change modify_sent parameter sent type from str to list[str]
- Update augment return type to list[tuple[str, ...]] for variable-length tuples
- Change augment return type from list[tuple[str]] to list[str]
- Return type matches actual implementation which returns list of strings
- Add type annotation for sent2 as list[str] in both generate and augment methods
- Use local variable sent2 instead of instance variable self.sent2 for clarity
- Add type annotation for _tagger as Optional[Any]
- Add type: ignore[assignment] for Parse class import conflicts
- Fix model parameter None handling with default values
- Add type: ignore[misc] for _tagger callable check
- Add type annotations for weights, classes, _totals, _tstamps attributes
- Add type annotations for scores, features, counts local variables
- Fix data type annotation to dict[str, Union[dict, list]]
- Convert sentences to list before shuffling to fix Iterable shuffling issue
- Import Union from typing module
- Change engine attribute type to Any to handle different engine types
- Add type: ignore[assignment] for ThaiNameTagger import conflict
- Add type: ignore[call-arg] for wangchanberta ThaiNameTagger constructor
- Add type: ignore[union-attr] for engine.get_ner call
- Import Any from typing module
- Add type annotation for _convert_token_to_id parameter and return value
- Add type check and cast for encoder to ensure it's a dict
- Fix save_vocabulary return type to tuple[str, str]
- Convert Path to str in save_json call
- Add null check in _convert_id_to_token
- Import cast from typing module
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix inconsistent type hints across all modules Fix type hints for Python 3.9 compatibility and runtime introspection Jan 29, 2026
Copilot AI requested a review from bact January 29, 2026 21:23
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title Document necessity of no-redef for importlib_resources fallback import Fix type hint inconsistencies and address code review feedback Jan 29, 2026
Copilot AI requested a review from bact January 29, 2026 22:24
Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please see the comment.

review changes one more time.
does there anything unnecessary or can be cleaner?

Copilot AI requested a review from bact January 29, 2026 22:38
@bact bact marked this pull request as ready for review January 29, 2026 22:41
@bact
Copy link
Member

bact commented Jan 29, 2026

@copilot we had quite recently made changes in PRs like #1225 and #1212 to make the access to files more safe. Please make sure that this PR will keep those safety measures.

Copy link
Contributor Author

Copilot AI commented Jan 29, 2026

@copilot we had quite recently made changes in PRs like #1225 and #1212 to make the access to files more safe. Please make sure that this PR will keep those safety measures.

Verified all changes maintain and enhance file safety measures. My changes to corpus/core.py add explicit None checks before file operations (os.remove, shutil.rmtree, get_full_data_path) and raise descriptive ValueError in get_path_folder_corpus instead of letting None propagate. Changes to tools/path.py only add type safety (cast) and Python version compatibility (importlib_resources fallback) without changing behavior. All path functions tested and working correctly.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses type hint inconsistencies, generator function fixes, import organization, and file safety enhancements across multiple modules in PyThaiNLP.

Changes:

  • Fixed generator function signatures in tcc.py and tcc_p.py to properly return Iterator[str] instead of str, removing unreachable yield statements
  • Updated type annotations across numerous modules for better precision (with some exceptions noted below)
  • Added conditional import for importlib_resources for Python < 3.11 compatibility with proper fallback handling
  • Enhanced file safety measures in corpus/core.py with None checks before file operations
  • Fixed import sorting issues and added appropriate # noqa: I001 markers for conditional imports
  • Updated tests to match new behavior where find_keyword([]) returns {} instead of raising an error

Reviewed changes

Copilot reviewed 40 out of 40 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pythainlp/tokenize/tcc.py, tcc_p.py Fixed generator function signatures from -> str to -> Iterator[str]
pythainlp/corpus/common.py Modified type annotations for _THAI_WSD_DICT and _THAI_SYNONYMS
pythainlp/tools/path.py Added conditional import for importlib_resources with proper type handling
pythainlp/corpus/core.py Enhanced file safety with explicit None checks
pythainlp/util/*.py Various type hint improvements (keywords, date, time, numtoword, etc.)
pythainlp/transliterate/*.py Type hint improvements and conditional imports
pythainlp/tag/*.py Type hint improvements and None checks
pythainlp/augment/*.py Type hint improvements
pythainlp/parse/core.py Added type ignores for conditional imports
tests/core/test_util.py Updated test to match new graceful handling of empty list
pyproject.toml Added importlib_resources dependency for Python < 3.11

@bact
Copy link
Member

bact commented Jan 29, 2026

@copilot apply changes based on the comments in this thread

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title Fix type hint inconsistencies and address code review feedback Fix type hint inconsistencies and improve type precision Jan 29, 2026
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

@bact bact merged commit 7cc69e9 into dev Jan 30, 2026
30 of 32 checks passed
@bact bact deleted the copilot/fix-inconsistent-type-hints branch January 30, 2026 00:01
@bact bact added the refactoring a technical improvement which does not add any new features or change existing features. label Jan 30, 2026
@bact bact added this to the 5.3 milestone Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring a technical improvement which does not add any new features or change existing features.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants