Skip to content

fix: modify getTextLength to use BreakIterator for grapheme count#499

Open
Pranavjeet-Naidu wants to merge 3 commits intoopenzim:mainfrom
Pranavjeet-Naidu:fix/correct_grapheme_count
Open

fix: modify getTextLength to use BreakIterator for grapheme count#499
Pranavjeet-Naidu wants to merge 3 commits intoopenzim:mainfrom
Pranavjeet-Naidu:fix/correct_grapheme_count

Conversation

@Pranavjeet-Naidu
Copy link
Copy Markdown
Contributor

brings up a fix for #498.
Used ICU's BreakIterator, took the definition code from their documentation + added a status check as well.
Please suggest changes if needed, considering this is my first time working with this.

Copy link
Copy Markdown
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix begs a unit test. Therefore the PR should be implemented as follows:

Commit 1: move the function getTextLength() to src/tools.cpp and declare it in src/tools.h.
Commit 2: add a unit test in test/tools-test.cpp. It must include a test-point reflecting the bug being addressed.
Commit 3: fix the bug

@Pranavjeet-Naidu
Copy link
Copy Markdown
Contributor Author

@veloman-yunkan, the three commits like you've asked are up, once you give a go ahead, I will squash them as well.

Copy link
Copy Markdown
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veloman-yunkan, the three commits like you've asked are up, once you give a go ahead, I will squash them as well.

The commits must not be squashed. The PR must consist of the said three commits at the time of merge. Now there are four commits.

Please also don't abuse the "fix:" label in the commit title. The commit message must not be misleading - adding a unit test is not a fix. Also it must be informative. For example, if a new unit test fails because it demonstrates a pre-existing bug, that must be stated in the commit description.

Signed-off-by: Pranavjeet-Naidu <pranavjeetnaidu@gmail.com>
The new test cases include grapheme clusters (e.g., emoji with skin tone
modifiers) which demonstrate bug openzim#498 - the current implementation counts
code points instead of graphemes.

Signed-off-by: Pranavjeet-Naidu <pranavjeetnaidu@gmail.com>
@Pranavjeet-Naidu Pranavjeet-Naidu force-pushed the fix/correct_grapheme_count branch 2 times, most recently from b25563f to 486325e Compare March 31, 2026 19:10
Use ICU's BreakIterator to properly count grapheme clusters. This fixes
cases where characters like emoji with skin tone modifiers (e.g., 👍)
were incorrectly counted as multiple characters.

- Update build configuration to include ICU dependency for tools.cpp
- Add tools.cpp to metadata-test dependencies

Fixes openzim#498

Signed-off-by: Pranavjeet-Naidu <pranavjeetnaidu@gmail.com>
@Pranavjeet-Naidu Pranavjeet-Naidu force-pushed the fix/correct_grapheme_count branch from 486325e to 834ff17 Compare March 31, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants