feat: regex for counter validation by Pranavjeet-Naidu · Pull Request #497 · openzim/zim-tools

Pranavjeet-Naidu · 2026-03-19T08:10:32Z

Solves #473

The added regex also has tests that validate it, have been run locally and verified.
( inspiration taken from openzim/libzim#1000 as well ).

I didn't notice a contributions.md so please let me know if any changes in formatting are needed.

Pranavjeet-Naidu · 2026-03-19T14:15:25Z

@benoit74 please take a look, suggest some additional testcases if necessary as well :)

benoit74 · 2026-03-19T16:23:02Z

I really doubt you took any inspiration for bad mimetype values from openzim/libzim#1000, please take a closer look

codecov · 2026-03-19T16:25:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 27.68%. Comparing base (3533bd3) to head (5c62ae5).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #497   +/-   ##
=======================================
  Coverage   27.68%   27.68%           
=======================================
  Files          26       26           
  Lines        2348     2348           
  Branches     1282     1282           
=======================================
  Hits          650      650           
  Misses       1257     1257           
  Partials      441      441

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Pranavjeet-Naidu · 2026-03-19T20:36:16Z

@benoit74 I went through the convo again, the regex remains unchanged but is now validated by more test cases right from the issue. Also, I agree with you on stripping the MIME type params. Please take a look at this, and I can open an issue, make a pr for counterHandler.h ?

benoit74

LGTM, thank you

benoit74 · 2026-03-20T09:48:32Z

@kelson42 I feel like a more experience C++ person should review this as well, who?

Also, I agree with you on stripping the MIME type params. Please take a look at this, and I can open an issue, make a pr for counterHandler.h ?

You mean in libzim right? Yes, please propose a PR as well. We should deploy both zim-tools and libzim changes in a coordinated manner if we do not want to many failures in zimcheck due to libzim not cleaning things correctly

veloman-yunkan

Thanks for the contribution! It looks good to me, but I chose to outline possible ways to improve it just in case. Please feel free to leave the PR as is.

src/metadata.cpp

test/metadata-test.cpp

Pranavjeet-Naidu · 2026-03-23T04:04:35Z

@veloman-yunkan please take a look, the changes should be up. Also @benoit74, I don't think this is the right place for it but I just noticed that :
The icu::UnicodeString::length() method returns the number of UTF-16 code units, not grapheme clusters.

The documentation here, mentions this exact bug as well.

When counting length of strings (e.g. for title, description, ...) we want to count the number of visual characters (since this is the reason we limit the length of these metadata, we want to ensure they do not break reader UIs by taking way too much visual space) and not the number of Unicode characters needed to store/render this visual character. Some languages / characters need multiple Unicode characters. One example is में which has only 1 grapheme but uses 3 Unicode characters (e.g. in Python, len("में") == 3).

For the string "में" (Hindi):
Unicode code points: 3 (म U+092E, े U+0947, ं U+0902)
UTF-16 code units: 3 (all in BMP, so 1 code unit each)
Grapheme clusters: 1 (visually one character)
So the current getTextLength("में") returns 3, but according to the spec it should return 1.

Do you think this is worth the rewrite? Seems like only one function should be modified.

benoit74 · 2026-03-23T08:09:07Z

I've opened #498 for the grapheme issue

test/metadata-test.cpp

veloman-yunkan

Also please squash all your changes so that this PR consists of a single commit.

Signed-off-by: Pranavjeet-Naidu <pranavjeetnaidu@gmail.com>

Pranavjeet-Naidu · 2026-03-23T21:49:10Z

@veloman-yunkan I've added the tests and squashed as well, please take a look

kelson42 · 2026-03-24T08:58:34Z

@Pranavjeet-Naidu Thank you very much for your succesfuly merged PR!

benoit74 self-requested a review March 20, 2026 09:48

benoit74 approved these changes Mar 20, 2026

View reviewed changes

veloman-yunkan reviewed Mar 20, 2026

View reviewed changes

src/metadata.cpp Outdated Show resolved Hide resolved

test/metadata-test.cpp Outdated Show resolved Hide resolved

benoit74 mentioned this pull request Mar 23, 2026

Count graphemes, not unicode code points #498

Open

veloman-yunkan requested changes Mar 23, 2026

View reviewed changes

test/metadata-test.cpp Outdated Show resolved Hide resolved

test/metadata-test.cpp Show resolved Hide resolved

veloman-yunkan reviewed Mar 23, 2026

View reviewed changes

feat: regex for counter validation

5c62ae5

Signed-off-by: Pranavjeet-Naidu <pranavjeetnaidu@gmail.com>

Pranavjeet-Naidu force-pushed the feat/regex_counter_validation branch from 1892417 to 5c62ae5 Compare March 23, 2026 21:47

veloman-yunkan approved these changes Mar 24, 2026

View reviewed changes

veloman-yunkan merged commit 20fb860 into openzim:main Mar 24, 2026
10 checks passed

benoit74 mentioned this pull request Mar 31, 2026

ZIM Counter Metadata should be checked properly #473

Closed

Uh oh!

Conversation

Pranavjeet-Naidu commented Mar 19, 2026

Uh oh!

Pranavjeet-Naidu commented Mar 19, 2026

Uh oh!

benoit74 commented Mar 19, 2026

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Pranavjeet-Naidu commented Mar 19, 2026

Uh oh!

benoit74 left a comment

Choose a reason for hiding this comment

Uh oh!

benoit74 commented Mar 20, 2026

Uh oh!

veloman-yunkan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Pranavjeet-Naidu commented Mar 23, 2026

Uh oh!

benoit74 commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

veloman-yunkan left a comment

Choose a reason for hiding this comment

Uh oh!

Pranavjeet-Naidu commented Mar 23, 2026

Uh oh!

Uh oh!

kelson42 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Mar 19, 2026 •

edited

Loading