Skip to content

feat: regex for counter validation#497

Merged
veloman-yunkan merged 1 commit intoopenzim:mainfrom
Pranavjeet-Naidu:feat/regex_counter_validation
Mar 24, 2026
Merged

feat: regex for counter validation#497
veloman-yunkan merged 1 commit intoopenzim:mainfrom
Pranavjeet-Naidu:feat/regex_counter_validation

Conversation

@Pranavjeet-Naidu
Copy link
Copy Markdown
Contributor

Solves #473

The added regex also has tests that validate it, have been run locally and verified.
( inspiration taken from openzim/libzim#1000 as well ).

I didn't notice a contributions.md so please let me know if any changes in formatting are needed.

@Pranavjeet-Naidu
Copy link
Copy Markdown
Contributor Author

@benoit74 please take a look, suggest some additional testcases if necessary as well :)

@benoit74
Copy link
Copy Markdown
Contributor

I really doubt you took any inspiration for bad mimetype values from openzim/libzim#1000, please take a closer look

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 27.68%. Comparing base (3533bd3) to head (5c62ae5).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #497   +/-   ##
=======================================
  Coverage   27.68%   27.68%           
=======================================
  Files          26       26           
  Lines        2348     2348           
  Branches     1282     1282           
=======================================
  Hits          650      650           
  Misses       1257     1257           
  Partials      441      441           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Pranavjeet-Naidu
Copy link
Copy Markdown
Contributor Author

@benoit74 I went through the convo again, the regex remains unchanged but is now validated by more test cases right from the issue. Also, I agree with you on stripping the MIME type params. Please take a look at this, and I can open an issue, make a pr for counterHandler.h ?

@benoit74 benoit74 self-requested a review March 20, 2026 09:48
Copy link
Copy Markdown
Contributor

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you

@benoit74
Copy link
Copy Markdown
Contributor

@kelson42 I feel like a more experience C++ person should review this as well, who?

Also, I agree with you on stripping the MIME type params. Please take a look at this, and I can open an issue, make a pr for counterHandler.h ?

You mean in libzim right? Yes, please propose a PR as well. We should deploy both zim-tools and libzim changes in a coordinated manner if we do not want to many failures in zimcheck due to libzim not cleaning things correctly

Copy link
Copy Markdown
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! It looks good to me, but I chose to outline possible ways to improve it just in case. Please feel free to leave the PR as is.

@Pranavjeet-Naidu
Copy link
Copy Markdown
Contributor Author

@veloman-yunkan please take a look, the changes should be up. Also @benoit74, I don't think this is the right place for it but I just noticed that :
The icu::UnicodeString::length() method returns the number of UTF-16 code units, not grapheme clusters.

The documentation here, mentions this exact bug as well.

When counting length of strings (e.g. for title, description, ...) we want to count the number of visual characters (since this is the reason we limit the length of these metadata, we want to ensure they do not break reader UIs by taking way too much visual space) and not the number of Unicode characters needed to store/render this visual character. Some languages / characters need multiple Unicode characters. One example is में which has only 1 grapheme but uses 3 Unicode characters (e.g. in Python, len("में") == 3).

For the string "में" (Hindi):
Unicode code points: 3 (म U+092E, े U+0947, ं U+0902)
UTF-16 code units: 3 (all in BMP, so 1 code unit each)
Grapheme clusters: 1 (visually one character)
So the current getTextLength("में") returns 3, but according to the spec it should return 1.

Do you think this is worth the rewrite? Seems like only one function should be modified.

@benoit74
Copy link
Copy Markdown
Contributor

I've opened #498 for the grapheme issue

Copy link
Copy Markdown
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please squash all your changes so that this PR consists of a single commit.

Signed-off-by: Pranavjeet-Naidu <pranavjeetnaidu@gmail.com>
@Pranavjeet-Naidu Pranavjeet-Naidu force-pushed the feat/regex_counter_validation branch from 1892417 to 5c62ae5 Compare March 23, 2026 21:47
@Pranavjeet-Naidu
Copy link
Copy Markdown
Contributor Author

@veloman-yunkan I've added the tests and squashed as well, please take a look

@veloman-yunkan veloman-yunkan merged commit 20fb860 into openzim:main Mar 24, 2026
10 checks passed
@kelson42
Copy link
Copy Markdown
Contributor

@Pranavjeet-Naidu Thank you very much for your succesfuly merged PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants