Fix #1619: Implement grapheme cluster counting for character count co…#1765
Fix #1619: Implement grapheme cluster counting for character count co…#1765moshaid wants to merge 6 commits intonhsuk:mainfrom
Conversation
…nt component - Add graphemeCount utility function with Intl.Segmenter support and fallback - Update character count component to use grapheme counting instead of UTF-16 code units - Add comprehensive tests for Unicode, emoji, and complex characters - Update component documentation with counting methodology - Ensures client-side validation matches Python len() behavior for server-side consistency
b923c57 to
27cf1c1
Compare
|
Thanks for raising this @moshaid Better compatibility with Python Knowing we only have
Server-side consistencyBecause the error message set server-side might include a different count to client-side, should we:
Have you tried this already on your service? Thanks |
|
Appreciate you reviewing this @colinrotherham I have added a fall back to support the gap. |
|
@colinrotherham i just ran a prettier check on the file and it is well formatted. I am stuck on what could be wrong with the prettier check fail. Is there any option you could recommend on this?. Thanks |
|
Can you run I'm getting the following change on Thanks @moshaid --- A/packages/nhsuk-frontend/src/nhsuk/i18n.mjs
+++ B/packages/nhsuk-frontend/src/nhsuk/i18n.mjs
@@ -141,7 +141,7 @@
hasIntlPluralRulesSupport() {
return Boolean(
'PluralRules' in window.Intl &&
+ Intl.PluralRules.supportedLocalesOf(this.locale).length
- Intl.PluralRules.supportedLocalesOf(this.locale).length
)
} |
|
Thanks for the tip @colinrotherham |
|
|
||
| ## How characters are counted | ||
|
|
||
| By default, the character count component uses **code point counting**, which matches Python's `len()` function for Unicode strings. This ensures consistency between client-side (JavaScript) and server-side (Python) validation in `nhsuk-frontend-jinja`, preventing mismatched error messages. |
There was a problem hiding this comment.
Not everyone is using python and jinja, so I don't know if this paragraph should be geared at a broader audience than teams using python.
nhsuk-frontend-jinja is a port of the nunjucks components, so it's not actually the thing providing server-side validation, but wherever the template relies on the len filter, it will be using the code point counting, so it's good that its consistent.
This is a great contribution btw 👍🏻
There was a problem hiding this comment.
What do you both think about a customisable count function?
- Character count component counts code points, not characters alphagov/govuk-frontend#1104
- Character count's character/word count functions should be customisable alphagov/govuk-frontend#1364
It might need to support promises (should a server-side count be needed)
Similar to a config option, you could pass in pre-exported count functions that we publish. Or alternatively use your own if necessary
There was a problem hiding this comment.
I like the idea in theory. I think calling the server for the count is a bit excessive though. Just being able to specify a count function on the client side to match what the server does seems like it would be flexible enough.
Counting bytes seems like the main use case other than the options we've got now (graphemes, codepoints, words). The byte count would depend on the encoding the server/database uses, so if we export a function for this it probably shouldn't assume utf-8.
I'm sure there are plenty of cases where we're constrained by some legacy backend thing, but I'm not sure the ideal UX in that case. With graphemes - and to a lesser extent, the existing codepoint counting - the display will track with what the user is entering, but it will lie about whether they are over the limit if they enter any multi-byte characters. But if they customise the count function to use byte counting, the count will jump by more than 1 when they type multi-byte characters, which might confuse users as well. 🤷🏻
There was a problem hiding this comment.
I wonder if we can partially mitigate mis-matches by:
- recommending that you set quite a high limit where possible
- recommending that teams use a
threshold, so that most users don’t see the character count at all - even if both client side and server-side are counting characters, you could allow an extra 5 or so characters server-side vs client-side, just in case there’s a discrepancy in the counting functions...
| - **Characters with combining marks**: Accented characters like é, ñ, and ü are counted correctly regardless of whether they're stored as a single code point or as a base character plus combining mark | ||
| - **Complex scripts**: Non-Latin scripts (Chinese, Japanese, Korean, Arabic, etc.) are counted accurately | ||
|
|
||
| **Important**: Only enable grapheme counting if your server-side validation also uses grapheme counting. Otherwise, you may see different counts between client and server validation messages. |
There was a problem hiding this comment.
Would it be helpful to signpost the way to do this in javascript and the various other languages we use on the backend? (In python I think you probably need a 3rd party library like https://grapheme.readthedocs.io/en/latest/grapheme.html)
This PR implements grapheme cluster counting for the character count component to ensure client-side validation matches Python's
len()behaviour for server-side consistency: