Skip to content

Fix #1619: Implement grapheme cluster counting for character count co…#1765

Open
moshaid wants to merge 6 commits intonhsuk:mainfrom
moshaid:fix/1619-grapheme-counting
Open

Fix #1619: Implement grapheme cluster counting for character count co…#1765
moshaid wants to merge 6 commits intonhsuk:mainfrom
moshaid:fix/1619-grapheme-counting

Conversation

@moshaid
Copy link

@moshaid moshaid commented Jan 13, 2026

This PR implements grapheme cluster counting for the character count component to ensure client-side validation matches Python's len() behaviour for server-side consistency:

  • Add graphemeCount utility function with Intl.Segmenter support and fallback
  • Update character count component to use grapheme counting instead of UTF-16 code units
  • Add comprehensive tests for Unicode, emoji, and complex characters
  • Update component documentation with counting methodology
  • Ensures client-side validation matches Python len() behaviour for server-side consistency

…nt component

- Add graphemeCount utility function with Intl.Segmenter support and fallback
- Update character count component to use grapheme counting instead of UTF-16 code units
- Add comprehensive tests for Unicode, emoji, and complex characters
- Update component documentation with counting methodology
- Ensures client-side validation matches Python len() behavior for server-side consistency
@moshaid moshaid force-pushed the fix/1619-grapheme-counting branch from b923c57 to 27cf1c1 Compare January 15, 2026 15:28
@colinrotherham
Copy link
Contributor

Thanks for raising this @moshaid

Better compatibility with Python len() and our nhsuk-frontend-jinja port would be great

Knowing we only have Intl.Segmenter in Baseline 2024, what should we do about the support gap?

Server-side consistency

Because the error message set server-side might include a different count to client-side, should we:

  1. Support grapheme counting with string counting fallback
  2. Support grapheme counting with no fallback, textarea only
  3. Support grapheme counting via config only

Have you tried this already on your service?

Thanks

@moshaid
Copy link
Author

moshaid commented Jan 16, 2026

Appreciate you reviewing this @colinrotherham

I have added a fall back to support the gap.
For the server consistency, i tried the 3rd option which i consider the best.

@moshaid
Copy link
Author

moshaid commented Jan 27, 2026

@colinrotherham i just ran a prettier check on the file and it is well formatted. I am stuck on what could be wrong with the prettier check fail. Is there any option you could recommend on this?.

Thanks

@colinrotherham
Copy link
Contributor

Can you run npm install and try again?

I'm getting the following change on npm run lint:prettier:fix

Thanks @moshaid

--- A/packages/nhsuk-frontend/src/nhsuk/i18n.mjs
+++ B/packages/nhsuk-frontend/src/nhsuk/i18n.mjs
@@ -141,7 +141,7 @@
   hasIntlPluralRulesSupport() {
     return Boolean(
       'PluralRules' in window.Intl &&
+      Intl.PluralRules.supportedLocalesOf(this.locale).length
-        Intl.PluralRules.supportedLocalesOf(this.locale).length
     )
   }

@moshaid moshaid closed this Jan 28, 2026
@moshaid moshaid reopened this Jan 28, 2026
@moshaid
Copy link
Author

moshaid commented Jan 28, 2026

Thanks for the tip @colinrotherham


## How characters are counted

By default, the character count component uses **code point counting**, which matches Python's `len()` function for Unicode strings. This ensures consistency between client-side (JavaScript) and server-side (Python) validation in `nhsuk-frontend-jinja`, preventing mismatched error messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not everyone is using python and jinja, so I don't know if this paragraph should be geared at a broader audience than teams using python.

nhsuk-frontend-jinja is a port of the nunjucks components, so it's not actually the thing providing server-side validation, but wherever the template relies on the len filter, it will be using the code point counting, so it's good that its consistent.

This is a great contribution btw 👍🏻

Copy link
Contributor

@colinrotherham colinrotherham Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you both think about a customisable count function?

It might need to support promises (should a server-side count be needed)

Similar to a config option, you could pass in pre-exported count functions that we publish. Or alternatively use your own if necessary

Copy link
Contributor

@MatMoore MatMoore Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea in theory. I think calling the server for the count is a bit excessive though. Just being able to specify a count function on the client side to match what the server does seems like it would be flexible enough.

Counting bytes seems like the main use case other than the options we've got now (graphemes, codepoints, words). The byte count would depend on the encoding the server/database uses, so if we export a function for this it probably shouldn't assume utf-8.

I'm sure there are plenty of cases where we're constrained by some legacy backend thing, but I'm not sure the ideal UX in that case. With graphemes - and to a lesser extent, the existing codepoint counting - the display will track with what the user is entering, but it will lie about whether they are over the limit if they enter any multi-byte characters. But if they customise the count function to use byte counting, the count will jump by more than 1 when they type multi-byte characters, which might confuse users as well. 🤷🏻

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can partially mitigate mis-matches by:

  • recommending that you set quite a high limit where possible
  • recommending that teams use a threshold, so that most users don’t see the character count at all
  • even if both client side and server-side are counting characters, you could allow an extra 5 or so characters server-side vs client-side, just in case there’s a discrepancy in the counting functions...

- **Characters with combining marks**: Accented characters like é, ñ, and ü are counted correctly regardless of whether they're stored as a single code point or as a base character plus combining mark
- **Complex scripts**: Non-Latin scripts (Chinese, Japanese, Korean, Arabic, etc.) are counted accurately

**Important**: Only enable grapheme counting if your server-side validation also uses grapheme counting. Otherwise, you may see different counts between client and server validation messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to signpost the way to do this in javascript and the various other languages we use on the backend? (In python I think you probably need a 3rd party library like https://grapheme.readthedocs.io/en/latest/grapheme.html)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants