Skip to content

Improve Unicode support with grapheme clusters#396

Open
jquast wants to merge 1 commit intopeterbrittain:masterfrom
jquast:jq/wcwidth-integration
Open

Improve Unicode support with grapheme clusters#396
jquast wants to merge 1 commit intopeterbrittain:masterfrom
jquast:jq/wcwidth-integration

Conversation

@jquast
Copy link

@jquast jquast commented Jan 27, 2026

  • I read the contributing guidelines
  • and that's why I include some flake8 and typing/mypy fixes
  • I also included comprehensive tests
  • many were designed with TDD (started failing, succeed after change).

Issues fixed by this PR

  • I did not discover any open issues, I am surprised!
  • You must not have many CJK and emoji uses.

What does this implement/fix?

Problem: asciimatics text utilities wrongly split up grapheme clusters (emoji ZWJ sequences like 👨‍👩‍👧, regional flags like 🇨🇦, skin tone modifiers, combining characters, etc) -- this causes display corruption in the SpeechBubble, or incorrect width calculations and padding in other places due to missing grapheme support.

Solution: Integrate with wcwidth>=0.5.0 by using:

Any other comments?

  • I notice that there is a choice to "ignore unicode" for performance improvement, but I can suggest that wcwidth has many "fast path" checks for pure-ascii strings to return len(string) and so on, along with lru_cache, the performance is negligble to always support unicode since, related downstream automatic benchmarking results can be viewed here of upgrade of wcwidth:
View benchmarks of wcwidth 0.2.14 to 0.5.0:
Benchmark BASE HEAD Efficiency
test_center_ascii 104.8 µs 62.1 µs +68.87%
test_rjust_cjk 760.6 µs 640.2 µs +18.81%
test_center_cjk 763.5 µs 642.6 µs +18.81%
test_truncate_ascii 33,519.5 µs 83 µs ×400
test_center_ansi 678.1 µs 359.2 µs +88.78%
test_truncate_cjk 14.2 ms 6.8 ms ×2.1
test_length_cjk 749.7 µs 633 µs +18.45%
test_truncate_emoji_zwj 5.2 ms 1.6 ms ×3.4
test_rjust_ansi 673.7 µs 356.1 µs +89.19%
test_truncate_ansi 46.1 ms 23.2 ms +99.18%
test_length_ansi 672.2 µs 488.9 µs +37.5%
test_length_ascii 99.2 µs 60.4 µs +64.23%
test_rjust_ascii 99.8 µs 58.9 µs +69.33%
test_ljust_ansi 670.5 µs 355.3 µs +88.74%
test_ljust_ascii 100.2 µs 58.9 µs +70.1%
test_ljust_cjk 759.6 µs 640.7 µs +18.55%
test_length_emoji_vs16 739.7 µs 670.7 µs +10.28%

Let me know if you would like any such changes or additional PR's, happy to help.

@jquast jquast force-pushed the jq/wcwidth-integration branch from ac684a5 to 944238c Compare January 27, 2026 06:57
**Problem**: asciimatics text utilities wrongly split up grapheme
clusters (emoji ZWJ sequences like 👨‍👩‍👧gional flags like
🇨🇦skin tone modifiers, combining characters), causing display
corruption and incorrect width calculations.

**Solution**: Integrate with wcwidth >= 0.5.0 by using:
  - https://wcwidth.readthedocs.io/en/latest/api.html#wcwidth.iter_graphemes
    for iteration in _enforce_width_ext(), _find_min_start(), _get_offset()
  - https://wcwidth.readthedocs.io/en/latest/api.html#wcwidth.wrap
    for grapheme-aware word wrapping in _split_text()
  - https://wcwidth.readthedocs.io/en/latest/api.html#wcwidth.ljust
    for line padding in SpeechBubble

I notice that there is a choice to "ignore unicode" for performance
improvement, but I can suggest that wcwidth has many "fast path" checks
for pure-ascii strings to return len(string) and so on, along with
lru_cache, the performance is negligble to always support unicode.
@jquast jquast force-pushed the jq/wcwidth-integration branch from 944238c to f49700d Compare January 27, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments