Skip to content

Conversation

@Jayant-kernel
Copy link
Contributor

Adds Vietnamese name-like strings benchmark for collation performance testing.
The benchmark uses common Vietnamese surnames (Nguyễn, Trần, Lê, Phạm, etc.)
with various diacritical marks to provide interesting Latin1/UTF-16 performance
comparisons.

Fixes #7560

@CLAassistant
Copy link

CLAassistant commented Feb 3, 2026

CLA assistant check
All committers have signed the CLA.

@Jayant-kernel
Copy link
Contributor Author

@hsivonen @echeran
review this correct one

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The data superficially (I don't know enough Vietnamese to say for sure) appears to follow Vietnamese name structure, but it's not actually useful for exercising Vietnamese collation, because the first difference is never the same base vowel with different diacritics. Also, the list seems to over-do the identical prefixes.

@Jayant-kernel Jayant-kernel force-pushed the vietnamese-benchmark-clean branch from 239208e to 07abe93 Compare February 4, 2026 11:04
Adds Vietnamese name-like strings benchmark for collation performance testing.
The benchmark uses common Vietnamese surnames (Nguyễn, Trần, Lê, Phạm, etc.)
with various diacritical marks to provide interesting Latin1/UTF-16 performance
comparisons.

Fixes unicode-org#7560
@Jayant-kernel Jayant-kernel force-pushed the vietnamese-benchmark-clean branch from 07abe93 to 60c5ec9 Compare February 4, 2026 11:10
@Jayant-kernel
Copy link
Contributor Author

Jayant-kernel commented Feb 9, 2026

@hsivonen @sffc @echeran
Thanks for the feedback! I've completely revised the benchmark data.

What Changed:
Focus on diacritical mark differences on the same base vowel
Complete coverage of all 5 Vietnamese tone marks (sắc, huyền, hỏi, ngã, nặng)
Pattern examples: ba → bá → bà → bả → bã → bạ | ban → bàn → bán → bản → bạnIncludes ~30+ actual Vietnamese syllables (bàn=table, bạn=friend, ăn=eat, etc.)
12 consonant varieties to prevent prefix over-repetition

Result:
Now properly tests Vietnamese collation where the first difference is always a tone mark on the same base character, exercising Latin1/UTF-16 performance with diacritics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a Vietnamese collation benchmark

3 participants