Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 6, 2026

What does this changes

Fixes royin engine romanization of Thai words containing consonant clusters (กร, กล, กว) and multi-syllable words. Previously returned incorrect output for 12 out of 26 test cases.

What was wrong

The _replace_consonants function treated all consonants after the first as final consonants, ignoring Thai phonotactics where ร/r, ล/l, ว/w form initial clusters with preceding consonants. Multi-syllable word boundaries were not detected, causing incorrect vowel insertion and consonant form selection.

How this fixes it

Consonant cluster detection:

  • Track vowel presence to distinguish initial clusters (pre-vowel) from final consonants (post-vowel)
  • Recognize ร, ล, ว as valid second consonants in initial clusters
  • Example: กรม → k+r(cluster)+o+m = "krom" instead of k+n(final)+m = "knm"

Syllable boundary detection:

  • Reset cluster state when consonant follows vowel and is itself followed by vowel
  • Add implicit 'a' when previous syllable has explicit vowel and next token is consonant cluster
  • Example: บ้านไร่ (บaนรai) → ba+n, then ร+ai starts new syllable = "banrai"
  • Note: This improved syllable boundary detection produces more linguistically accurate romanization (e.g., คาราเมล → "kharamen" instead of "khanamen"), which affects the lookup engine's fallback behavior for words not in the dictionary.

Special cases:

  • Single ห romanizes as empty (silent leading consonant)
  • อ with empty initial form handled correctly (does not break cluster detection)
  • รร triggers vowel_seen flag (acts as vowel 'a')

Code structure:

  • Extracted _should_add_syllable_separator() for syllable joining logic
  • Added _ROMANIZED_VOWELS constant for consistency
  • Added romanization examples in comments (e.g., ร/r, กรม/krom)

Test coverage:

  • Enabled 11 previously failing test cases in BASIC_TESTS
  • Enabled 3 previously failing consistency tests
  • Updated lookup engine fallback test expectation to reflect improved syllable boundary detection

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

This section details on the original issue you should resolve

<issue_title>'royin' engine gives wrong romanization in a lot of cases</issue_title>
<issue_description>Try this test set:

from pythainlp.transliterate import romanize
test_cases = {
    None: "",
    "": "",
    "หมอก": "mok",
    "หาย": "hai",
    "แมว": "maeo",
    "เดือน": "duean",
    "ดำ": "dam",
    "ดู": "du",
    "บัว": "bua",
    "กก": "kok",
    "กร": "kon",
    "กรร": "kan",
    "กรรม": "kam",
    "กรม": "krom",  # failed
    "ฝ้าย": "fai",
    "นพพร": "nopphon",
    "ทีปกร": "thipakon",  # failed
    "ธรรพ์": "than",  # failed
    "ธรรม": "tham",  # failed
    "มหา": "maha",  # failed
    "หยาก": "yak",  # failed
    "อยาก": "yak",  # failed
    "ยมก": "yamok",  # failed
    "กลัว": "klua",  # failed
    "บ้านไร่": "banrai",  # failed
    "ชารินทร์": "charin",  # failed
}
for word in test_cases:
    expect = test_cases[word]
    actual = romanize(word, engine="royin")
    print(f"{expect == actual} - word: {word} expect: {expect} actual: {actual}")

Half of them will failed:

True - word: None expect:  actual: 
True - word:  expect:  actual: 
True - word: หมอก expect: mok actual: mok
True - word: หาย expect: hai actual: hai
True - word: แมว expect: maeo actual: maeo
True - word: เดือน expect: duean actual: duean
True - word: ดำ expect: dam actual: dam
True - word: ดู expect: du actual: du
True - word: บัว expect: bua actual: bua
True - word: กก expect: kok actual: kok
True - word: กร expect: kon actual: kon
True - word: กรร expect: kan actual: kan
True - word: กรรม expect: kam actual: kam
False - word: กรม expect: krom actual: knm
True - word: ฝ้าย expect: fai actual: fai
True - word: นพพร expect: nopphon actual: nopphon
False - word: ทีปกร expect: thipakon actual: thipkon
False - word: ธรรพ์ expect: than actual: thonrop
False - word: ธรรม expect: tham actual: thnnm
False - word: มหา expect: maha actual: ma
False - word: หยาก expect: yak actual: hyak
False - word: อยาก expect: yak actual: ak
False - word: ยมก expect: yamok actual: ymk
False - word: กลัว expect: klua actual: knua
False - word: บ้านไร่ expect: banrai actual: bannai
False - word: ชารินทร์ expect: charin actual: charinthon

This test set will be added to test_transliterate.py.


Consistency Test

# these are set of two-syllable words,
# to test if the transliteration/romanization is consistent, say
# romanize(1+2) = romanize(1) + romanize(2)
_CONSISTENCY_TESTS = [
    # ("กระจก", "กระ", "จก"),  # failed
    # ("ระเบิด", "ระ", "เบิด"),  # failed
    # ("หยากไย่", "หยาก", "ไย่"),  # failed
    ("ตากใบ", "ตาก", "ใบ"),
    # ("จัดสรร", "จัด", "สรร"),  # failed
]

def test_romanize_royin_consistency(self):
    for word, part1, part2 in _CONSISTENCY_TESTS:
        self.assertEqual(
            romanize(word, engine="royin"),
            (
                romanize(part1, engine="royin")
                + romanize(part2, engine="royin")
            ),
        )

In general, I think we need a more systematic evaluation of different algorithms, including soundex.
</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI self-assigned this Jan 6, 2026
Copilot AI changed the title [WIP] Fix wrong romanization in 'royin' engine Fix royin romanization for Thai consonant clusters and syllable boundaries Jan 6, 2026
Copilot AI requested a review from wannaphong January 6, 2026 10:43
@wannaphong
Copy link
Member

@copilot Why? FAIL: test_romanize_lookup (tests.core.test_transliterate.TransliterateTestCase)

Traceback (most recent call last):
File "D:\a\pythainlp\pythainlp\tests\core\test_transliterate.py", line 86, in test_romanize_lookup
self.assertEqual(
AssertionError: 'kharamenkhapuchino' != 'khanamenkhaputino'

  • kharamenkhapuchino
    ? ^ ^^
  • khanamenkhaputino
    ? ^ ^

Copy link
Contributor Author

Copilot AI commented Jan 6, 2026

@copilot Why? FAIL: test_romanize_lookup (tests.core.test_transliterate.TransliterateTestCase)

Traceback (most recent call last):...

The test was failing because my improvements to the royin engine changed how syllable boundaries are detected. The lookup engine falls back to royin when words aren't found in the dictionary.

The old behavior treated all consonants after vowels as final consonants, giving "khanamenkhaputino". My fix detects syllable boundaries more accurately by recognizing when consonants after vowels are followed by vowels (indicating a new syllable), giving "kharamenkhapuchino" which is linguistically more correct:

  • คาราเมล (caramel): kha-ra-men (3 syllables)
  • คาปูชิโน่ (cappuccino): kha-pu-chi-no (4 syllables)

Updated the test expectation in commit f42eb81. All 26 test cases from the main issue still pass.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 6, 2026

@coveralls
Copy link

Coverage Status

coverage: 52.763% (+0.3%) from 52.461%
when pulling f42eb81 on copilot/fix-romanization-issues
into de6be60 on dev.

@wannaphong wannaphong marked this pull request as ready for review January 6, 2026 11:29
@wannaphong wannaphong added this to the 5.3 milestone Jan 6, 2026
@wannaphong wannaphong merged commit 0264e67 into dev Jan 6, 2026
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

'royin' engine gives wrong romanization in a lot of cases

3 participants