Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 11, 2026

Implement comprehensive PyThaiNLP-compatible test suite for newmm tokenizer. Tests were failing due to incorrect Thai Character Cluster (TCC) boundary detection and suboptimal tokenization algorithm.

Changes

Thai Character Cluster boundary detection (ThaiCharacterCluster.java)

  • Separated leading vowels (เ, แ, โ, ใ, ไ) from following characters
  • Leading vowels now correctly start new character clusters rather than continuing previous ones
  • Example: "ภาษาไทยเพราะ" now breaks at position 13 (before "เ") allowing "ภาษาไทย" to be recognized

Newmm tokenization algorithm (NewmmTokenizer.java)

  • Replaced BFS graph pathfinding with straightforward maximal matching
  • Build word graph at TCC boundaries, choose longest dictionary match at each position
  • Simplified from ~80 lines to ~40 lines while improving accuracy

Test coverage (NewmmTokenizerTest.java)

  • Added testNewmmComprehensive() covering all PyThaiNLP test cases
  • Validates null/empty inputs, numeric patterns (IPs, decimals, currency), whitespace handling, special characters

Example

// Previously failed - "ภาษาไทย" was incorrectly split
Tokenize.wordTokenize("ฉันรักภาษาไทยเพราะฉันเป็นคนไทย", null, "newmm", false);
// Before: [ฉัน, รัก, ภา, ษาไทยเพราะฉันเป็น, คนไทย]
// After:  [ฉัน, รัก, ภาษาไทย, เพราะ, ฉัน, เป็น, คนไทย] ✓

// Numeric patterns now handled correctly
Tokenize.wordTokenize("USD1,984.42", null, "newmm", false);
// Returns: [USD, 1,984.42] ✓

All 17 tests pass (9 newmm + 8 trie tests).

Original prompt

This section details on the original issue you should resolve

<issue_title>Make passed all test like pythainlp</issue_title>
<issue_description>Make passed all test:

    def test_newmm(self):
        self.assertEqual(newmm.segment(None), [])
        self.assertEqual(newmm.segment(""), [])
        self.assertEqual(
            word_tokenize("ฉันรักภาษาไทยเพราะฉันเป็นคนไทย", engine="newmm"),
            ["ฉัน", "รัก", "ภาษาไทย", "เพราะ", "ฉัน", "เป็น", "คนไทย"],
        )
        self.assertEqual(
            word_tokenize("19...", engine="newmm"),
            ["19", "..."],
        )
        self.assertEqual(
            word_tokenize("19.", engine="newmm"),
            ["19", "."],
        )
        self.assertEqual(
            word_tokenize("19.84", engine="newmm"),
            ["19.84"],
        )
        self.assertEqual(
            word_tokenize("127.0.0.1", engine="newmm"),
            ["127.0.0.1"],
        )
        self.assertEqual(
            word_tokenize("USD1,984.42", engine="newmm"),
            ["USD", "1,984.42"],
        )
        self.assertEqual(
            word_tokenize(
                "สวัสดีครับ สบายดีไหมครับ",
                engine="newmm",
                keep_whitespace=True,
            ),
            ["สวัสดี", "ครับ", " ", "สบายดี", "ไหม", "ครับ"],
        )
        self.assertEqual(
            word_tokenize("จุ๋มง่วงนอนยัง", engine="newmm"),
            ["จุ๋ม", "ง่วงนอน", "ยัง"],
        )
        self.assertEqual(word_tokenize("จุ๋มง่วง", engine="newmm"), ["จุ๋ม", "ง่วง"])
        self.assertEqual(
            word_tokenize("จุ๋ม   ง่วง", engine="newmm", keep_whitespace=False),
            ["จุ๋ม", "ง่วง"],
        )
        self.assertNotIn(
            " ",
            word_tokenize(
                "จุ๋มง่วง",
                keep_whitespace=False,
            ),
        )
        self.assertEqual(
            word_tokenize("(คนไม่เอา)", engine="newmm"),
            ["(", "คน", "ไม่", "เอา", ")"],
        )
        self.assertEqual(
            word_tokenize("กม/ชม", engine="newmm"), ["กม", "/", "ชม"]
        )
        self.assertEqual(
            word_tokenize("สีหน้า(รถ)", engine="newmm"), ["สีหน้า", "(", "รถ", ")"]
        )
```</issue_description>

## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI self-assigned this Jan 11, 2026
Copilot AI changed the title [WIP] Fix segmentation tests to pass for newmm Fix newmm tokenizer: Thai Character Cluster boundaries and maximal matching Jan 11, 2026
Copilot AI requested a review from wannaphong January 11, 2026 10:08
@wannaphong wannaphong marked this pull request as ready for review January 11, 2026 10:09
@wannaphong wannaphong merged commit 10d0f7a into master Jan 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make passed all test like pythainlp

2 participants