Fix newmm tokenizer: Thai Character Cluster boundaries and maximal matching #6

Copilot · 2026-01-11T09:59:28Z

Implement comprehensive PyThaiNLP-compatible test suite for newmm tokenizer. Tests were failing due to incorrect Thai Character Cluster (TCC) boundary detection and suboptimal tokenization algorithm.

Changes

Thai Character Cluster boundary detection (ThaiCharacterCluster.java)

Separated leading vowels (เ, แ, โ, ใ, ไ) from following characters
Leading vowels now correctly start new character clusters rather than continuing previous ones
Example: "ภาษาไทยเพราะ" now breaks at position 13 (before "เ") allowing "ภาษาไทย" to be recognized

Newmm tokenization algorithm (NewmmTokenizer.java)

Replaced BFS graph pathfinding with straightforward maximal matching
Build word graph at TCC boundaries, choose longest dictionary match at each position
Simplified from ~80 lines to ~40 lines while improving accuracy

Test coverage (NewmmTokenizerTest.java)

Added testNewmmComprehensive() covering all PyThaiNLP test cases
Validates null/empty inputs, numeric patterns (IPs, decimals, currency), whitespace handling, special characters

Example

// Previously failed - "ภาษาไทย" was incorrectly split
Tokenize.wordTokenize("ฉันรักภาษาไทยเพราะฉันเป็นคนไทย", null, "newmm", false);
// Before: [ฉัน, รัก, ภา, ษาไทยเพราะฉันเป็น, คนไทย]
// After:  [ฉัน, รัก, ภาษาไทย, เพราะ, ฉัน, เป็น, คนไทย] ✓

// Numeric patterns now handled correctly
Tokenize.wordTokenize("USD1,984.42", null, "newmm", false);
// Returns: [USD, 1,984.42] ✓

All 17 tests pass (9 newmm + 8 trie tests).

Original prompt

This section details on the original issue you should resolve

<issue_title>Make passed all test like pythainlp</issue_title>
<issue_description>Make passed all test:

    def test_newmm(self):
        self.assertEqual(newmm.segment(None), [])
        self.assertEqual(newmm.segment(""), [])
        self.assertEqual(
            word_tokenize("ฉันรักภาษาไทยเพราะฉันเป็นคนไทย", engine="newmm"),
            ["ฉัน", "รัก", "ภาษาไทย", "เพราะ", "ฉัน", "เป็น", "คนไทย"],
        )
        self.assertEqual(
            word_tokenize("19...", engine="newmm"),
            ["19", "..."],
        )
        self.assertEqual(
            word_tokenize("19.", engine="newmm"),
            ["19", "."],
        )
        self.assertEqual(
            word_tokenize("19.84", engine="newmm"),
            ["19.84"],
        )
        self.assertEqual(
            word_tokenize("127.0.0.1", engine="newmm"),
            ["127.0.0.1"],
        )
        self.assertEqual(
            word_tokenize("USD1,984.42", engine="newmm"),
            ["USD", "1,984.42"],
        )
        self.assertEqual(
            word_tokenize(
                "สวัสดีครับ สบายดีไหมครับ",
                engine="newmm",
                keep_whitespace=True,
            ),
            ["สวัสดี", "ครับ", " ", "สบายดี", "ไหม", "ครับ"],
        )
        self.assertEqual(
            word_tokenize("จุ๋มง่วงนอนยัง", engine="newmm"),
            ["จุ๋ม", "ง่วงนอน", "ยัง"],
        )
        self.assertEqual(word_tokenize("จุ๋มง่วง", engine="newmm"), ["จุ๋ม", "ง่วง"])
        self.assertEqual(
            word_tokenize("จุ๋ม   ง่วง", engine="newmm", keep_whitespace=False),
            ["จุ๋ม", "ง่วง"],
        )
        self.assertNotIn(
            " ",
            word_tokenize(
                "จุ๋มง่วง",
                keep_whitespace=False,
            ),
        )
        self.assertEqual(
            word_tokenize("(คนไม่เอา)", engine="newmm"),
            ["(", "คน", "ไม่", "เอา", ")"],
        )
        self.assertEqual(
            word_tokenize("กม/ชม", engine="newmm"), ["กม", "/", "ชม"]
        )
        self.assertEqual(
            word_tokenize("สีหน้า(รถ)", engine="newmm"), ["สีหน้า", "(", "รถ", ")"]
        )
```</issue_description>

## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>

Fixes Make passed all test like pythainlp #5

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: wannaphong <[email protected]>

Copilot AI self-assigned this Jan 11, 2026

Initial plan

546c6f9

Copilot AI assigned wannaphong Jan 11, 2026

Copilot started work on behalf of wannaphong January 11, 2026 09:59 View session

Add comprehensive newmm tests and fix TCC/tokenization algorithm

5a063d9

Co-authored-by: wannaphong <[email protected]>

Copilot AI changed the title ~~[WIP] Fix segmentation tests to pass for newmm~~ Fix newmm tokenizer: Thai Character Cluster boundaries and maximal matching Jan 11, 2026

Copilot AI requested a review from wannaphong January 11, 2026 10:08

Copilot finished work on behalf of wannaphong January 11, 2026 10:08

wannaphong marked this pull request as ready for review January 11, 2026 10:09

wannaphong merged commit 10d0f7a into master Jan 11, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix newmm tokenizer: Thai Character Cluster boundaries and maximal matching #6

Fix newmm tokenizer: Thai Character Cluster boundaries and maximal matching #6

Uh oh!

Copilot AI commented Jan 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix newmm tokenizer: Thai Character Cluster boundaries and maximal matching #6

Fix newmm tokenizer: Thai Character Cluster boundaries and maximal matching #6

Uh oh!

Conversation

Copilot AI commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Example

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 11, 2026 •

edited

Loading