Implement PyThaiNLP-compatible test_newmm with numeric pattern tokenization by Copilot · Pull Request #9 · PyThaiNLP/CThaiNLP

Copilot · 2026-01-11T09:54:54Z

Adds comprehensive test suite matching PyThaiNLP's newmm tokenization behavior, requiring fixes to numeric/punctuation handling and API compatibility.

Changes

API Compatibility

Added cthainlp.newmm module with segment() function accepting None/empty inputs
Added keep_whitespace parameter to word_tokenize() (default: True)

Numeric Pattern Tokenization (src/newmm.c)

Split trailing punctuation from numbers: "19..." → ["19", "..."] instead of ["19..."]
Preserve decimal separators followed by digits: "19.84" → ["19.84"]
Preserve IP addresses: "127.0.0.1" → ["127.0.0.1"]
Group consecutive identical punctuation: "..." as single token

Implementation uses lookahead to distinguish valid numeric patterns (decimal, IP) from punctuation boundaries.

Test Suite (tests/python/test_tokenize.py)

Added test_newmm() with 16 assertions covering edge cases, numeric patterns, whitespace handling, and special characters
All existing tests continue passing

Dictionary

Added missing word จุ๋ม for correct Thai tokenization

from cthainlp import newmm, word_tokenize

# New module API
newmm.segment(None)  # []
newmm.segment("")    # []

# Numeric pattern handling
word_tokenize("19...", engine="newmm")      # ["19", "..."]
word_tokenize("19.84", engine="newmm")      # ["19.84"]
word_tokenize("127.0.0.1", engine="newmm")  # ["127.0.0.1"]

# Whitespace control
word_tokenize("สวัสดี ครับ", keep_whitespace=True)   # ["สวัสดี", " ", "ครับ"]
word_tokenize("สวัสดี ครับ", keep_whitespace=False)  # ["สวัสดี", "ครับ"]

Original prompt

This section details on the original issue you should resolve

<issue_title>Make passed all test like pythainlp</issue_title>
<issue_description>Make passed all test:

    def test_newmm(self):
        self.assertEqual(newmm.segment(None), [])
        self.assertEqual(newmm.segment(""), [])
        self.assertEqual(
            word_tokenize("ฉันรักภาษาไทยเพราะฉันเป็นคนไทย", engine="newmm"),
            ["ฉัน", "รัก", "ภาษาไทย", "เพราะ", "ฉัน", "เป็น", "คนไทย"],
        )
        self.assertEqual(
            word_tokenize("19...", engine="newmm"),
            ["19", "..."],
        )
        self.assertEqual(
            word_tokenize("19.", engine="newmm"),
            ["19", "."],
        )
        self.assertEqual(
            word_tokenize("19.84", engine="newmm"),
            ["19.84"],
        )
        self.assertEqual(
            word_tokenize("127.0.0.1", engine="newmm"),
            ["127.0.0.1"],
        )
        self.assertEqual(
            word_tokenize("USD1,984.42", engine="newmm"),
            ["USD", "1,984.42"],
        )
        self.assertEqual(
            word_tokenize(
                "สวัสดีครับ สบายดีไหมครับ",
                engine="newmm",
                keep_whitespace=True,
            ),
            ["สวัสดี", "ครับ", " ", "สบายดี", "ไหม", "ครับ"],
        )
        self.assertEqual(
            word_tokenize("จุ๋มง่วงนอนยัง", engine="newmm"),
            ["จุ๋ม", "ง่วงนอน", "ยัง"],
        )
        self.assertEqual(word_tokenize("จุ๋มง่วง", engine="newmm"), ["จุ๋ม", "ง่วง"])
        self.assertEqual(
            word_tokenize("จุ๋ม   ง่วง", engine="newmm", keep_whitespace=False),
            ["จุ๋ม", "ง่วง"],
        )
        self.assertNotIn(
            " ",
            word_tokenize(
                "จุ๋มง่วง",
                keep_whitespace=False,
            ),
        )
        self.assertEqual(
            word_tokenize("(คนไม่เอา)", engine="newmm"),
            ["(", "คน", "ไม่", "เอา", ")"],
        )
        self.assertEqual(
            word_tokenize("กม/ชม", engine="newmm"), ["กม", "/", "ชม"]
        )
        self.assertEqual(
            word_tokenize("สีหน้า(รถ)", engine="newmm"), ["สีหน้า", "(", "รถ", ")"]
        )
```</issue_description>

## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>

Fixes Make passed all test like pythainlp #8

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

- Create newmm module with segment function for PyThaiNLP compatibility - Add keep_whitespace parameter support to word_tokenize - Fix numeric tokenization to handle patterns like 19..., 19., 19.84, etc. - Group consecutive same punctuation characters (e.g., ...) - Add missing word จุ๋ม to dictionary - All tests now pass including new test_newmm test Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

- Remove unused is_valid_numeric_continuation function - Fix import example in newmm.py docstring Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Copilot AI assigned Copilot and wannaphong Jan 11, 2026

Initial plan

d0b0403

Copilot started work on behalf of wannaphong January 11, 2026 09:55 View session

Copilot AI and others added 2 commits January 11, 2026 10:02

Address code review feedback

19a1c5a

- Remove unused is_valid_numeric_continuation function - Fix import example in newmm.py docstring Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix tests for newmm segment functionality~~ Implement PyThaiNLP-compatible test_newmm with numeric pattern tokenization Jan 11, 2026

Copilot AI requested a review from wannaphong January 11, 2026 10:06

Copilot finished work on behalf of wannaphong January 11, 2026 10:06

wannaphong marked this pull request as ready for review January 11, 2026 10:08

wannaphong merged commit a107915 into main Jan 11, 2026
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement PyThaiNLP-compatible test_newmm with numeric pattern tokenization#9

Implement PyThaiNLP-compatible test_newmm with numeric pattern tokenization#9
wannaphong merged 3 commits intomainfrom
copilot/fix-newmm-segment-tests

Copilot AI commented Jan 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 11, 2026 •

edited

Loading