Skip to content

Implement PyThaiNLP-compatible test_newmm with numeric pattern tokenization#9

Merged
wannaphong merged 3 commits intomainfrom
copilot/fix-newmm-segment-tests
Jan 11, 2026
Merged

Implement PyThaiNLP-compatible test_newmm with numeric pattern tokenization#9
wannaphong merged 3 commits intomainfrom
copilot/fix-newmm-segment-tests

Conversation

Copy link
Contributor

Copilot AI commented Jan 11, 2026

Adds comprehensive test suite matching PyThaiNLP's newmm tokenization behavior, requiring fixes to numeric/punctuation handling and API compatibility.

Changes

API Compatibility

  • Added cthainlp.newmm module with segment() function accepting None/empty inputs
  • Added keep_whitespace parameter to word_tokenize() (default: True)

Numeric Pattern Tokenization (src/newmm.c)

  • Split trailing punctuation from numbers: "19..."["19", "..."] instead of ["19..."]
  • Preserve decimal separators followed by digits: "19.84"["19.84"]
  • Preserve IP addresses: "127.0.0.1"["127.0.0.1"]
  • Group consecutive identical punctuation: "..." as single token

Implementation uses lookahead to distinguish valid numeric patterns (decimal, IP) from punctuation boundaries.

Test Suite (tests/python/test_tokenize.py)

  • Added test_newmm() with 16 assertions covering edge cases, numeric patterns, whitespace handling, and special characters
  • All existing tests continue passing

Dictionary

  • Added missing word จุ๋ม for correct Thai tokenization
from cthainlp import newmm, word_tokenize

# New module API
newmm.segment(None)  # []
newmm.segment("")    # []

# Numeric pattern handling
word_tokenize("19...", engine="newmm")      # ["19", "..."]
word_tokenize("19.84", engine="newmm")      # ["19.84"]
word_tokenize("127.0.0.1", engine="newmm")  # ["127.0.0.1"]

# Whitespace control
word_tokenize("สวัสดี ครับ", keep_whitespace=True)   # ["สวัสดี", " ", "ครับ"]
word_tokenize("สวัสดี ครับ", keep_whitespace=False)  # ["สวัสดี", "ครับ"]
Original prompt

This section details on the original issue you should resolve

<issue_title>Make passed all test like pythainlp</issue_title>
<issue_description>Make passed all test:

    def test_newmm(self):
        self.assertEqual(newmm.segment(None), [])
        self.assertEqual(newmm.segment(""), [])
        self.assertEqual(
            word_tokenize("ฉันรักภาษาไทยเพราะฉันเป็นคนไทย", engine="newmm"),
            ["ฉัน", "รัก", "ภาษาไทย", "เพราะ", "ฉัน", "เป็น", "คนไทย"],
        )
        self.assertEqual(
            word_tokenize("19...", engine="newmm"),
            ["19", "..."],
        )
        self.assertEqual(
            word_tokenize("19.", engine="newmm"),
            ["19", "."],
        )
        self.assertEqual(
            word_tokenize("19.84", engine="newmm"),
            ["19.84"],
        )
        self.assertEqual(
            word_tokenize("127.0.0.1", engine="newmm"),
            ["127.0.0.1"],
        )
        self.assertEqual(
            word_tokenize("USD1,984.42", engine="newmm"),
            ["USD", "1,984.42"],
        )
        self.assertEqual(
            word_tokenize(
                "สวัสดีครับ สบายดีไหมครับ",
                engine="newmm",
                keep_whitespace=True,
            ),
            ["สวัสดี", "ครับ", " ", "สบายดี", "ไหม", "ครับ"],
        )
        self.assertEqual(
            word_tokenize("จุ๋มง่วงนอนยัง", engine="newmm"),
            ["จุ๋ม", "ง่วงนอน", "ยัง"],
        )
        self.assertEqual(word_tokenize("จุ๋มง่วง", engine="newmm"), ["จุ๋ม", "ง่วง"])
        self.assertEqual(
            word_tokenize("จุ๋ม   ง่วง", engine="newmm", keep_whitespace=False),
            ["จุ๋ม", "ง่วง"],
        )
        self.assertNotIn(
            " ",
            word_tokenize(
                "จุ๋มง่วง",
                keep_whitespace=False,
            ),
        )
        self.assertEqual(
            word_tokenize("(คนไม่เอา)", engine="newmm"),
            ["(", "คน", "ไม่", "เอา", ")"],
        )
        self.assertEqual(
            word_tokenize("กม/ชม", engine="newmm"), ["กม", "/", "ชม"]
        )
        self.assertEqual(
            word_tokenize("สีหน้า(รถ)", engine="newmm"), ["สีหน้า", "(", "รถ", ")"]
        )
```</issue_description>

## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits January 11, 2026 10:02
- Create newmm module with segment function for PyThaiNLP compatibility
- Add keep_whitespace parameter support to word_tokenize
- Fix numeric tokenization to handle patterns like 19..., 19., 19.84, etc.
- Group consecutive same punctuation characters (e.g., ...)
- Add missing word จุ๋ม to dictionary
- All tests now pass including new test_newmm test

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
- Remove unused is_valid_numeric_continuation function
- Fix import example in newmm.py docstring

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix tests for newmm segment functionality Implement PyThaiNLP-compatible test_newmm with numeric pattern tokenization Jan 11, 2026
Copilot AI requested a review from wannaphong January 11, 2026 10:06
@wannaphong wannaphong marked this pull request as ready for review January 11, 2026 10:08
@wannaphong wannaphong merged commit a107915 into main Jan 11, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make passed all test like pythainlp

2 participants