Skip to content

Port newmm tokenizer from PyThaiNLP to C#1

Merged
wannaphong merged 7 commits intomainfrom
copilot/port-newmm-to-c-api
Jan 11, 2026
Merged

Port newmm tokenizer from PyThaiNLP to C#1
wannaphong merged 7 commits intomainfrom
copilot/port-newmm-to-c-api

Conversation

Copy link
Contributor

Copilot AI commented Jan 11, 2026

Implements newmm (New Maximum Matching) word segmentation algorithm from PyThaiNLP in C with similar API design.

Implementation

  • Trie data structure (src/trie.c) - Prefix matching with UTF-8 support for dictionary lookups
  • Thai Character Cluster (src/tcc.c) - TCC boundary detection following Theeramunkong et al. 2000 rules
  • newmm algorithm (src/newmm.c) - Maximal matching constrained by TCC boundaries, handles Thai/English/numeric text

API

// Segment Thai text with optional custom dictionary
char** newmm_segment(const char* text, const char* dict_path, int* token_count);
void newmm_free_result(char** tokens, int token_count);

Usage mirrors PyThaiNLP:

// C
int count;
char** tokens = newmm_segment("ฉันไปโรงเรียน", "dict.txt", &count);
// ['ฉัน', 'ไป', 'โรงเรียน']
newmm_free_result(tokens, count);
# Python (PyThaiNLP)
tokens = word_tokenize("ฉันไปโรงเรียน", engine="newmm")
# ['ฉัน', 'ไป', 'โรงเรียน']

Testing

  • Test suite with 8 cases covering Thai text, mixed content, edge cases
  • Example program demonstrating usage
  • Sample dictionary included in data/

CI/CD

  • GitHub Actions workflow (.github/workflows/test.yml) - Automated testing on Ubuntu (gcc) and macOS (clang)
  • Runs on push and pull requests to ensure code quality
  • CI badge added to README for build status visibility
Original prompt

Port newmm fro pythainlp to C that API should like pythainlp

GitHub: https://github.com/PyThaiNLP/pythainlp


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits January 11, 2026 05:51
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
@wannaphong
Copy link
Member

@copilot Add Github action for testing

Copilot AI and others added 3 commits January 11, 2026 05:55
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copilot AI changed the title [WIP] Port newmm from Pythainlp to C with similar API Port newmm tokenizer from PyThaiNLP to C Jan 11, 2026
Copilot AI requested a review from wannaphong January 11, 2026 06:00
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 11, 2026

@copilot Add Github action for testing

Added GitHub Actions workflow in commit dcba1f6. The CI pipeline now:

  • Tests on Ubuntu (gcc) and macOS (clang)
  • Runs on push and pull requests
  • Executes full test suite and example programs
  • Uploads artifacts on failure for debugging

CI badge added to README.

@wannaphong wannaphong marked this pull request as ready for review January 11, 2026 06:03
@wannaphong wannaphong merged commit ae8e000 into main Jan 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants