Collection of Thai NLP libraries, dictionaries, and corpus. Always welcome for pull requests.
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
TCC | Thai Character Cluster | C | Thanaruk et.al. |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Swath | SWATH (Smart Word Analysis for THai) is a word segmentation for Thai | C | Longest Matching, Maximal Matching and Part-of-Speech Bigram. | GPL | CMU |
Lexto | Lexto: Thai Lexeme Tokenizer | Java | LGPL | NECTEC | |
Python 2 | LGPL | Python2 Wrapper | |||
Python 3 | LGPL | Python3 Wrapper | |||
Wordcut | Thai word breaker for Node.js | JavaScript, Node.JS | LGPL-3.0 | Veer66, github | |
CutKum | Thai Word-Segmentation with Deep Learning in Tensorflow | Python | 0.93 recall, 0.92 precision, 0.93 F-measure. | MIT | Pucktada, github |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Jitar+NAiST | A simple Trigram HMM part-of-speech tagger | Java | Ver66, Jitar + NAiST, 1 + NAiST, 2 |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Chart-parser | Extract Syntactic Structure from POS Tagged Sentence. | C | Copyright | Thanaruk T. ([email protected]) | |
Grammar Processing | Labelled Buckets -> Context Free Grammer (CFG) | Python | Transform and compute probability | Thodsaporn C. |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
Transliteration Corpus | 31K pairs | Thai-Eng Translation Pair | CC BY-NC-SA 3.0 TH | NECTEC | |
Lexitron | Opensource Thai-English Dictionary | TH->EN, EN->TH | LGPL | NECTEC |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
ORCHID | 30K sent. | Word Seg., POS Tagged. | CC BY-NC-SA 3.0 TH | NECTEC | |
InterBEST 2009/2010 | 5M words | Word Seg. | CC BY-NC-SA 3.0 TH | NECTEC | |
Thai Wikipedia | Formal Articles | 1.49GB (~213.1 MB compressed) | XML | GFDL | WIKIPEDIA |
TNC Top-5000 Words | Word frequency | 5,000 words | Frequency of Thai words in various genres, EXCEL | Copyright | CHULA |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
Thai National Corpus 2 | 32M words. | Query text by genre, domain | Copyright | CHULA |
Pre-trained Model | Description | Size | Dimensions | License | Link |
---|---|---|---|---|---|
fastText | Skip-Gram model trained on Wikipedia using fastText | 300 | CC BY-SA 3.0 | Facebook + Bin & Text + Text Only |