Skip to content

NanoSoftTech/nlp_thai_resources

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 

Repository files navigation

Thai NLP Resource

Collection of Thai NLP libraries, dictionaries, and corpus. Always welcome for pull requests.

Thai NLP Libraries

Thai Character Cluster

Library Description Programming Languages Features License Author & Link
TCC Thai Character Cluster C Thanaruk et.al.

Word Segmentation

Library Description Programming Languages Features License Author & Link
Swath SWATH (Smart Word Analysis for THai) is a word segmentation for Thai C Longest Matching, Maximal Matching and Part-of-Speech Bigram. GPL CMU
Lexto Lexto: Thai Lexeme Tokenizer Java LGPL NECTEC
Python 2 LGPL Python2 Wrapper
Python 3 LGPL Python3 Wrapper
Wordcut Thai word breaker for Node.js JavaScript, Node.JS LGPL-3.0 Veer66, github
CutKum Thai Word-Segmentation with Deep Learning in Tensorflow Python 0.93 recall, 0.92 precision, 0.93 F-measure. MIT Pucktada, github

Part of Speech Tagging (POS Tagging)

Library Description Programming Languages Features License Author & Link
Jitar+NAiST A simple Trigram HMM part-of-speech tagger Java Ver66, Jitar + NAiST, 1 + NAiST, 2

Name Entity Recognition

Syntactic Parsing & Tools

Library Description Programming Languages Features License Author & Link
Chart-parser Extract Syntactic Structure from POS Tagged Sentence. C Copyright Thanaruk T. ([email protected])
Grammar Processing Labelled Buckets -> Context Free Grammer (CFG) Python Transform and compute probability Thodsaporn C.

Dictionaries / Translation Pairs

Library Description Size Features License Link
Transliteration Corpus 31K pairs Thai-Eng Translation Pair CC BY-NC-SA 3.0 TH NECTEC
Lexitron Opensource Thai-English Dictionary TH->EN, EN->TH LGPL NECTEC

Downloadable Text Corpus

Library Description Size Features License Link
ORCHID 30K sent. Word Seg., POS Tagged. CC BY-NC-SA 3.0 TH NECTEC
InterBEST 2009/2010 5M words Word Seg. CC BY-NC-SA 3.0 TH NECTEC
Thai Wikipedia Formal Articles 1.49GB (~213.1 MB compressed) XML GFDL WIKIPEDIA
TNC Top-5000 Words Word frequency 5,000 words Frequency of Thai words in various genres, EXCEL Copyright CHULA

Web Query Text Corpus

Library Description Size Features License Link
Thai National Corpus 2 32M words. Query text by genre, domain Copyright CHULA

Pre-trained Word Vectors

Pre-trained Model Description Size Dimensions License Link
fastText Skip-Gram model trained on Wikipedia using fastText 300 CC BY-SA 3.0 Facebook + Bin & Text + Text Only

About

More than 17+ collections of Thai NLP libraries. Update daily.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published