Skip to content

Commit 0538ccb

Browse files
Copilotwannaphong
andcommitted
Update Phupha integration to filter in spell checker
- Changed to use full Phupha dataset (62,264 words) in corpus file - Added filtering logic in pythainlp/spell/pn.py to filter by thai_orst_words - This allows the full Phupha dataset to be available for other uses - Updated tests to verify filtering works correctly - Spell checker now filters 38,160 ORST words from full Phupha dataset Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
1 parent 10b566d commit 0538ccb

File tree

5 files changed

+24151
-13
lines changed

5 files changed

+24151
-13
lines changed

pythainlp/corpus/phupha.py

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,6 @@
44
"""Phupha: Thai Word Frequency Dataset
55
66
Phupha is A Thai Word Frequency Dataset from Common Crawl Corpus.
7-
The dataset is filtered to include only words from the Royal Society of
8-
Thailand (ORST) word list.
97
108
Dataset:
119
Phatthiyaphaibun, W. (2026). Phupha: Thai Word Frequency Dataset
@@ -30,11 +28,9 @@
3028

3129

3230
def word_freqs() -> list[tuple[str, int]]:
33-
"""Get word frequency from Phupha dataset (filtered with thai_orst_words)
31+
"""Get word frequency from Phupha dataset
3432
35-
Phupha is A Thai Word Frequency Dataset from Common Crawl Corpus,
36-
filtered to include only words from the Royal Society of Thailand
37-
(ORST) word list.
33+
Phupha is A Thai Word Frequency Dataset from Common Crawl Corpus.
3834
3935
:return: List of tuples (word, frequency)
4036
:rtype: list[tuple[str, int]]
@@ -63,11 +59,9 @@ def word_freqs() -> list[tuple[str, int]]:
6359

6460

6561
def unigram_word_freqs() -> dict[str, int]:
66-
"""Get unigram word frequency from Phupha dataset (filtered with thai_orst_words)
62+
"""Get unigram word frequency from Phupha dataset
6763
68-
Phupha is A Thai Word Frequency Dataset from Common Crawl Corpus,
69-
filtered to include only words from the Royal Society of Thailand
70-
(ORST) word list.
64+
Phupha is A Thai Word Frequency Dataset from Common Crawl Corpus.
7165
7266
:return: Dictionary mapping words to their frequencies
7367
:rtype: dict[str, int]

0 commit comments

Comments
 (0)