Skip to content

Commit f0532a0

Browse files
committed
Updated README
1 parent 7b21cbe commit f0532a0

File tree

3 files changed

+84
-42
lines changed

3 files changed

+84
-42
lines changed

README.md

Lines changed: 69 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,12 @@ Includes word tokeniser for Taiwanese Hokkien.
5858
<li><a href="#convert-non-cjk">Convert non-CJK</a></li>
5959
</ul>
6060
</li>
61-
<li><a href="#tokeniser">Tokeniser</a></li>
61+
<li>
62+
<a href="#tokeniser">Tokeniser</a>
63+
<ul>
64+
<li><a href="#keep-original">Keep original</a></li>
65+
</ul>
66+
</li>
6267
<li><a href="#other-functions">Other Functions</a></li>
6368
</ul>
6469
</li>
@@ -135,7 +140,7 @@ c.get(input)
135140

136141
`format` String - format in which tones will be represented in the converted sentence.
137142

138-
* `mark` (default) - uses diacritics for each syllable. Not available for TLPA.
143+
* `mark` (default) - uses diacritics for each syllable. Not available for TLPA
139144
* `number` - add a number which represents the tone at the end of the syllable
140145
* `strip` - removes any tone marking
141146

@@ -173,9 +178,9 @@ Default value depends on the chosen `system`:
173178
* `auto` - for `Tongiong`
174179
* `none` - for `Tailo`, `POJ`, `Zhuyin`, `TLPA`, `Pingyim`, `IPA`
175180

176-
| text | none | auto | exc_last | incl_last |
177-
| ---------------- | ------------------------- | -------------------------- | ------------------------- | ------------------------- |
178-
| 這是你的手機仔無 | Tse sī lí ê tshiú-ki-á bô | Tse sì li ē tshiu-kī-á bô? | Tsē sì li ē tshiu-kī-a bô | Tsē sì li ē tshiu-kī-a bō |
181+
| text | none | auto | exc_last | incl_last |
182+
| ---------------- | ----------------------- | ---------------------- | ---------------------- | ---------------------- |
183+
| 這是你的茶桌仔無 | Tse sī lí ê tê-toh-á bô | Tse sì li ē tē-to-á bô | Tsē sì li ē tē-tó-a bô | Tsē sì li ē tē-tó-a bō |
179184

180185
Sandhi rules also change depending on the dialect chosen.
181186

@@ -187,8 +192,8 @@ Sandhi rules also change depending on the dialect chosen.
187192

188193
`punctuation` String
189194

190-
* `format` (default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentence.
191-
* `none` - preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences.
195+
* `format` (default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentence
196+
* `none` - preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences
192197

193198
| text | format | none |
194199
| ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
@@ -211,24 +216,38 @@ Sandhi rules also change depending on the dialect chosen.
211216

212217
```python
213218
# Constructor
214-
t = Tokeniser()
219+
t = Tokeniser(keep_original)
215220

216221
# Tokenise Taiwanese Hokkien sentence
217222
t.tokenise(input)
218223
```
219224

225+
#### Keep original
226+
227+
`keep_original` Boolean - defines whether the original characters of the input are retained.
228+
229+
* `True` (default) - preserve original characters
230+
* `False` - replace original characters with characters defined in the dataset
231+
232+
| text | True | False |
233+
| ------------ | -------------------- | -------------------- |
234+
| 臺灣火鸡肉饭 | ['臺灣', '火鸡肉饭'] | ['台灣', '火雞肉飯'] |
235+
220236
### Other Functions
221237

222238
Handy functions for NLP tasks in Taiwanese Hokkien.
223239

240+
`to_traditional` function converts input to Traditional Chinese characters that are used in the dataset. Also accounts for different variants of Traditional Chinese characters.
241+
242+
`to_simplified` function converts input to Simplified Chinese characters.
243+
244+
`is_cjk` function checks whether the input string consists entirely of Chinese characters.
245+
224246
```python
225-
# Convert to Traditional
226247
to_traditional(input)
227248

228-
# Convert to Simplified
229249
to_simplified(input)
230250

231-
# Check if the string is fully composed of Chinese characters
232251
is_cjk(input)
233252
```
234253

@@ -283,20 +302,20 @@ c.get("先生講,學生恬恬聽。")
283302

284303
## Sandhi
285304
c = Converter() # for Tailo, sandhi none by default
286-
c.get("這是台灣囡仔")
287-
>> Tse sī Tâi-uân gín-á
305+
c.get("這是你的茶桌仔無")
306+
>> Tse sī lí ê tê-toh-á bô
288307

289308
c = Converter(sandhi='auto')
290-
c.get("這是台灣囡仔")
291-
>> Tse sì Tāi-uān gin-á
309+
c.get("這是你的茶桌仔無")
310+
>> Tse sì li ē tē-to-á bô
292311

293312
c = Converter(sandhi='exc_last')
294-
c.get("這是台灣囡仔")
295-
>> Tsē sì Tāi-uān gin-á
313+
c.get("這是你的茶桌仔無")
314+
>> Tsē sì li ē tē--a bô
296315

297316
c = Converter(sandhi='incl_last')
298-
c.get("這是台灣囡仔")
299-
>> Tsē sì Tāi-uān gin-a
317+
c.get("這是你的茶桌仔無")
318+
>> Tsē sì li ē tē--a bō
300319

301320
## Punctuation
302321
c = Converter() # format punctuation default
@@ -308,11 +327,11 @@ c.get("太空朋友,恁好!恁食飽未?")
308327
>> thài-khong pîng-iú,lín-hó!lín tsia̍h-pá buē?
309328

310329
## Convert non-CJK
311-
c = Convert(system='Zhuyin') # False convert_non_cjk default
330+
c = Converter(system='Zhuyin') # False convert_non_cjk default
312331
c.get("我食pháng")
313332
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng
314333

315-
c = Convert(system='Zhuyin', convert_non_cjk=True)
334+
c = Converter(system='Zhuyin', convert_non_cjk=True)
316335
c.get("我食pháng")
317336
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ
318337

@@ -324,16 +343,40 @@ t = Tokeniser()
324343
t.tokenise("太空朋友,恁好!恁食飽未?")
325344
>> ['太空', '朋友', '', '恁好', '', '', '食飽', '', '']
326345

346+
## Keep Original
347+
t = Tokeniser() # True keep_original default
348+
t.tokenise("爲啥物臺灣遮爾好?")
349+
>> ['爲啥物', '臺灣', '遮爾', '', '']
350+
351+
t.tokenise("为啥物台湾遮尔好?")
352+
>> ['为啥物', '台湾', '遮尔', '', '']
353+
354+
t = Tokeniser(False)
355+
t.tokenise("爲啥物臺灣遮爾好?")
356+
>> ['為啥物', '台灣', '遮爾', '', '']
357+
358+
t.tokenise("为啥物台湾遮尔好?")
359+
>> ['為啥物', '台灣', '遮爾', '', '']
360+
327361

328362
# Other Functions
329363
from taibun import to_traditional, to_simplified, is_cjk
330364

331-
to_traditional("我听无台湾话")
332-
>> 我聽無台灣話
365+
## to_traditional
366+
to_traditional("我听无台语")
367+
>> 我聽無台語
368+
369+
to_traditional("我爱这个个人台面")
370+
>> 我愛這个個人檯面
371+
372+
to_traditional("爲啥物")
373+
>> 為啥物
333374

334-
to_simplified("我聽無臺灣話")
335-
>> 我听无台湾话
375+
## to_simplified
376+
to_simplified("我聽無台語")
377+
>> 我听无台语
336378

379+
## is_cjk
337380
is_cjk('我食麭')
338381
>> True
339382

@@ -377,7 +420,7 @@ The data is licensed under [CC BY-SA 4.0][data-cc]
377420
[licence-badge]: https://img.shields.io/github/license/andreihar/taibun?color=000000&style=for-the-badge
378421
[licence]: LICENSE
379422
[linkedin-badge]: https://img.shields.io/badge/LinkedIn-0077b5?style=for-the-badge&logo=linkedin&logoColor=ffffff
380-
[linkedin]: https://www.linkedin.com/in/andrei-harbachov/
423+
[linkedin]: https://www.linkedin.com/in/andreihar/
381424
[js-badge]: https://img.shields.io/badge/JS_Version-f7df1e?style=for-the-badge&logo=javascript&logoColor=000000
382425
[js-link]: https://github.com/andreihar/taibun.js
383426
[downloads-badge]: https://img.shields.io/pypi/dm/taibun.svg?style=for-the-badge

setup.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[metadata]
22
name = taibun
3-
version = 1.1.1
3+
version = 1.1.2
44
author = Andrei Harbachov
55
author_email = andrei.harbachov@gmail.com
66
description = Taiwanese Hokkien Transliterator and Tokeniser

taibun/data/words.json

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2386,7 +2386,6 @@
23862386
"無論": "bô-lūn",
23872387
"無名": "bô-miâ",
23882388
"無名化": "bô-miâ-huà",
2389-
"無名先生": "bô-miâ-sin-senn/bô-miâ-sin-sinn",
23902389
"無命": "bô-miā",
23912390
"無暝無日": "bô-mê-bô-ji̍t/bô-mî-bô-li̍t",
23922391
"無我": "bô-ngóo",
@@ -3880,7 +3879,7 @@
38803879
"玩具": "guán-khū",
38813880
"玩樂": "guán-lo̍k",
38823881
"阮兩人": "guán-nn̄g-lâng",
3883-
"阮先生": "guán-sian-senn/guán-sian-sinn",
3882+
"阮先生": "guán-sian-sinn",
38843883
"阮太太": "guán-thài-thài",
38853884
"玩者": "guán-tsiá",
38863885
"元": "guân",
@@ -24168,7 +24167,7 @@
2416824167
"老步": "lāu-pōo",
2416924168
"老步定": "lāu-pōo-tiānn",
2417024169
"老步在": "lāu-pōo-tsāi",
24171-
"老先生": "lāu-sian-senn/lāu-sian-sinn",
24170+
"老先生": "lāu-sian-sinn",
2417224171
"漏洩": "lāu-sia̍p",
2417324172
"老身": "lāu-sin",
2417424173
"老生": "lāu-sing",
@@ -28634,7 +28633,7 @@
2863428633
"壁爐": "piah-lôo",
2863528634
"壁邊": "piah-pinn",
2863628635
"壁報": "piah-pò",
28637-
"壁先生": "piah-sian-senn/piah-sian-sinn",
28636+
"壁先生": "piah-sian-sinn",
2863828637
"壁頭": "piah-thâu",
2863928638
"壁燈": "piah-ting",
2864028639
"壁鐘": "piah-tsing",
@@ -30038,7 +30037,7 @@
3003830037
"拜師傅": "pài-sai-hū",
3003930038
"拜三": "pài-sann",
3004030039
"拜生日": "pài-senn-ji̍t/pài-sinn-li̍t",
30041-
"拜先生": "pài-sian-senn/pài-sian-sinn",
30040+
"拜先生": "pài-sian-sinn",
3004230041
"拜上帝": "pài-siāng-tè",
3004330042
"拜壽": "pài-siū",
3004430043
"拜歲蘭": "pài-suè-lân",
@@ -32173,12 +32172,12 @@
3217332172
"仙巴掌": "sian-pa-tsiáng",
3217432173
"仙拚仙": "sian-piànn-sian",
3217532174
"先輩": "sian-puè",
32176-
"先生": "sian-senn/sian-sinn",
32177-
"先生公": "sian-senn-kong/sian-sinn-kong",
32178-
"先生禮": "sian-senn-lé/sian-sinn-lé",
32179-
"先生媽": "sian-senn-má/sian-sinn-má",
32180-
"先生娘": "sian-senn-niû/sian-sinn-niû",
32181-
"先生仔": "sian-senn-á/sian-sinn-á",
32175+
"先生": "sian-sinn",
32176+
"先生公": "sian-sinn-kong",
32177+
"先生禮": "sian-sinn-lé",
32178+
"先生媽": "sian-sinn-má",
32179+
"先生娘": "sian-sinn-niû",
32180+
"先生仔": "sian-sinn-á",
3218232181
"仙屎": "sian-sái",
3218332182
"仙丹": "sian-tan",
3218432183
"仙丹花": "sian-tan-hue",
@@ -53012,7 +53011,7 @@
5301253011
"地理": "tē-lí/tuē-lí",
5301353012
"地理仙": "tē-lí-sian/tuē-lí-sian",
5301453013
"地理先": "tē-lí-sian/tuē-lí-sian",
53015-
"地理先生": "tē-lí-sian-senn/tuē-lí-sian-sinn",
53014+
"地理先生": "tē-lí-sian-sinn/tuē-lí-sian-sinn",
5301653015
"地理仙仔": "tē-lí-sian-á/tuē-lí-sian-á",
5301753016
"地理師": "tē-lí-su/tuē-lí-su",
5301853017
"地理師仔": "tē-lí-su-á/tuē-lí-su-á",
@@ -53614,7 +53613,7 @@
5361453613
"蝹碖蜷": "un-lún-khûn",
5361553614
"溫瓶": "un-pân",
5361653615
"溫房": "un-pâng",
53617-
"蝹先生": "un-sian-senn/un-sian-sinn",
53616+
"蝹先生": "un-sian-sinn",
5361853617
"溫室": "un-sik",
5361953618
"溫室效應": "un-sik-hāu-ìng",
5362053619
"溫燒": "un-sio",
@@ -54973,8 +54972,8 @@
5497354972
"法蘭克福": "Huat-lân-khik-hok",
5497454973
"法蘭西": "Huat-lân-se",
5497554974
"法西斯": "Huat-se-su",
54976-
"花蓮": "Hue-liân",
54977-
"花蓮港": "Hue-liân-káng",
54975+
"花蓮": "Hua-liân",
54976+
"花蓮港": "Hua-liân-káng",
5497854977
"花霸王": "Hue-pà-ông",
5497954978
"花壇": "Hue-tuânn",
5498054979
"花壇鄉": "Hue-tuânn-hiong",

0 commit comments

Comments
 (0)