Skip to content
This repository was archived by the owner on Mar 9, 2023. It is now read-only.

Commit 80cdf94

Browse files
author
Sorami Hisamoto
authored
Major content update (#119)
1 parent bb5195a commit 80cdf94

File tree

1 file changed

+172
-97
lines changed

1 file changed

+172
-97
lines changed

README.md

Lines changed: 172 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -6,58 +6,66 @@
66

77
SudachiPy is a Python version of [Sudachi](https://github.com/WorksApplications/Sudachi), a Japanese morphological analyzer.
88

9-
Sudachi & SudachiPy are developed in [WAP Tokushima Laboratory of AI and NLP](http://nlp.worksap.co.jp/), an institute under [Works Applications](http://www.worksap.com/) that focuses on Natural Language Processing (NLP).
109

11-
**Warning: some functions are still incompatible with Java Sudachi.**
12-
13-
## Easy Setup
14-
15-
### Step 1: Install SudachiPy
16-
17-
SudachiPy is distributed from PyPI. You can install SudachiPy by executing `pip install SudachiPy` from the command line.
10+
## TL;DR
1811

1912
```bash
20-
$ pip install SudachiPy
13+
$ pip install sudachipy sudachidict_core
14+
15+
$ echo "高輪ゲートウェイ駅" | sudachipy
16+
高輪ゲートウェイ駅 名詞,固有名詞,一般,*,*,* 高輪ゲートウェイ駅
17+
EOS
18+
19+
$ echo "高輪ゲートウェイ駅" | sudachipy -m A
20+
高輪 名詞,固有名詞,地名,一般,*,* 高輪
21+
ゲートウェイ 名詞,普通名詞,一般,*,*,* ゲートウェー
22+
駅 名詞,普通名詞,一般,*,*,*
23+
EOS
24+
25+
$ echo "空缶空罐空きカン" | sudachipy -a
26+
空缶 名詞,普通名詞,一般,*,*,* 空き缶 空缶 アキカン 0
27+
空罐 名詞,普通名詞,一般,*,*,* 空き缶 空罐 アキカン 0
28+
空きカン 名詞,普通名詞,一般,*,*,* 空き缶 空きカン アキカン 0
29+
EOS
2130
```
2231

23-
SudachiPy(>=v0.3.0) refers to system.dic of SudachiDict_core (not included in SudachiPy) package by default.
24-
Please proceed to Step 2 to install the dict package.
32+
## Setup
2533

26-
### Step 2: Get The Dictionary
34+
You need SudachiPy and a dictionary.
2735

28-
You can install a dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the `core` edition).
36+
### Step 1. Install SudachiPy
2937

3038
```bash
31-
$ pip install sudachidict_core
39+
$ pip install sudachipy
3240
```
3341

34-
Alternatively, you can choose other editions of the dictionary. There are three editions, namely, `small`, `core`, and `full`. See [WorksApplications/SudachiDict](https://github.com/WorksApplications/SudachiDict) for the detail.
42+
### Step 2. Get a Dictionary
3543

36-
You need to specify the dictionary with the `link -t` command.
44+
You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the `core` edition).
3745

3846
```bash
39-
$ pip install sudachidict_small
40-
$ sudachipy link -t small
47+
$ pip install sudachidict_core
4148
```
4249

43-
```bash
44-
$ pip install sudachidict_full
45-
$ sudachipy link -t full
46-
```
50+
Alternatively, you can choose other dictionary editions. See [this section](#dictionary-edition) for the detail.
4751

48-
## Usage
4952

50-
### As a command
53+
## Usage: As a command
5154

52-
After installing SudachiPy, you may also use it in the terminal via command `sudachipy`.
55+
There is a CLI command `sudachipy`.
5356

54-
You can excute `sudachipy` with standard input by this way:
5557
```bash
56-
$ sudachipy
58+
$ echo "外国人参政権" | sudachipy
59+
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権
60+
EOS
61+
$ echo "外国人参政権" | sudachipy -m A
62+
外国 名詞,普通名詞,一般,*,*,* 外国
63+
人 接尾辞,名詞的,一般,*,*,*
64+
参政 名詞,普通名詞,一般,*,*,* 参政
65+
権 接尾辞,名詞的,一般,*,*,*
66+
EOS
5767
```
5868

59-
`sudachipy` has 4 subcommands (default: `tokenize`)
60-
6169
```bash
6270
$ sudachipy tokenize -h
6371
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
@@ -77,66 +85,51 @@ optional arguments:
7785
-d print the debug information
7886
-v, --version print sudachipy version
7987
```
80-
```bash
81-
$ sudachipy link -h
82-
usage: sudachipy link [-h] [-t {small,core,full}] [-u]
8388
84-
Link Default Dict Package
89+
### Output
8590
86-
optional arguments:
87-
-h, --help show this help message and exit
88-
-t {small,core,full} dict dict
89-
-u unlink sudachidict
90-
```
91-
```bash
92-
$ sudachipy build -h
93-
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
91+
Columns are tab separated.
9492
95-
Build Sudachi Dictionary
93+
- Surface
94+
- Part-of-Speech Tags (comma separated)
95+
- Normalized Form
9696
97-
positional arguments:
98-
file source files with CSV format (one of more)
97+
When you add the `-a` option, it additionally outputs
9998
100-
optional arguments:
101-
-h, --help show this help message and exit
102-
-o file output file (default: system.dic)
103-
-d string description comment to be embedded on dictionary
99+
- Dictionary Form
100+
- Reading Form
101+
- Dictionary ID
102+
- `0` for the system dictionary
103+
- `1` and above for the [user dictionaries](#user-dictionary)
104+
- `-1\t(OOV)` if a word is Out-of-Vocabulary (not in the dictionary)
104105
105-
required named arguments:
106-
-m file connection matrix file with MeCab's matrix.def format
107-
```
108-
**WARNING: v0.3.\* ubuild contains bug.**
109106
```bash
110-
$ sudachipy ubuild -h
111-
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
112-
113-
Build User Dictionary
114-
115-
positional arguments:
116-
file source files with CSV format (one or more)
107+
$ echo "外国人参政権" | sudachipy -a
108+
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0
109+
EOS
110+
```
117111
118-
optional arguments:
119-
-h, --help show this help message and exit
120-
-d string description comment to be embedded on dictionary
121-
-o file output file (default: user.dic)
122-
-s file system dictionary (default: linked system_dic, see link -h)
112+
```bash
113+
echo "阿quei" | sudachipy -a
114+
阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 (OOV)
115+
quei 名詞,普通名詞,一般,*,*,* quei quei -1 (OOV)
116+
EOS
123117
```
124118
125-
### As a Python package
126119
127-
Here is an example usage;
120+
## Usage: As a Python package
121+
122+
Here is an example;
128123
129124
```python
130125
from sudachipy import tokenizer
131126
from sudachipy import dictionary
132127
133-
134128
tokenizer_obj = dictionary.Dictionary().create()
129+
```
135130
136-
137-
# Multi-granular tokenization
138-
# using `system_core.dic` or `system_full.dic` version 20190781
139-
# you may not be able to replicate this particular example due to dictionary you use
131+
```python
132+
# Multi-granular Tokenization
140133
141134
mode = tokenizer.Tokenizer.SplitMode.C
142135
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
@@ -149,8 +142,10 @@ mode = tokenizer.Tokenizer.SplitMode.B
149142
mode = tokenizer.Tokenizer.SplitMode.A
150143
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
151144
# => ['国家', '公務', '員']
145+
```
152146
153147
148+
```python
154149
# Morpheme information
155150
156151
m = tokenizer_obj.tokenize("食べ", mode)[0]
@@ -159,8 +154,10 @@ m.surface() # => '食べ'
159154
m.dictionary_form() # => '食べる'
160155
m.reading_form() # => 'タベ'
161156
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
157+
```
162158
163159
160+
```python
164161
# Normalization
165162
166163
tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
@@ -171,31 +168,42 @@ tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
171168
# => 'シミュレーション'
172169
```
173170
174-
## Install dict packages
171+
(With `20200330` `core` dictionary. The results may change when you use other versions)
175172
176-
You can download and install the built dictionaries from [Python packages · WorksApplications/SudachiDict](https://github.com/WorksApplications/SudachiDict#python-packages).
173+
174+
## Dictionary Edition
175+
176+
There are three editions of Sudachi Dictionary, namely, `small`, `core`, and `full`. See [WorksApplications/SudachiDict](https://github.com/WorksApplications/SudachiDict) for the detail.
177+
178+
SudachiPy uses `sudachidict_core` by default. You can specify the dictionary with the `link -t` command.
177179
178180
```bash
179-
$ pip install SudachiDict_full-20190718.tar.gz
181+
$ pip install sudachidict_small
182+
$ sudachipy link -t small
180183
```
181184
182-
You can change the default dict package by executing link command.
183-
184185
```bash
186+
$ pip install sudachidict_full
185187
$ sudachipy link -t full
186188
```
187189
188-
You can remove default dict setting.
190+
You can remove the dictionary link with the `link -u` commnad.
189191
190192
```bash
191193
$ sudachipy link -u
192194
```
193195
194-
## Customized dictionary
196+
Dictionaries are installed as Python packages `sudachidict_small`, `sudachidict_core`, and `sudachidict_full`. SudachiPy tries to refer `sudachidict` package to use a dictionary. The `link` subcommand creates *a symbolic link* of `sudachidict_*` as `sudachidict`, to switch the packages.
197+
198+
* [SudachiDict-small · PyPI](https://pypi.org/project/SudachiDict-small/)
199+
* [SudachiDict-core · PyPI](https://pypi.org/project/SudachiDict-core/)
200+
* [SudachiDict-full · PyPI](https://pypi.org/project/SudachiDict-full/)
195201
196-
If you need to apply customized `system.dic`,
197-
place [sudachi.json](https://github.com/WorksApplications/Sudachi/blob/develop/src/main/resources/sudachi.json) to anywhere you like,
198-
and overwrite `systemDict` value with the relative path from `sudachi.json` to your `system.dic`.
202+
The dictionary files are not in the package itself, but it is downloaded upon installation.
203+
204+
### Dictionary in The Setting File
205+
206+
Alternatively, if the dictionary file is specified in the setting file, `sudachi.json`, SudachiPy will use that file.
199207
200208
```
201209
{
@@ -204,42 +212,109 @@ and overwrite `systemDict` value with the relative path from `sudachi.json` to y
204212
}
205213
```
206214
207-
Then you can specify `sudachi.json` with `-r` option.
215+
The default setting file is [sudachipy/resources/sudachi.json](https://github.com/WorksApplications/SudachiPy/blob/develop/sudachipy/resources/sudachi.json). You can specify your `sudachi.json` with the `-r` option.
216+
208217
```bash
209218
$ sudachipy -r path/to/sudachi.json
210219
```
211220
212-
In the end, we would like to make a flow to get these resources via the code, like [NLTK](https://www.nltk.org/data.html) (e.g., `import nltk; nltk.download()`) or [spaCy](https://spacy.io/usage/models) (e.g., `$python -m spacy download en`).
213221
214-
## User defined Dictionary
222+
## User Dictionary
215223
216-
If you need to apply customized user dictionary, `user.dic`,
217-
place [sudachi.json](https://github.com/WorksApplications/Sudachi/blob/develop/src/main/resources/sudachi.json) to anywhere you like,
218-
and add `userDict` value with the relative path from `sudachi.json` to your `user.dic`.
224+
To use a user dictionary, `user.dic`, place [sudachi.json](https://github.com/WorksApplications/SudachiPy/blob/develop/sudachipy/resources/sudachi.json) to anywhere you like, and add `userDict` value with the relative path from `sudachi.json` to your `user.dic`.
219225
220-
```
226+
```js
221227
{
222228
"userDict" : ["relative/path/to/user.dic"],
223229
...
224230
}
225231
```
226232
227-
Also, you can build user dictionary with sub-command `ubuild`.
233+
Then specify your `sudachi.json` with the `-r` option.
228234
229-
About file format, see [here](https://github.com/WorksApplications/Sudachi/blob/develop/docs/user_dict.md)
230-
(written in Japanese, English document is unavailable now)
235+
```bash
236+
$ sudachipy -r path/to/sudachi.json
237+
```
238+
239+
240+
You can build a user dictionary with the subcommand `ubuild`.
241+
242+
**WARNING: v0.3.\* ubuild contains bug.**
243+
244+
```bash
245+
$ sudachipy ubuild -h
246+
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
247+
248+
Build User Dictionary
249+
250+
positional arguments:
251+
file source files with CSV format (one or more)
252+
253+
optional arguments:
254+
-h, --help show this help message and exit
255+
-d string description comment to be embedded on dictionary
256+
-o file output file (default: user.dic)
257+
-s file system dictionary (default: linked system_dic, see link -h)
258+
```
259+
260+
About the dictionary file format, please refer to [this document](https://github.com/WorksApplications/Sudachi/blob/develop/docs/user_dict.md) (written in Japanese, English version is not available yet).
231261
232-
## For developer
233262
234-
### Code format
263+
## Customized System Dictionary
235264
236-
You can use `./scripts/format.sh` and check if your code is in rule. `flake8` `flake8-import-order` `flake8-buitins` is required. See `requirements.txt`
265+
```bash
266+
$ sudachipy build -h
267+
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
268+
269+
Build Sudachi Dictionary
270+
271+
positional arguments:
272+
file source files with CSV format (one of more)
273+
274+
optional arguments:
275+
-h, --help show this help message and exit
276+
-o file output file (default: system.dic)
277+
-d string description comment to be embedded on dictionary
278+
279+
required named arguments:
280+
-m file connection matrix file with MeCab's matrix.def format
281+
```
282+
283+
To use your customized `system.dic`, place [sudachi.json](https://github.com/WorksApplications/SudachiPy/blob/develop/sudachipy/resources/sudachi.json) to anywhere you like, and overwrite `systemDict` value with the relative path from `sudachi.json` to your `system.dic`.
284+
285+
```
286+
{
287+
"systemDict" : "relative/path/to/system.dic",
288+
...
289+
}
290+
```
291+
292+
Then specify your `sudachi.json` with the `-r` option.
293+
294+
```bash
295+
$ sudachipy -r path/to/sudachi.json
296+
```
297+
298+
299+
## For Developers
300+
301+
### Code Format
302+
303+
Run `scripts/format.sh` to check if your code is formatted correctly.
304+
305+
You need packages `flake8` `flake8-import-order` `flake8-buitins` (See `requirements.txt`).
237306
238307
### Test
239308
240-
You can use `./scripts/test.sh` and check if your changes do not cause regression.
309+
Run `scripts/test.sh` to run the tests.
310+
241311
242312
## Contact
243313
244-
We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.
245-
- https://sudachi-dev.slack.com/ (Please take invitation from [here](https://join.slack.com/t/sudachi-dev/shared_invite/enQtMzg2NTI2NjYxNTUyLTMyYmNkZWQ0Y2E5NmQxMTI3ZGM3NDU0NzU4NGE1Y2UwYTVmNTViYjJmNDI0MWZiYTg4ODNmMzgxYTQ3ZmI2OWU))
314+
Sudachi and SudachiPy are developed by [WAP Tokushima Laboratory of AI and NLP](http://nlp.worksap.co.jp/).
315+
316+
Open an issue, or come to our Slack workspace for questions and discussion.
317+
318+
https://sudachi-dev.slack.com/ (Get invitation [here](https://join.slack.com/t/sudachi-dev/shared_invite/enQtMzg2NTI2NjYxNTUyLTMyYmNkZWQ0Y2E5NmQxMTI3ZGM3NDU0NzU4NGE1Y2UwYTVmNTViYjJmNDI0MWZiYTg4ODNmMzgxYTQ3ZmI2OWU))
319+
320+
Enjoy tokenization!

0 commit comments

Comments
 (0)