Skip to content
This repository was archived by the owner on Mar 9, 2023. It is now read-only.

Commit 27825cf

Browse files
authored
Merge pull request #46 from megagonlabs/feature/default_dict_package
Feature/default dict package
2 parents ced1c7b + defdac8 commit 27825cf

File tree

12 files changed

+189
-30
lines changed

12 files changed

+189
-30
lines changed

MANIFEST.in

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1 @@
11
include README.md LICENSE requirements.txt
2-
recursive-include resources *.def *.json *.dic

README.md

Lines changed: 88 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,42 +7,75 @@ Sudachi & SudachiPy are developed in [WAP Tokushima Laboratory of AI and NLP](ht
77

88
**Warning: SudachiPy is still under development, and some of the functions are still not complete. Please use it at your own risk.**
99

10+
## Breaking changes
11+
### v0.3.0
1012

11-
## Setup
13+
- `resources/` directory was moved to `sudachipy/`.
14+
15+
### V0.2.2
16+
17+
- Distribute SudachiPy package via PyPI
18+
- `pip install SudachiPy`
19+
20+
### v0.2.0
21+
22+
- User dictionary feature added
23+
24+
25+
## Easy Setup
1226

1327
SudachiPy requires Python3.5+.
1428

15-
SudachiPy is not registered to PyPI just yet, so you may not install it via `pip` command at the moment.
29+
You can install SudachiPy and SudachiDict_core packages together from PyPI.
1630

31+
```bash
32+
$ pip install SudachiPy
1733
```
18-
$ pip install -e git+git://github.com/WorksApplications/SudachiPy@develop#egg=SudachiPy
19-
```
20-
The dictionary file is not included in the repository. You can get the built dictionary from [Releases · WorksApplications/Sudachi](https://github.com/WorksApplications/Sudachi/releases). Please download either `sudachi-x.y.z-dictionary-core.zip` or `sudachi-x.y.z-dictionary-full.zip`, unzip and rename it to `system.dic`, then place it under `SudachiPy/resources/`. In the end, we would like to make a flow to get these resources via the code, like [NLTK](https://www.nltk.org/data.html) (e.g., `import nltk; nltk.download()`) or [spaCy](https://spacy.io/usage/models) (e.g., `$python -m spacy download en`).
34+
35+
SudachiPy(>=v0.3.0) refers to system.dic of SudachiDict_core package by default.
2136

2237
## Usage
2338

2439
### As a command
2540

2641
After installing SudachiPy, you may also use it in the terminal via command `sudachipy`.
27-
`sudachipy` has 3 subcommands (in default `tokenize`)
42+
43+
You can excute `sudachipy` with standard input by this way:
44+
```bash
45+
$ sudachipy
46+
```
47+
48+
`sudachipy` has 4 subcommands (in default `tokenize`)
2849

2950
```bash
3051
$ sudachipy tokenize -h
31-
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d]
32-
file [file ...]
52+
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
53+
[file [file ...]]
3354

3455
Tokenize Text
3556

3657
positional arguments:
37-
file text written in utf-8
58+
file text written in utf-8
3859

3960
optional arguments:
40-
-h, --help show this help message and exit
41-
-r file the setting file in JSON format
42-
-m {A,B,C} the mode of splitting
43-
-o file the output file
44-
-a print all of the fields
45-
-d print the debug information
61+
-h, --help show this help message and exit
62+
-r file the setting file in JSON format
63+
-m {A,B,C} the mode of splitting
64+
-o file the output file
65+
-a print all of the fields
66+
-d print the debug information
67+
-v, --version print sudachipy version
68+
```
69+
```bash
70+
$ sudachipy link -h
71+
usage: sudachipy link [-h] [-t {small,core,full}] [-u]
72+
73+
Link Default Dict Package
74+
75+
optional arguments:
76+
-h, --help show this help message and exit
77+
-t {small,core,full} dict dict
78+
-u unlink sudachidict
4679
```
4780
```bash
4881
$ sudachipy build -h
@@ -126,6 +159,46 @@ tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
126159
# => 'シミュレーション'
127160
```
128161
162+
## Install dict packages
163+
164+
You can download and install the built dictionaries from [Python packages · WorksApplications/SudachiDict](https://github.com/WorksApplications/SudachiDict#python-packages).
165+
166+
```bash
167+
$ pip install SudachiDict_full-20190531.tar.gz
168+
```
169+
170+
You can change the default dict package by executing link command.
171+
172+
```bash
173+
$ sudachipy link -t full
174+
```
175+
176+
You can remove default dict setting.
177+
178+
```bash
179+
$ sudachipy link -u
180+
```
181+
182+
## Customized dictionary
183+
184+
If you need to apply customized `system.dic`,
185+
place [sudachi.json](https://github.com/WorksApplications/Sudachi/blob/develop/src/main/resources/sudachi.json) to anywhere you like,
186+
and overwrite `systemDict` value with the relative path from `sudachi.json` to your `system.dic`.
187+
188+
```
189+
{
190+
"systemDict" : "relative/path/to/system.dic",
191+
...
192+
}
193+
```
194+
195+
Then you can specify `sudachi.json` with `-r` option.
196+
```bash
197+
$ sudachipy -r path/to/sudachi.json
198+
```
199+
200+
In the end, we would like to make a flow to get these resources via the code, like [NLTK](https://www.nltk.org/data.html) (e.g., `import nltk; nltk.download()`) or [spaCy](https://spacy.io/usage/models) (e.g., `$python -m spacy download en`).
201+
129202
## For developer
130203
131204
### Code format

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ sortedcontainers >= 2.1.0, < 2.2.0
22
# flake8 >= 3.7.7, < 3.8.0
33
# flake8-import-order >= 0.18.1, < 0.19.0
44
# flake8-buitins >= 1.4.1, < 1.5.0
5+
https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20190531.tar.gz

setup.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
from setuptools import setup, find_packages
2+
from sudachipy import SUDACHIPY_VERSION
23

34
setup(name="SudachiPy",
4-
version="0.2.1",
5+
version=SUDACHIPY_VERSION,
56
description="Python version of Sudachi, the Japanese Morphological Analyzer",
67
long_description=open('README.md').read(),
78
long_description_content_type="text/markdown",
@@ -10,9 +11,12 @@
1011
author="Works Applications",
1112
author_email="takaoka_k@worksap.co.jp",
1213
packages=find_packages(include=["sudachipy", "sudachipy.*"]),
14+
package_data={"": ["resources/*.json", "resources/*.dic", "resources/*.def"]},
1315
entry_points={
1416
"console_scripts": ["sudachipy=sudachipy.command_line:main"],
1517
},
16-
install_requires=["sortedcontainers>=2.1.0,<2.2.0"],
17-
include_package_data=True,
18+
install_requires=[
19+
"sortedcontainers>=2.1.0,<2.2.0",
20+
"SudachiDict_core @ https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20190531.tar.gz",
21+
],
1822
)

sudachipy/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
from . import utf8inputtextbuilder
22
from . import tokenizer
33
from . import config
4+
5+
SUDACHIPY_VERSION = '0.3.0'

sudachipy/command_line.py

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66

77
from . import dictionary
88
from . import tokenizer
9+
from . import SUDACHIPY_VERSION
10+
from .config import set_default_dict_package, unlink_default_dict_package
911
from .dictionarylib import BinaryDictionary
1012
from .dictionarylib import SYSTEM_DICT_VERSION, USER_DICT_VERSION_2
1113
from .dictionarylib.dictionarybuilder import DictionaryBuilder
@@ -70,10 +72,6 @@ def _system_dic_checker(args, print_usage):
7072

7173

7274
def _input_files_checker(args, print_usage):
73-
if not args.in_files:
74-
print_usage()
75-
print('{}: error: no input files'.format(__name__))
76-
exit()
7775
for file in args.in_files:
7876
if not os.path.exists(file):
7977
print_usage()
@@ -111,7 +109,26 @@ def _command_build(args, print_usage):
111109
builder.build(args.in_files, rf, wf)
112110

113111

112+
def _command_link(args, print_usage):
113+
output = sys.stdout
114+
if args.unlink:
115+
unlink_default_dict_package(output=output)
116+
return
117+
118+
dict_package = 'sudachidict_' + args.dict_type
119+
try:
120+
return set_default_dict_package(dict_package, output=output)
121+
except ImportError:
122+
print_usage()
123+
print('{} not installed'.format(dict_package))
124+
exit()
125+
126+
114127
def _command_tokenize(args, print_usage):
128+
if args.version:
129+
print_version()
130+
return
131+
115132
_input_files_checker(args, print_usage)
116133

117134
if args.mode == "A":
@@ -140,23 +157,32 @@ def _command_tokenize(args, print_usage):
140157
output.close()
141158

142159

160+
def print_version():
161+
print('sudachipy v{}'.format(SUDACHIPY_VERSION))
162+
163+
143164
def main():
144165
parser = argparse.ArgumentParser(description="Japanese Morphological Analyzer")
145166

146167
subparsers = parser.add_subparsers(description='')
147168

148-
parser.add_argument("-v", "--version", action="version", version="%(prog)s v0.2.0")
149-
150169
# root, tokenizer parser
151170
parser_tk = subparsers.add_parser('tokenize', help='(default) see `tokenize -h`', description='Tokenize Text')
152171
parser_tk.add_argument("-r", dest="fpath_setting", metavar="file", help="the setting file in JSON format")
153172
parser_tk.add_argument("-m", dest="mode", choices=["A", "B", "C"], default="C", help="the mode of splitting")
154173
parser_tk.add_argument("-o", dest="fpath_out", metavar="file", help="the output file")
155174
parser_tk.add_argument("-a", action="store_true", help="print all of the fields")
156175
parser_tk.add_argument("-d", action="store_true", help="print the debug information")
157-
parser_tk.add_argument("in_files", metavar="file", nargs=argparse.ONE_OR_MORE, help='text written in utf-8')
176+
parser_tk.add_argument("-v", "--version", action="store_true", dest="version", help="print sudachipy version")
177+
parser_tk.add_argument("in_files", metavar="file", nargs=argparse.ZERO_OR_MORE, help='text written in utf-8')
158178
parser_tk.set_defaults(handler=_command_tokenize, print_usage=parser_tk.print_usage)
159179

180+
# link default dict package
181+
parser_ln = subparsers.add_parser('link', help='see `link -h`', description='Link Default Dict Package')
182+
parser_ln.add_argument("-t", dest="dict_type", choices=["small", "core", "full"], default="core", help="dict dict")
183+
parser_ln.add_argument("-u", dest="unlink", action="store_true", help="unlink sudachidict")
184+
parser_ln.set_defaults(handler=_command_link, print_usage=parser_ln.print_usage)
185+
160186
# build dictionary parser
161187
parser_bd = subparsers.add_parser('build', help='see `build -h`', description='Build Sudachi Dictionary')
162188
parser_bd.add_argument('-o', dest='out_file', metavar='file', default='system.dic',

sudachipy/config.py

Lines changed: 57 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,59 @@
1+
from importlib import import_module
12
import json
23
import os
4+
from pathlib import Path
35
from typing import List
46

5-
DEFAULT_SETTINGFILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), os.pardir, "resources/sudachi.json")
6-
DEFAULT_RESOURCEDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), os.pardir, "resources")
7+
DEFAULT_SETTINGFILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), "resources/sudachi.json")
8+
DEFAULT_RESOURCEDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "resources")
9+
10+
11+
def unlink_default_dict_package(output):
12+
try:
13+
dst_path = Path(import_module('sudachidict').__file__).parent
14+
except ImportError:
15+
print('sudachidict not exists', file=output)
16+
return
17+
18+
if dst_path.is_symlink():
19+
print('unlinking sudachidict', file=output)
20+
dst_path.unlink()
21+
print('sudachidict unlinked', file=output)
22+
if dst_path.exists():
23+
raise IOError('unlink failed (directory exists)')
24+
25+
26+
def set_default_dict_package(dict_package, output):
27+
unlink_default_dict_package(output)
28+
29+
src_path = Path(import_module(dict_package).__file__).parent
30+
dst_path = src_path.parent / 'sudachidict'
31+
dst_path.symlink_to(src_path)
32+
print('default dict package = {}'.format(dict_package), file=output)
33+
34+
return dst_path
35+
36+
37+
def create_default_link_for_sudachidict_core(output):
38+
try:
39+
dict_path = Path(import_module('sudachidict').__file__).parent
40+
except ImportError:
41+
try:
42+
import_module('sudachidict_core')
43+
except ImportError:
44+
raise KeyError('`systemDict` must be specified if `SudachiDict_core` not installed')
45+
try:
46+
import_module('sudachidict_full')
47+
raise KeyError('Multiple packages of `SudachiDict_*` installed. Set default dict with link command.')
48+
except ImportError:
49+
pass
50+
try:
51+
import_module('sudachidict_small')
52+
raise KeyError('Multiple packages of `SudachiDict_*` installed. Set default dict with link command.')
53+
except ImportError:
54+
pass
55+
dict_path = set_default_dict_package('sudachidict_core', output=output)
56+
return dict_path / 'resources' / 'system.dic'
757

858

959
class _Settings(object):
@@ -37,7 +87,11 @@ def __contains__(self, item):
3787
def system_dict_path(self) -> str:
3888
if 'systemDict' in self.__dict_:
3989
return os.path.join(self.resource_dir, self.__dict_['systemDict'])
40-
raise KeyError('`systemDict` not defined in setting file')
90+
else:
91+
with open(os.devnull, 'w') as f:
92+
dict_path = create_default_link_for_sudachidict_core(output=f)
93+
self.__dict_['systemDict'] = dict_path
94+
return dict_path
4195

4296
def char_def_path(self) -> str:
4397
if 'characterDefinitionFile' in self.__dict_:
Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
{
2-
"systemDict" : "system.dic",
32
"characterDefinitionFile" : "char.def",
43
"inputTextPlugin" : [
54
{ "class" : "sudachipy.plugin.input_text.DefaultInputTextPlugin" }

0 commit comments

Comments
 (0)