|
1 |
| -Python Word Segmentation |
2 |
| -======================== |
3 |
| - |
4 |
| -`WordSegment`_ is an Apache2 licensed module for English word |
5 |
| -segmentation, written in pure-Python, and based on a trillion-word corpus. |
6 |
| - |
7 |
| -Based on code from the chapter "`Natural Language Corpus Data`_" by Peter |
8 |
| -Norvig from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009). |
9 |
| - |
10 |
| -Data files are derived from the `Google Web Trillion Word Corpus`_, as |
11 |
| -described by Thorsten Brants and Alex Franz, and `distributed`_ by the |
12 |
| -Linguistic Data Consortium. This module contains only a subset of that |
13 |
| -data. The unigram data includes only the most common 333,000 words. Similarly, |
14 |
| -bigram data includes only the most common 250,000 phrases. Every word and |
15 |
| -phrase is lowercased with punctuation removed. |
16 |
| - |
17 |
| -.. _`WordSegment`: http://www.grantjenks.com/docs/wordsegment/ |
18 |
| -.. _`Natural Language Corpus Data`: http://norvig.com/ngrams/ |
19 |
| -.. _`Beautiful Data`: http://oreilly.com/catalog/9780596157111/ |
20 |
| -.. _`Google Web Trillion Word Corpus`: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html |
21 |
| -.. _`distributed`: https://catalog.ldc.upenn.edu/LDC2006T13 |
22 |
| - |
23 |
| -Features |
24 |
| --------- |
25 |
| - |
26 |
| -- Pure-Python |
27 |
| -- Fully documented |
28 |
| -- 100% Test Coverage |
29 |
| -- Includes unigram and bigram data |
30 |
| -- Command line interface for batch processing |
31 |
| -- Easy to hack (e.g. different scoring, new data, different language) |
32 |
| -- Developed on Python 2.7 |
33 |
| -- Tested on CPython 2.6, 2.7, 3.2, 3.3, 3.4 and PyPy 2.5+, PyPy3 2.4+ |
34 |
| - |
35 |
| -.. image:: https://api.travis-ci.org/grantjenks/wordsegment.svg |
36 |
| - :target: http://www.grantjenks.com/docs/wordsegment/ |
37 |
| - |
38 |
| -Quickstart |
39 |
| ----------- |
40 |
| - |
41 |
| -Installing WordSegment is simple with |
42 |
| -`pip <http://www.pip-installer.org/>`_:: |
43 |
| - |
44 |
| - $ pip install wordsegment |
45 |
| - |
46 |
| -You can access documentation in the interpreter with Python's built-in help |
47 |
| -function:: |
48 |
| - |
49 |
| - >>> import wordsegment |
50 |
| - >>> help(wordsegment) |
51 |
| - |
52 |
| -Tutorial |
53 |
| --------- |
54 |
| - |
55 |
| -In your own Python programs, you'll mostly want to use `segment` to divide a |
56 |
| -phrase into a list of its parts:: |
57 |
| - |
58 |
| - >>> from wordsegment import segment |
59 |
| - >>> segment('thisisatest') |
60 |
| - ['this', 'is', 'a', 'test'] |
61 |
| - |
62 |
| -WordSegment also provides a command-line interface for batch processing. This |
63 |
| -interface accepts two arguments: in-file and out-file. Lines from in-file are |
64 |
| -iteratively segmented, joined by a space, and written to out-file. Input and |
65 |
| -output default to stdin and stdout respectively. :: |
66 |
| - |
67 |
| - $ echo thisisatest | python -m wordsegment |
68 |
| - this is a test |
69 |
| - |
70 |
| -The maximum segmented word length is 24 characters. Neither the unigram nor |
71 |
| -bigram data contain words exceeding that length. The corpus also excludes |
72 |
| -punctuation and all letters have been lowercased. Before segmenting text, |
73 |
| -`clean` is called to transform the input to a canonical form:: |
74 |
| - |
75 |
| - >>> from wordsegment import clean |
76 |
| - >>> clean('She said, "Python rocks!"') |
77 |
| - 'shesaidpythonrocks' |
78 |
| - >>> segment('She said, "Python rocks!"') |
79 |
| - ['she', 'said', 'python', 'rocks'] |
80 |
| - |
81 |
| -Sometimes its interesting to explore the unigram and bigram counts |
82 |
| -themselves. These are stored in Python dictionaries mapping word to count. :: |
83 |
| - |
84 |
| - >>> import wordsegment as ws |
85 |
| - >>> ws.load() |
86 |
| - >>> ws.UNIGRAMS['the'] |
87 |
| - 23135851162.0 |
88 |
| - >>> ws.UNIGRAMS['gray'] |
89 |
| - 21424658.0 |
90 |
| - >>> ws.UNIGRAMS['grey'] |
91 |
| - 18276942.0 |
92 |
| - |
93 |
| -Above we see that the spelling `gray` is more common than the spelling `grey`. |
94 |
| - |
95 |
| -Bigrams are joined by a space:: |
96 |
| - |
97 |
| - >>> import heapq |
98 |
| - >>> from pprint import pprint |
99 |
| - >>> from operator import itemgetter |
100 |
| - >>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1))) |
101 |
| - [('of the', 2766332391.0), |
102 |
| - ('in the', 1628795324.0), |
103 |
| - ('to the', 1139248999.0), |
104 |
| - ('on the', 800328815.0), |
105 |
| - ('for the', 692874802.0), |
106 |
| - ('and the', 629726893.0), |
107 |
| - ('to be', 505148997.0), |
108 |
| - ('is a', 476718990.0), |
109 |
| - ('with the', 461331348.0), |
110 |
| - ('from the', 428303219.0)] |
111 |
| - |
112 |
| -Some bigrams begin with `<s>`. This is to indicate the start of a bigram:: |
113 |
| - |
114 |
| - >>> ws.BIGRAMS['<s> where'] |
115 |
| - 15419048.0 |
116 |
| - >>> ws.BIGRAMS['<s> what'] |
117 |
| - 11779290.0 |
118 |
| - |
119 |
| -The unigrams and bigrams data is stored in the `wordsegment_data` directory in |
120 |
| -the `unigrams.txt` and `bigrams.txt` files respectively. |
121 |
| - |
122 |
| -Reference and Indices |
123 |
| ---------------------- |
124 |
| - |
125 |
| -.. toctree:: |
126 |
| - |
127 |
| - api |
128 |
| - using-a-different-corpus |
129 |
| - python-load-dict-fast-from-file |
130 |
| - |
131 |
| -* `WordSegment Documentation`_ |
132 |
| -* `WordSegment at PyPI`_ |
133 |
| -* `WordSegment at Github`_ |
134 |
| -* `WordSegment Issue Tracker`_ |
135 |
| -* :ref:`search` |
136 |
| -* :ref:`genindex` |
137 |
| - |
138 |
| -.. _`WordSegment Documentation`: http://www.grantjenks.com/docs/wordsegment/ |
139 |
| -.. _`WordSegment at PyPI`: https://pypi.python.org/pypi/wordsegment |
140 |
| -.. _`WordSegment at Github`: https://github.com/grantjenks/python-wordsegment |
141 |
| -.. _`WordSegment Issue Tracker`: https://github.com/grantjenks/python-wordsegment/issues |
142 |
| - |
143 |
| -WordSegment License |
144 |
| -------------------- |
145 |
| - |
146 |
| -.. include:: ../LICENSE |
| 1 | +.. include:: ../README.rst |
0 commit comments