Skip to content

Commit 72f61c1

Browse files
committed
Update README and make index.rst use that
1 parent fbf3dab commit 72f61c1

File tree

2 files changed

+9
-150
lines changed

2 files changed

+9
-150
lines changed

README.rst

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Features
3030
- Command line interface for batch processing
3131
- Easy to hack (e.g. different scoring, new data, different language)
3232
- Developed on Python 2.7
33-
- Tested on CPython 2.6, 2.7, 3.2, 3.3, 3.4 and PyPy 2.5+, PyPy3 2.4+
33+
- Tested on CPython 2.6, 2.7, 3.2, 3.3, 3.4, 3.5, 3.6 and PyPy, PyPy3
3434

3535
.. image:: https://api.travis-ci.org/grantjenks/wordsegment.svg
3636
:target: http://www.grantjenks.com/docs/wordsegment/
@@ -55,10 +55,14 @@ Tutorial
5555
In your own Python programs, you'll mostly want to use `segment` to divide a
5656
phrase into a list of its parts::
5757

58-
>>> from wordsegment import segment
58+
>>> from wordsegment import load, segment
59+
>>> load()
5960
>>> segment('thisisatest')
6061
['this', 'is', 'a', 'test']
6162

63+
The `load` function reads and parses the unigrams and bigrams data from
64+
disk. Loading the data only needs to be done once.
65+
6266
WordSegment also provides a command-line interface for batch processing. This
6367
interface accepts two arguments: in-file and out-file. Lines from in-file are
6468
iteratively segmented, joined by a space, and written to out-file. Input and
@@ -116,7 +120,7 @@ Some bigrams begin with `<s>`. This is to indicate the start of a bigram::
116120
>>> ws.BIGRAMS['<s> what']
117121
11779290.0
118122

119-
The unigrams and bigrams data is stored in the `wordsegment_data` directory in
123+
The unigrams and bigrams data is stored in the `wordsegment` directory in
120124
the `unigrams.txt` and `bigrams.txt` files respectively.
121125

122126
Reference and Indices
@@ -135,7 +139,7 @@ Reference and Indices
135139
WordSegment License
136140
-------------------
137141

138-
Copyright 2016 Grant Jenks
142+
Copyright 2017 Grant Jenks
139143

140144
Licensed under the Apache License, Version 2.0 (the "License");
141145
you may not use this file except in compliance with the License.

docs/index.rst

Lines changed: 1 addition & 146 deletions
Original file line numberDiff line numberDiff line change
@@ -1,146 +1 @@
1-
Python Word Segmentation
2-
========================
3-
4-
`WordSegment`_ is an Apache2 licensed module for English word
5-
segmentation, written in pure-Python, and based on a trillion-word corpus.
6-
7-
Based on code from the chapter "`Natural Language Corpus Data`_" by Peter
8-
Norvig from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009).
9-
10-
Data files are derived from the `Google Web Trillion Word Corpus`_, as
11-
described by Thorsten Brants and Alex Franz, and `distributed`_ by the
12-
Linguistic Data Consortium. This module contains only a subset of that
13-
data. The unigram data includes only the most common 333,000 words. Similarly,
14-
bigram data includes only the most common 250,000 phrases. Every word and
15-
phrase is lowercased with punctuation removed.
16-
17-
.. _`WordSegment`: http://www.grantjenks.com/docs/wordsegment/
18-
.. _`Natural Language Corpus Data`: http://norvig.com/ngrams/
19-
.. _`Beautiful Data`: http://oreilly.com/catalog/9780596157111/
20-
.. _`Google Web Trillion Word Corpus`: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
21-
.. _`distributed`: https://catalog.ldc.upenn.edu/LDC2006T13
22-
23-
Features
24-
--------
25-
26-
- Pure-Python
27-
- Fully documented
28-
- 100% Test Coverage
29-
- Includes unigram and bigram data
30-
- Command line interface for batch processing
31-
- Easy to hack (e.g. different scoring, new data, different language)
32-
- Developed on Python 2.7
33-
- Tested on CPython 2.6, 2.7, 3.2, 3.3, 3.4 and PyPy 2.5+, PyPy3 2.4+
34-
35-
.. image:: https://api.travis-ci.org/grantjenks/wordsegment.svg
36-
:target: http://www.grantjenks.com/docs/wordsegment/
37-
38-
Quickstart
39-
----------
40-
41-
Installing WordSegment is simple with
42-
`pip <http://www.pip-installer.org/>`_::
43-
44-
$ pip install wordsegment
45-
46-
You can access documentation in the interpreter with Python's built-in help
47-
function::
48-
49-
>>> import wordsegment
50-
>>> help(wordsegment)
51-
52-
Tutorial
53-
--------
54-
55-
In your own Python programs, you'll mostly want to use `segment` to divide a
56-
phrase into a list of its parts::
57-
58-
>>> from wordsegment import segment
59-
>>> segment('thisisatest')
60-
['this', 'is', 'a', 'test']
61-
62-
WordSegment also provides a command-line interface for batch processing. This
63-
interface accepts two arguments: in-file and out-file. Lines from in-file are
64-
iteratively segmented, joined by a space, and written to out-file. Input and
65-
output default to stdin and stdout respectively. ::
66-
67-
$ echo thisisatest | python -m wordsegment
68-
this is a test
69-
70-
The maximum segmented word length is 24 characters. Neither the unigram nor
71-
bigram data contain words exceeding that length. The corpus also excludes
72-
punctuation and all letters have been lowercased. Before segmenting text,
73-
`clean` is called to transform the input to a canonical form::
74-
75-
>>> from wordsegment import clean
76-
>>> clean('She said, "Python rocks!"')
77-
'shesaidpythonrocks'
78-
>>> segment('She said, "Python rocks!"')
79-
['she', 'said', 'python', 'rocks']
80-
81-
Sometimes its interesting to explore the unigram and bigram counts
82-
themselves. These are stored in Python dictionaries mapping word to count. ::
83-
84-
>>> import wordsegment as ws
85-
>>> ws.load()
86-
>>> ws.UNIGRAMS['the']
87-
23135851162.0
88-
>>> ws.UNIGRAMS['gray']
89-
21424658.0
90-
>>> ws.UNIGRAMS['grey']
91-
18276942.0
92-
93-
Above we see that the spelling `gray` is more common than the spelling `grey`.
94-
95-
Bigrams are joined by a space::
96-
97-
>>> import heapq
98-
>>> from pprint import pprint
99-
>>> from operator import itemgetter
100-
>>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
101-
[('of the', 2766332391.0),
102-
('in the', 1628795324.0),
103-
('to the', 1139248999.0),
104-
('on the', 800328815.0),
105-
('for the', 692874802.0),
106-
('and the', 629726893.0),
107-
('to be', 505148997.0),
108-
('is a', 476718990.0),
109-
('with the', 461331348.0),
110-
('from the', 428303219.0)]
111-
112-
Some bigrams begin with `<s>`. This is to indicate the start of a bigram::
113-
114-
>>> ws.BIGRAMS['<s> where']
115-
15419048.0
116-
>>> ws.BIGRAMS['<s> what']
117-
11779290.0
118-
119-
The unigrams and bigrams data is stored in the `wordsegment_data` directory in
120-
the `unigrams.txt` and `bigrams.txt` files respectively.
121-
122-
Reference and Indices
123-
---------------------
124-
125-
.. toctree::
126-
127-
api
128-
using-a-different-corpus
129-
python-load-dict-fast-from-file
130-
131-
* `WordSegment Documentation`_
132-
* `WordSegment at PyPI`_
133-
* `WordSegment at Github`_
134-
* `WordSegment Issue Tracker`_
135-
* :ref:`search`
136-
* :ref:`genindex`
137-
138-
.. _`WordSegment Documentation`: http://www.grantjenks.com/docs/wordsegment/
139-
.. _`WordSegment at PyPI`: https://pypi.python.org/pypi/wordsegment
140-
.. _`WordSegment at Github`: https://github.com/grantjenks/python-wordsegment
141-
.. _`WordSegment Issue Tracker`: https://github.com/grantjenks/python-wordsegment/issues
142-
143-
WordSegment License
144-
-------------------
145-
146-
.. include:: ../LICENSE
1+
.. include:: ../README.rst

0 commit comments

Comments
 (0)