Skip to content

Commit f68b8a6

Browse files
committed
Update docs references for unigram/bigram counts and load function
1 parent 0f51bb2 commit f68b8a6

File tree

5 files changed

+34
-42
lines changed

5 files changed

+34
-42
lines changed

README.rst

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -82,11 +82,12 @@ Sometimes its interesting to explore the unigram and bigram counts
8282
themselves. These are stored in Python dictionaries mapping word to count. ::
8383

8484
>>> import wordsegment as ws
85-
>>> ws.unigram_counts['the']
85+
>>> ws.load()
86+
>>> ws.UNIGRAMS['the']
8687
23135851162.0
87-
>>> ws.unigram_counts['gray']
88+
>>> ws.UNIGRAMS['gray']
8889
21424658.0
89-
>>> ws.unigram_counts['grey']
90+
>>> ws.UNIGRAMS['grey']
9091
18276942.0
9192

9293
Above we see that the spelling `gray` is more common than the spelling `grey`.
@@ -96,7 +97,7 @@ Bigrams are joined by a space::
9697
>>> import heapq
9798
>>> from pprint import pprint
9899
>>> from operator import itemgetter
99-
>>> pprint(heapq.nlargest(10, ws.bigram_counts.items(), itemgetter(1)))
100+
>>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
100101
[('of the', 2766332391.0),
101102
('in the', 1628795324.0),
102103
('to the', 1139248999.0),
@@ -110,9 +111,9 @@ Bigrams are joined by a space::
110111

111112
Some bigrams begin with `<s>`. This is to indicate the start of a bigram::
112113

113-
>>> ws.bigram_counts['<s> where']
114+
>>> ws.BIGRAMS['<s> where']
114115
15419048.0
115-
>>> ws.bigram_counts['<s> what']
116+
>>> ws.BIGRAMS['<s> what']
116117
11779290.0
117118

118119
The unigrams and bigrams data is stored in the `wordsegment_data` directory in

docs/api.rst

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,11 @@ WordSegment API Reference
1212
Yield (prefix, suffix) pairs from `text` with len(prefix) not
1313
exceeding `limit`.
1414

15+
.. py:function:: load()
16+
:module: wordsegment
17+
18+
Load unigram and bigram counts from disk.
19+
1520
.. py:function:: score(word, prev=None)
1621
:module: wordsegment
1722

@@ -22,13 +27,13 @@ WordSegment API Reference
2227

2328
Return a list of words that is the best segmenation of `text`.
2429

25-
.. py:data:: unigram_counts
30+
.. py:data:: UNIGRAMS
2631
:module: wordsegment
2732

2833
Mapping of (unigram, count) pairs.
2934
Loaded from the file 'wordsegment_data/unigrams.txt'.
3035

31-
.. py:data:: bigram_counts
36+
.. py:data:: BIGRAMS
3237
:module: wordsegment
3338

3439
Mapping of (bigram, count) pairs.
@@ -39,5 +44,5 @@ WordSegment API Reference
3944
:module: wordsegment
4045

4146
Total number of unigrams in the corpus.
42-
Need not match `sum(unigram_counts.values())`.
47+
Need not match `sum(UNIGRAMS.values())`.
4348
Defaults to 1,024,908,267,229.

docs/conf.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414

1515
import sys
1616
import os
17-
import shlex
1817

1918
# If extensions (or modules to document with autodoc) are in another directory,
2019
# add these directories to sys.path here. If the directory is relative to the
@@ -51,8 +50,8 @@
5150
master_doc = 'index'
5251

5352
# General information about the project.
54-
project = u'WordSegment'
55-
copyright = u'2015, Grant Jenks'
53+
project = u'Word Segment'
54+
copyright = u'2016, Grant Jenks'
5655
author = u'Grant Jenks'
5756

5857
# The version info for the project you're documenting, acts as replacement for
@@ -128,6 +127,7 @@
128127
'show_related': True,
129128
'github_user': 'grantjenks',
130129
'github_repo': 'wordsegment',
130+
'github_type': 'star',
131131
}
132132

133133
# Add any paths that contain custom themes here, relative to this directory.

docs/index.rst

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -78,11 +78,12 @@ Sometimes its interesting to explore the unigram and bigram counts
7878
themselves. These are stored in Python dictionaries mapping word to count. ::
7979

8080
>>> import wordsegment as ws
81-
>>> ws.unigram_counts['the']
81+
>>> ws.load()
82+
>>> ws.UNIGRAMS['the']
8283
23135851162.0
83-
>>> ws.unigram_counts['gray']
84+
>>> ws.UNIGRAMS['gray']
8485
21424658.0
85-
>>> ws.unigram_counts['grey']
86+
>>> ws.UNIGRAMS['grey']
8687
18276942.0
8788

8889
Above we see that the spelling `gray` is more common than the spelling `grey`.
@@ -92,7 +93,7 @@ Bigrams are joined by a space::
9293
>>> import heapq
9394
>>> from pprint import pprint
9495
>>> from operator import itemgetter
95-
>>> pprint(heapq.nlargest(10, ws.bigram_counts.items(), itemgetter(1)))
96+
>>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
9697
[('of the', 2766332391.0),
9798
('in the', 1628795324.0),
9899
('to the', 1139248999.0),
@@ -106,9 +107,9 @@ Bigrams are joined by a space::
106107

107108
Some bigrams begin with `<s>`. This is to indicate the start of a bigram::
108109

109-
>>> ws.bigram_counts['<s> where']
110+
>>> ws.BIGRAMS['<s> where']
110111
15419048.0
111-
>>> ws.bigram_counts['<s> what']
112+
>>> ws.BIGRAMS['<s> what']
112113
11779290.0
113114

114115
The unigrams and bigrams data is stored in the `wordsegment_data` directory in

docs/using-a-different-corpus.rst

Lines changed: 9 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
Using a Different Corpus
32
========================
43

@@ -22,44 +21,38 @@ Jane Austen's *Pride and Prejudice*.
2221
2322
print len(text)
2423
25-
2624
.. parsed-literal::
2725
2826
717573
2927
30-
3128
Great. We've got a new corpus for ``wordsegment``. Now let's look at
3229
what parts of the API we need to change. There's one function and two
33-
dictionaries: ``wordsegment.clean``, ``wordsegment.bigram_counts`` and
34-
``wordsegment.unigram_counts``. We'll work on these in reverse.
30+
dictionaries: ``wordsegment.clean``, ``wordsegment.BIGRAMS`` and
31+
``wordsegment.UNIGRAMS``. We'll work on these in reverse.
3532

3633
.. code:: python
3734
3835
import wordsegment
3936
4037
.. code:: python
4138
42-
print type(wordsegment.unigram_counts), type(wordsegment.bigram_counts)
43-
39+
print type(wordsegment.UNIGRAMS), type(wordsegment.BIGRAMS)
4440
4541
.. parsed-literal::
4642
4743
<type 'dict'> <type 'dict'>
4844
49-
5045
.. code:: python
5146
52-
print wordsegment.unigram_counts.items()[:3]
53-
print wordsegment.bigram_counts.items()[:3]
54-
47+
print wordsegment.UNIGRAMS.items()[:3]
48+
print wordsegment.BIGRAMS.items()[:3]
5549
5650
.. parsed-literal::
5751
5852
[('biennials', 37548.0), ('verplank', 48349.0), ('tsukino', 19771.0)]
5953
[('personal effects', 151369.0), ('basic training', 294085.0), ('it absolutely', 130505.0)]
6054
61-
62-
Ok, so ``wordsegment.unigram_counts`` is just a dictionary mapping
55+
Ok, so ``wordsegment.UNIGRAMS`` is just a dictionary mapping
6356
unigrams to their counts. Let's write a method to tokenize our text.
6457

6558
.. code:: python
@@ -72,19 +65,17 @@ unigrams to their counts. Let's write a method to tokenize our text.
7265
7366
print list(tokenize("Wait, what did you say?"))
7467
75-
7668
.. parsed-literal::
7769
7870
['Wait', 'what', 'did', 'you', 'say']
7971
80-
8172
Now we'll build our dictionaries.
8273

8374
.. code:: python
8475
8576
from collections import Counter
8677
87-
wordsegment.unigram_counts = Counter(tokenize(text))
78+
wordsegment.UNIGRAMS = Counter(tokenize(text))
8879
8980
def pairs(iterable):
9081
iterator = iter(iterable)
@@ -94,7 +85,7 @@ Now we'll build our dictionaries.
9485
yield ' '.join(values)
9586
del values[0]
9687
97-
wordsegment.bigram_counts = Counter(pairs(tokenize(text)))
88+
wordsegment.BIGRAMS = Counter(pairs(tokenize(text)))
9889
9990
That's it.
10091

@@ -115,24 +106,18 @@ input to ``segment``.
115106
116107
wordsegment.segment('wantofawife')
117108
118-
119-
120-
121109
.. parsed-literal::
122110
123111
['want', 'of', 'a', 'wife']
124112
125-
126-
127113
If you find this behaves poorly then you may need to change the
128114
``wordsegment.TOTAL`` variable to reflect the total of all unigrams. In
129115
our case that's simply:
130116

131117
.. code:: python
132118
133-
wordsegment.TOTAL = float(sum(wordsegment.unigram_counts.values()))
119+
wordsegment.TOTAL = float(sum(wordsegment.UNIGRAMS.values()))
134120
135121
WordSegment doesn't require any fancy machine learning training
136122
algorithms. Simply update the unigram and bigram count dictionaries and
137123
you're ready to go.
138-

0 commit comments

Comments
 (0)