Skip to content

Commit d3aceef

Browse files
committed
Update docs for API changes
1 parent c39e7b3 commit d3aceef

File tree

2 files changed

+17
-15
lines changed

2 files changed

+17
-15
lines changed

docs/api.rst

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,16 @@
11
WordSegment API Reference
22
=========================
33

4+
`WordSegment`_ API reference.
5+
6+
.. _`WordSegment`: http://www.grantjenks.com/docs/wordsegment/
7+
48
.. py:function:: clean(text)
59
:module: wordsegment
610

711
Return `text` lower-cased with non-alphanumeric characters removed.
812

9-
.. py:function:: divide(text, limit=24)
13+
.. py:function:: divide(text)
1014
:module: wordsegment
1115

1216
Yield (prefix, suffix) pairs from `text` with len(prefix) not
@@ -36,18 +40,11 @@ WordSegment API Reference
3640
:module: wordsegment
3741

3842
Mapping of (unigram, count) pairs.
39-
Loaded from the file 'wordsegment_data/unigrams.txt'.
43+
Loaded from the file 'wordsegment/unigrams.txt'.
4044

4145
.. py:data:: BIGRAMS
4246
:module: wordsegment
4347

4448
Mapping of (bigram, count) pairs.
4549
Bigram keys are joined by a space.
46-
Loaded from the file 'wordsegment_data/bigrams.txt'.
47-
48-
.. py:data:: TOTAL
49-
:module: wordsegment
50-
51-
Total number of unigrams in the corpus.
52-
Need not match `sum(UNIGRAMS.values())`.
53-
Defaults to 1,024,908,267,229.
50+
Loaded from the file 'wordsegment/bigrams.txt'.

docs/using-a-different-corpus.rst

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ dictionaries: ``wordsegment.clean``, ``wordsegment.BIGRAMS`` and
3333
.. code:: python
3434
3535
import wordsegment
36+
wordsegment.load()
3637
3738
.. code:: python
3839
@@ -75,7 +76,8 @@ Now we'll build our dictionaries.
7576
7677
from collections import Counter
7778
78-
wordsegment.UNIGRAMS = Counter(tokenize(text))
79+
wordsegment.UNIGRAMS.clear()
80+
wordsegment.UNIGRAMS.update(Counter(tokenize(text)))
7981
8082
def pairs(iterable):
8183
iterator = iter(iterable)
@@ -85,7 +87,8 @@ Now we'll build our dictionaries.
8587
yield ' '.join(values)
8688
del values[0]
8789
88-
wordsegment.BIGRAMS = Counter(pairs(tokenize(text)))
90+
wordsegment.BIGRAMS.clear()
91+
wordsegment.BIGRAMS.update(Counter(pairs(tokenize(text))))
8992
9093
That's it.
9194

@@ -97,10 +100,12 @@ input to ``segment``.
97100

98101
.. code:: python
99102
103+
from wordsegment import _segmenter
104+
100105
def identity(value):
101106
return value
102107
103-
wordsegment.clean = identity
108+
_segmenter.clean = identity
104109
105110
.. code:: python
106111
@@ -111,12 +116,12 @@ input to ``segment``.
111116
['want', 'of', 'a', 'wife']
112117
113118
If you find this behaves poorly then you may need to change the
114-
``wordsegment.TOTAL`` variable to reflect the total of all unigrams. In
119+
``_segmenter.total`` variable to reflect the total of all unigrams. In
115120
our case that's simply:
116121

117122
.. code:: python
118123
119-
wordsegment.TOTAL = float(sum(wordsegment.UNIGRAMS.values()))
124+
_segmenter.total = float(sum(wordsegment.UNIGRAMS.values()))
120125
121126
WordSegment doesn't require any fancy machine learning training
122127
algorithms. Simply update the unigram and bigram count dictionaries and

0 commit comments

Comments
 (0)