Update docs references for unigram/bigram counts and load function

grantjenks · grantjenks · commit f68b8a69ffc6 · 2016-12-20T16:26:53.000-08:00
diff --git a/README.rst b/README.rst
@@ -82,11 +82,12 @@ Sometimes its interesting to explore the unigram and bigram counts
 themselves. These are stored in Python dictionaries mapping word to count. ::
 
     >>> import wordsegment as ws
-    >>> ws.unigram_counts['the']
+    >>> ws.load()
+    >>> ws.UNIGRAMS['the']
     23135851162.0
-    >>> ws.unigram_counts['gray']
+    >>> ws.UNIGRAMS['gray']
     21424658.0
-    >>> ws.unigram_counts['grey']
+    >>> ws.UNIGRAMS['grey']
     18276942.0
 
 Above we see that the spelling `gray` is more common than the spelling `grey`.
@@ -96,7 +97,7 @@ Bigrams are joined by a space::
     >>> import heapq
     >>> from pprint import pprint
     >>> from operator import itemgetter
-    >>> pprint(heapq.nlargest(10, ws.bigram_counts.items(), itemgetter(1)))
+    >>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
     [('of the', 2766332391.0),
      ('in the', 1628795324.0),
      ('to the', 1139248999.0),
@@ -110,9 +111,9 @@ Bigrams are joined by a space::
 
 Some bigrams begin with `<s>`. This is to indicate the start of a bigram::
 
-    >>> ws.bigram_counts['<s> where']
+    >>> ws.BIGRAMS['<s> where']
     15419048.0
-    >>> ws.bigram_counts['<s> what']
+    >>> ws.BIGRAMS['<s> what']
     11779290.0
 
 The unigrams and bigrams data is stored in the `wordsegment_data` directory in
diff --git a/docs/api.rst b/docs/api.rst
@@ -12,6 +12,11 @@ WordSegment API Reference
     Yield (prefix, suffix) pairs from `text` with len(prefix) not
     exceeding `limit`.
 
+.. py:function:: load()
+   :module: wordsegment
+
+    Load unigram and bigram counts from disk.
+
 .. py:function:: score(word, prev=None)
    :module: wordsegment
 
@@ -22,13 +27,13 @@ WordSegment API Reference
 
     Return a list of words that is the best segmenation of `text`.
 
-.. py:data:: unigram_counts
+.. py:data:: UNIGRAMS
    :module: wordsegment
 
     Mapping of (unigram, count) pairs.
     Loaded from the file 'wordsegment_data/unigrams.txt'.
 
-.. py:data:: bigram_counts
+.. py:data:: BIGRAMS
    :module: wordsegment
 
     Mapping of (bigram, count) pairs.
@@ -39,5 +44,5 @@ WordSegment API Reference
    :module: wordsegment
 
     Total number of unigrams in the corpus.
-    Need not match `sum(unigram_counts.values())`.
+    Need not match `sum(UNIGRAMS.values())`.
     Defaults to 1,024,908,267,229.
diff --git a/docs/conf.py b/docs/conf.py
@@ -14,7 +14,6 @@
 
 import sys
 import os
-import shlex
 
 # If extensions (or modules to document with autodoc) are in another directory,
 # add these directories to sys.path here. If the directory is relative to the
@@ -51,8 +50,8 @@
 master_doc = 'index'
 
 # General information about the project.
-project = u'WordSegment'
-copyright = u'2015, Grant Jenks'
+project = u'Word Segment'
+copyright = u'2016, Grant Jenks'
 author = u'Grant Jenks'
 
 # The version info for the project you're documenting, acts as replacement for
@@ -128,6 +127,7 @@
     'show_related': True,
     'github_user': 'grantjenks',
     'github_repo': 'wordsegment',
+    'github_type': 'star',
 }
 
 # Add any paths that contain custom themes here, relative to this directory.
diff --git a/docs/index.rst b/docs/index.rst
@@ -78,11 +78,12 @@ Sometimes its interesting to explore the unigram and bigram counts
 themselves. These are stored in Python dictionaries mapping word to count. ::
 
     >>> import wordsegment as ws
-    >>> ws.unigram_counts['the']
+    >>> ws.load()
+    >>> ws.UNIGRAMS['the']
     23135851162.0
-    >>> ws.unigram_counts['gray']
+    >>> ws.UNIGRAMS['gray']
     21424658.0
-    >>> ws.unigram_counts['grey']
+    >>> ws.UNIGRAMS['grey']
     18276942.0
 
 Above we see that the spelling `gray` is more common than the spelling `grey`.
@@ -92,7 +93,7 @@ Bigrams are joined by a space::
     >>> import heapq
     >>> from pprint import pprint
     >>> from operator import itemgetter
-    >>> pprint(heapq.nlargest(10, ws.bigram_counts.items(), itemgetter(1)))
+    >>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
     [('of the', 2766332391.0),
      ('in the', 1628795324.0),
      ('to the', 1139248999.0),
@@ -106,9 +107,9 @@ Bigrams are joined by a space::
 
 Some bigrams begin with `<s>`. This is to indicate the start of a bigram::
 
-    >>> ws.bigram_counts['<s> where']
+    >>> ws.BIGRAMS['<s> where']
     15419048.0
-    >>> ws.bigram_counts['<s> what']
+    >>> ws.BIGRAMS['<s> what']
     11779290.0
 
 The unigrams and bigrams data is stored in the `wordsegment_data` directory in
diff --git a/docs/using-a-different-corpus.rst b/docs/using-a-different-corpus.rst
@@ -1,4 +1,3 @@
-
 Using a Different Corpus
 ========================
 
@@ -22,44 +21,38 @@ Jane Austen's *Pride and Prejudice*.
     
     print len(text)
 
-
 .. parsed-literal::
 
     717573
 
-
 Great. We've got a new corpus for ``wordsegment``. Now let's look at
 what parts of the API we need to change. There's one function and two
-dictionaries: ``wordsegment.clean``, ``wordsegment.bigram_counts`` and
-``wordsegment.unigram_counts``. We'll work on these in reverse.
+dictionaries: ``wordsegment.clean``, ``wordsegment.BIGRAMS`` and
+``wordsegment.UNIGRAMS``. We'll work on these in reverse.
 
 .. code:: python
 
     import wordsegment
 
 .. code:: python
 
-    print type(wordsegment.unigram_counts), type(wordsegment.bigram_counts)
-
+    print type(wordsegment.UNIGRAMS), type(wordsegment.BIGRAMS)
 
 .. parsed-literal::
 
     <type 'dict'> <type 'dict'>
 
-
 .. code:: python
 
-    print wordsegment.unigram_counts.items()[:3]
-    print wordsegment.bigram_counts.items()[:3]
-
+    print wordsegment.UNIGRAMS.items()[:3]
+    print wordsegment.BIGRAMS.items()[:3]
 
 .. parsed-literal::
 
     [('biennials', 37548.0), ('verplank', 48349.0), ('tsukino', 19771.0)]
     [('personal effects', 151369.0), ('basic training', 294085.0), ('it absolutely', 130505.0)]
 
-
-Ok, so ``wordsegment.unigram_counts`` is just a dictionary mapping
+Ok, so ``wordsegment.UNIGRAMS`` is just a dictionary mapping
 unigrams to their counts. Let's write a method to tokenize our text.
 
 .. code:: python
@@ -72,19 +65,17 @@ unigrams to their counts. Let's write a method to tokenize our text.
     
     print list(tokenize("Wait, what did you say?"))
 
-
 .. parsed-literal::
 
     ['Wait', 'what', 'did', 'you', 'say']
 
-
 Now we'll build our dictionaries.
 
 .. code:: python
 
     from collections import Counter
     
-    wordsegment.unigram_counts = Counter(tokenize(text))
+    wordsegment.UNIGRAMS = Counter(tokenize(text))
     
     def pairs(iterable):
         iterator = iter(iterable)
@@ -94,7 +85,7 @@ Now we'll build our dictionaries.
             yield ' '.join(values)
             del values[0]
     
-    wordsegment.bigram_counts = Counter(pairs(tokenize(text)))
+    wordsegment.BIGRAMS = Counter(pairs(tokenize(text)))
 
 That's it.
 
@@ -115,24 +106,18 @@ input to ``segment``.
 
     wordsegment.segment('wantofawife')
 
-
-
-
 .. parsed-literal::
 
     ['want', 'of', 'a', 'wife']
 
-
-
 If you find this behaves poorly then you may need to change the
 ``wordsegment.TOTAL`` variable to reflect the total of all unigrams. In
 our case that's simply:
 
 .. code:: python
 
-    wordsegment.TOTAL = float(sum(wordsegment.unigram_counts.values()))
+    wordsegment.TOTAL = float(sum(wordsegment.UNIGRAMS.values()))
 
 WordSegment doesn't require any fancy machine learning training
 algorithms. Simply update the unigram and bigram count dictionaries and
 you're ready to go.
-