Fixes for doc8

grantjenks · grantjenks · commit 4ba0d8732924 · 2016-12-20T16:30:50.000-08:00
diff --git a/README.rst b/README.rst
@@ -7,15 +7,15 @@ Python Word Segmentation
 `WordSegment`_ is an Apache2 licensed module for English word
 segmentation, written in pure-Python, and based on a trillion-word corpus.
 
-Based on code from the chapter "`Natural Language Corpus Data`_" by Peter Norvig
-from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009).
-
-Data files are derived from the `Google Web Trillion Word Corpus`_, as described
-by Thorsten Brants and Alex Franz, and `distributed`_ by the Linguistic Data
-Consortium. This module contains only a subset of that data. The unigram data
-includes only the most common 333,000 words. Similarly, bigram data includes
-only the most common 250,000 phrases. Every word and phrase is lowercased with
-punctuation removed.
+Based on code from the chapter "`Natural Language Corpus Data`_" by Peter
+Norvig from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009).
+
+Data files are derived from the `Google Web Trillion Word Corpus`_, as
+described by Thorsten Brants and Alex Franz, and `distributed`_ by the
+Linguistic Data Consortium. This module contains only a subset of that
+data. The unigram data includes only the most common 333,000 words. Similarly,
+bigram data includes only the most common 250,000 phrases. Every word and
+phrase is lowercased with punctuation removed.
 
 .. _`WordSegment`: http://www.grantjenks.com/docs/wordsegment/
 .. _`Natural Language Corpus Data`: http://norvig.com/ngrams/
diff --git a/docs/index.rst b/docs/index.rst
@@ -4,15 +4,15 @@ Python Word Segmentation
 Python WordSegment is an Apache2 licensed module for English word segmentation,
 written in pure-Python, and based on a trillion-word corpus.
 
-Based on code from the chapter "`Natural Language Corpus Data`_" by Peter Norvig
-from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009).
-
-Data files are derived from the `Google Web Trillion Word Corpus`_, as described
-by Thorsten Brants and Alex Franz, and `distributed`_ by the Linguistic Data
-Consortium. This module contains only a subset of that data. The unigram data
-includes only the most common 333,000 words. Similarly, bigram data includes
-only the most common 250,000 phrases. Every word and phrase is lowercased with
-punctuation removed.
+Based on code from the chapter "`Natural Language Corpus Data`_" by Peter
+Norvig from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009).
+
+Data files are derived from the `Google Web Trillion Word Corpus`_, as
+described by Thorsten Brants and Alex Franz, and `distributed`_ by the
+Linguistic Data Consortium. This module contains only a subset of that
+data. The unigram data includes only the most common 333,000 words. Similarly,
+bigram data includes only the most common 250,000 phrases. Every word and
+phrase is lowercased with punctuation removed.
 
 .. _`Natural Language Corpus Data`: http://norvig.com/ngrams/
 .. _`Beautiful Data`: http://oreilly.com/catalog/9780596157111/
diff --git a/docs/python-load-dict-fast-from-file.rst b/docs/python-load-dict-fast-from-file.rst
@@ -45,16 +45,16 @@ platform.
     print subprocess.check_output([
         '/usr/sbin/sysctl', '-n', 'machdep.cpu.brand_string'
     ])
-    
+
     import sys
     print sys.version
 
 
 .. parsed-literal::
 
     Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
-    
-    2.7.10 (default, May 25 2015, 13:06:17) 
+
+    2.7.10 (default, May 25 2015, 13:06:17)
     [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)]
 
 
@@ -204,10 +204,10 @@ two.
         pairs = [line.split('\t') for line in reader]
         words = [pair[0] for pair in pairs]
         counts = [float(pair[1]) for pair in pairs]
-        
+
         with open('words.txt', 'wb') as writer:
             writer.write('\n'.join(words))
-            
+
         from array import array
         values = array('d')
         values.fromlist(counts)
@@ -262,4 +262,3 @@ I also tried formatting the ``dict`` in a Python module which would be
 parsed on import. This was actually a little slower than the initial
 code. My guess is the Python interpreter is doing roughly the same
 thing.
-
diff --git a/docs/using-a-different-corpus.rst b/docs/using-a-different-corpus.rst
@@ -14,11 +14,11 @@ Jane Austen's *Pride and Prejudice*.
 .. code:: python
 
     import requests
-    
+
     response = requests.get('https://www.gutenberg.org/ebooks/1342.txt.utf-8')
-    
+
     text = response.text
-    
+
     print len(text)
 
 .. parsed-literal::
@@ -58,11 +58,11 @@ unigrams to their counts. Let's write a method to tokenize our text.
 .. code:: python
 
     import re
-    
+
     def tokenize(text):
         pattern = re.compile('[a-zA-Z]+')
         return (match.group(0) for match in pattern.finditer(text))
-    
+
     print list(tokenize("Wait, what did you say?"))
 
 .. parsed-literal::
@@ -74,17 +74,17 @@ Now we'll build our dictionaries.
 .. code:: python
 
     from collections import Counter
-    
+
     wordsegment.UNIGRAMS = Counter(tokenize(text))
-    
+
     def pairs(iterable):
         iterator = iter(iterable)
         values = [next(iterator)]
         for value in iterator:
             values.append(value)
             yield ' '.join(values)
             del values[0]
-    
+
     wordsegment.BIGRAMS = Counter(pairs(tokenize(text)))
 
 That's it.
@@ -99,7 +99,7 @@ input to ``segment``.
 
     def identity(value):
         return value
-    
+
     wordsegment.clean = identity
 
 .. code:: python