Skip to content

Commit 4ba0d87

Browse files
committed
Fixes for doc8
1 parent fd2a783 commit 4ba0d87

File tree

4 files changed

+32
-33
lines changed

4 files changed

+32
-33
lines changed

README.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@ Python Word Segmentation
77
`WordSegment`_ is an Apache2 licensed module for English word
88
segmentation, written in pure-Python, and based on a trillion-word corpus.
99

10-
Based on code from the chapter "`Natural Language Corpus Data`_" by Peter Norvig
11-
from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009).
12-
13-
Data files are derived from the `Google Web Trillion Word Corpus`_, as described
14-
by Thorsten Brants and Alex Franz, and `distributed`_ by the Linguistic Data
15-
Consortium. This module contains only a subset of that data. The unigram data
16-
includes only the most common 333,000 words. Similarly, bigram data includes
17-
only the most common 250,000 phrases. Every word and phrase is lowercased with
18-
punctuation removed.
10+
Based on code from the chapter "`Natural Language Corpus Data`_" by Peter
11+
Norvig from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009).
12+
13+
Data files are derived from the `Google Web Trillion Word Corpus`_, as
14+
described by Thorsten Brants and Alex Franz, and `distributed`_ by the
15+
Linguistic Data Consortium. This module contains only a subset of that
16+
data. The unigram data includes only the most common 333,000 words. Similarly,
17+
bigram data includes only the most common 250,000 phrases. Every word and
18+
phrase is lowercased with punctuation removed.
1919

2020
.. _`WordSegment`: http://www.grantjenks.com/docs/wordsegment/
2121
.. _`Natural Language Corpus Data`: http://norvig.com/ngrams/

docs/index.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@ Python Word Segmentation
44
Python WordSegment is an Apache2 licensed module for English word segmentation,
55
written in pure-Python, and based on a trillion-word corpus.
66

7-
Based on code from the chapter "`Natural Language Corpus Data`_" by Peter Norvig
8-
from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009).
9-
10-
Data files are derived from the `Google Web Trillion Word Corpus`_, as described
11-
by Thorsten Brants and Alex Franz, and `distributed`_ by the Linguistic Data
12-
Consortium. This module contains only a subset of that data. The unigram data
13-
includes only the most common 333,000 words. Similarly, bigram data includes
14-
only the most common 250,000 phrases. Every word and phrase is lowercased with
15-
punctuation removed.
7+
Based on code from the chapter "`Natural Language Corpus Data`_" by Peter
8+
Norvig from the book "`Beautiful Data`_" (Segaran and Hammerbacher, 2009).
9+
10+
Data files are derived from the `Google Web Trillion Word Corpus`_, as
11+
described by Thorsten Brants and Alex Franz, and `distributed`_ by the
12+
Linguistic Data Consortium. This module contains only a subset of that
13+
data. The unigram data includes only the most common 333,000 words. Similarly,
14+
bigram data includes only the most common 250,000 phrases. Every word and
15+
phrase is lowercased with punctuation removed.
1616

1717
.. _`Natural Language Corpus Data`: http://norvig.com/ngrams/
1818
.. _`Beautiful Data`: http://oreilly.com/catalog/9780596157111/

docs/python-load-dict-fast-from-file.rst

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,16 +45,16 @@ platform.
4545
print subprocess.check_output([
4646
'/usr/sbin/sysctl', '-n', 'machdep.cpu.brand_string'
4747
])
48-
48+
4949
import sys
5050
print sys.version
5151
5252
5353
.. parsed-literal::
5454
5555
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
56-
57-
2.7.10 (default, May 25 2015, 13:06:17)
56+
57+
2.7.10 (default, May 25 2015, 13:06:17)
5858
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)]
5959
6060
@@ -204,10 +204,10 @@ two.
204204
pairs = [line.split('\t') for line in reader]
205205
words = [pair[0] for pair in pairs]
206206
counts = [float(pair[1]) for pair in pairs]
207-
207+
208208
with open('words.txt', 'wb') as writer:
209209
writer.write('\n'.join(words))
210-
210+
211211
from array import array
212212
values = array('d')
213213
values.fromlist(counts)
@@ -262,4 +262,3 @@ I also tried formatting the ``dict`` in a Python module which would be
262262
parsed on import. This was actually a little slower than the initial
263263
code. My guess is the Python interpreter is doing roughly the same
264264
thing.
265-

docs/using-a-different-corpus.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ Jane Austen's *Pride and Prejudice*.
1414
.. code:: python
1515
1616
import requests
17-
17+
1818
response = requests.get('https://www.gutenberg.org/ebooks/1342.txt.utf-8')
19-
19+
2020
text = response.text
21-
21+
2222
print len(text)
2323
2424
.. parsed-literal::
@@ -58,11 +58,11 @@ unigrams to their counts. Let's write a method to tokenize our text.
5858
.. code:: python
5959
6060
import re
61-
61+
6262
def tokenize(text):
6363
pattern = re.compile('[a-zA-Z]+')
6464
return (match.group(0) for match in pattern.finditer(text))
65-
65+
6666
print list(tokenize("Wait, what did you say?"))
6767
6868
.. parsed-literal::
@@ -74,17 +74,17 @@ Now we'll build our dictionaries.
7474
.. code:: python
7575
7676
from collections import Counter
77-
77+
7878
wordsegment.UNIGRAMS = Counter(tokenize(text))
79-
79+
8080
def pairs(iterable):
8181
iterator = iter(iterable)
8282
values = [next(iterator)]
8383
for value in iterator:
8484
values.append(value)
8585
yield ' '.join(values)
8686
del values[0]
87-
87+
8888
wordsegment.BIGRAMS = Counter(pairs(tokenize(text)))
8989
9090
That's it.
@@ -99,7 +99,7 @@ input to ``segment``.
9999
100100
def identity(value):
101101
return value
102-
102+
103103
wordsegment.clean = identity
104104
105105
.. code:: python

0 commit comments

Comments
 (0)