bjc-r/cur/programming/python/hug_text_processing/python_top_words.html at 28482e4fd32bd8c5c9d40015d8d600c28e0630be · cs10/bjc-r · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
<!DOCTYPE html>
<html>
    <head>
        <script src="/bjc-r/llab/loader.js"></script>
        <title>Top Words</title>
    <!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-176402054-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-176402054-1');
</script>
</head>
    <body>
        <p>
            Now that we know how to count the number of occurrence of words in a file, let's use that to find the most popular words in a file.
        </p>

        <p>At the bottom of word_analyzer.py, add the code below (not part of any function):
        </p>

    <pre>
      <code>
text = read_file("horse_ebooks.txt")
counts = count_words(text)
print(sorted(counts))</code>
    </pre>

    <p>The sorted function takes as input an iterable, and reports a sorted version of this iterable. As you'll see when you run this program, it will return a list of the words in the dictionary sorted alphabetically. However, we want them sorted by their values (i.e. by their counts in the original file).
    </p>

    <p>Luckily for us, <code>sort</code> is a higher order function. <code>sort</code> can take an optional input called <code>key</code> which tells the sorted function how to sort. For example, we can sort a list of numbers by their absolute value as follows, by passing <code>abs</code> function into the <code>key</code> parameter. Here we're using a special syntax for optional parameters in Python, where we set the optional parameter with an equals sign. There is no equivalent for this in Snap<em>!</em>, but the idea should hopefully be clear.
    </p>

    <pre>
      <code>>>> sorted([-5, -1, 2, 3], key = abs)
[-1, 2, 3, -5]</p></code>
    </pre>


    <p>In our case, in order to find the most popular words, we want to sort each key in the dictionary by its value. This is probably a bit mind blowing, but this means that we need to use count's <code>get</code> function, as shown below. Try copying and pasting this into word_analyzer.py to see what happens:
    </p>

    <pre>
      <code>
text = read_file("horse_ebooks.txt")
counts = count_words(text)
print(sorted(counts, key = counts.get))</code>
    </pre>

    <pre>
      <code>
$$ python word_analyzer.py
['on', 'Budget', 'to', 'Fruit', 'Clean', 'Fruits', 'Store', 'at', 'a', 'and', 'Vegetables']
</code>
    </pre>


        <p>What this does is call the get function of the counts dictionary for each item, and use that for sorting. You'll see that the words are now sorted by frequency (i.e. their value in the dictionary). Because the sorted function always puts smallest items first, you'll see that the less common words come first. If we want the opposite order, we can use another optional parameter for the sorted function called reverse, as shown below:
        </p>

    <pre>
      <code>
text = read_file("horse_ebooks.txt")
counts = (count_words(text))
print(sorted(counts, key = counts.get, reverse = True))</code>
    </pre>

    <p>Using the examples above as inspiration, write a function <code>top_n_words</code> as defined below:
    </p>

<pre>
<code>def top_n_words(counts, n):
    """Returns the top n words by count. For example:

    top_n_words({'and': 5, 'on': 1, 'Vegetables': 5, 'Budget': 1, 'to': 1, 'Fruit': 1, 'a': 2, 'Clean': 1, 'Fruits': 1, 'Store': 1, 'at': 1}, 2)

    would return ["and", "Vegetables"].

    In the case of a tie, it doesn't matter which words are chosen to break the tie."""</code>
    </pre>

    <p>Make sure to test that <code>top_n_words</code> works. We haven't told you how to do this explicitly, but using what you know so far, you should try to figure out how to test top_n_words using a print statement.</p>

 <p class="alert quoteBlue">
Hint by example: <code>x[:5]</code> returns the first 5 items of a list.
 </p>

 <h3>Printing the Top N Words</h3>

    <p>Complete this section of the lab by writing a function <code>print_top_n_words</code> that prints out the top n words along with their counts, with one word on each line.</p>

<pre>
<code>def print_top_n_words(counts, n):
    """Prints the top n words along with their counts. For example:

    print_top_n_words({'and': 5, 'on': 1, 'Vegetables': 5, 'Budget': 1, 'to': 1, 'Fruit': 1, 'a': 2, 'Clean': 1, 'Fruits': 1, 'Store': 1, 'at': 1}, 2)

    would print:
    and 5
    Vegetables 5"""</code>
    </pre>


 <p class="alert quoteBlue">
Hint: Make sure to use your <code>top_n_words</code> function when writing this function. <br>
Hint 2: <code>"and " + 5</code> will cause a TypeError since Python doesn't want to add a string to a number. To fix this, use the <code>str</code> function to convert the number to a string, e.g. <code>"and " + str(5)</code>
 </p>


    <p>Try printing the top words in a few of the provided files (beatles.txt, nietzsche.txt, etc.) You may need to make n kind of large to see anything interesting.
    </p>

 <h3>Removing Boring Words (optional)</h3>

    <p>You might find it annoying that the most common words for most texts are boring things like "the". Create another function <code>top_n_words_except(counts, n, boring)</code> that returns the top words except for anything that appears in <code>boring</code>, which is a list of boring words.
    </p>
</html>