-
Notifications
You must be signed in to change notification settings - Fork 0
Description
alone. (The ASCII tab character should also be included for good
measure in a production script.)
At this point, we have data consisting of words separated by blank
space. The words only contain alphanumeric characters (and the
underscore). The next step is break the data apart so that we have one
word per line. This makes the counting operation much easier, as we
will see shortly.
$ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \n' |
> tr -s ' ' '\n' | ...
This command turns blanks into newlines. The ‘-s’ option squeezes
multiple newline characters in the output into just one, removing blank lines. (The ‘>’ is the shell’s “secondary prompt.” This is what the
shell prints when it notices you haven’t finished typing in all of a
command.)
We now have data consisting of one word per line, no punctuation, all
one case. We’re ready to count each word:
$ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \n' |
> tr -s ' ' '\n' | sort | uniq -c | ...
At this point, the data might look something like this:
60 a
2 able
6 about
1 above
2 accomplish
1 acquire
1 actually
2 additional
The output is sorted by word, not by count! What we want is the most
frequently used words first. Fortunately, this is easy to accomplish,
with the help of two more ‘sort’ options:
‘-n’
do a numeric sort, not a textual one
‘-r’
reverse the order of the sort
The final pipeline looks like this:
$ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \n' |
> tr -s ' ' '\n' | sort | uniq -c | sort -n -r
⊣ 156 the
⊣ 60 a
⊣ 58 to
⊣ 51 of
⊣ 51 and
...
Whew! That’s a lot to digest. Yet, the same principles apply. With
six commands, on two lines (really one long one split for convenience),
we’ve created a program that does something interesting and useful, in
much less time than we could have written a C program to do the same
thing.
A minor modification to the above pipeline can give us a simple
spelling checker! To determine if you’ve spelled a word correctly, all
you have to do is look it up in a dictionary. If it is not there, then
chances are that your spelling is incorrect. So, we need a dictionary.
The conventional location for a dictionary is ‘/usr/share/dict/words’.
Now, how to compare our file with the dictionary? As before, we
generate a sorted list of words, one per line:
-----Info: (coreutils)Putting the tools together, 317 lines --60%------------------