Skip to content

Commit 3c1013c

Browse files
edited uniq and comm chapters
1 parent f5736fe commit 3c1013c

File tree

4 files changed

+18
-15
lines changed

4 files changed

+18
-15
lines changed

comm.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
Array.from(document.querySelectorAll('#sidebar a')).forEach(function(link) {
3939
link.setAttribute('tabIndex', sidebar === 'visible' ? 0 : -1);
4040
});
41-
</script><div id=content class=content><main><div class=sidetoc><nav class=pagetoc></nav></div><h1 id=comm><a class=header href=#comm>comm</a></h1><p>The <code>comm</code> command finds common and unique lines between two sorted files. These results are formatted as a table with three columns, one or more of these columns can be suppressed as needed.<h2 id=three-column-output><a class=header href=#three-column-output>Three column output</a></h2><p>Consider the sample input files as shown below:<pre><code class=language-bash># side by side view of the sample files
41+
</script><div id=content class=content><main><div class=sidetoc><nav class=pagetoc></nav></div><h1 id=comm><a class=header href=#comm>comm</a></h1><p>The <code>comm</code> command finds common and unique lines between two sorted files. These results are formatted as a table with three columns and one or more of these columns can be suppressed as required.<h2 id=three-column-output><a class=header href=#three-column-output>Three column output</a></h2><p>Consider the sample input files as shown below:<pre><code class=language-bash># side by side view of the sample files
4242
# note that these files are already sorted
4343
$ paste colors_1.txt colors_2.txt
4444
Blue Black
@@ -96,7 +96,7 @@
9696
Pink
9797
</code></pre><p>You can combine all the three options as well. Useful with the <code>--total</code> option to get only the count of lines for each of the three columns.<pre><code class=language-bash>$ comm --total -123 colors_1.txt colors_2.txt
9898
3 3 4 total
99-
</code></pre><h2 id=duplicate-lines><a class=header href=#duplicate-lines>Duplicate lines</a></h2><p>The number of duplicate lines in the common column will be minimum of the duplicate occurrences between the two files. Rest of the duplicate lines, if any, will be considered as unique to that file. Here's an example:<pre><code class=language-bash>$ paste list_1.txt list_2.txt
99+
</code></pre><h2 id=duplicate-lines><a class=header href=#duplicate-lines>Duplicate lines</a></h2><p>The number of duplicate lines in the common column will be minimum of the duplicate occurrences between the two files. Rest of the duplicate lines, if any, will be considered as unique to the file having the excess lines. Here's an example:<pre><code class=language-bash>$ paste list_1.txt list_2.txt
100100
apple cherry
101101
banana cherry
102102
cherry mango
@@ -117,6 +117,6 @@
117117
papaya
118118
</code></pre><h2 id=nul-separator><a class=header href=#nul-separator>NUL separator</a></h2><p>Use <code>-z</code> option if you want to use NUL character as the line separator. In this scenario, <code>comm</code> will ensure to add a final NUL character even if not present in the input.<pre><code class=language-bash>$ comm -z -12 <(printf 'a\0b\0c') <(printf 'a\0c\0x') | cat -v
119119
a^@c^@
120-
</code></pre><h2 id=alternatives><a class=header href=#alternatives>Alternatives</a></h2><p>Here's some alternate commands you can explore if <code>comm</code> isn't enough to solve your task. These alternatives do not require input to be sorted.<ul><li><a href=https://github.com/yarrow/zet>zet</a> — set operations on one or more input files<li><a href=https://learnbyexample.github.io/learn_gnugrep_ripgrep/frequently-used-options.html#comparing-lines-between-files>Comparing lines between files</a> section from my <strong>GNU grep</strong> ebook<li><a href=https://learnbyexample.github.io/learn_gnuawk/two-file-processing.html>Two file processing</a> chapter from my <strong>GNU awk</strong> ebook, both line and field based comparison<li><a href=https://learnbyexample.github.io/learn_perl_oneliners/two-file-processing.html>Two file processing</a> chapter from my <strong>Perl one-liners</strong> ebook, both line and field based comparison</ul></main><nav class=nav-wrapper aria-label="Page navigation"><a rel=prev href=uniq.html class="mobile-nav-chapters previous"title="Previous chapter"aria-label="Previous chapter"aria-keyshortcuts=Left> <i class="fa fa-angle-left"></i> </a><a rel=next href=join.html class="mobile-nav-chapters next"title="Next chapter"aria-label="Next chapter"aria-keyshortcuts=Right> <i class="fa fa-angle-right"></i> </a><div style="clear: both"></div></nav></div></div><nav class=nav-wide-wrapper aria-label="Page navigation"><a rel=prev href=uniq.html class="nav-chapters previous"title="Previous chapter"aria-label="Previous chapter"aria-keyshortcuts=Left> <i class="fa fa-angle-left"></i> </a><a rel=next href=join.html class="nav-chapters next"title="Next chapter"aria-label="Next chapter"aria-keyshortcuts=Right> <i class="fa fa-angle-right"></i> </a></nav></div><script>
120+
</code></pre><h2 id=alternatives><a class=header href=#alternatives>Alternatives</a></h2><p>Here's some alternate commands you can explore if <code>comm</code> isn't enough to solve your task. These alternatives do not require the input files to be sorted.<ul><li><a href=https://github.com/yarrow/zet>zet</a> — set operations on one or more input files<li><a href=https://learnbyexample.github.io/learn_gnugrep_ripgrep/frequently-used-options.html#comparing-lines-between-files>Comparing lines between files</a> section from my <strong>GNU grep</strong> ebook<li><a href=https://learnbyexample.github.io/learn_gnuawk/two-file-processing.html>Two file processing</a> chapter from my <strong>GNU awk</strong> ebook, both line and field based comparison<li><a href=https://learnbyexample.github.io/learn_perl_oneliners/two-file-processing.html>Two file processing</a> chapter from my <strong>Perl one-liners</strong> ebook, both line and field based comparison</ul></main><nav class=nav-wrapper aria-label="Page navigation"><a rel=prev href=uniq.html class="mobile-nav-chapters previous"title="Previous chapter"aria-label="Previous chapter"aria-keyshortcuts=Left> <i class="fa fa-angle-left"></i> </a><a rel=next href=join.html class="mobile-nav-chapters next"title="Next chapter"aria-label="Next chapter"aria-keyshortcuts=Right> <i class="fa fa-angle-right"></i> </a><div style="clear: both"></div></nav></div></div><nav class=nav-wide-wrapper aria-label="Page navigation"><a rel=prev href=uniq.html class="nav-chapters previous"title="Previous chapter"aria-label="Previous chapter"aria-keyshortcuts=Left> <i class="fa fa-angle-left"></i> </a><a rel=next href=join.html class="nav-chapters next"title="Next chapter"aria-label="Next chapter"aria-keyshortcuts=Right> <i class="fa fa-angle-right"></i> </a></nav></div><script>
121121
window.playground_copyable = true;
122122
</script><script src=elasticlunr.min.js charset=utf-8></script><script src=mark.min.js charset=utf-8></script><script src=searcher.js charset=utf-8></script><script src=clipboard.min.js charset=utf-8></script><script src=highlight.js charset=utf-8></script><script src=book.js charset=utf-8></script><script src=sidebar.js></script>

searchindex.js

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

searchindex.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

uniq.html

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -38,23 +38,24 @@
3838
Array.from(document.querySelectorAll('#sidebar a')).forEach(function(link) {
3939
link.setAttribute('tabIndex', sidebar === 'visible' ? 0 : -1);
4040
});
41-
</script><div id=content class=content><main><div class=sidetoc><nav class=pagetoc></nav></div><h1 id=uniq><a class=header href=#uniq>uniq</a></h1><p>The <code>uniq</code> command identifies similar lines that are adjacent to each other. There are various options to help you filter unique or duplicate lines, count them, group them, etc.<h2 id=retain-single-copy-of-duplicates><a class=header href=#retain-single-copy-of-duplicates>Retain single copy of duplicates</a></h2><p>This is the default behavior of the <code>uniq</code> command. If adjacent lines are the same, only the first copy will be displayed in the output. Unlike <code>sort</code>, the <code>uniq</code> command doesn't have to read the entire input since it compares only the lines that are next to each other.<pre><code class=language-bash># uniq will add a newline even if not present for the last input line
41+
</script><div id=content class=content><main><div class=sidetoc><nav class=pagetoc></nav></div><h1 id=uniq><a class=header href=#uniq>uniq</a></h1><p>The <code>uniq</code> command identifies similar lines that are adjacent to each other. There are various options to help you filter unique or duplicate lines, count them, group them, etc.<h2 id=retain-single-copy-of-duplicates><a class=header href=#retain-single-copy-of-duplicates>Retain single copy of duplicates</a></h2><p>This is the default behavior of the <code>uniq</code> command. If adjacent lines are the same, only the first copy will be displayed in the output.<pre><code class=language-bash># only the adjacent lines are compared to determine duplicates
42+
# which is why you get 'red' twice in the output for this input
4243
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq
4344
red
4445
green
4546
red
4647
blue
47-
</code></pre><p>If you want to retain only a single copy based on the entire input contents, one option is to sort the input before applying <code>uniq</code>. Or, use <code>sort -u</code> if applicable.<pre><code class=language-bash># same as sort -u for this case
48+
</code></pre><p>You'll need sorted input to make sure all the input lines are considered to determine duplicates. For some cases, <code>sort -u</code> is enough, like the example shown below:<pre><code class=language-bash># same as sort -u for this case
4849
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | sort | uniq
4950
blue
5051
green
5152
red
52-
</code></pre><p>Sometimes though, you want to sort based on some specific criteria but identify duplicates based on the entire line contents. <code>uniq</code> will help in such cases.<pre><code class=language-bash># can't use sort -n -u here
53+
</code></pre><p>Sometimes though, you may need to sort based on some specific criteria and then identify duplicates based on the entire line contents. Here's an example:<pre><code class=language-bash># can't use sort -n -u here
5354
$ printf '2 balls\n13 pens\n2 pins\n13 pens\n' | sort -n | uniq
5455
2 balls
5556
2 pins
5657
13 pens
57-
</code></pre><p>If you need to preserve input order, use alternatives like <code>awk</code>, <code>perl</code> and <code>huniq</code>.<pre><code class=language-bash># retain single copy of duplicates, maintain input order
58+
</code></pre><blockquote><p><img src=./images/info.svg alt=info> <code>sort+uniq</code> won't be suitable if you need to preserve the input order as well. You can use alternatives like <code>awk</code>, <code>perl</code> and <a href=https://github.com/koraa/huniq>huniq</a> for such cases.</blockquote><pre><code class=language-bash># retain single copy of duplicates, maintain input order
5859
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | awk '!seen[$0]++'
5960
red
6061
green
@@ -83,6 +84,7 @@
8384
toothpaste
8485
washing powder
8586

87+
# just a reminder that uniq works based on adjacent lines only
8688
$ printf 'red\nred\nred\ngreen\nred\nblue\nblue' | uniq -u
8789
green
8890
red
@@ -131,12 +133,12 @@
131133
1 toothpaste
132134
1 soap
133135
</code></pre><h2 id=ignoring-case><a class=header href=#ignoring-case>Ignoring case</a></h2><p>Use the <code>-i</code> option to ignore case while determining duplicates.<pre><code class=language-bash># depending on your locale, sort and sort -f can give the same results
134-
$ printf 'cat\nbat\nCAT\ncar\nbat\n' | sort -f | uniq -iD
136+
$ printf 'cat\nbat\nCAT\ncar\nbat\nmat\nmoat' | sort -f | uniq -iD
135137
bat
136138
bat
137139
cat
138140
CAT
139-
</code></pre><h2 id=partial-match><a class=header href=#partial-match>Partial match</a></h2><p><code>uniq</code> has three options to change the matching criteria to partial parts of the input line. These aren't as powerful as the <code>sort -k</code> option, but they do come in handy for some use cases.<p>The <code>-f</code> option allows you to skip first <code>N</code> fields. Field separation is based on one or more space/tab characters only. Note that these separators will still be part of the field contents, so this will not work with variable number of blanks.<pre><code class=language-bash># skip first field
141+
</code></pre><h2 id=partial-match><a class=header href=#partial-match>Partial match</a></h2><p><code>uniq</code> has three options to change the matching criteria to partial parts of the input line. These aren't as powerful as the <code>sort -k</code> option, but they do come in handy for some use cases.<p>The <code>-f</code> option allows you to skip first <code>N</code> fields. Field separation is based on one or more space/tab characters only. Note that these separators will still be part of the field contents, so this will not work with variable number of blanks.<pre><code class=language-bash># skip first field, works as expected since no. of blanks is consistent
140142
$ printf '2 cars\n5 cars\n10 jeeps\n5 jeeps\n3 trucks\n' | uniq -f1 --group
141143
2 cars
142144
5 cars
@@ -147,22 +149,23 @@
147149
3 trucks
148150

149151
# example with variable number of blanks
150-
$ printf '2 cars\n5 cars\n10 jeeps\n5 jeeps\n3 trucks\n' | uniq -f1
152+
# 'cars' entries were identified as duplicates, but not 'jeeps'
153+
$ printf '2 cars\n5 cars\n1 jeeps\n5 jeeps\n3 trucks\n' | uniq -f1
151154
2 cars
152-
10 jeeps
155+
1 jeeps
153156
5 jeeps
154157
3 trucks
155158
</code></pre><p>The <code>-s</code> option allows you to skip first <code>N</code> characters (calculated as bytes).<pre><code class=language-bash># skip first character
156159
$ printf '* red\n* green\n- green\n* blue\n= blue' | uniq -s1
157160
* red
158161
* green
159162
* blue
160-
</code></pre><p>The <code>-w</code> option allows you to specify a maximum of <code>N</code> characters to be used for comparison (calculated as bytes).<pre><code class=language-bash># compare only first 2 characters
163+
</code></pre><p>The <code>-w</code> option restricts the comparison to the first <code>N</code> characters (calculated as bytes).<pre><code class=language-bash># compare only first 2 characters
161164
$ printf '1) apple\n1) almond\n2) banana\n3) cherry' | uniq -w2
162165
1) apple
163166
2) banana
164167
3) cherry
165-
</code></pre><p>When these options are used simultaneously, the priority is <code>-f</code> first, then <code>-s</code> and then <code>-w</code> option. Remember that blanks are part of the field content.<pre><code class=language-bash># skip first field
168+
</code></pre><p>When these options are used simultaneously, the priority is <code>-f</code> first, then <code>-s</code> and finally <code>-w</code> option. Remember that blanks are part of the field content.<pre><code class=language-bash># skip first field
166169
# then skip first two characters (including the blank character)
167170
# use next two characters for comparison ('bl' and 'ch' in this example)
168171
$ printf '2 @blue\n10 :black\n5 :cherry\n3 @chalk' | uniq -f1 -s2 -w2

0 commit comments

Comments
 (0)