OK. Let's sort our expression file using sort:
$ sort E-MTAB-2754-analytics.tsv | head
ENSGALG00000000003 PANX2 0.100242375805959 -0.4
ENSGALG00000000011 C10orf88 0.0802046773105167 0.2
ENSGALG00000000038 CTRB2 NA 0.2
ENSGALG00000000044 WFIKKN1 NA 0
ENSGALG00000000048 0.288103121752422 0.4
ENSGALG00000000055 LAMTOR3 0.529728058895927 0.1
ENSGALG00000000059 TUBB3 0.228430079834946 -0.2
ENSGALG00000000067 SPR 0.0560358954256604 -0.4
ENSGALG00000000071 0.878861305389193 0
ENSGALG00000000081 IL4I1 NA 0
- This isn't right.
sortis sorting by the first column- let's fix this by indicating we want to sort by the 3rd column (p-value)
- And, look, we have 'NA' in the 3rd coulumn (p-value column)
- Is there any other unexpected values in p-value column?
Review the file using more to check out the file for unexpected values:
$ more E-MTAB-2754-analytics.tsv
Gene ID Gene Name g1_g2.p-value g1_g2.log2foldchange
ENSGALG00000000003 PANX2 0.100242375805959 -0.4
ENSGALG00000000011 C10orf88 0.0802046773105167 0.2
ENSGALG00000000038 CTRB2 NA 0.2
ENSGALG00000000044 WFIKKN1 NA 0
ENSGALG00000000048 0.288103121752422 0.4
ENSGALG00000000055 LAMTOR3 0.529728058895927 0.1
ENSGALG00000000059 TUBB3 0.228430079834946 -0.2
ENSGALG00000000067 SPR 0.0560358954256604 -0.4
ENSGALG00000000071 0.878861305389193 0
ENSGALG00000000081 IL4I1 NA 0
ENSGALG00000000086 TIMM17B 0.80618323892789 0
ENSGALG00000000091 INTS4 0.981452026849213 0
ENSGALG00000000094 ADIPOR1 0.283972808349171 -0.1
ENSGALG00000000102 UBE2T 0.901824687993534 0
ENSGALG00000000104 0.28635973460605 -0.4
ENSGALG00000000106 LRIF1 0.777128323405445 0
ENSGALG00000000107 TRIM7.1 NA 0
ENSGALG00000000109 AFMID 0.598969080234545 0.2
ENSGALG00000000112 PLP1 NA -0.1
ENSGALG00000000115 HEP21 NA 0
ENSGALG00000000120 BMP10 NA 0
ENSGALG00000000122 GNB2L1 1.60561843551779e-20 -0.6
ENSGALG00000000129 TCF25 0.756222862478979 0
ENSGALG00000000136 BLEC2 NA 0
ENSGALG00000000137 SNRPE 0.232542507231953 0.2
ENSGALG00000000141 BLB1 0.00790723600064685 0.9
ENSGALG00000000142 CEPT1 0.788385941146629 0
ENSGALG00000000146 SUPT6H 0.934343074557283 0
ENSGALG00000000150 RPL9 0.380362601236553 0.2
ENSGALG00000000151 ADAMTS19 NA 0
ENSGALG00000000154 DENND2D 0.527022644746131 0.2
ENSGALG00000000156 BRD2 3.43933719323364e-09 -0.5
ENSGALG00000000158 DMA 0.338457297207509 0.3
ENSGALG00000000161 ISOC1 0.401325361605509 0.2
ENSGALG00000000162 DMB1 NA 0
ENSGALG00000000164 MYBPHL NA 0
ENSGALG00000000168 ADORA1 NA 0
ENSGALG00000000172 MYOG NA 0
ENSGALG00000000184 SLC27A6 NA 0
ENSGALG00000000186 MTCH1 0.244657916348403 -0.3
ENSGALG00000000189 YTHDC2 0.818791595599391 0
ENSGALG00000000195 SLC23A2 0.00291000489916312 0.3
ENSGALG00000000201 PLEKHM1 0.104412345521551 0.2
ENSGALG00000000208 MCC 6.73362741791211e-11 -1.4
ENSGALG00000000209 PRNP 3.07006995025324e-18 0.8
ENSGALG00000000215 DCP2 0.000936131627143128 -0.4
ENSGALG00000000217 PPFIA4 0.676718383163428 0.1
- Right away, I see we have scientific notation.
- Also I notice we have some missing entries in the 2nd column.
- Let's keep these points in mind.
Let's sort by the third column (p-value):
$ sort -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000000295 0.807343900207163 -0.1
ENSGALG00000000407 0.416558476444968 -0.1
ENSGALG00000000497 0.77516911446794 -0.1
ENSGALG00000000629 0.657700041880819 -0.1
ENSGALG00000001110 0.632025804512114 -0.1
ENSGALG00000001304 0.718829473200678 -0.1
ENSGALG00000001533 0.90216379088787 -0.1
ENSGALG00000001620 NA -0.1
ENSGALG00000001701 0.814909576526554 -0.1
ENSGALG00000001720 0.363542657094163 -0.1
- Hmm. This doesn't look right either.
- Looks like we are sorting by the 4th column, the log2foldchange column.
- Why?? Maybe it has something to do with missing entries in the 2nd column, making only 3 columns in some areas of the file.
- Let's indicate that we have tab characters as a column separator.
Use sort and specify our column separator character:
$ sort -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000002286 H3F3B 0 -7
ENSGALG00000020078 H3F3C 0 -8
ENSGALG00000008094 HSPD1 0.000100309166748761 -0.3
ENSGALG00000042503 0.000100639846707172 -0.5
ENSGALG00000039013 EXT1 0.000101180056590984 -0.5
ENSGALG00000002162 ACO1 0.000101317230226459 -0.4
ENSGALG00000001024 TPRG1L 0.000101701679175278 -0.3
ENSGALG00000026757 DHFR 0.0001017195368379 -0.4
ENSGALG00000028501 DCK 0.000101840426398403 -0.3
ENSGALG00000003580 MMP2 0.000102357608503999 0.4
- OK. That looks better, the 3rd column looks like it is being sorted.
- Wait, we had p-values with scientific notation (3.43933719323364e-09). Shouldn't they come first?
- Are we sorting by characters or by numbers? Characters! We should be sorting by numerical values!!
Use sort and specify we want to sort the column 3 numerically:
$ sort -n -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000000038 CTRB2 NA 0.2
ENSGALG00000000044 WFIKKN1 NA 0
ENSGALG00000000081 IL4I1 NA 0
ENSGALG00000000107 TRIM7.1 NA 0
ENSGALG00000000112 PLP1 NA -0.1
ENSGALG00000000115 HEP21 NA 0
ENSGALG00000000120 BMP10 NA 0
ENSGALG00000000136 BLEC2 NA 0
ENSGALG00000000151 ADAMTS19 NA 0
ENSGALG00000000162 DMB1 NA 0
- Wait, What? Now we have 'NA' sorted to the top.
- What about the other end of our list, let's reverse the sort.
Reverse sort to take a quick look at the other end of our sorted file:
$ sort -r -n -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000041634 ACTG2 9.99765161055154e-06 0.5
ENSGALG00000003959 AP1M1 9.99364241038882e-05 -0.3
ENSGALG00000026019 ASCC1 9.99213530582087e-06 -0.4
ENSGALG00000029817 USP22 9.95241747734131e-18 0.6
ENSGALG00000007430 ARCN1 9.92710147646906e-11 0.5
ENSGALG00000035182 CMKLR1 9.9135182215432e-10 0.8
ENSGALG00000005617 NTHL1 9.9135182215432e-10 0.6
ENSGALG00000009112 LRRC57 9.88331240104096e-08 0.6
ENSGALG00000015446 POU2F1 9.8793969295065e-06 -0.5
ENSGALG00000016279 RAB23 9.86441714514607e-05 -0.6
- Huh. Looks a bit better, but they are sorting by the digits and not the appropirate scientific notation?
- We need to sort by the actual value and not just the starting digits.
Use sort with -g option (general-numeric-sort):
$ sort -g -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000000038 CTRB2 NA 0.2
ENSGALG00000000044 WFIKKN1 NA 0
ENSGALG00000000081 IL4I1 NA 0
ENSGALG00000000107 TRIM7.1 NA 0
ENSGALG00000000112 PLP1 NA -0.1
ENSGALG00000000115 HEP21 NA 0
ENSGALG00000000120 BMP10 NA 0
ENSGALG00000000136 BLEC2 NA 0
ENSGALG00000000151 ADAMTS19 NA 0
ENSGALG00000000162 DMB1 NA 0
- Alright, well, not sure if it worked because we have these 'NA' values.
- Let's check out the other end of the file
Use a reverse sort or do a tail of your orginal sort to check out the other end of the sorted file:
$ sort -r -g -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000035699 MARCHF6 0.999851391175025 0
ENSGALG00000021365 DCTN3 0.999719667557785 0
ENSGALG00000018264 gga-let-7d 0.999535070448575 0
ENSGALG00000041334 SNRPD1 0.999327008134205 0
ENSGALG00000007493 NSDHL 0.999327008134205 0
ENSGALG00000010980 STK31 0.999244795403493 0
ENSGALG00000033513 ELL 0.99911451358804 0
ENSGALG00000040018 NUP98 0.998960114063576 0
ENSGALG00000051668 0.998808098961675 0
ENSGALG00000048351 0.998647215307104 0```
- OK. This looks good so far. Let's get rid of the 'NA' values then check out the top and bottom of our sort
- if you still have issues with your system's sort check out this post
Use grep -v to remove lines that match a pattern:
$ cat E-MTAB-2754-analytics.tsv | grep -v -e "\tNA\t" | sort -g -t$'\t' -k3 | head
Gene ID Gene Name g1_g2.p-value g1_g2.log2foldchange
ENSGALG00000002286 H3F3B 0 -7
ENSGALG00000020078 H3F3C 0 -8
ENSGALG00000027353 C10orf71 1.29880461971355e-212 -3.7
ENSGALG00000030407 IL6R 3.8485270968074e-183 2.4
ENSGALG00000037669 FKBP9 3.58400895215323e-177 -3.6
ENSGALG00000031786 BLVRA 5.34950331436035e-177 2.2
ENSGALG00000023279 EPN3 3.12236315199182e-170 -2.9
ENSGALG00000011551 JCHAIN 6.47982639371527e-144 -1.7
ENSGALG00000002911 MOXD1 1.29318024850042e-142 -2.4
- Looks better. Scientific values look appropriately sorted.
- What about the other end of the list?
Review other end of the sorted file using a reverse sort or tail of the orginal sort:
$ cat E-MTAB-2754-analytics.tsv | grep -v -e "\tNA\t" | sort -r -g -t$'\t' -k3 | head
ENSGALG00000035699 MARCHF6 0.999851391175025 0
ENSGALG00000021365 DCTN3 0.999719667557785 0
ENSGALG00000018264 gga-let-7d 0.999535070448575 0
ENSGALG00000041334 SNRPD1 0.999327008134205 0
ENSGALG00000007493 NSDHL 0.999327008134205 0
ENSGALG00000010980 STK31 0.999244795403493 0
ENSGALG00000033513 ELL 0.99911451358804 0
ENSGALG00000040018 NUP98 0.998960114063576 0
ENSGALG00000051668 0.998808098961675 0
ENSGALG00000048351 0.998647215307104 0
- OK!! I think we got our p-value sort!!
Let's keep this file:
$ cat E-MTAB-2754-analytics.tsv | grep -v -e "\tNA\t" | sort -g -t$'\t' -k3 > pvalue_sorted.tsv
Let's double check the new file. Does the top look okay?
$ head pvalue_sorted.tsv
Gene ID Gene Name g1_g2.p-value g1_g2.log2foldchange
ENSGALG00000002286 H3F3B 0 -7
ENSGALG00000020078 H3F3C 0 -8
ENSGALG00000027353 C10orf71 1.29880461971355e-212 -3.7
ENSGALG00000030407 IL6R 3.8485270968074e-183 2.4
ENSGALG00000037669 FKBP9 3.58400895215323e-177 -3.6
ENSGALG00000031786 BLVRA 5.34950331436035e-177 2.2
ENSGALG00000023279 EPN3 3.12236315199182e-170 -2.9
ENSGALG00000011551 JCHAIN 6.47982639371527e-144 -1.7
ENSGALG00000002911 MOXD1 1.29318024850042e-142 -2.4
- GOOD!!
How about the bottom?
$ tail pvalue_sorted.tsv
ENSGALG00000048351 0.998647215307104 0
ENSGALG00000051668 0.998808098961675 0
ENSGALG00000040018 NUP98 0.998960114063576 0
ENSGALG00000033513 ELL 0.99911451358804 0
ENSGALG00000010980 STK31 0.999244795403493 0
ENSGALG00000007493 NSDHL 0.999327008134205 0
ENSGALG00000041334 SNRPD1 0.999327008134205 0
ENSGALG00000018264 gga-let-7d 0.999535070448575 0
ENSGALG00000021365 DCTN3 0.999719667557785 0
ENSGALG00000035699 MARCHF6 0.999851391175025 0
- GOOD!!