Sort file by p-value

OK. Let's sort our expression file using sort:

$ sort E-MTAB-2754-analytics.tsv | head
ENSGALG00000000003	PANX2	0.100242375805959	-0.4
ENSGALG00000000011	C10orf88	0.0802046773105167	0.2
ENSGALG00000000038	CTRB2	NA	0.2
ENSGALG00000000044	WFIKKN1	NA	0
ENSGALG00000000048		0.288103121752422	0.4
ENSGALG00000000055	LAMTOR3	0.529728058895927	0.1
ENSGALG00000000059	TUBB3	0.228430079834946	-0.2
ENSGALG00000000067	SPR	0.0560358954256604	-0.4
ENSGALG00000000071		0.878861305389193	0
ENSGALG00000000081	IL4I1	NA	0

This isn't right. sort is sorting by the first column
- let's fix this by indicating we want to sort by the 3rd column (p-value)
And, look, we have 'NA' in the 3rd coulumn (p-value column)
Is there any other unexpected values in p-value column?

Review the file using more to check out the file for unexpected values:

$ more E-MTAB-2754-analytics.tsv
Gene ID Gene Name       g1_g2.p-value   g1_g2.log2foldchange
ENSGALG00000000003      PANX2   0.100242375805959       -0.4
ENSGALG00000000011      C10orf88        0.0802046773105167      0.2
ENSGALG00000000038      CTRB2   NA      0.2
ENSGALG00000000044      WFIKKN1 NA      0
ENSGALG00000000048              0.288103121752422       0.4
ENSGALG00000000055      LAMTOR3 0.529728058895927       0.1
ENSGALG00000000059      TUBB3   0.228430079834946       -0.2
ENSGALG00000000067      SPR     0.0560358954256604      -0.4
ENSGALG00000000071              0.878861305389193       0
ENSGALG00000000081      IL4I1   NA      0
ENSGALG00000000086      TIMM17B 0.80618323892789        0
ENSGALG00000000091      INTS4   0.981452026849213       0
ENSGALG00000000094      ADIPOR1 0.283972808349171       -0.1
ENSGALG00000000102      UBE2T   0.901824687993534       0
ENSGALG00000000104              0.28635973460605        -0.4
ENSGALG00000000106      LRIF1   0.777128323405445       0
ENSGALG00000000107      TRIM7.1 NA      0
ENSGALG00000000109      AFMID   0.598969080234545       0.2
ENSGALG00000000112      PLP1    NA      -0.1
ENSGALG00000000115      HEP21   NA      0
ENSGALG00000000120      BMP10   NA      0
ENSGALG00000000122      GNB2L1  1.60561843551779e-20    -0.6
ENSGALG00000000129      TCF25   0.756222862478979       0
ENSGALG00000000136      BLEC2   NA      0
ENSGALG00000000137      SNRPE   0.232542507231953       0.2
ENSGALG00000000141      BLB1    0.00790723600064685     0.9
ENSGALG00000000142      CEPT1   0.788385941146629       0
ENSGALG00000000146      SUPT6H  0.934343074557283       0
ENSGALG00000000150      RPL9    0.380362601236553       0.2
ENSGALG00000000151      ADAMTS19        NA      0
ENSGALG00000000154      DENND2D 0.527022644746131       0.2
ENSGALG00000000156      BRD2    3.43933719323364e-09    -0.5
ENSGALG00000000158      DMA     0.338457297207509       0.3
ENSGALG00000000161      ISOC1   0.401325361605509       0.2
ENSGALG00000000162      DMB1    NA      0
ENSGALG00000000164      MYBPHL  NA      0
ENSGALG00000000168      ADORA1  NA      0
ENSGALG00000000172      MYOG    NA      0
ENSGALG00000000184      SLC27A6 NA      0
ENSGALG00000000186      MTCH1   0.244657916348403       -0.3
ENSGALG00000000189      YTHDC2  0.818791595599391       0
ENSGALG00000000195      SLC23A2 0.00291000489916312     0.3
ENSGALG00000000201      PLEKHM1 0.104412345521551       0.2
ENSGALG00000000208      MCC     6.73362741791211e-11    -1.4
ENSGALG00000000209      PRNP    3.07006995025324e-18    0.8
ENSGALG00000000215      DCP2    0.000936131627143128    -0.4
ENSGALG00000000217      PPFIA4  0.676718383163428       0.1

Right away, I see we have scientific notation.
Also I notice we have some missing entries in the 2nd column.
Let's keep these points in mind.

Let's sort by the third column (p-value):

$ sort -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000000295		0.807343900207163	-0.1
ENSGALG00000000407		0.416558476444968	-0.1
ENSGALG00000000497		0.77516911446794	-0.1
ENSGALG00000000629		0.657700041880819	-0.1
ENSGALG00000001110		0.632025804512114	-0.1
ENSGALG00000001304		0.718829473200678	-0.1
ENSGALG00000001533		0.90216379088787	-0.1
ENSGALG00000001620		NA	-0.1
ENSGALG00000001701		0.814909576526554	-0.1
ENSGALG00000001720		0.363542657094163	-0.1

Hmm. This doesn't look right either.
Looks like we are sorting by the 4th column, the log2foldchange column.
Why?? Maybe it has something to do with missing entries in the 2nd column, making only 3 columns in some areas of the file.
Let's indicate that we have tab characters as a column separator.

Use sort and specify our column separator character:

$ sort -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000002286	H3F3B	0	-7
ENSGALG00000020078	H3F3C	0	-8
ENSGALG00000008094	HSPD1	0.000100309166748761	-0.3
ENSGALG00000042503		0.000100639846707172	-0.5
ENSGALG00000039013	EXT1	0.000101180056590984	-0.5
ENSGALG00000002162	ACO1	0.000101317230226459	-0.4
ENSGALG00000001024	TPRG1L	0.000101701679175278	-0.3
ENSGALG00000026757	DHFR	0.0001017195368379	-0.4
ENSGALG00000028501	DCK	0.000101840426398403	-0.3
ENSGALG00000003580	MMP2	0.000102357608503999	0.4

OK. That looks better, the 3rd column looks like it is being sorted.
Wait, we had p-values with scientific notation (3.43933719323364e-09). Shouldn't they come first?
Are we sorting by characters or by numbers? Characters! We should be sorting by numerical values!!

Use sort and specify we want to sort the column 3 numerically:

$ sort -n -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000000038	CTRB2	NA	0.2
ENSGALG00000000044	WFIKKN1	NA	0
ENSGALG00000000081	IL4I1	NA	0
ENSGALG00000000107	TRIM7.1	NA	0
ENSGALG00000000112	PLP1	NA	-0.1
ENSGALG00000000115	HEP21	NA	0
ENSGALG00000000120	BMP10	NA	0
ENSGALG00000000136	BLEC2	NA	0
ENSGALG00000000151	ADAMTS19	NA	0
ENSGALG00000000162	DMB1	NA	0

Wait, What? Now we have 'NA' sorted to the top.
What about the other end of our list, let's reverse the sort.

Reverse sort to take a quick look at the other end of our sorted file:

$ sort -r -n -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000041634	ACTG2	9.99765161055154e-06	0.5
ENSGALG00000003959	AP1M1	9.99364241038882e-05	-0.3
ENSGALG00000026019	ASCC1	9.99213530582087e-06	-0.4
ENSGALG00000029817	USP22	9.95241747734131e-18	0.6
ENSGALG00000007430	ARCN1	9.92710147646906e-11	0.5
ENSGALG00000035182	CMKLR1	9.9135182215432e-10	0.8
ENSGALG00000005617	NTHL1	9.9135182215432e-10	0.6
ENSGALG00000009112	LRRC57	9.88331240104096e-08	0.6
ENSGALG00000015446	POU2F1	9.8793969295065e-06	-0.5
ENSGALG00000016279	RAB23	9.86441714514607e-05	-0.6

Huh. Looks a bit better, but they are sorting by the digits and not the appropirate scientific notation?
We need to sort by the actual value and not just the starting digits.

Use sort with -g option (general-numeric-sort):

$ sort -g -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000000038	CTRB2	NA	0.2
ENSGALG00000000044	WFIKKN1	NA	0
ENSGALG00000000081	IL4I1	NA	0
ENSGALG00000000107	TRIM7.1	NA	0
ENSGALG00000000112	PLP1	NA	-0.1
ENSGALG00000000115	HEP21	NA	0
ENSGALG00000000120	BMP10	NA	0
ENSGALG00000000136	BLEC2	NA	0
ENSGALG00000000151	ADAMTS19	NA	0
ENSGALG00000000162	DMB1	NA	0

Alright, well, not sure if it worked because we have these 'NA' values.
Let's check out the other end of the file

Use a reverse sort or do a tail of your orginal sort to check out the other end of the sorted file:

$ sort -r -g -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000035699	MARCHF6	0.999851391175025	0
ENSGALG00000021365	DCTN3	0.999719667557785	0
ENSGALG00000018264	gga-let-7d	0.999535070448575	0
ENSGALG00000041334	SNRPD1	0.999327008134205	0
ENSGALG00000007493	NSDHL	0.999327008134205	0
ENSGALG00000010980	STK31	0.999244795403493	0
ENSGALG00000033513	ELL	0.99911451358804	0
ENSGALG00000040018	NUP98	0.998960114063576	0
ENSGALG00000051668		0.998808098961675	0
ENSGALG00000048351		0.998647215307104	0```

OK. This looks good so far. Let's get rid of the 'NA' values then check out the top and bottom of our sort
if you still have issues with your system's sort check out this post

Use grep -v to remove lines that match a pattern:

$ cat E-MTAB-2754-analytics.tsv | grep -v -e "\tNA\t" | sort -g -t$'\t' -k3 | head
Gene ID	Gene Name	g1_g2.p-value	g1_g2.log2foldchange
ENSGALG00000002286	H3F3B	0	-7
ENSGALG00000020078	H3F3C	0	-8
ENSGALG00000027353	C10orf71	1.29880461971355e-212	-3.7
ENSGALG00000030407	IL6R	3.8485270968074e-183	2.4
ENSGALG00000037669	FKBP9	3.58400895215323e-177	-3.6
ENSGALG00000031786	BLVRA	5.34950331436035e-177	2.2
ENSGALG00000023279	EPN3	3.12236315199182e-170	-2.9
ENSGALG00000011551	JCHAIN	6.47982639371527e-144	-1.7
ENSGALG00000002911	MOXD1	1.29318024850042e-142	-2.4

Looks better. Scientific values look appropriately sorted.
What about the other end of the list?

Review other end of the sorted file using a reverse sort or tail of the orginal sort:

$ cat E-MTAB-2754-analytics.tsv | grep -v -e "\tNA\t" | sort -r -g -t$'\t' -k3 | head
ENSGALG00000035699	MARCHF6	0.999851391175025	0
ENSGALG00000021365	DCTN3	0.999719667557785	0
ENSGALG00000018264	gga-let-7d	0.999535070448575	0
ENSGALG00000041334	SNRPD1	0.999327008134205	0
ENSGALG00000007493	NSDHL	0.999327008134205	0
ENSGALG00000010980	STK31	0.999244795403493	0
ENSGALG00000033513	ELL	0.99911451358804	0
ENSGALG00000040018	NUP98	0.998960114063576	0
ENSGALG00000051668		0.998808098961675	0
ENSGALG00000048351		0.998647215307104	0

OK!! I think we got our p-value sort!!

Let's keep this file:

$ cat E-MTAB-2754-analytics.tsv | grep -v -e "\tNA\t" | sort -g -t$'\t' -k3 > pvalue_sorted.tsv

Let's double check the new file. Does the top look okay?

$ head pvalue_sorted.tsv
Gene ID	Gene Name	g1_g2.p-value	g1_g2.log2foldchange
ENSGALG00000002286	H3F3B	0	-7
ENSGALG00000020078	H3F3C	0	-8
ENSGALG00000027353	C10orf71	1.29880461971355e-212	-3.7
ENSGALG00000030407	IL6R	3.8485270968074e-183	2.4
ENSGALG00000037669	FKBP9	3.58400895215323e-177	-3.6
ENSGALG00000031786	BLVRA	5.34950331436035e-177	2.2
ENSGALG00000023279	EPN3	3.12236315199182e-170	-2.9
ENSGALG00000011551	JCHAIN	6.47982639371527e-144	-1.7
ENSGALG00000002911	MOXD1	1.29318024850042e-142	-2.4

GOOD!!

How about the bottom?

$ tail pvalue_sorted.tsv
ENSGALG00000048351		0.998647215307104	0
ENSGALG00000051668		0.998808098961675	0
ENSGALG00000040018	NUP98	0.998960114063576	0
ENSGALG00000033513	ELL	0.99911451358804	0
ENSGALG00000010980	STK31	0.999244795403493	0
ENSGALG00000007493	NSDHL	0.999327008134205	0
ENSGALG00000041334	SNRPD1	0.999327008134205	0
ENSGALG00000018264	gga-let-7d	0.999535070448575	0
ENSGALG00000021365	DCTN3	0.999719667557785	0
ENSGALG00000035699	MARCHF6	0.999851391175025	0

GOOD!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort file by p-value

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Sort file by p-value