Skip to content

Latest commit

 

History

History
260 lines (239 loc) · 10.6 KB

File metadata and controls

260 lines (239 loc) · 10.6 KB

Sort file by p-value

OK. Let's sort our expression file using sort:

$ sort E-MTAB-2754-analytics.tsv | head
ENSGALG00000000003	PANX2	0.100242375805959	-0.4
ENSGALG00000000011	C10orf88	0.0802046773105167	0.2
ENSGALG00000000038	CTRB2	NA	0.2
ENSGALG00000000044	WFIKKN1	NA	0
ENSGALG00000000048		0.288103121752422	0.4
ENSGALG00000000055	LAMTOR3	0.529728058895927	0.1
ENSGALG00000000059	TUBB3	0.228430079834946	-0.2
ENSGALG00000000067	SPR	0.0560358954256604	-0.4
ENSGALG00000000071		0.878861305389193	0
ENSGALG00000000081	IL4I1	NA	0
  • This isn't right. sort is sorting by the first column
    • let's fix this by indicating we want to sort by the 3rd column (p-value)
  • And, look, we have 'NA' in the 3rd coulumn (p-value column)
  • Is there any other unexpected values in p-value column?

Review the file using more to check out the file for unexpected values:

$ more E-MTAB-2754-analytics.tsv
Gene ID Gene Name       g1_g2.p-value   g1_g2.log2foldchange
ENSGALG00000000003      PANX2   0.100242375805959       -0.4
ENSGALG00000000011      C10orf88        0.0802046773105167      0.2
ENSGALG00000000038      CTRB2   NA      0.2
ENSGALG00000000044      WFIKKN1 NA      0
ENSGALG00000000048              0.288103121752422       0.4
ENSGALG00000000055      LAMTOR3 0.529728058895927       0.1
ENSGALG00000000059      TUBB3   0.228430079834946       -0.2
ENSGALG00000000067      SPR     0.0560358954256604      -0.4
ENSGALG00000000071              0.878861305389193       0
ENSGALG00000000081      IL4I1   NA      0
ENSGALG00000000086      TIMM17B 0.80618323892789        0
ENSGALG00000000091      INTS4   0.981452026849213       0
ENSGALG00000000094      ADIPOR1 0.283972808349171       -0.1
ENSGALG00000000102      UBE2T   0.901824687993534       0
ENSGALG00000000104              0.28635973460605        -0.4
ENSGALG00000000106      LRIF1   0.777128323405445       0
ENSGALG00000000107      TRIM7.1 NA      0
ENSGALG00000000109      AFMID   0.598969080234545       0.2
ENSGALG00000000112      PLP1    NA      -0.1
ENSGALG00000000115      HEP21   NA      0
ENSGALG00000000120      BMP10   NA      0
ENSGALG00000000122      GNB2L1  1.60561843551779e-20    -0.6
ENSGALG00000000129      TCF25   0.756222862478979       0
ENSGALG00000000136      BLEC2   NA      0
ENSGALG00000000137      SNRPE   0.232542507231953       0.2
ENSGALG00000000141      BLB1    0.00790723600064685     0.9
ENSGALG00000000142      CEPT1   0.788385941146629       0
ENSGALG00000000146      SUPT6H  0.934343074557283       0
ENSGALG00000000150      RPL9    0.380362601236553       0.2
ENSGALG00000000151      ADAMTS19        NA      0
ENSGALG00000000154      DENND2D 0.527022644746131       0.2
ENSGALG00000000156      BRD2    3.43933719323364e-09    -0.5
ENSGALG00000000158      DMA     0.338457297207509       0.3
ENSGALG00000000161      ISOC1   0.401325361605509       0.2
ENSGALG00000000162      DMB1    NA      0
ENSGALG00000000164      MYBPHL  NA      0
ENSGALG00000000168      ADORA1  NA      0
ENSGALG00000000172      MYOG    NA      0
ENSGALG00000000184      SLC27A6 NA      0
ENSGALG00000000186      MTCH1   0.244657916348403       -0.3
ENSGALG00000000189      YTHDC2  0.818791595599391       0
ENSGALG00000000195      SLC23A2 0.00291000489916312     0.3
ENSGALG00000000201      PLEKHM1 0.104412345521551       0.2
ENSGALG00000000208      MCC     6.73362741791211e-11    -1.4
ENSGALG00000000209      PRNP    3.07006995025324e-18    0.8
ENSGALG00000000215      DCP2    0.000936131627143128    -0.4
ENSGALG00000000217      PPFIA4  0.676718383163428       0.1
  • Right away, I see we have scientific notation.
  • Also I notice we have some missing entries in the 2nd column.
  • Let's keep these points in mind.

Let's sort by the third column (p-value):

$ sort -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000000295		0.807343900207163	-0.1
ENSGALG00000000407		0.416558476444968	-0.1
ENSGALG00000000497		0.77516911446794	-0.1
ENSGALG00000000629		0.657700041880819	-0.1
ENSGALG00000001110		0.632025804512114	-0.1
ENSGALG00000001304		0.718829473200678	-0.1
ENSGALG00000001533		0.90216379088787	-0.1
ENSGALG00000001620		NA	-0.1
ENSGALG00000001701		0.814909576526554	-0.1
ENSGALG00000001720		0.363542657094163	-0.1
  • Hmm. This doesn't look right either.
  • Looks like we are sorting by the 4th column, the log2foldchange column.
  • Why?? Maybe it has something to do with missing entries in the 2nd column, making only 3 columns in some areas of the file.
  • Let's indicate that we have tab characters as a column separator.

Use sort and specify our column separator character:

$ sort -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000002286	H3F3B	0	-7
ENSGALG00000020078	H3F3C	0	-8
ENSGALG00000008094	HSPD1	0.000100309166748761	-0.3
ENSGALG00000042503		0.000100639846707172	-0.5
ENSGALG00000039013	EXT1	0.000101180056590984	-0.5
ENSGALG00000002162	ACO1	0.000101317230226459	-0.4
ENSGALG00000001024	TPRG1L	0.000101701679175278	-0.3
ENSGALG00000026757	DHFR	0.0001017195368379	-0.4
ENSGALG00000028501	DCK	0.000101840426398403	-0.3
ENSGALG00000003580	MMP2	0.000102357608503999	0.4
  • OK. That looks better, the 3rd column looks like it is being sorted.
  • Wait, we had p-values with scientific notation (3.43933719323364e-09). Shouldn't they come first?
  • Are we sorting by characters or by numbers? Characters! We should be sorting by numerical values!!

Use sort and specify we want to sort the column 3 numerically:

$ sort -n -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000000038	CTRB2	NA	0.2
ENSGALG00000000044	WFIKKN1	NA	0
ENSGALG00000000081	IL4I1	NA	0
ENSGALG00000000107	TRIM7.1	NA	0
ENSGALG00000000112	PLP1	NA	-0.1
ENSGALG00000000115	HEP21	NA	0
ENSGALG00000000120	BMP10	NA	0
ENSGALG00000000136	BLEC2	NA	0
ENSGALG00000000151	ADAMTS19	NA	0
ENSGALG00000000162	DMB1	NA	0
  • Wait, What? Now we have 'NA' sorted to the top.
  • What about the other end of our list, let's reverse the sort.

Reverse sort to take a quick look at the other end of our sorted file:

$ sort -r -n -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000041634	ACTG2	9.99765161055154e-06	0.5
ENSGALG00000003959	AP1M1	9.99364241038882e-05	-0.3
ENSGALG00000026019	ASCC1	9.99213530582087e-06	-0.4
ENSGALG00000029817	USP22	9.95241747734131e-18	0.6
ENSGALG00000007430	ARCN1	9.92710147646906e-11	0.5
ENSGALG00000035182	CMKLR1	9.9135182215432e-10	0.8
ENSGALG00000005617	NTHL1	9.9135182215432e-10	0.6
ENSGALG00000009112	LRRC57	9.88331240104096e-08	0.6
ENSGALG00000015446	POU2F1	9.8793969295065e-06	-0.5
ENSGALG00000016279	RAB23	9.86441714514607e-05	-0.6
  • Huh. Looks a bit better, but they are sorting by the digits and not the appropirate scientific notation?
  • We need to sort by the actual value and not just the starting digits.

Use sort with -g option (general-numeric-sort):

$ sort -g -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000000038	CTRB2	NA	0.2
ENSGALG00000000044	WFIKKN1	NA	0
ENSGALG00000000081	IL4I1	NA	0
ENSGALG00000000107	TRIM7.1	NA	0
ENSGALG00000000112	PLP1	NA	-0.1
ENSGALG00000000115	HEP21	NA	0
ENSGALG00000000120	BMP10	NA	0
ENSGALG00000000136	BLEC2	NA	0
ENSGALG00000000151	ADAMTS19	NA	0
ENSGALG00000000162	DMB1	NA	0
  • Alright, well, not sure if it worked because we have these 'NA' values.
  • Let's check out the other end of the file

Use a reverse sort or do a tail of your orginal sort to check out the other end of the sorted file:

$ sort -r -g -t$'\t' -k3 E-MTAB-2754-analytics.tsv | head
ENSGALG00000035699	MARCHF6	0.999851391175025	0
ENSGALG00000021365	DCTN3	0.999719667557785	0
ENSGALG00000018264	gga-let-7d	0.999535070448575	0
ENSGALG00000041334	SNRPD1	0.999327008134205	0
ENSGALG00000007493	NSDHL	0.999327008134205	0
ENSGALG00000010980	STK31	0.999244795403493	0
ENSGALG00000033513	ELL	0.99911451358804	0
ENSGALG00000040018	NUP98	0.998960114063576	0
ENSGALG00000051668		0.998808098961675	0
ENSGALG00000048351		0.998647215307104	0```
  • OK. This looks good so far. Let's get rid of the 'NA' values then check out the top and bottom of our sort
  • if you still have issues with your system's sort check out this post

Use grep -v to remove lines that match a pattern:

$ cat E-MTAB-2754-analytics.tsv | grep -v -e "\tNA\t" | sort -g -t$'\t' -k3 | head
Gene ID	Gene Name	g1_g2.p-value	g1_g2.log2foldchange
ENSGALG00000002286	H3F3B	0	-7
ENSGALG00000020078	H3F3C	0	-8
ENSGALG00000027353	C10orf71	1.29880461971355e-212	-3.7
ENSGALG00000030407	IL6R	3.8485270968074e-183	2.4
ENSGALG00000037669	FKBP9	3.58400895215323e-177	-3.6
ENSGALG00000031786	BLVRA	5.34950331436035e-177	2.2
ENSGALG00000023279	EPN3	3.12236315199182e-170	-2.9
ENSGALG00000011551	JCHAIN	6.47982639371527e-144	-1.7
ENSGALG00000002911	MOXD1	1.29318024850042e-142	-2.4
  • Looks better. Scientific values look appropriately sorted.
  • What about the other end of the list?

Review other end of the sorted file using a reverse sort or tail of the orginal sort:

$ cat E-MTAB-2754-analytics.tsv | grep -v -e "\tNA\t" | sort -r -g -t$'\t' -k3 | head
ENSGALG00000035699	MARCHF6	0.999851391175025	0
ENSGALG00000021365	DCTN3	0.999719667557785	0
ENSGALG00000018264	gga-let-7d	0.999535070448575	0
ENSGALG00000041334	SNRPD1	0.999327008134205	0
ENSGALG00000007493	NSDHL	0.999327008134205	0
ENSGALG00000010980	STK31	0.999244795403493	0
ENSGALG00000033513	ELL	0.99911451358804	0
ENSGALG00000040018	NUP98	0.998960114063576	0
ENSGALG00000051668		0.998808098961675	0
ENSGALG00000048351		0.998647215307104	0
  • OK!! I think we got our p-value sort!!

Let's keep this file:

$ cat E-MTAB-2754-analytics.tsv | grep -v -e "\tNA\t" | sort -g -t$'\t' -k3 > pvalue_sorted.tsv

Let's double check the new file. Does the top look okay?

$ head pvalue_sorted.tsv
Gene ID	Gene Name	g1_g2.p-value	g1_g2.log2foldchange
ENSGALG00000002286	H3F3B	0	-7
ENSGALG00000020078	H3F3C	0	-8
ENSGALG00000027353	C10orf71	1.29880461971355e-212	-3.7
ENSGALG00000030407	IL6R	3.8485270968074e-183	2.4
ENSGALG00000037669	FKBP9	3.58400895215323e-177	-3.6
ENSGALG00000031786	BLVRA	5.34950331436035e-177	2.2
ENSGALG00000023279	EPN3	3.12236315199182e-170	-2.9
ENSGALG00000011551	JCHAIN	6.47982639371527e-144	-1.7
ENSGALG00000002911	MOXD1	1.29318024850042e-142	-2.4
  • GOOD!!

How about the bottom?

$ tail pvalue_sorted.tsv
ENSGALG00000048351		0.998647215307104	0
ENSGALG00000051668		0.998808098961675	0
ENSGALG00000040018	NUP98	0.998960114063576	0
ENSGALG00000033513	ELL	0.99911451358804	0
ENSGALG00000010980	STK31	0.999244795403493	0
ENSGALG00000007493	NSDHL	0.999327008134205	0
ENSGALG00000041334	SNRPD1	0.999327008134205	0
ENSGALG00000018264	gga-let-7d	0.999535070448575	0
ENSGALG00000021365	DCTN3	0.999719667557785	0
ENSGALG00000035699	MARCHF6	0.999851391175025	0
  • GOOD!!