SRM-Textbook/15-CompareQual.Rmd at main · PeterKDunn/SRM-Textbook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Comparing qualitative data between individuals {#CompareQualData}

\index{Qualitative data!comparing \textit{between} individuals}

<!-- Introductions; easier to separate by format -->

```{r, child = if (knitr::is_html_output()) {'./introductions/15-CompareQual-HTML.Rmd'} else {'./introductions/15-CompareQual-LaTeX.Rmd'}}
```


<!-- Define colours as appropriate -->
```{r, child = if (knitr::is_html_output()) {'./children/coloursHTML.Rmd'} else {'./children/coloursLaTeX.Rmd'}}
```


## Introduction {#CompareQual-Intro}

Relational RQs compare groups.
This chapter considers how to compare *qualitative* variables in different groups.
Graphs are useful for this purpose, and a table including odds, odds ratios and proportions is usually produced also.


## Two-way tables {#QualitativeTwoWaytables}
\index{Two-way tables}

When more than one qualitative variable is recorded for each individual, the data can be collated into a table.
When *two* qualitative variables are cross-tabulated, the resulting table is called a *two-way table*.\index{Two-way tables}
The categories for each variable should be *exhaustive*\index{Exhaustive} (cover all levels) and *mutually exclusive*\index{Mutually exclusive} (observations belong to one and only one level).
Usually, the levels of the explanatory variable are in the rows of the table.


::: {#SmallKidneyStones .example name="Two-way tables"}
To compare two treatments for kidney stones, @data:Charig:stones collected data from $700$\ UK patients on two qualitative variables:

* the treatment method ('A' or\ 'B'), the explanatory variable.
* the result of the procedure ('success' or\ 'failure'), the response variable.

Both variables are *qualitative* with two *levels*, and each treatment was used on $350$ patients.
Treatment\ A was used from 1972--1980, and Treatment\ B from 1980--1985; that is, treatments were *not randomly allocated*, and so *confounding* may be present.
For this reason, the researchers also recorded the *size* of the kidney stone ('small' or 'large') as one possible confounding variable.
Firstly, consider just the *small stones* [@julious1994confounding], displayed in the two-way table in Table\ \@ref(tab:KS-Small).
:::

(ref:KStonesNumbersSmall) *Counts* for two procedures with *small* kidney stones.

```{r KS-Small}
data(KStones)

KS.small <- xtabs( Counts ~ Method + Result,
                   data = subset(KStones, Size == "Small"))[, c(2, 1)]

KS.small2 <- cbind(KS.small,
                   "Total" = rowSums(KS.small))

KS.small2.full <- rbind( KS.small2,
                         "Total" = colSums(KS.small2) )

if( knitr::is_latex_output() ) {
  kable(pad(KS.small2.full,
            surroundMaths = TRUE,
            targetLength = c(3, 2, 3),
            decDigits = 0),
        format = "latex",
        longtable = FALSE,
        booktabs = TRUE,
        escape = FALSE,
        digits = 0,
        align = "c",
        col.names = c("Success",
                      "Failure",
                      "Total"),
        caption = "(ref:KStonesNumbersSmall)"
  ) %>%
    column_spec(1,
                bold = TRUE) %>%
    row_spec(0,
             bold = TRUE) %>%
    row_spec(3,
             bold = TRUE) %>%
    row_spec(2,
             hline_after = TRUE) %>%
    kable_styling(full_width = FALSE) %>%
    kable_styling(font_size = 8) %>%
    column_spec(column = 4,
                bold = TRUE)
}
if( knitr::is_html_output() ) {
  kable(pad(KS.small2.full,
            surroundMaths = TRUE,
            targetLength = c(3, 2, 3),
            decDigits = 0),
        format = "html",
        longtable = FALSE,
        booktabs = TRUE,
        digits = 0,
        align = "c",
        col.names = c("Success",
                      "Failure",
                      "Total"),
        caption = "(ref:KStonesNumbersSmall)"
  ) %>%
    kable_styling(full_width = FALSE) %>%
    column_spec(column = 4,
                bold = TRUE) %>%
    row_spec(3,
             bold = TRUE)
}
```


## Summary tables by rows and columns {#RowPercentages}
\index{Two-way tables!summary by rows}\index{Two-way tables!summary by columns}

Each variable in a two-way table can be analysed separately, using percentages or proportions (Sect.\ \@ref(QualitativeProportionsPercentages)) or odds (Sect.\ \@ref(QualOdds)).
For example, the two variables in Table\ \@ref(tab:KS-Small) (Method; Result) can be analysed separately.
For overall results:

* the proportion of procedures that were successful is $315/357 = 0.882$ (or $88.2$%).
* the odds that a procedure was successful is $315/42 = 7.5$; that is, there were\ $7.5$ times as many successful procedures as unsuccessful procedures.

However, to *compare* Methods\ A and\ B, the proportions (or percentages) and odds of successful results need to be computed for each row separately.


::: {#SummaryTableCompareQual .example name="Small kidney stones"}
The data in Table\ \@ref(tab:KS-Small) can be summarised by computing proportions or percentages by *row*.
Each row refers to a different method, so row percentages will compute success percentages for the two methods.

For the small kidney stones (Table\ \@ref(tab:KS-Small)), the *row percentages*
`r if (knitr::is_latex_output()) {
  '(Table\\ \\@ref(tab:KidneyRowColLATEX), left table)'
} else {
  '(Table\\ \\@ref(tab:KS-Small-rowPC)' }`
give the percentage of successes for each *Method*, since the rows represent the counts for Methods\ A and\ B.\xspace\index{Proportions}
*Row* proportions (or percentages) allow the proportions (or percentages) *within the rows* (i.e., for each Method) to be compared:

* with Method\ A, $81 \div 87 = 0.931$ (or\ $93.1$%) of operations in the sample were successful.
* with Method\ B, $234\div 270 = 0.867$ (or\ $86.7$%) of operations in the sample were successful.

For small kidney stones, Method\ A is slightly more successful\ ($93.1$%) than Method\ B\ ($86.7$%) in the *sample*.
These percentages are collated in\index{Percentages}
`r if (knitr::is_latex_output()) {
  'Table\\ \\@ref(tab:KidneyRowColLATEX) (left table).' } else
{
  'Table\\ \\@ref(tab:KS-Small-rowPC).'
}`

Odds can also be computed:\index{Odds}

* with Method\ A, the odds of success is $81\div6 = 13.5$; there are $13.5$ times as many successful procedures than failures for Method\ A.
* with Method\ B, the odds of success is $234\div36 = 6.5$; there are $6.5$ times as many successful procedures than failures for Method\ B.

The odds of a success is far greater for Method\ A than Method\ B in the sample.
:::


(ref:KStonesRowPercentSmall) *Row percentages* for two procedures with *small* kidney stones (from Table\ \@ref(tab:KS-Small)). Row *proportions* could also be used.

```{r KS-Small-rowPC}
KS.small.rowPC <- prop.table(KS.small,
                             margin = 1) * 100
KS.small.rowPC2 <- cbind(KS.small.rowPC,
                         "Total" = c(100, 100) )

if( knitr::is_html_output() ) {
  kable( pad(KS.small.rowPC2,
             surroundMaths = TRUE,
             targetLength = c(4, 3, 3),
             decDigits = c(1, 1, 1) ),
         format = "html",
         longtable = FALSE,
         booktabs = TRUE,
         escape = FALSE,
         digits = 1,
         align = c("r", "r", "r", "r"),
         col.names = c("Success",
                       "Failure",
                       "Total"),
         caption = "(ref:KStonesRowPercentSmall)") %>%
    kable_styling(full_width = FALSE) %>%
    column_spec(column = 4,
                bold = TRUE)
}
```


(ref:KStonesColPercentSmall) *Column percentages* for two procedures with *small* kidney stones (from Table \@ref(tab:KS-Small)). Column *proportions* could also be used.

```{r KS-Small-colPC}
KS.small.colPC <- prop.table(KS.small,
                             margin = 2) * 100
KS.small.colPC2 <- rbind(KS.small.colPC,
                         "Total" = c(100, 100) )


if( knitr::is_html_output() ) {
  kable(pad(KS.small.colPC2,
            surroundMaths = TRUE,
            targetLength = 3,
            decDigits = c(1, 1)),
        format = "html",
        longtable = FALSE,
        digits = 1,
        booktabs = TRUE,
        align = c("r", "r", "r"),
        col.names = c("Success",
                      "Failure"),
        caption = "(ref:KStonesColPercentSmall)") %>%
    kable_styling(full_width = FALSE) %>%
    row_spec(row = 3,
             bold = TRUE)
}
```


(ref:KStonesRowColPercentSmall) Two procedures with *small* kidney stones. Left: *row* percentages. Right: *column* percentages (from Table\ \@ref(tab:KS-Small)). Proportions could be used rather than percentages.

```{r}
KS.small.rowPC <- prop.table(KS.small,
                             margin = 1) * 100
KS.small.rowPC <- round(KS.small.rowPC, 1)
KS.small.rowPC2 <- cbind(KS.small.rowPC,
                         "Total" = c("$100.0$", "$100.0$") )

KS.small.colPC <- prop.table(KS.small,
                             margin = 2) * 100
KS.small.colPC2 <- rbind(KS.small.colPC,
                         "Total" = c(100, 100) )
KS.small.colPC <- round(KS.small.colPC, 1)

if( knitr::is_latex_output() ) {
  T1 <- kable(pad( KS.small.rowPC2,
                   surroundMaths = TRUE,
                   targetLength = c(4, 4, 5),
                   decDigits = c(1, 1, 1)),
              format = "latex",
              longtable = FALSE,
              valign = "t",
              booktabs = TRUE,
              escape = FALSE,
              digits = 1,
              align = "c",
              col.names = c("Success",
                            "Failure",
                            "Total")) %>%
    row_spec(0, bold = TRUE) #%>%    column_spec(1, bold = TRUE)

  T2 <- kable(pad(KS.small.colPC2,
                  surroundMaths = TRUE,
                  targetLength = 3,
                  decDigits = c(1, 1)),
              format = "latex",
              longtable = FALSE,
              valign = "t",
              digits = 1,
              booktabs = TRUE,
              escape = FALSE,
              align = c("r", "r", "r"),
              col.names = c("Success",
                            "Failure"))  %>%
    row_spec(0, bold = TRUE) %>%
    row_spec(2,
             hline_after = TRUE)

  out <- knitr::kables(list(T1, T2),
                       format = "latex",
                       label = "KidneyRowColLATEX",
                       caption = "(ref:KStonesRowColPercentSmall)") %>%
    kable_styling(font_size = 8) %>%
    column_spec(1, bold = TRUE)  # Causes an error if placed with each table...

  out2 <- prepareSideBySideTable(out,
                                 gap = "\\qquad")
  out2

}
```


Rather than comparing *methods* (in the rows), the procedure *results* can be compared (i.e., the columns).


::: {#KidneyStonesSmallColums .example name="Comparing by column"}
For the small kidney stones (Table\ \@ref(tab:KS-Small)), the *column percentages*
`r if (knitr::is_latex_output()) {
  '(Table\\ \\@ref(tab:KidneyRowColLATEX), right table)'
} else {
  '(Table\\ \\@ref(tab:KS-Small-colPC)'
}`
give the percentage of successes within each column (i.e., for successes and for failures), since the columns contain the procedure results.
*Column* percentages (or proportions) allow the percentages (or proportions) within *columns* to be compared:

* the proportion of the *successful* procedures from Method\ A is $81 \div 315 = 0.257$ (or\ $25.7$%).
* the proportion of the *failed* procedures from Method\ A is $234\div 315 = 0.143$ (or\ $14.3$%).

Odds can also be computed:

* the odds of a *success* coming from Method\ A is $81/234 = 0.346$; there are $0.346$\ times as many Method\ A procedures than Method\ B procedures among the successes.
* the odds of *failure* coming from Method\ A is $6/36 = 0.167$; there are $0.167$\ times as many Method\ A procedures than Method\ B procedures among the failures.

The odds of a success being a Method\ A procedure is quite different from the odds of a success being a Method\ B procedure.

Comparing rows (i.e., using row percentages and row odds) seems more intuitive than column proportions here: they compare the success percentages and odds for each method.
:::


## Graphs for the comparison {#QualitativeCompareGraphs}
\index{Qualitative data!comparing \textit{between} individuals!graphs}\index{Software output!graphs}

When a *qualitative* variable is compared across different groups (i.e., comparing between individuals), options for plotting include:

* *stacked bar charts* (Sect.\ \@ref(StackedBarCharts)).
* *side-by-side bar charts* (Sect.\ \@ref(SideBySideBarCharts)).
* *dot charts* (Sect.\ \@ref(TwoWayCountsDotCharts)).


### Stacked bar charts {#StackedBarCharts}
\index{Graphs!stacked bar chart}

The data can be graphed by using a bar for each level of one variable, and *stacking* the bars for the levels of the second variable.
Bars indicate the counts (or percentages) in each category.
The levels can be on the horizontal or vertical axis, but placing the level names on the vertical axis often makes for easier reading, and room for long labels.


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The axis displaying the counts (or percentages) should *start from zero*, since the height of the bars visually implies the frequency of those observations (see Example\ \@ref(exm:VerticalTruncation)).
:::


::: {#BarStacked .example name="Stacked bar charts"}
For the small kidney-stone data in Example\ \@ref(exm:SmallKidneyStones), a stacked bar chart can be created by producing a bar for each method, and *stacking* the successes and failures for each method (Fig.\ \@ref(fig:QualGraphsStones), top left panel).

Rather than using *numbers*, the *percentages* separately within each group can be used too (Fig.\ \@ref(fig:QualGraphsStones), bottom left panel).
This makes comparing the *relative* proportions easier.
:::

```{r QualGraphsStones, fig.align="center", fig.cap="Six plots for the small kidney-stone data. Top plots: displaying the numbers for each method. Bottom plots: displaying the percentages for each method. Left: stacked bar chart. Centre: side-by-side bar charts. Right: dot charts.", out.width='100%', fig.width=6.25, fig.height=4.1}
par( mfrow = c(2, 3),
     mar = c(4.5, 4.0, 5, 0.2),
     xpd = TRUE)

# 1
barplot( t(KS.small),
         las = 1,
         main = "Procedure results:\nsmall kidney stones",
         ylab = "Number of patients",
         col = grey( c(0.2, 0.8) ) )
#         legend.text = TRUE,
#         args.legend = list(x = "topleft") )
legend(x = "top",
       ncol = 2,
       xpd = TRUE,
       bty = "n",
       inset = c(0.5, -0.3),
       legend = c("Success",
                  "Failure"),
       fill = grey( c(0.2, 0.8) )
)


# 2
barplot( t(KS.small),
         las = 1,
         ylim = c(0, 250),
         main = "Procedure results:\nsmall kidney stones",
         ylab = "Number of patients",
         beside = TRUE)
#legend.text = TRUE,
#args.legend = list(x = "topleft"))
legend(x = "top",
       ncol = 2,
       xpd = TRUE,
       bty = "n",
       inset = c(0.5, -0.3),
       legend = c("Success",
                  "Failure"),
       fill = grey( c(0.2, 0.8) )
)


# 3
dotchart(t (KS.small),
         main = "Procedure results:\nsmall kidney stones",
         xlab = "Number of patients",
         pt.cex = 0.9,
         lcol = rgb(1, 0, 0, alpha = 0) , # The lines extend past the end of the box...? Use transparent (alpha = 0)
         xlim = c(0, 250),
         pch = c(19, 4))
# Add the lines manually
lines( x = c(0, 250),
       y = c(1, 1),
       lwd = 1,
       col = "grey",
       lty = 2)
lines( x = c(0, 250),
       y = 2 * c(1, 1),
       lwd = 1,
       col = "grey",
       lty = 2)
lines( x = c(0, 250),
       y = 5 * c(1, 1),
       lwd = 1,
       col = "grey",
       lty = 2)
lines( x = c(0, 250),
       y = 6 * c(1, 1),
       lwd = 1,
       col = "grey",
       lty = 2)

# 4
barplot( prop.table( t(KS.small),
                     margin = 2 ) * 100,
         las = 1,
         main = "Procedure results:\nsmall kidney stones",
         ylab = "Percentage of patients" )
#         legend.text = TRUE,
#         args.legend = list(x = "right"))
legend(x = "top",
       ncol = 2,
       xpd = TRUE,
       bty = "n",
       inset = c(0.5, -0.3),
       legend = c("Success",
                  "Failure"),
       fill = grey( c(0.2, 0.8) )
)


# 5
barplot( prop.table( t(KS.small),
                     margin = 2 ) * 100,
         las = 1,
         ylim = c(0, 100),
         main = "Procedure results:\nsmall kidney stones",
         ylab = "Percentage of patients",
         beside = TRUE)
#         legend.text = TRUE,
#         args.legend = list(x = "left"))
legend(x = "top",
       ncol = 2,
       xpd = TRUE,
       bty = "n",
       inset = c(0.5, -0.3),
       legend = c("Success",
                  "Failure"),
       fill = grey( c(0.2, 0.8) )
)


# 6
dotchart(prop.table( t(KS.small),
                     margin = 2 ) * 100,
         main = "Results of procedures:\nsmall kidney stones",
         xlab = "Percentage of patients",
         pt.cex = 0.9,
         lcol = rgb(1, 0, 0, alpha = 0) , # The lines extend past the end of the box...? Use transparent (alpha = 0)
         xlim = c(0, 100),
         pch = c(19, 4))
lines( x = c(0, 100),
       y = c(1, 1),
       lwd = 1,
       col = "grey",
       lty = 2)
lines( x = c(0, 100),
       y = 2 * c(1, 1),
       lwd = 1,
       col = "grey",
       lty = 2)
lines( x = c(0, 100),
       y = 5 * c(1, 1),
       lwd = 1,
       col = "grey",
       lty = 2)
lines( x = c(0, 100),
       y = 6 * c(1, 1),
       lwd = 1,
       col = "grey",
       lty = 2)
```

### Side-by-side bar charts {#SideBySideBarCharts}
\index{Graphs!side-by-side bar chart}

Instead of stacking the success and failures bars on top of each other, these bars can be placed *side-by-side* for each method.
Bars indicate the counts (or percentages) in each category.
The levels can be on the horizontal or vertical axis, but placing the level names on the vertical axis often makes for easier reading, and room for long labels.


::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The axis displaying the counts (or percentages) should *start from zero*, since the height of the bars visually implies the frequency of those observations (see Example\ \@ref(exm:VerticalTruncation)).
:::


::: {#BarSideBySide .example name="Side-by-side bar charts"}
For the small kidney-stone data in Example\ \@ref(exm:SmallKidneyStones), a side-by-side bar chart can be created by producing two bars for each method (one for failures; one for successes), and placing these side-by-side (Fig.\ \@ref(fig:QualGraphsStones), centre panels).
Again, numbers or percentages within each method can be graphed.
:::

### Dot charts {#TwoWayCountsDotCharts}
\index{Graphs!dot chart!comparing qualitative data}

Instead of bars, dots (or other symbols) can be used in place of the bars in a side-by-side bar chart to create a dot chart.

\clearpage

::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The axis displaying the counts (or percentages) should *start from zero*, since the distance of the dots from the axis visually implies the frequency of those observations (see Example\ \@ref(exm:VerticalTruncation)).
:::


::: {#BarSideBySide2 .example name="Dot charts"}
For the data in Example\ \@ref(exm:SmallKidneyStones), a dot chart can be created by placing plotting symbols for each result (one for failures; one for successes) side-by-side for each method (Fig.\ \@ref(fig:QualGraphsStones), right panels).
Again, numbers or percentages can be used.
:::

### Other variations {#OtherVariations}

Many variations of these charts are possible, by making different choices:

* using a stacked bar chart, side-by-side bar chart, or dot chart.
* using percentages or counts on one axis.
(The percentages can be percentages of the total, or within the total for each level of the variable, as in the bottom plots in Fig.\ \@ref(fig:QualGraphsStones).)
* using the counts (or percentage) on either the horizontal or vertical axis.
* deciding which variable can be used as the first division of the data.

The guiding principle remains: *the purpose of a graph is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data*.

Using a computer to create graphs is recommended, and using a computer makes it easy to try different variations to find the graph that best displays the message in the data.


## Numerical summary: difference between proportions {#DiffProportions}
\index{Difference between proportions}\index{Software output!comparing two proportions}\index{Summary table!comparing two proportions}\index{Summary table!comparing two odds}

The difference between the success-rates of the two methods for the small kidney-stone data (Table\ \@ref(tab:KS-Small)) can be summarised using the difference between the respective proportions:

- for *Method\ A*, the *sample* proportion of successful procedures is $\hat{p}_A = 0.931$.
- for *Method\ B*, the *sample* proportion of successful procedures is $\hat{p}_B = 0.867$.

The *difference* between these proportions is\ $\hat{p}_A - \hat{p}_B = 0.064$ (i.e., the success rate is higher for Method\ A).
The difference between the proportions is a *statistic*, and the (unknown) difference between the population proportions (i.e., $p_A - p_B$)  is a *parameter*.


## Numerical summary: odds ratios {#OddsRatios}
\index{Odds ratio}

The small kidney-stone data (Table\ \@ref(tab:KS-Small)) also can be summarised using the odds of success for each method:

- for *Method\ A*, the odds of success are\ $13.5$ ($13.5$\ *times* as many successes as failures).
- for *Method\ B*, the odds of success are\ $6.5$ ($6.5$\ *times* as many successes as failures).

The odds of success for Method\ A and Method\ B are very different.
In the sample, the odds of success for Method\ A is many *times* greater than for Method\ B.\spacex
In fact, in the sample, the odds of success for Method\ A is $13.5\div 6.5 = 2.08$ *times* the odds of a success for Method\ B.\spacex
This value is the *odds ratio* (OR).
The sample OR is a *statistic*, and the (unknown) population OR is a *parameter*.
There is no commonly-used symbol for odds ratios.


::: {style="float:right; width: 222x; border: 1px; padding:10px"}
<img src="Pics/iconmonstr-cursor-21-240.png" width="50px"/>
:::

::: {#OddsRatio .definition name="Odds Ratio (OR)"}
The *odds ratio* (often written OR) is the ratio of the odds of a result of interest in one group, compared to the odds of the *same* result in a *different* group:
$$
\text{Odds ratio (OR)} =
\frac{\text{Odds of a result in Group A}}
{\text{Odds of the same result in Group B}}.
$$
:::


::: {#InterpretingOdds .example name="Odds ratios"}
For the small kidney-stone data, the odds of a success for Method\ A is $81\div6 = 13.5$.
The odds of a success for Method\ B is $234\div 36 = 6.5$.
The OR is then computed as $13.5\div 6.5 = 2.08$.
The odds have been computed *with the rows*.

This means that the odds of a success for Method\ A is about\ $2.08$ times the odds of a success for Method\ B.
:::

Most software computes the OR from a two-way table by using the values in the *first* row and *first* column on the *top* of the fractions when computing the odds and the odds ratio.
In Example\ \@ref(exm:InterpretingOdds), for instance, the odds for both methods were computed with the Column\ 1 values on the top of the fraction ($81$ and\ $234$), and the OR comparing the *rows* was computed with the Row\ 1 odds ($13.5$) on top of the fraction.

However, the OR could also be computed using the odds within the columns (i.e., comparing the *columns*), rather than within the rows.


<!-- ::: {#ComputingOddsFractions .example name="Odds ratios"} -->
<!-- For the small kidney stone data, the odds of a *success* coming from Method\ A (i.e., Column\ 1) is $81/234 = 0.3462$. -->
<!-- Likewise, the odds of a *failure* (i.e., Column\ 2) coming from Method\ A is $6\div36 =  0.1667$. -->
<!-- The odds ratio is  $0.3462\div 0.1667 = 2.08$, as in Example\ \@ref(exm:InterpretingOdds). -->
<!-- This means that the odds of Method\ A producing a success is about\ $2.08$ times the odds of Method\ A producing a failure. -->

<!-- The two odds ratio calculations produce the same value. -->
<!-- The odds ratio can be interpreted in either way: as in this example or as in Example\ \@ref(exm:InterpretingOdds). -->
<!-- *Both interpretations are correct.* -->
<!-- ::: -->


::: {.softwareBox .software data-latex="{iconmonstr-laptop-4-240.png}"}
The OR can be interpreted in *either* of these ways (i.e., both are correct):\index{Odds ratio!interpreting}\index{Software output!comparing two odds (odds ratio)}

* the *odds* in each column compares Row\ 1 counts (top) to Row\ 2 counts (bottom).
  The *OR* then compares the Column\ 1 odds (top) to the Column\ 2 odds (bottom).
* the *odds* in each row compares Column\ 1 counts to Column\ 2 counts.
  The *OR* then compares the Row\ 1 odds to the Row\ 2 odds.

Odds and ORs are computed with the *first row* and *first column* values on the *top* of the fraction.
While both are correct, the levels of the explanatory variable are usually the rows of the table (as in Table\ \@ref(tab:KS-Small)), so usually the *second* interpretation makes more sense (as in Example\ \@ref(exm:InterpretingOdds)).
:::


The OR compares the odds of the same result (e.g., success) in two groups (e.g., Method\ A and Method\ B).
This means a $2\times 2$ table can be summarised with one number: the OR.

When interpreting ORs:

* ORs *greater than*\ $1$ mean the odds of the result is *larger* for the group on top of the fraction compared to the group on the bottom.
* ORs *equal to*\ $1$ mean the odds of the result is the *same* for both groups (on the top and the bottom of the fraction).
* ORs *less than*\ $1$ mean the odds of the result is *smaller* for the group on the top of the fraction compared to the group on the bottom.

`r if (knitr::is_html_output()){   'The following short video may help explain some of these concepts:' }`

::: {style="text-align:center;"}
```{r}
htmltools::tags$video(src ="./videos/oddsratios.mp4",
                      width="550",
                      controls="controls",
                      loop="loop",
                      style="padding:5px; border: 2px solid gray;")
```
:::

<iframe src="https://learningapps.org/watch?v=pcyn538fj22" style="border:0px;width:100%;height:500px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true">

</iframe>

The numerical summary information for comparing qualitative variables can be collated in a table.\index{Qualitative data!compare \textit{between} individuals!summary tables}
The data should be summarised by one of the qualitative variables, producing proportions (or percentages) and odds for the other.
The summary table also requires the differences between the proportions (or percentages) and the odds ratio.

::: {#GorillaComparisonTable .example name="Numerical summary table"}
For the small kidney-stone data, the summary of the data can be tabulated as in Table\ \@ref(tab:KidneySmallSum), using percentages and odds.
:::

```{r KidneySmallSum}
KidneySmall.Sum <- array( NA,
                          dim = c(3, 3))
rownames(KidneySmall.Sum) <- c("Method A",
                               "Method B",
                               "")

KidneySmall.Sum[1, 1] <- KS.small[1, 1] / sum(KS.small[1,]) * 100
KidneySmall.Sum[2, 1] <- KS.small[2, 1] / sum(KS.small[2,]) * 100

KidneySmall.Sum[1, 2] <- KS.small[1, 1] /KS.small[1, 2]
KidneySmall.Sum[2, 2] <- KS.small[2, 1] /KS.small[2, 2]

KidneySmall.Sum[3, 2] <- KidneySmall.Sum[1, 2] / KidneySmall.Sum[2, 2]
KidneySmall.Sum[3, 1] <- KidneySmall.Sum[1, 1] - KidneySmall.Sum[2, 1]

KidneySmall.Sum[1, 3] <- sum(KS.small[1, ])
KidneySmall.Sum[2, 3] <- sum(KS.small[2, ])

KidneySmall.Sum[, 1] <- round( KidneySmall.Sum[, 1], 1)
KidneySmall.Sum[, 2] <- round( KidneySmall.Sum[, 2], 3)
KidneySmall.Sum[, 3] <- round( KidneySmall.Sum[, 3], 0)


if( knitr::is_latex_output() ) {
  KidneySmall.Sum[3, 1] <- paste0("\\llap{Difference:\\ \\ }$\\phantom{0}",
                                  KidneySmall.Sum[3, 1],
                                  "$")
  KidneySmall.Sum[3, 2] <- paste0("\\llap{OR:\\ }$\\phantom{0}",
                                  round( as.numeric(KidneySmall.Sum[3, 2]), 2),
                                  "$")

  kable(pad(KidneySmall.Sum,
            surroundMaths = TRUE,
            targetLength = c(4, 5, 3),
            decDigits = c(1, 2, 0,
                          1, 2, 0,
                          1, 2, 0)),
        format = "latex",
        longtable = FALSE,
        booktabs = TRUE,
        escape = FALSE,
        align = "c",
        digits = c(1, 3),
        col.names = c("Percentage success",
                      "Odds of success",
                      "Sample size"),
        caption = "Numerical summary of the small kidney-stone data: odds and percentage of a successful procedure.") %>%
    row_spec(0, bold = TRUE) %>%
    column_spec(1, bold = TRUE) %>%
    kable_styling(font_size = 8) %>%
    row_spec(3, italic = TRUE) %>%
    row_spec(2, hline_after = TRUE)
}
if( knitr::is_html_output() ) {

  KidneySmall.Sum[3, 1] <- paste0("Difference: $\\phantom{0}",
                                  KidneySmall.Sum[3, 1],
                                  "$")
  KidneySmall.Sum[3, 2] <- paste0("OR: $\\phantom{0}",
                                  round( as.numeric(KidneySmall.Sum[3, 2]), 2),
                                  "$")

  kable(pad(KidneySmall.Sum,
            surroundMaths = TRUE,
            targetLength = c(4, 6, 3),
            decDigits = c(1, 3, 0)),
        format = "html",
        longtable = FALSE,
        booktabs = TRUE,
        align = c("r", "r", "r"),
        digits = c(1, 3),
        col.names =  c("Percentage success",
                       "Odds of success",
                       "Sample size"),
        caption = "Numerical summary of the small kidney-stone data: odds and percentage of a successful procedure.") %>%
    row_spec(3, bold = TRUE)
}
```

## Example: large kidney stones {#KidneyExample}

::: {style="float:right; width: 222x; border: 1px; padding:10px"}
<img src="Illustrations/surgery-3034133_640.jpg" width="200px"/>
:::

The data in Table\ \@ref(tab:KS-Small) are for procedures on *small* kidney stones.
Data were also recorded for the *large* kidney stones `r if (knitr::is_latex_output()) {    '(Table\\ \\@ref(tab:KStonesNumbersLargeAll), left table).' } else {    '(Table\\ \\@ref(tab:KS-Large)).' }`
As for small kidney stones, the *success proportions* can be computed for both methods:

* for *Method\ A*, the success proportion for *large* kidney stones: $192/263 = 0.730$.
* for *Method\ B*, the success proportion for *large* kidney stones: $55/80 = 0.688$.

For large kidney stones, then, *Method\ A* has a higher success proportion than Method\ B, just as with the small kidney stones.

(ref:KStonesNumbersLarge) *Counts* for two procedures with *large* kidney stones.

```{r KS-Large}
KS.large <- xtabs( Counts ~ Method + Result,
                   data = subset(KStones, Size == "Large"))[, c(2, 1)]
KS.large2 <- cbind(KS.large,
                   "Total" = rowSums(KS.large))

if( knitr::is_html_output() ) {
  kable(pad(KS.large2,
            surroundMaths = TRUE,
            targetLength = c(3, 2, 3),
            decDigits = 0),
        format = "html",
        longtable = FALSE,
        booktabs = TRUE,
        digits = 0,
        align = "c",
        col.names = c("Success",
                      "Failure",
                      "Total"),
        caption = "(ref:KStonesNumbersLarge)") %>%
    column_spec(column = 4,
                bold = TRUE)
}
```

So, could the data for small (Table\ \@ref(tab:KS-Small)) and large kidney stones
`r if (knitr::is_latex_output()) {
'(Table\\ \\@ref(tab:KStonesNumbersLargeAll), left table)' } else
{
'(Table\\ \\@ref(tab:KS-Large))' }`
be combined, to produce a single two-way table of just Method and Result
`r if (knitr::is_latex_output()) {
'(Table\\ \\@ref(tab:KStonesNumbersLargeAll), right table)?' } else
{
'(Table\\ \\@ref(tab:KSAll))?' }`
From this table of small and large stones combined:

* for *Method\ A*, the success proportion for *all* kidney stones: $273/350 = 0.780$.
* for *Method\ B*, the success proportion for *all* kidney stones: $289/350 = 0.826$.

(ref:KStonesNumbersAll) Counts from two procedures, when small and large kidney stones are combined.

```{r KSAll}
KS.all <- KS.small + KS.large
KS.all2 <- cbind(KS.all,
                 "Total" = rowSums(KS.all))

if( knitr::is_html_output() ) {
  kable(pad(KS.all2,
            surroundMaths = TRUE,
            targetLength = c(3, 2, 3),
            decDigits = 0),
        format = "html",
        longtable = FALSE,
        booktabs = TRUE,
        digits = 0,
        align = "c",
        col.names = c("Success",
                      "Failure",
                      "Total"),
        caption = "(ref:KStonesNumbersAll)") %>%
    column_spec(column = 4,
                bold = TRUE)
}
```

(ref:KStonesNumbersLargeAll) The kidney stones data. Left: numbers for *large* stones only. Right: numbers for *all* kidney stones combined, without separating by the size of the kidney stone.

```{r}
if( knitr::is_latex_output() ) {

  # For some reason, col_spec(col = 0) does not work in T1 (works in T2 though...), so hack:
  rownames(KS.large2) <- c("\\textbf{Method A}",
                           "\\textbf{Method B}")
  T1 <- kable(pad(KS.large2,
                  surroundMaths = TRUE,
                  targetLength = c(3, 2, 3),
                  decDigits = 0),
              format  = "latex",
              longtable = FALSE,
              booktabs = TRUE,
              escape = FALSE,
              digits = 1,
              align = "c",
              col.names = c("Success",
                            "Failure",
                            "Total")) %>%
    row_spec(0, bold = TRUE)  %>%
    add_header_above( c("Large stones only" = 4) )

  T2 <- kable(pad(KS.all2,
                  surroundMaths = TRUE,
                  targetLength = c(3, 2, 3),
                  decDigits = 0),
              format = "latex",
              longtable = FALSE,
              digits = 1,
              booktabs = TRUE,
              escape = FALSE,
              align = "c",
              col.names = c("Success",
                            "Failure",
                            "Total"))  %>%
    row_spec(0, bold = TRUE)  %>%
    column_spec(1, bold = TRUE) %>%
    add_header_above( c("Large and small stones combined" = 4) )


  out <- knitr::kables(list(T1, T2),
                       format = "latex",
                       label = "KStonesNumbersLargeAll",
                       caption = "(ref:KStonesNumbersLargeAll)") %>%
    kable_styling(font_size = 8)
  out2 <- prepareSideBySideTable(out,
                                 gap = "\\qquad\\qquad")
  out2

}
```

When all kidney stones are combined, *Method\ A* has a *lower* success proportion than Method\ B.
To summarise:

* *Method\ A* is more successful for *small* stones ($0.931$ vs\ $0.867$).
* *Method\ A* is more successful for *large* stones ($0.730$ vs\ $0.688$).
* *Method\ B* is more successful for *all* stones combined ($0.780$ vs\ $0.826$).

That seems strange: Method\ A performs better for small *and* for large kidney stones, but Method\ B performs better when combining all kidney stones.
The explanation is that the *size of the stone* is a *confounding variable*\index{Variables!confounding} (Fig.\ \@ref(fig:SimpsonRulesStones)).
Size is associated with both the method (small stones are treated more often with Method\ B) *and* with the result (small stones have a higher success proportion for *both* methods).
Method\ B was used more often on smaller kidney stones, for which a success is more likely (due to their smaller size).

This confounding could have been avoided by randomly allocating a treatment method to patients.
However, random allocation was not possible in this observational study, so the researchers used a different method to manage confounding: *recording* the size of the kidney stones to use in the analysis (see Sect.\
\@ref(ManagingConfounding)).\index{Confounding}\index{Variables!lurking}\index{Variables!confounding}\index{Confounding!analysis}

In this example, incorporating information about a potential confounder (the size of the kidney stone) is important, otherwise the wrong (opposite) conclusion is reached: Method\ B would be incorrectly considered better if the size of the stones was ignored, when the better method really is Method\ A.

This is called\index{Simpson's paradox} `r if (knitr::is_latex_output()) {    "*Simpson's paradox*." } else {    "[*Simpson's paradox*](https://en.wikipedia.org/wiki/Simpson%27s_paradox)." }`\index{Simpson's paradox}
If the size of the kidney stone had not been recorded, size would be a *lurking variable*, and the incorrect conclusion would have been reached.


```{r SimpsonRulesStones, fig.cap="The size of the stones is associated with the success percentage and method.", fig.align="center", fig.height=2.15, fig.width=7, out.width='80%'}
source("R/showYInfluences.R")

showYInfluences(showConfounding = TRUE,
                ResponseName = "Success",
                ExplanatoryName = "Method",
                ExtraneousName = "Size",
                explanatoryBoxWidth = 1.1) # Enlarge explanatory box a bit)
```


## Example: water access {#WaterAcessQualCompare}

@lopez2022farmers recorded data about access to water for three rural communities in Cameroon (see Sects.\ \@ref(WaterAccessQuant) and\ \@ref(WaterAccessQual)).
The study could be used to determine associations to the incidence of diarrhoea in young children ($85$\ households had children under\ $5$).
A cross-tabulation (Table\ \@ref(tab:WaterAcessQualCrosstab)) shows the relationship with keeping livestock; the numerical summary table (Table\ \@ref(tab:WaterAcessQualCompareTable)) may suggest a difference in the percentage of children with diarrhoea in households that do and do not keep livestock.
The comparison in Fig.\ \@ref(fig:WaterAccessQualCompare) includes some categories with small sample sizes, so the percentages shown may not be precise estimates\index{Precision} of the population values.

As usual, the data come from one of countless possible samples, but the RQ is about the population, so making a definitive decision about the population is difficult.


```{r WaterAcessQualCrosstab}
data(WaterAccess)

WAkids <- subset(WaterAccess, HouseholdUnder5s > 0 ) # 85 have kids
WAKidsTab <- xtabs( ~ HasLivestock + Diarrhea,
                    data = WAkids)
rownames(WAKidsTab) <- c("Household does not have livestock",
                         "Household has livestock")

if( knitr::is_latex_output() ) {
  kable(pad(WAKidsTab,
            surroundMaths = TRUE,
            targetLength = 2,
            decDigits = 0),
        format = "latex",
        longtable = FALSE,
        booktabs = TRUE,
        escape = FALSE,
        align = "c",
        digits = c(1, 3),
        col.names = c("in children",
                      "in children"),
        caption = "Cross-tabulation of having livestock in the household, and children under $5$ years of age having diarrhoea in the household in the last two weeks.") %>%
    row_spec(0, bold = TRUE) %>%
    column_spec(1, bold = TRUE) %>%
    kable_styling(font_size = 8) %>%
    add_header_above( c(" " = 1,
                        "No diarrhoea reported" = 1,
                        "Diarrhoea reported" = 1),
                      line = FALSE,
                      bold = TRUE)
}
if( knitr::is_html_output() ) {
  kable(pad(WAKidsTab,
            surroundMaths = TRUE,
            targetLength = 2,
            decDigits = 0),
        format = "html",
        longtable = FALSE,
        booktabs = TRUE,
        align = "c",
        digits = c(1, 3),
        col.names = c("No diarrhoea",
                      "Diarrhoea"),
        caption = "Cross-tabulation of having livestock in the household, and children under $5$ years of age having diarrhoea in the household in the last two weeks.")
}
```

```{r WaterAcessQualCompareTable}

WANumericalSummary <- array( dim = c(3, 3))

rownames(WANumericalSummary) <- c("Household does not have livestock",
                                  "Household has livestock",
                                  "")
colnames(WANumericalSummary) <- c("Percentage",
                                  "Odds",
                                  "Sample size")


WANumericalSummary[1:2, 1] <- prop.table(WAKidsTab, 1)[, 2] * 100
WANumericalSummary[1:2, 2] <- WAKidsTab[, 2] / WAKidsTab[, 1]
WANumericalSummary[1:2, 3] <- rowSums(WAKidsTab)

WANumericalSummary[3, 1] <- WANumericalSummary[1, 1] - WANumericalSummary[2, 1]
WANumericalSummary[3, 2] <- WANumericalSummary[1, 2] / WANumericalSummary[2, 2]

WANumericalSummary[, 1] <- round( WANumericalSummary[, 1], 1)
WANumericalSummary[, 2] <- round( WANumericalSummary[, 2], 3)
WANumericalSummary[, 3] <- round( WANumericalSummary[, 3], 0)

if( knitr::is_latex_output() ) {
  WANumericalSummary[3, 1] <- paste0("\\llap{Difference:\\ }$",
                                     WANumericalSummary[3, 1],
                                     "$")
  WANumericalSummary[3, 2] <- paste0("\\llap{OR:\\ }$",
                                     WANumericalSummary[3, 2],
                                     "$")

  kable(pad(WANumericalSummary,
            surroundMaths = TRUE,
            targetLength = c(5, 5, 2),
            decDigits = c(1, 3, 0)),
        format = "latex",
        longtable = FALSE,
        booktabs = TRUE,
        escape = FALSE,
        align = "c",