Skip to content

Commit 76a97b0

Browse files
author
acp29
committed
Added bootstrap p-value for regression coefficients
1 parent c5985bb commit 76a97b0

File tree

5 files changed

+124
-48
lines changed

5 files changed

+124
-48
lines changed

DESCRIPTION

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
name: statistics-bootstrap
2-
version: 5.2.5
3-
date: 2023-05-14
2+
version: 5.2.6
3+
date: 2023-05-18
44
author: Andrew Penn <andy.c.penn@gmail.com>
55
maintainer: Andrew Penn <andy.c.penn@gmail.com>
66
title: A statistics package with a variety of bootstrap resampling tools

README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,18 +46,17 @@ To install (or test) the statistics-bootstrap package at it's existing location
4646

4747
* `boot` returns resamples data or indices created by balanced bootstrap or bootknife resampling
4848
* `bootknife` performs balanced bootknife resampling and calculates bootstrap bias, standard error and confidence intervals. The interval types supported are simple percentile, bias-corrected and accelerated, or calibrated percentile. This function supports iterated and stratified resampling.
49-
* `bootbayes` performs Bayesian nonparametric bootstrap and calculates posterior statistics for coefficients from a linear model. Two interval types are supported: shortest probability intervals and percentile intervals. See also `bootcoeff` and `bootemm`.
49+
* `bootbayes` performs Bayesian nonparametric bootstrap and calculates posterior statistics and frequentist *p*-values for the regression coefficients from a linear model. Two credible interval types are supported: shortest probability intervals and percentile intervals. The *p*-values are computed under the null hypothesis. See also `bootcoeff` and `bootemm`.
5050
* `bootnhst` calculates *p*-values by bootstrap null-hypothesis significance testing (two-tailed). This function can be used to compare 2 or more (independent) samples in designs with a one-way layout. This function resamples under the null hypothesis.
5151
* `bootmode` uses bootstrap to evaluate the likely number of real modes in a distribution
5252
* `bootci` is a function for calculating bootstrap confidence intervals. This function is a wrapper of the `bootknife` function but has the same usage as the `bootci` function from Matlab's Statistics and Machine Learning toolbox.
5353
* `bootstrp` is a function for calculating bootstrap statistics. This function is a wrapper of the `bootknife` function but has the same usage as the `bootstrp` function from Matlab's Statistics and Machine Learning toolbox.
54-
* `bootcoeff` (Octave only) is a function for calculating Bayesian nonparametric bootstrap credible intervals for the regression coefficients of a linear model fit using `anovan` or `fitlm`. This function uses `bootbayes`.
55-
* `bootemm` (Octave only) is a function for calculating Bayesian nonparametric bootstrap credible intervals for the estimated marginal means of a linear model fit using `anovan` or `fitlm`. This function uses `bootbayes`.
54+
* `bootcoeff` (Octave only) is a function for calculating Bayesian nonparametric bootstrap credible intervals and frequentist *p*-values for the regression coefficients of a linear model, which was fitted using `anovan` or `fitlm`. This function uses `bootbayes`.
55+
* `bootemm` (Octave only) is a function for calculating Bayesian nonparametric bootstrap credible intervals for the estimated marginal means of a linear model, which was fitted using `anovan` or `fitlm`. This function uses `bootbayes`.
5656

5757
At the Octave/MATLAB command prompt, type `help function-name` for more information about the function and it's input and output arguments. In Octave, you can also request demonstrations of function usage through examples by typing 'demo function-name` at the command prompt.
5858

5959
## Development roadmap
6060

61-
* We intend on including the following features in version 5.2.0: 1) an option in `cor` function to select Spearman's rank or Pearson's correlation coefficient; and 2) an additional function folder for fitting linear models and returning bootstrap confidence intervals for regression coefficients and estimated marginal means. The function have similar usage to `anovan`.
62-
* Make `bootcoeff` and `bootemm` compatible with Matlab.
61+
6362

inst/bootbayes.m

Lines changed: 74 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,16 @@
1515
% [1,2]) distribution(s) is/are summarised with the following statistics
1616
% printed to the standard output:
1717
% • original: the mean of the data vector y
18-
% • bias: bootstrap estimate(s) of the bias
18+
% • bias: bootstrap bias estimate(s)
1919
% • median: the median of the posterior distribution(s)
2020
% • CI_lower: lower bound(s) of the 95% credible interval
2121
% • CI_upper: upper bound(s) of the 95% credible interval
22+
% • p-val: two-tailed p-value(s) for the parameter(s) being equal to 0
2223
% By default, the credible intervals are shortest probability intervals,
2324
% which represent a more computationally stable version of the highest
24-
% posterior density interval [3].
25+
% posterior density interval [3]. The p-value(s) is/are computed from
26+
% the Student-t (null) distribution(s) constructed from the posterior
27+
% statistics and heteroscedasticity-consistent standard errors [4,5].
2528
%
2629
% 'bootbayes (y, X)' also specifies the design matrix (X) for least squares
2730
% regression of y on X. X should be a column vector or matrix the same
@@ -64,11 +67,14 @@
6467
% 'bootbayes' results are reproducible.
6568
%
6669
% 'bootbayes (..., NBOOT, PROB, PRIOR, SEED, L)' multiplies the regression
67-
% coefficients by the hypothesis matrix L. If L is not provided or is empty,
68-
% it will assume the default value of 1.
70+
% coefficients by the hypothesis matrix L. If L is not provided or is empty,
71+
% it will assume the default value of 1. This functionality is usually used
72+
% to convert regression to estimated marginal means. NaN is returned for
73+
% p-values if a hypothesis matrix is provided.
6974
%
7075
% 'STATS = bootbayes (STATS, ...) returns a structure with the following
71-
% fields (defined above): original, bias, median, CI_lower, CI_upper.
76+
% fields (defined above): original, bias, median, CI_lower, CI_upper, tstat
77+
% and pval.
7278
%
7379
% '[STATS, BOOTSTAT] = bootbayes (STATS, ...) also returns the a vector (or
7480
% matrix) of bootstrap statistics (BOOTSTAT) calculated over the bootstrap
@@ -80,8 +86,12 @@
8086
% Bootstrap Mean. Ann. Statist. 17(2):705-710
8187
% [3] Liu, Gelman & Zheng (2015). Simulation-efficient shortest probability
8288
% intervals. Statistics and Computing, 25(4), 809–819.
89+
% [4] Hall and Wilson (1991) Two Guidelines for Bootstrap Hypothesis Testing.
90+
% Biometrics, 47(2), 757-762
91+
% [5] Long and Ervin (2000) Using Heteroscedasticity Consistent Standard
92+
% Errors in the Linear Regression Model. Am. Stat, 54(3), 217-224
8393
%
84-
% bootbayes (version 2023.05.14)
94+
% bootbayes (version 2023.05.18)
8595
% Author: Andrew Charles Penn
8696
% https://www.researchgate.net/profile/Andrew_Penn/
8797
%
@@ -230,8 +240,11 @@
230240
% Create weighted least squares anonymous function
231241
bootfun = @(w) lmfit (X, y, diag (w), L);
232242

233-
% Calculate estimate(s)
234-
original = bootfun (ones (n, 1));
243+
% Calculate estimate(s) and heteroscedasticity robust standard error(s) (HC1)
244+
S = bootfun (ones (n, 1));
245+
original = S.b;
246+
std_err = S.se;
247+
t = original ./ std_err;
235248

236249
% Create weights by randomly sampling from a symmetric Dirichlet distribution.
237250
% This can be achieved by normalizing a set of randomly generated values from
@@ -243,8 +256,19 @@
243256
end
244257
W = bsxfun (@rdivide, r, sum (r));
245258

246-
% Compute bootstat
247-
bootstat = cell2mat (cellfun (bootfun, num2cell (W, 1), 'UniformOutput', false));
259+
% Compute bootstap statistics
260+
bootout = cell2mat (cellfun (bootfun, num2cell (W, 1), 'UniformOutput', false));
261+
bootstat = [bootout.b];
262+
bootse = [bootout.se];
263+
264+
% Compute frequentist p-values following the guidelines described by
265+
% Hall and Wilson (1991) Biometrics, 47(2), 757-762
266+
if (all (isnan (t)))
267+
pval = t;
268+
else
269+
T = bsxfun (@minus, bootstat, original) ./ bootse; % Null distribution
270+
pval = sum (bsxfun(@gt, abs (T), abs (t)), 2) / nboot;
271+
end
248272

249273
% Bootstrap bias estimation
250274
bias = mean (bootstat, 2) - original;
@@ -277,6 +301,8 @@
277301
stats.median = median (bootstat, 2);
278302
stats.CI_lower = ci(:, 1);
279303
stats.CI_upper = ci(:, 2);
304+
stats.tstat = t;
305+
stats.pval = pval;
280306

281307
% Print output if no output arguments are requested
282308
if (nargout == 0)
@@ -289,14 +315,14 @@
289315

290316
%% FUNCTION TO FIT THE LINEAR MODEL
291317

292-
function b = lmfit (X, y, W, L)
318+
function S = lmfit (X, y, W, L)
293319

294320
% Get model coefficients by solving the linear equation by matrix arithmetic
295321
% If optional arument W is provided, it should be a diagonal matrix of
296322
% weights or a positive definite covariance matrix
323+
n = numel (y);
297324
if (nargin < 3)
298325
% If no weights are provided, create an identity matrix
299-
n = numel (y);
300326
W = eye (n);
301327
end
302328
if (nargin < 4)
@@ -305,7 +331,28 @@
305331
end
306332

307333
% Solve linear equation to minimize weighted least squares
308-
b = L * pinv (X' * W * X) * (X' * W * y);
334+
XW = X' * W;
335+
invG = pinv (XW * X); % calculate pseudoinverse of the Gram matrix
336+
b = L * (invG * (XW * y));
337+
338+
% Calculate heteroscedasticity-consistent standard errors (HC1) for the
339+
% regression coefficients. The standard errors calculated here reproduce
340+
% the HC1 standard errors calculated in R using vcovHC from the sandwich
341+
% package.
342+
% Ref: Long and Ervin (2000) Am. Stat, 54(3), 217-224
343+
k = numel (b);
344+
if ((numel (L) == 1) && all (L == 1))
345+
yf = X * invG * (XW * y);
346+
r = y - yf;
347+
rw = W * r;
348+
se = sqrt (diag ((n / (n - k) * invG * X' * diag ((rw).^2) * X * invG)));
349+
else
350+
se = nan (k, 1);
351+
end
352+
353+
% Prepare output
354+
S.b = b;
355+
S.se = se;
309356

310357
end
311358

@@ -336,11 +383,23 @@ function print_output (stats, nboot, prob, prior, p, L)
336383
fprintf (' Credible interval: %.3g%%\n', mass);
337384
end
338385
end
386+
fprintf (' Null value (H0) used for hypothesis testing (p-values): 0 \n')
339387
fprintf ('\nPosterior Statistics: \n');
340-
fprintf (' original bias median CI_lower CI_upper\n');
388+
fprintf (' original bias median CI_lower CI_upper p-val\n');
341389
for j = 1:p
342-
fprintf (' %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g \n',...
390+
if (stats.pval(j) <= 0.001)
391+
fprintf (' %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g <.001 \n',...
392+
[stats.original(j), stats.bias(j), stats.median(j), stats.CI_lower(j), stats.CI_upper(j)]);
393+
elseif (stats.pval(j) < 0.9995)
394+
fprintf (' %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g .%03u \n',...
395+
[stats.original(j), stats.bias(j), stats.median(j), stats.CI_lower(j), stats.CI_upper(j), round(stats.pval(j) * 1e+03)]);
396+
elseif (isnan (stats.pval(j)))
397+
fprintf (' %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g NaN \n',...
343398
[stats.original(j), stats.bias(j), stats.median(j), stats.CI_lower(j), stats.CI_upper(j)]);
399+
else
400+
fprintf (' %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g %#-+12.6g 1.000 \n',...
401+
[stats.original(j), stats.bias(j), stats.median(j), stats.CI_lower(j), stats.CI_upper(j)]);
402+
end
344403
end
345404
fprintf ('\n');
346405

inst/octave/bootcoeff.m

Lines changed: 21 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,16 @@
1212
% Bayesian nonparametric bootstrap [1] to compute and return the following
1313
% statistics:
1414
% • original: the coefficient(s) from regressing y on X
15-
% • bias: bootstrap estimate(s) of the bias
15+
% • bias: bootstrap bias estimate(s)
1616
% • median: the median of the posterior distribution(s)
1717
% • CI_lower: lower bound(s) of the 95% credible interval
1818
% • CI_upper: upper bound(s) of the 95% credible interval
19+
% • p-val: two-tailed p-value(s) for the parameter(s) being equal to 0
1920
% By default, the credible intervals are shortest probability intervals,
2021
% which represent a more computationally stable version of the highest
21-
% posterior density interval [2].
22+
% posterior density interval [2]. The p-value(s) is/are computed from
23+
% the Student-t (null) distribution(s) constructed from the posterior
24+
% statistics and heteroscedasticity-consistent standard errors [3,4].
2225
%
2326
% 'bootcoeff (STATS, NBOOT)' specifies the number of bootstrap resamples,
2427
% where NBOOT must be a positive integer. If empty, the default value of
@@ -54,9 +57,9 @@
5457
% 'bootcoeff' results are reproducible.
5558
%
5659
% 'COEFF = bootcoeff (STATS, ...) returns a structure with the following
57-
% fields (defined above): original, bias, median, CI_lower, CI_upper.
58-
% These statistics summarise the posterior distributions of the coefficients
59-
% from the linear model.
60+
% fields (defined above): original, bias, median, CI_lower, CI_upper, tstat
61+
% and pval. These statistics summarise the posterior distributions of the
62+
% coefficients from the linear model.
6063
%
6164
% '[COEFF, BOOTSTAT] = bootcoeff (STATS, ...) also returns the bootstrap
6265
% statistics (i.e. posterior) for the coefficients.
@@ -69,9 +72,13 @@
6972
% Bibliography:
7073
% [1] Rubin (1981) The Bayesian Bootstrap. Ann. Statist. 9(1):130-134
7174
% [2] Liu, Gelman & Zheng (2015). Simulation-efficient shortest probability
72-
% intervals. Statistics and Computing, 25(4), 809–819.
75+
% intervals. Statistics and Computing, 25(4), 809–819.
76+
% [3] Hall and Wilson (1991) Two Guidelines for Bootstrap Hypothesis Testing.
77+
% Biometrics, 47(2), 757-762
78+
% [4] Long and Ervin (2000) Using Heteroscedasticity Consistent Standard
79+
% Errors in the Linear Regression Model. Am. Stat, 54(3), 217-224
7380
%
74-
% bootcoeff (version 2023.05.14)
81+
% bootcoeff (version 2023.05.18)
7582
% Author: Andrew Charles Penn
7683
% https://www.researchgate.net/profile/Andrew_Penn/
7784
%
@@ -105,13 +112,6 @@
105112
if (nargin < 4)
106113
prior = []; % Use default in bootbayes
107114
end
108-
if (nargin > 4)
109-
if (ISOCTAVE)
110-
randg ('seed', seed);
111-
else
112-
rng ('default');
113-
end
114-
end
115115

116116
% Error checking
117117
info = ver;
@@ -123,6 +123,13 @@
123123
if ((~ any (statspackage)) || (str2double (info (statspackage).Version(1:3)) < 1.5))
124124
error ('bootcoeff: Requires version >= 1.5 of the statistics package')
125125
end
126+
if (nargin > 4)
127+
if (ISOCTAVE)
128+
randg ('seed', seed);
129+
else
130+
rng ('default');
131+
end
132+
end
126133

127134
% Fetch required information from STATS structure
128135
X = full (STATS.X);

inst/octave/bootemm.m

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,11 @@
1111
% and Bayesian nonparametric bootstrap [1] to compute and return the
1212
% following statistics along the dimension DIM:
1313
% • original: the estimated marginal mean(s) from the regression of y on X
14-
% • bias: bootstrap estimate(s) of the bias
14+
% • bias: bootstrap bias estimate(s)
1515
% • median: the median of the posterior distribution(s)
1616
% • CI_lower: lower bound(s) of the 95% credible interval
1717
% • CI_upper: upper bound(s) of the 95% credible interval
18+
% • p-val: returns NaN for estimated marginal means
1819
% By default, the credible intervals are shortest probability intervals,
1920
% which represent a more computationally stable version of the highest
2021
% posterior density interval [2].
@@ -53,9 +54,9 @@
5354
% that 'bootemm' results are reproducible.
5455
%
5556
% 'EMM = bootemm (STATS, DIM, ...) returns a structure with the following
56-
% fields (defined above): original, bias, median, CI_lower, CI_upper.
57-
% These statistics summarise the posterior distributions of the estimated
58-
% marginal means of the linear model along the dimenions DIM.
57+
% fields (defined above): original, bias, median, CI_lower, CI_upper, tstat
58+
% and pval. These statistics summarise the posterior distributions of the
59+
% estimated marginal means of the linear model along the dimenions DIM.
5960
%
6061
% '[EMM, BOOTSTAT] = bootemm (STATS, DIM, ...) also returns the bootstrap
6162
% statistics for the estimated marginal means.
@@ -70,7 +71,7 @@
7071
% [2] Liu, Gelman & Zheng (2015). Simulation-efficient shortest probability
7172
% intervals. Statistics and Computing, 25(4), 809–819.
7273
%
73-
% bootemm (version 2023.05.14)
74+
% bootemm (version 2023.05.18)
7475
% Author: Andrew Charles Penn
7576
% https://www.researchgate.net/profile/Andrew_Penn/
7677
%
@@ -107,13 +108,6 @@
107108
if (nargin < 5)
108109
prior = []; % Use default in bootbayes
109110
end
110-
if (nargin > 5)
111-
if (ISOCTAVE)
112-
randg ('seed', seed);
113-
else
114-
rng ('default');
115-
end
116-
end
117111

118112
% Error checking
119113
info = ver;
@@ -128,6 +122,23 @@
128122
if (ismember (dim, find (STATS.continuous)))
129123
error ('bootemm: Estimated marginal means are only calculated for categorical variables')
130124
end
125+
if (nargin > 5)
126+
if (ISOCTAVE)
127+
randg ('seed', seed);
128+
else
129+
rng ('default');
130+
end
131+
end
132+
N = numel (STATS.contrasts);
133+
for j = 1:N
134+
if (isnumeric (STATS.contrasts{j}))
135+
% Check that the columns sum to 0
136+
if (any (abs (sum (STATS.contrasts{j})) > eps("single")))
137+
error (strcat(["Use a STATS structure from a model refit with"], ...
138+
[" sum-to-zero contrast coding, e.g. ""simple"""]));
139+
end
140+
end
141+
end
131142

132143
% Fetch required information from STATS structure
133144
X = full (STATS.X);

0 commit comments

Comments
 (0)