Added bootstrap p-value for regression coefficients

acp29 · acp29 · commit 76a97b072dbe · 2023-05-18T17:01:23.000+01:00
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 name: statistics-bootstrap
-version: 5.2.5
-date: 2023-05-14
+version: 5.2.6
+date: 2023-05-18
 author: Andrew Penn <andy.c.penn@gmail.com>
 maintainer: Andrew Penn <andy.c.penn@gmail.com>
 title: A statistics package with a variety of bootstrap resampling tools
diff --git a/README.md b/README.md
@@ -46,18 +46,17 @@ To install (or test) the statistics-bootstrap package at it's existing location
 
 * `boot` returns resamples data or indices created by balanced bootstrap or bootknife resampling  
 * `bootknife` performs balanced bootknife resampling and calculates bootstrap bias, standard error and confidence intervals. The interval types supported are simple percentile, bias-corrected and accelerated, or calibrated percentile. This function supports iterated and stratified resampling.
-* `bootbayes` performs Bayesian nonparametric bootstrap and calculates posterior statistics for coefficients from a linear model. Two interval types are supported: shortest probability intervals and percentile intervals. See also `bootcoeff` and `bootemm`.
+* `bootbayes` performs Bayesian nonparametric bootstrap and calculates posterior statistics and frequentist *p*-values for the regression coefficients from a linear model. Two credible interval types are supported: shortest probability intervals and percentile intervals. The *p*-values are computed under the null hypothesis. See also `bootcoeff` and `bootemm`.
 * `bootnhst` calculates *p*-values by bootstrap null-hypothesis significance testing (two-tailed). This function can be used to compare 2 or more (independent) samples in designs with a one-way layout. This function resamples under the null hypothesis.
 * `bootmode` uses bootstrap to evaluate the likely number of real modes in a distribution
 * `bootci` is a function for calculating bootstrap confidence intervals. This function is a wrapper of the `bootknife` function but has the same usage as the `bootci` function from Matlab's Statistics and Machine Learning toolbox.  
 * `bootstrp` is a function for calculating bootstrap statistics. This function is a wrapper of the `bootknife` function but has the same usage as the `bootstrp` function from Matlab's Statistics and Machine Learning toolbox.  
-* `bootcoeff` (Octave only) is a function for calculating Bayesian nonparametric bootstrap credible intervals for the regression coefficients of a linear model fit using `anovan` or `fitlm`. This function uses `bootbayes`.
-* `bootemm` (Octave only) is a function for calculating Bayesian nonparametric bootstrap credible intervals for the estimated marginal means of a linear model fit using `anovan` or `fitlm`. This function uses `bootbayes`.
+* `bootcoeff` (Octave only) is a function for calculating Bayesian nonparametric bootstrap credible intervals and frequentist *p*-values for the regression coefficients of a linear model, which was fitted using `anovan` or `fitlm`. This function uses `bootbayes`.
+* `bootemm` (Octave only) is a function for calculating Bayesian nonparametric bootstrap credible intervals for the estimated marginal means of a linear model, which was fitted using `anovan` or `fitlm`. This function uses `bootbayes`.
 
 At the Octave/MATLAB command prompt, type `help function-name` for more information about the function and it's input and output arguments. In Octave, you can also request demonstrations of function usage through examples by typing 'demo function-name` at the command prompt.
 
 ## Development roadmap
  
-* We intend on including the following features in version 5.2.0: 1) an option in `cor` function to select Spearman's rank or Pearson's correlation coefficient; and 2) an additional function folder for fitting linear models and returning bootstrap confidence intervals for regression coefficients and estimated marginal means. The function have similar usage to `anovan`.  
-* Make `bootcoeff` and `bootemm` compatible with Matlab.  
+ 
 
diff --git a/inst/bootbayes.m b/inst/bootbayes.m
@@ -15,13 +15,16 @@
 %     [1,2]) distribution(s) is/are summarised with the following statistics
 %     printed to the standard output:
 %        • original: the mean of the data vector y
-%        • bias: bootstrap estimate(s) of the bias
+%        • bias: bootstrap bias estimate(s)
 %        • median: the median of the posterior distribution(s)
 %        • CI_lower: lower bound(s) of the 95% credible interval
 %        • CI_upper: upper bound(s) of the 95% credible interval
+%        • p-val: two-tailed p-value(s) for the parameter(s) being equal to 0
 %          By default, the credible intervals are shortest probability intervals,
 %          which represent a more computationally stable version of the highest
-%          posterior density interval [3].
+%          posterior density interval [3]. The p-value(s) is/are computed from
+%          the Student-t (null) distribution(s) constructed from the posterior
+%          statistics and heteroscedasticity-consistent standard errors [4,5].
 %
 %     'bootbayes (y, X)' also specifies the design matrix (X) for least squares
 %     regression of y on X. X should be a column vector or matrix the same
@@ -64,11 +67,14 @@
 %     'bootbayes' results are reproducible.
 %
 %     'bootbayes (..., NBOOT, PROB, PRIOR, SEED, L)' multiplies the regression
-%     coefficients by the hypothesis matrix L. If L is not provided or is empty,
-%     it will assume the default value of 1.
+%     coefficients by the hypothesis matrix L.  If L is not provided or is empty,
+%     it will assume the default value of 1. This functionality is usually used
+%     to convert regression to estimated marginal means. NaN is returned for
+%     p-values if a hypothesis matrix is provided.
 %
 %     'STATS = bootbayes (STATS, ...) returns a structure with the following
-%     fields (defined above): original, bias, median, CI_lower, CI_upper.
+%     fields (defined above): original, bias, median, CI_lower, CI_upper, tstat
+%     and pval. 
 %
 %     '[STATS, BOOTSTAT] = bootbayes (STATS, ...)  also returns the a vector (or
 %     matrix) of bootstrap statistics (BOOTSTAT) calculated over the bootstrap
@@ -80,8 +86,12 @@
 %        Bootstrap Mean. Ann. Statist. 17(2):705-710
 %  [3] Liu, Gelman & Zheng (2015). Simulation-efficient shortest probability
 %        intervals. Statistics and Computing, 25(4), 809–819. 
+%  [4] Hall and Wilson (1991) Two Guidelines for Bootstrap Hypothesis Testing.
+%        Biometrics, 47(2), 757-762
+%  [5] Long and Ervin (2000) Using Heteroscedasticity Consistent Standard
+%        Errors in the Linear Regression Model. Am. Stat, 54(3), 217-224
 %
-%  bootbayes (version 2023.05.14)
+%  bootbayes (version 2023.05.18)
 %  Author: Andrew Charles Penn
 %  https://www.researchgate.net/profile/Andrew_Penn/
 %
@@ -230,8 +240,11 @@
   % Create weighted least squares anonymous function
   bootfun = @(w) lmfit (X, y, diag (w), L);
 
-  % Calculate estimate(s)
-  original = bootfun (ones (n, 1));
+  % Calculate estimate(s) and heteroscedasticity robust standard error(s) (HC1)
+  S = bootfun (ones (n, 1));
+  original = S.b;
+  std_err = S.se;
+  t = original ./ std_err;
 
   % Create weights by randomly sampling from a symmetric Dirichlet distribution.
   % This can be achieved by normalizing a set of randomly generated values from
@@ -243,8 +256,19 @@
   end
   W = bsxfun (@rdivide, r, sum (r));
 
-  % Compute bootstat
-  bootstat = cell2mat (cellfun (bootfun, num2cell (W, 1), 'UniformOutput', false));
+  % Compute bootstap statistics
+  bootout = cell2mat (cellfun (bootfun, num2cell (W, 1), 'UniformOutput', false));
+  bootstat = [bootout.b];
+  bootse = [bootout.se];
+
+  % Compute frequentist p-values following the guidelines described by 
+  % Hall and Wilson (1991) Biometrics, 47(2), 757-762
+  if (all (isnan (t)))
+    pval = t;
+  else
+    T = bsxfun (@minus, bootstat, original) ./ bootse; % Null distribution
+    pval = sum (bsxfun(@gt, abs (T), abs (t)), 2) / nboot;
+  end
 
   % Bootstrap bias estimation
   bias = mean (bootstat, 2) - original;
@@ -277,6 +301,8 @@
   stats.median = median (bootstat, 2);
   stats.CI_lower = ci(:, 1);
   stats.CI_upper = ci(:, 2);
+  stats.tstat = t;
+  stats.pval = pval;
 
   % Print output if no output arguments are requested
   if (nargout == 0) 
@@ -289,14 +315,14 @@
 
 %% FUNCTION TO FIT THE LINEAR MODEL
 
-function b = lmfit (X, y, W, L)
+function S = lmfit (X, y, W, L)
 
   % Get model coefficients by solving the linear equation by matrix arithmetic
   % If optional arument W is provided, it should be a diagonal matrix of
   % weights or a positive definite covariance matrix
+  n = numel (y);
   if (nargin < 3)
     % If no weights are provided, create an identity matrix
-    n = numel (y);
     W = eye (n);
   end
   if (nargin < 4)
@@ -305,7 +331,28 @@
   end
   
   % Solve linear equation to minimize weighted least squares
-  b = L * pinv (X' * W * X) * (X' * W * y);
+  XW = X' * W;
+  invG = pinv (XW * X); % calculate pseudoinverse of the Gram matrix
+  b = L * (invG * (XW * y));
+
+  % Calculate heteroscedasticity-consistent standard errors (HC1) for the
+  % regression coefficients. The standard errors calculated here reproduce
+  % the HC1 standard errors calculated in R using vcovHC from the sandwich
+  % package.
+  % Ref: Long and Ervin (2000) Am. Stat, 54(3), 217-224
+  k = numel (b);
+  if ((numel (L) == 1) && all (L == 1))
+    yf = X * invG * (XW * y);
+    r = y - yf;
+    rw = W * r;
+    se = sqrt (diag ((n / (n - k) * invG * X' * diag ((rw).^2) * X * invG)));
+  else
+    se = nan (k, 1);
+  end
+
+  % Prepare output
+  S.b = b;
+  S.se = se;
 
 end
 
@@ -336,11 +383,23 @@ function print_output (stats, nboot, prob, prior, p, L)
         fprintf (' Credible interval: %.3g%%\n', mass);
       end
     end
+    fprintf (' Null value (H0) used for hypothesis testing (p-values): 0 \n')
     fprintf ('\nPosterior Statistics: \n');
-    fprintf (' original       bias           median         CI_lower       CI_upper\n');
+    fprintf (' original       bias           median         CI_lower       CI_upper     p-val\n');
     for j = 1:p
-      fprintf (' %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g \n',... 
+      if (stats.pval(j) <= 0.001)
+        fprintf (' %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g <.001 \n',... 
+                 [stats.original(j), stats.bias(j), stats.median(j), stats.CI_lower(j), stats.CI_upper(j)]);
+      elseif (stats.pval(j) < 0.9995)
+        fprintf (' %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g  .%03u \n',... 
+                 [stats.original(j), stats.bias(j), stats.median(j), stats.CI_lower(j), stats.CI_upper(j), round(stats.pval(j) * 1e+03)]);
+      elseif (isnan (stats.pval(j)))
+        fprintf (' %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g   NaN \n',... 
                  [stats.original(j), stats.bias(j), stats.median(j), stats.CI_lower(j), stats.CI_upper(j)]);
+      else
+        fprintf (' %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g   %#-+12.6g 1.000 \n',... 
+                 [stats.original(j), stats.bias(j), stats.median(j), stats.CI_lower(j), stats.CI_upper(j)]);
+      end
     end
     fprintf ('\n');
 
diff --git a/inst/octave/bootcoeff.m b/inst/octave/bootcoeff.m
@@ -12,13 +12,16 @@
 %     Bayesian nonparametric bootstrap [1] to compute and return the following
 %     statistics:
 %        • original: the coefficient(s) from regressing y on X
-%        • bias: bootstrap estimate(s) of the bias
+%        • bias: bootstrap bias estimate(s)
 %        • median: the median of the posterior distribution(s)
 %        • CI_lower: lower bound(s) of the 95% credible interval
 %        • CI_upper: upper bound(s) of the 95% credible interval
+%        • p-val: two-tailed p-value(s) for the parameter(s) being equal to 0
 %          By default, the credible intervals are shortest probability intervals,
 %          which represent a more computationally stable version of the highest
-%          posterior density interval [2].
+%          posterior density interval [2]. The p-value(s) is/are computed from
+%          the Student-t (null) distribution(s) constructed from the posterior
+%          statistics and heteroscedasticity-consistent standard errors [3,4].
 %
 %     'bootcoeff (STATS, NBOOT)' specifies the number of bootstrap resamples,
 %     where NBOOT must be a positive integer. If empty, the default value of
@@ -54,9 +57,9 @@
 %     'bootcoeff' results are reproducible.
 %
 %     'COEFF = bootcoeff (STATS, ...) returns a structure with the following
-%     fields (defined above): original, bias, median, CI_lower, CI_upper.
-%     These statistics summarise the posterior distributions of the coefficients
-%     from the linear model.
+%     fields (defined above): original, bias, median, CI_lower, CI_upper, tstat
+%     and pval. These statistics summarise the posterior distributions of the
+%     coefficients from the linear model.
 %
 %     '[COEFF, BOOTSTAT] = bootcoeff (STATS, ...) also returns the bootstrap
 %     statistics (i.e. posterior) for the coefficients.
@@ -69,9 +72,13 @@
 %  Bibliography:
 %  [1] Rubin (1981) The Bayesian Bootstrap. Ann. Statist. 9(1):130-134
 %  [2] Liu, Gelman & Zheng (2015). Simulation-efficient shortest probability
-%        intervals. Statistics and Computing, 25(4), 809–819. 
+%        intervals. Statistics and Computing, 25(4), 809–819.
+%  [3] Hall and Wilson (1991) Two Guidelines for Bootstrap Hypothesis Testing.
+%        Biometrics, 47(2), 757-762
+%  [4] Long and Ervin (2000) Using Heteroscedasticity Consistent Standard
+%        Errors in the Linear Regression Model. Am. Stat, 54(3), 217-224
 %
-%  bootcoeff (version 2023.05.14)
+%  bootcoeff (version 2023.05.18)
 %  Author: Andrew Charles Penn
 %  https://www.researchgate.net/profile/Andrew_Penn/
 %
@@ -105,13 +112,6 @@
   if (nargin < 4)
     prior = []; % Use default in bootbayes
   end
-  if (nargin > 4)
-    if (ISOCTAVE)
-      randg ('seed', seed);
-    else
-      rng ('default');
-    end
-  end
 
   % Error checking
   info = ver; 
@@ -123,6 +123,13 @@
   if ((~ any (statspackage)) || (str2double (info (statspackage).Version(1:3)) < 1.5))
     error ('bootcoeff: Requires version >= 1.5 of the statistics package')
   end
+  if (nargin > 4)
+    if (ISOCTAVE)
+      randg ('seed', seed);
+    else
+      rng ('default');
+    end
+  end
 
   % Fetch required information from STATS structure
   X = full (STATS.X);
diff --git a/inst/octave/bootemm.m b/inst/octave/bootemm.m
@@ -11,10 +11,11 @@
 %     and Bayesian nonparametric bootstrap [1] to compute and return the
 %     following statistics along the dimension DIM:
 %        • original: the estimated marginal mean(s) from the regression of y on X
-%        • bias: bootstrap estimate(s) of the bias
+%        • bias: bootstrap bias estimate(s)
 %        • median: the median of the posterior distribution(s)
 %        • CI_lower: lower bound(s) of the 95% credible interval
 %        • CI_upper: upper bound(s) of the 95% credible interval
+%        • p-val: returns NaN for estimated marginal means
 %          By default, the credible intervals are shortest probability intervals,
 %          which represent a more computationally stable version of the highest
 %          posterior density interval [2].
@@ -53,9 +54,9 @@
 %     that 'bootemm' results are reproducible.
 %
 %     'EMM = bootemm (STATS, DIM, ...) returns a structure with the following
-%     fields (defined above): original, bias, median, CI_lower, CI_upper.
-%     These statistics summarise the posterior distributions of the estimated
-%     marginal means of the linear model along the dimenions DIM.
+%     fields (defined above):  original, bias, median, CI_lower, CI_upper, tstat
+%     and pval. These statistics summarise the posterior distributions of the
+%     estimated marginal means of the linear model along the dimenions DIM.
 %
 %     '[EMM, BOOTSTAT] = bootemm (STATS, DIM, ...) also returns the bootstrap
 %     statistics for the estimated marginal means.
@@ -70,7 +71,7 @@
 %  [2] Liu, Gelman & Zheng (2015). Simulation-efficient shortest probability
 %        intervals. Statistics and Computing, 25(4), 809–819. 
 %
-%  bootemm (version 2023.05.14)
+%  bootemm (version 2023.05.18)
 %  Author: Andrew Charles Penn
 %  https://www.researchgate.net/profile/Andrew_Penn/
 %
@@ -107,13 +108,6 @@
   if (nargin < 5)
     prior = []; % Use default in bootbayes
   end
-  if (nargin > 5)
-    if (ISOCTAVE)
-      randg ('seed', seed);
-    else
-      rng ('default');
-    end
-  end
 
   % Error checking
   info = ver; 
@@ -128,6 +122,23 @@
   if (ismember (dim, find (STATS.continuous)))
     error ('bootemm: Estimated marginal means are only calculated for categorical variables')
   end
+  if (nargin > 5)
+    if (ISOCTAVE)
+      randg ('seed', seed);
+    else
+      rng ('default');
+    end
+  end
+  N = numel (STATS.contrasts);
+  for j = 1:N
+    if (isnumeric (STATS.contrasts{j}))
+      % Check that the columns sum to 0
+      if (any (abs (sum (STATS.contrasts{j})) > eps("single")))
+         error (strcat(["Use a STATS structure from a model refit with"], ...
+                       [" sum-to-zero contrast coding, e.g. ""simple"""]));
+      end
+    end
+  end
 
   % Fetch required information from STATS structure
   X = full (STATS.X);