-
Notifications
You must be signed in to change notification settings - Fork 6
New appendices on limitations of ANOVA/OLS thinking and Wilkinson-Roger notation #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
palday
wants to merge
10
commits into
main
Choose a base branch
from
pa/ols
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
8e736f6
add appendix on identities vs definitions
palday 20c19b7
wilkinson notation appendix
palday cacdbba
quarto
palday d6a0f9c
don't resave that optsum
palday 57378c5
typos
palday 0785a44
Update traditional_concepts.qmd
kliegl 8109eab
wip
palday 13d7182
Merge branch 'main' of github.com:JuliaMixedModels/EmbraceUncertainty…
palday ab7584b
Merge branch 'main' of github.com:JuliaMixedModels/EmbraceUncertainty…
palday 257111f
Merge branch 'main' of github.com:JuliaMixedModels/EmbraceUncertainty…
palday File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| --- | ||
| engine: julia | ||
| --- | ||
|
|
||
| # A note on identities and concepts from "classical" statistics {#sec-traditional-concepts} | ||
|
|
||
| One common source of misunderstanding mixed-effects models seems to be the way in which the linear regression and analysis of variance are taught. | ||
| In particular, many *identities* hold for ordinary least squares "fixed-effects" regression that are then taken as *definitions* of the relevant quantities. | ||
|
|
||
| For example, in simple regression `y ~ x`[^wilkinson], the coefficient of determination is equal to the square of the Pearson correlation coefficient between the response `y` and the predictor `x`. | ||
| This *identity* is reinforced by the usual notation for the respective quantities, i.e. $R^2$ and $r$. | ||
| However, it is important to note that these quantities are **not** formally defined in terms of each other.[^wikipedia-definition] | ||
| Instead, the Pearson correlation coefficient is formally defined as $$\frac{\text{cov}(X,Y)}{\sigma_X\sigma_Y}$$ i.e. the standardized covariance between two random variables. | ||
| The coefficient of determination is usually defined in terms of "total sum of squares" and "residual sum of squares" $$ 1 - \frac{SS_\text{residual}}{SS_\text{total}} $$ but even this definition again brings us to another set of identities being used as definitions. | ||
|
|
||
| [^wilkinson]: Throughout this appendix, we use the Wilkinson-Roger notation where convenient instead of the full mathematical notation | ||
|
|
||
| [^wikipedia-definition]: Unfortunately, many popular sources, including [Wikipedia](https://web.archive.org/web/20250330013023/https://en.wikipedia.org/wiki/Coefficient_of_determination) at the time of writing, confuse this matter with statements such as | ||
|
|
||
| > There are several definitions of $R^2$ that are only sometimes equivalent. | ||
|
|
||
| There may be multiple possible definitions, but the "equaliances" are better thought of as *identities* that hold under certain conditions. | ||
|
|
||
| In the frequentist framework, we often use *maximum likelihood estimation* to fit a model to our data, such that the parameter estimates maximize the likelihood of the assumed statistical model. | ||
| For classical linear regression, this is equivalent to minimizing the sum of squared residuals, which is why the technique is often called "ordinary least squares". | ||
| However, this is again an *identity* and not a *definition*. | ||
| The likelihood is defined without using sums of squares, but it follows from the definition of the normal distribution that minimizing the squared error (i.e. the sum of squared residuals) will yield the maximum likelihood estimate. | ||
| In the classical ANOVA framework, this fact is then used to partition the variance into three components: the explained or model sum of squares, the residual sum of squares and total sum of squares, where the sum of the first two components is equal to the third.[^pythagoras] | ||
| The mixed-effects model extends the classical ANOVA framework by allowing further partitioning of the variance, which means that this simple identity quickly breaks down. | ||
| For this reason, many properties assumed within the classical framework break down for mixed effects models. | ||
| Even concepts such as *the* fully saturated model, which is often used to define other quantities, quickly become difficult to define. | ||
| Note that we wrote **the** fully saturated model: there must be precisely one fully saturated model for many of these other definitions to hold -- such as the definition of total sum of squares -- and without a unique value, we simply cannot define a single value. | ||
|
|
||
| [^pythagoras]: The particular geometry of these sums was used by Fisher to simplify certain computations in the days before computers. By construction, the residuals are *orthogonal* to the fitted values, which means that the residual sum of squares and the model sum of squares correspond to the length of two sides of a right triangle, with the total sum of squares being the length of the hypotenuse. This geometric interpretation is very useful, but also quickly becomes quite complicated when we consider further partitions of the variance contained within the model. | ||
|
|
||
| We have often commented in other fora (various mailing lists, help sites and GitHub issues) about the challenges of finding definitions of classical quantities that still hold onto all their original properties. Douglas Bates's [mailing list response "https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html"](https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html) is a valuable read that highlights how even things as simple as defining the denominator degrees of freedom is challenging in the mixed models framework. | ||
| It is important to note here that many of these hard-to-define quantities are most useful as input to other "simple" formulae based on the asymptotic behavior of the classical linear regression model (such as convergence to an $F$-distribution). | ||
| Unfortunately, it is unclear whether that same asymptotic behavior holds in the general case of the mixed effects model. | ||
| While the asymptotic behavior largely seems to hold in the idealized case of perfect balance and full nesting, it is not at all clear whether it does in the messiness of real world data, where balance, nesting and crossing are rarely perfect, as we have attempted to make clear throughout this book. | ||
|
|
||
| While this appendix may read as a pessmistic take on cherished concepts, we call out one point of optimism. | ||
| Much of the historical work around finding and applying the identities and properties of various quantities for classical ANOVA and linear regression stem from a time when datasets were comparably small and "computer" referred to a person employed purely to perform calculation by hand. | ||
| With modern computation -- both hardware and software -- other methods are available to us. | ||
| For example, bootstrapping and profiling provide methods for computing confidence intervals, which are far more informative than $p$-values anyway. | ||
|
|
||
| There is a model underlying all classical statistical tests, the general linear model, and more often than not |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| # Wilkinson-Rogers (1973) notation for models of (co)variance {#sec:wilkinson} | ||
|
|
||
| ## General rules | ||
|
|
||
| - "Addition" (`+`) indicates additive, i.e., main effects: `a + b` indicates main effects of `a` and `b`. | ||
| - "Multiplication" (`*`) indicates crossing: main effects and interactions between two terms: `a * b` indicates main effects of `a` and `b` as well as their interaction. | ||
| - Usual algebraic rules apply (associativity and distributivity): | ||
| - `(a + b) * c` is equivalent to `a * c + b * c` | ||
| - `a * b * c` corresponds to main effects of `a`, `b`, and `c`, as well as all three two-way interactions and the three-way interaction. | ||
| - Categorical terms are expanded into the associated indicators/contrast variables. | ||
| - Tilde (`~`) is used to separate response from predictors. | ||
| - The intercept is indicated by `1`. | ||
| - `y ~ 1 + (a + b) * c` is read as: | ||
| - The response variable is `y`. | ||
| - The model contains an intercept. | ||
| - The model contains main effects of `a`, `b`, and `c`. | ||
| - The model contains interactions between a and c and between b and c but not a and b | ||
| - We extend this notation for mixed-effects models with the grouping notation (`|`): | ||
| - `(1 + a | subject)` indicates "by-subject random effects for the intercept and main effect `a`". | ||
| - This is in line with the usual statistical reading of `|` as "conditional on". | ||
|
|
||
| ## Mixed models in Wilkinson-Rogers and mathematical notation | ||
|
|
||
| Models fit with MixedModels.jl are generally linear mixed-effects models with unconstrained random effects covariance matrices and homoskedastic, normally distributed residuals. | ||
| Under these assumptions, the model specification | ||
|
|
||
| `response ~ 1 + (age + sex) * education * n_children + (1 | subject)` | ||
|
|
||
| corresponds to the statistical model | ||
|
|
||
| \begin{align*} | ||
| \left(Y |\mathcal{B}=b\right) &\sim N\left(X\beta + Zb, \sigma^2 I \right) \\ | ||
| \mathcal{B} &\sim N\left(0, G\right) | ||
| \end{align*} | ||
|
|
||
| for which we wish to obtain the maximum-likelihood estimates for $G$ and thus the fixed-effects $\beta$. | ||
|
|
||
| - The model contains no restrictions on $G$, except that it is positive semidefinite. | ||
| - The response Y is the value of a given response. | ||
| - The fixed-effects design matrix X consists of columns for | ||
| - the intercept, age, sex, education, and number of children (contrast coded as appropriate) | ||
| - the interaction of all lower order terms, excluding interactions between age and sex | ||
| - The random-effects design matrix Z includes a column for | ||
| - the intercept for each subject |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@palday I think it would be useful to add extensions from RegressionFormulae in a separate section.