|
| 1 | +--- |
| 2 | +title: Advanced Statistical Data Analysis |
| 3 | +subtitle: Lecture Notes |
| 4 | +subject: Advanced Regression Modelling |
| 5 | +authors: |
| 6 | + - name: Andreas Ruckstuhl |
| 7 | + affiliation: ZHAW School of Engineering |
| 8 | +--- |
| 9 | + |
| 10 | +# Review of Multiple Linear Regression |
| 11 | + |
| 12 | +## Initial Remarks |
| 13 | + |
| 14 | +Regression analysis is used to model the relationship between a response variable $Y$ and one or more explanatory variables $x^{(1)}, \dots, x^{(m)}$, where the relationship is masked by random noise. |
| 15 | + |
| 16 | +### Objectives of Regression Analysis |
| 17 | +1. General description of data structure. |
| 18 | +2. Assessment of the effect of explanatory variables on the response. |
| 19 | +3. Prediction of future observations. |
| 20 | + |
| 21 | +:::{prf:definition} Multiple Linear Regression Model |
| 22 | +:label: mlr-equation |
| 23 | + |
| 24 | +The systematic relationship is explored via a function $f(\cdot)$: |
| 25 | +$$Y_i = \beta_0 + \beta_1 x^{(1)}_i + \dots + \beta_m x^{(m)}_i + \mathcal{E}_i, \quad i = 1, \dots, n$$ (mlr-equation) |
| 26 | + |
| 27 | +where $\mathcal{E}_i$ are unobservable random variables. |
| 28 | +::: |
| 29 | + |
| 30 | +:::{prf:remark} |
| 31 | +In a **linear model**, the parameters enter linearly; the predictors themselves do not have to be linear. For example, $y \approx \beta_0 + \beta_1 x^{(1)} + \beta_2 \log(x^{(2)})$ is a linear model, but $y \approx \beta_0 + \beta_1 (x^{(1)})^{\beta_2}$ is not. |
| 32 | +::: |
| 33 | + |
| 34 | +### Error Assumptions |
| 35 | +The standard assumptions for the error terms $\mathcal{E}_i$ are: |
| 36 | +- Stochastically independent. |
| 37 | +- Expectation zero and constant variance $\sigma^2$ (homoscedasticity). |
| 38 | +- Normally (Gaussian) distributed: $\mathcal{E}_i \sim \mathcal{N}(0, \sigma^2)$. |
| 39 | + |
| 40 | +## Matrix Representation |
| 41 | + |
| 42 | +To simplify notation, the regression equation {eq}`mlr-equation` is written in matrix form: |
| 43 | +$$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\mathcal{E}}$$ |
| 44 | + |
| 45 | +where: |
| 46 | +- $\mathbf{Y}$ is an $n \times 1$ vector of responses. |
| 47 | +- $\mathbf{X}$ is an $n \times p$ matrix of explanatory variables (including a column of 1s for the intercept). |
| 48 | +- $\boldsymbol{\beta}$ is a $p \times 1$ vector of unknown coefficients ($p = m+1$). |
| 49 | +- $\boldsymbol{\mathcal{E}}$ is an $n \times 1$ vector of unobserved random variables. |
| 50 | + |
| 51 | +## Tukey's First-Aid Transformations |
| 52 | + |
| 53 | +Standard recommendations used to linearize relationships and stabilize variance when there is no specific domain theory to guide variable transformation. |
| 54 | +These should be applied to both explanatory variables and responses unless a valid reason exists to do otherwise: |
| 55 | + |
| 56 | +| Data Type | Recommended Transformation | |
| 57 | +| :--- |:----------------------------------------------------------------------------| |
| 58 | +| **Concentrations and Amounts** | $\log(x)$ | |
| 59 | +| **Count Data** | $\sqrt{x}$ | |
| 60 | +| **Counted Fractions / Shares** | $\tilde{x} = \text{logit}(x) = \log\left(\frac{x + 0.005}{1.01 - x}\right)$ | |
| 61 | + |
| 62 | +## Model Fitting and Diagnostics |
| 63 | + |
| 64 | +### Least Squares Estimation |
| 65 | +The coefficients $\boldsymbol{\beta}$ are estimated by minimizing the sum of squared residuals. |
| 66 | + |
| 67 | +:::{prf:theorem} Gauss-Markov Theorem |
| 68 | +:label: thm-gauss-markov |
| 69 | +Under the assumptions of zero mean, constant variance, and uncorrelated errors, the Ordinary Least Squares (OLS) estimator is the **Best Linear Unbiased Estimator (BLUE)**. |
| 70 | +::: |
| 71 | + |
| 72 | +The OLS estimator is given by: |
| 73 | +$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}$$ |
| 74 | + |
| 75 | +### Model Adequacy (Residual Analysis) |
| 76 | +Model adequacy is checked using diagnostic plots: |
| 77 | +- **Tukey-Anscombe Plot:** Residuals vs. Fitted values to check for non-linearity or heteroscedasticity. |
| 78 | +- **Normal Q-Q Plot:** To check the normality assumption of errors. |
| 79 | +- **Scale-Location Plot:** To check for constant variance. |
| 80 | +- **Residuals vs. Leverage:** To identify influential observations (Cook's Distance). |
0 commit comments