Skip to content

Commit e514726

Browse files
committed
Advanced Statistical Data Analysis
1 parent 05728b0 commit e514726

File tree

2 files changed

+81
-0
lines changed

2 files changed

+81
-0
lines changed

myst.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ project:
4343
- file: projects/masters-degree-in-data-science.md
4444
children:
4545
- file: projects/masters-degree-in-data-science/advanced-deep-learning.ipynb
46+
- file: projects/masters-degree-in-data-science/advanced-statistical-data-analysis.md
4647
- file: projects/masters-degree-in-data-science/bayesian-machine-learning.ipynb
4748
- file: projects/masters-degree-in-data-science/complex-processes.ipynb
4849
- file: projects/masters-degree-in-data-science/predictive-modeling.ipynb
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
title: Advanced Statistical Data Analysis
3+
subtitle: Lecture Notes
4+
subject: Advanced Regression Modelling
5+
authors:
6+
- name: Andreas Ruckstuhl
7+
affiliation: ZHAW School of Engineering
8+
---
9+
10+
# Review of Multiple Linear Regression
11+
12+
## Initial Remarks
13+
14+
Regression analysis is used to model the relationship between a response variable $Y$ and one or more explanatory variables $x^{(1)}, \dots, x^{(m)}$, where the relationship is masked by random noise.
15+
16+
### Objectives of Regression Analysis
17+
1. General description of data structure.
18+
2. Assessment of the effect of explanatory variables on the response.
19+
3. Prediction of future observations.
20+
21+
:::{prf:definition} Multiple Linear Regression Model
22+
:label: mlr-equation
23+
24+
The systematic relationship is explored via a function $f(\cdot)$:
25+
$$Y_i = \beta_0 + \beta_1 x^{(1)}_i + \dots + \beta_m x^{(m)}_i + \mathcal{E}_i, \quad i = 1, \dots, n$$ (mlr-equation)
26+
27+
where $\mathcal{E}_i$ are unobservable random variables.
28+
:::
29+
30+
:::{prf:remark}
31+
In a **linear model**, the parameters enter linearly; the predictors themselves do not have to be linear. For example, $y \approx \beta_0 + \beta_1 x^{(1)} + \beta_2 \log(x^{(2)})$ is a linear model, but $y \approx \beta_0 + \beta_1 (x^{(1)})^{\beta_2}$ is not.
32+
:::
33+
34+
### Error Assumptions
35+
The standard assumptions for the error terms $\mathcal{E}_i$ are:
36+
- Stochastically independent.
37+
- Expectation zero and constant variance $\sigma^2$ (homoscedasticity).
38+
- Normally (Gaussian) distributed: $\mathcal{E}_i \sim \mathcal{N}(0, \sigma^2)$.
39+
40+
## Matrix Representation
41+
42+
To simplify notation, the regression equation {eq}`mlr-equation` is written in matrix form:
43+
$$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\mathcal{E}}$$
44+
45+
where:
46+
- $\mathbf{Y}$ is an $n \times 1$ vector of responses.
47+
- $\mathbf{X}$ is an $n \times p$ matrix of explanatory variables (including a column of 1s for the intercept).
48+
- $\boldsymbol{\beta}$ is a $p \times 1$ vector of unknown coefficients ($p = m+1$).
49+
- $\boldsymbol{\mathcal{E}}$ is an $n \times 1$ vector of unobserved random variables.
50+
51+
## Tukey's First-Aid Transformations
52+
53+
Standard recommendations used to linearize relationships and stabilize variance when there is no specific domain theory to guide variable transformation.
54+
These should be applied to both explanatory variables and responses unless a valid reason exists to do otherwise:
55+
56+
| Data Type | Recommended Transformation |
57+
| :--- |:----------------------------------------------------------------------------|
58+
| **Concentrations and Amounts** | $\log(x)$ |
59+
| **Count Data** | $\sqrt{x}$ |
60+
| **Counted Fractions / Shares** | $\tilde{x} = \text{logit}(x) = \log\left(\frac{x + 0.005}{1.01 - x}\right)$ |
61+
62+
## Model Fitting and Diagnostics
63+
64+
### Least Squares Estimation
65+
The coefficients $\boldsymbol{\beta}$ are estimated by minimizing the sum of squared residuals.
66+
67+
:::{prf:theorem} Gauss-Markov Theorem
68+
:label: thm-gauss-markov
69+
Under the assumptions of zero mean, constant variance, and uncorrelated errors, the Ordinary Least Squares (OLS) estimator is the **Best Linear Unbiased Estimator (BLUE)**.
70+
:::
71+
72+
The OLS estimator is given by:
73+
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}$$
74+
75+
### Model Adequacy (Residual Analysis)
76+
Model adequacy is checked using diagnostic plots:
77+
- **Tukey-Anscombe Plot:** Residuals vs. Fitted values to check for non-linearity or heteroscedasticity.
78+
- **Normal Q-Q Plot:** To check the normality assumption of errors.
79+
- **Scale-Location Plot:** To check for constant variance.
80+
- **Residuals vs. Leverage:** To identify influential observations (Cook's Distance).

0 commit comments

Comments
 (0)