Skip to content

Commit 9ec247d

Browse files
authored
Merge pull request #5 from Machine-Learning-Foundations/suggestions
updated the readme of the exercise repo for day 3
2 parents fa39d77 + f792603 commit 9ec247d

File tree

4 files changed

+102
-32
lines changed

4 files changed

+102
-32
lines changed

README.md

Lines changed: 81 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,34 @@
11
# Sample Statistics and Gaussians Exercise
22

3-
### Exercise 1: Mean and Variance
4-
This exercise first explores sample statistics like mean and variance.
3+
Today's exercise first explores sample statistics like mean and variance.
54
It continues with the definition of the gaussian distributions.
65
It ends with applying gaussian mixture models (gmm) and the expectation-maximization algorithm used to optimize gmm.
76

7+
### ⊙ Task 1: Mean and Variance
8+
89
- To get started, take a look at the `src/sample_mean_corr_rhein.py` file.
910

1011
Implement functions to compute sample mean and the standard deviation.
11-
Use
12-
13-
$$ \hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i , $$
12+
1. Use
1413

15-
to calculate the mean and
14+
$$ \hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i , $$
1615

17-
$$ \hat{\sigma} = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \hat{\mu})^2} $$
16+
to calculate the mean.
1817

19-
to compute the standard deviation. $x \in \mathbb{R}$ denotes individual sample elements, and $n \in \mathbb{N}$ the size of the sample.
18+
2. Use
2019

21-
Return to the Rhine data-set. Load the data from `./data/pegel.tab`. Compute the water level mean and standard deviation before and after the year 2000.
20+
$$ \hat{\sigma} = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \hat{\mu})^2} $$
2221

22+
to compute the standard deviation.
23+
$x_i \in \mathbb{R}$ for $i \in \{1, ... , n\}$ denotes individual sample elements, and $n \in \mathbb{N}$ the size of the sample.
24+
Don't use the pre-build functions np.mean() or np.std() to solve these tasks.
2325

26+
Return to the Rhine data-set and go to the `__main__`-function.
27+
The data from `./data/pegel.tab` is already loaded and processed and is ready to be used.
28+
3. Compute the water level mean and standard deviation before the year 2000.
29+
4. And now compute the water level mean and standard deviation after the year 2000.
2430

25-
### Exercise 2: Autocorrelation
31+
### ⊙ Task 2: Autocorrelation
2632
We now want to use autocorrelation to analyse the discrete time signal of the rhine level measurements. Implement the `auto_corr` function in `src/sample_mean_corr_rhein.py`. It should implement the engineering version without the normalization and return the autocorrelation
2733

2834
$$ R_{xx} = (c_{-N+1},\ldots,c_{1}, c_0, c_{1}, \ldots, c_{N-1}) $$
@@ -31,36 +37,81 @@ with
3137

3238
$$ c_{k} = \sum_{t=1}^{N-|k|} n_t n_{t + |k|}$$
3339

34-
with $n$ the normalized version of your signal of length $N$. The time shift $k$ moves from $-(N-1)$ to $N-1$. Therefore, the resulting array has a length of $2N-1$. For example the autocorrelation of an input signal $x=(2,3,-1)$ is $R_{xx}=(c_{-2}, c_{-1}, c_0, c_1, c_2)=(-2, 3, 14, 3, -2)$ and is symmetrical. Make sure that you normalize the signal *before* giving it to your `auto_corr` function. Once you have checked your implementation using `nox -s test`, you can use `np.correlate` for efficiency. Plot the autocorrelation for the rhine level measurements since 2000.
35-
Normalize your data via
40+
with $n$ the normalized version of your signal of length $N$. The time shift $k$ moves from $-(N-1)$ to $N-1$. Therefore, the resulting array has a length of $2N-1$.
41+
42+
For example the autocorrelation of an input signal $x=(2,3,-1)$ is $R_{xx}=(c_{-2}, c_{-1}, c_0, c_1, c_2)=(-2, 3, 14, 3, -2)$ and is symmetrical.
43+
44+
>>> In the following table you can see an illustrative depicition on how the $c_k$'s are calculated.
45+
The header contains the input signal x padded with 0's on its sides.
46+
For autocorrelation, we compute correlation between $x$ and $x$ itself.
47+
So in visual terms, we slide $x$ from left to right across itself.
48+
At each step we compute one $c_k$ by first multiplying the numbers that are aligned with the input signal in the header. Then, these products will be summed up. The result is written in the respective cell of the last column.
49+
50+
| 0 | 0 | 2 | 3 | -1 | 0 | 0 | $c_k$ |
51+
| -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
52+
| 2 | 3 | -1 | | | | | 0 + 0 - 2 = -2
53+
| | 2 | 3 | -1 | | | | 0 + 6 - 3 = 3
54+
| | | 2 | 3 | -1 | | | 4 + 9 + 1 = 14
55+
| | | | 2 | 3 | -1 | | 6 - 3 + 0 = 3
56+
| | | | | 2 | 3 | -1 | -2 + 0 + 0 = -2
57+
>>>As you can see, when reading from top to bottom we get the correct solution $R_{xx}=(-2, 3, 14, 3, -2)$.
58+
59+
So here are your tasks:
60+
1. Implement the `auto_corr` function as described above.
61+
>>> The function expects $x$ to be normalized. That means, that no normalization is done inside the `auto_corr`-function. Instead you normalize the input signal before calling `auto_corr`.
62+
2. Check your implementation using `nox -s test`.
63+
If the test passes you can use `np.correlate` for efficiency in the following exercises!
3664

37-
$$ n_{t} = \frac{x_{t} - \hat{\mu}}{\hat{\sigma}} ,$$
65+
Now go back to the `__main__`-function and consider the Rhine data-set after 2000.
66+
3. Normalize the data of the Rhine level measurements since 2000 via
3867

39-
for all $t$ measurements until the signal length N. Before running the autocorrelation computation. Compare the autocorrelation to a random signal from `np.random.randn` by plotting both results with `plt.plot`.
68+
$$ n_{t} = \frac{x_{t} - \hat{\mu}}{\hat{\sigma}} ,$$
4069

70+
for all $t$ measurements until the signal length N.
71+
4. Compute and plot the autocorrelation for the Rhine level measurements since 2000.
4172

42-
### Exercise 3: Distributions
73+
Now we want to compare this autocorrelation to the one of a random signal.
74+
5. Create a random signal from `np.random.randn` that has the same shape as the Rhine level measurements
75+
6. Normalize the random signal.
76+
7. Compute the autocorrelation of the normalized, random signal.
77+
8. Plot both autocorrelations of the Rhine level measurements and the random signal using `plt.plot` and compare the results.
4378

4479

45-
- Consider the `src/plot_gaussian.py` module. Implement the `gaussian_pdf` function.
80+
### ⊙ Task 3: Distributions
4681

47-
In one dimension gaussian probability density function is defined as
82+
1. Consider the `src/plot_gaussian.py` module. Implement the `gaussian_pdf` function.
4883

49-
$$\phi_1(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp({-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}) .$$
84+
In one dimension gaussian probability density function is defined as
5085

51-
$\pi \in \mathbb{R}$ denotes Pi, $\mu \in \mathbb{R}$ the mean and $\sigma \in \mathbb{R}$ the standard deviation for a random variable $X$. $e^x$ denotes the exponential function. $X$ having a gaussian pdf is described as gaussion or normal distribution $\mathcal{N}$. Explore the behavior of $\mathcal{N}(\mu, \sigma)$ for different values of $\mu$ and $\sigma$.
86+
$$\phi_1(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp({-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}) .$$
5287

88+
$\pi \in \mathbb{R}$ denotes Pi, $\mu \in \mathbb{R}$ the mean and $\sigma \in \mathbb{R}$ the standard deviation for a random variable $X$.
89+
$e^x$ denotes the exponential function.
90+
A random variable $X$ having a gaussian pdf is described as gaussion or normal distribution $\mathcal{N}$.
91+
>>> Remark: In the notation $\phi_1(x | \mu, \sigma)$, x is the variable that is plugged into the function and $\mu$ and $\sigma$ are parameters which are needed to define the function and that are determined beforehand.
92+
2. Explore the behavior of $\mathcal{N}(\mu, \sigma)$ for different values of $\mu$ and $\sigma$.
93+
The Code for plotting the pdf's is already given.
5394

54-
- Consider the `src/mixture_concpets.py` module.
55-
Implement a two-dimensional gaussian pdf following,
5695

57-
$$ \phi_2(\mathbf{x} | \mu_g, \Sigma_g) = \frac{1}{\sqrt{(2\pi)^2 \| \Sigma_g \|}} \exp({-\frac{1}{2}(\mathbf{x}-\mu_g)^T \Sigma_g^{-1}(\mathbf{x}-\mu_g)}).$$
5896

59-
$\mu_g \in \mathbb{R}^2$ denotes the two dimensional mean vector, $\Sigma_g \in \mathbb{R}^{2\times2}$ the covariance matrix, $^{-1}$ the matrix inverse, $T$ the transpose and $g \in \mathbb{N}$ the number of the distrubtion, which will be important later.
60-
Plot a 2d-bell curve with $\mu_1 = [-1.5, 2]$ and $\Sigma_1 = [[1, 0], [0, 1]]$ using the `plt.imshow` function. `np.linspace` and `np.meshgrid` will help you.
97+
3. Consider the `src/mixture_concepts.py` module.
98+
Go to the `twod_gaussian_pdf`-function and implement a two-dimensional gaussian pdf following,
6199

100+
$$ \phi_2(\mathbf{x} | \mu_g, \Sigma_g) = \frac{1}{\sqrt{(2\pi)^2 \| \Sigma_g \|}} \exp({-\frac{1}{2}(\mathbf{x}-\mu_g)^T \Sigma_g^{-1}(\mathbf{x}-\mu_g)}).$$
62101

63-
### Exercise 4: Gaussian mixture models (optional)
102+
$\mu_g \in \mathbb{R}^2$ denotes the two dimensional mean vector, $\Sigma_g \in \mathbb{R}^{2\times2}$ the covariance matrix, $^{-1}$ the matrix inverse, $T$ the transpose, $\| \|$ the determinant and $g \in \mathbb{N}$ the number of the distrubtion, which will be important later.
103+
104+
- As you can see, the x-parameter of the function is a grid of shape (grid_height, grid_width, 2).
105+
That means, that you get not only one but grid_height*grid_width many 2-dimensional values that should be evaluated.
106+
It's up to you how you want to approach this task.
107+
But Broadcasting might be an elegant way to evaluate all these values at the same time (https://numpy.org/doc/stable/user/basics.broadcasting.html). In this case, you might take a look at `np.swapaxes` to deal with the transponation.
108+
- At the very end of this document, we included a diagram that depicts how the shapes should develop when using Broadcasting. This is purely optional and just for when you need some guidance regarding the relevant shapes.
109+
110+
4. Plot a 2d-bell curve with $\mu_1 = [-1.5, 2]$ and $\Sigma_1 = [[1, 0], [0, 1]]$ using the `plt.imshow` function. `np.linspace` and `np.meshgrid` will help you.
111+
112+
113+
114+
### ✪ Task 4: Gaussian mixture models (optional)
64115

65116
We can use bell-curve sums for classification! A Gaussian mixture model has the density
66117

@@ -113,3 +164,8 @@ Train a gmm to find the diabetic patients.
113164

114165

115166
- Standard packages like sci-kit-learn implement GMMs, too. Take a minute to read https://scikit-learn.org/stable/modules/mixture.html .
167+
168+
---
169+
If you need some inspiration for broadcasting in Task 3.3:
170+
![broadcasting_hint](./figures/broadcasting.png)
171+

src/mixture_concepts.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ def twod_gaussian_pdf(x: np.ndarray, mu: np.ndarray, sigma: np.ndarray) -> np.nd
1919
Returns:
2020
np.ndarray: The two dimensional gaussian distribution.
2121
"""
22-
# TODO: Implement me.
22+
# TODO: 3.3 Implement me.
2323
return np.zeros_like(x)
2424

2525

@@ -66,6 +66,8 @@ def fit_gmm(points: np.ndarray, init_params_list: List) -> List:
6666

6767
if __name__ == "__main__":
6868
np.random.seed(42)
69+
#--------------------------------------------------------------------------------------#
70+
6971
dist1 = np.random.normal(loc=(2, 2), scale=(1.0, 1.0), size=(100, 2))
7072
dist2 = np.random.normal(loc=(-2, -2), scale=(1.0, 1.0), size=(100, 2))
7173

src/plot_gaussian.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,12 +45,15 @@ def forward_euler(x: np.ndarray, fun: Callable, int_0: float = 0.0) -> np.ndarra
4545

4646

4747
if __name__ == "__main__":
48-
params = (
48+
# 3.2 Explore the behavior of tha gaussian pdf by trying out different parameter values
49+
params = ( # each parameter-tuple contains a mean and a std: (mu, std)
4950
(0.0, np.sqrt(0.2)),
5051
(0.0, np.sqrt(1.0)),
5152
(0.0, np.sqrt(5.0)),
5253
(-2.0, np.sqrt(0.5)),
5354
)
55+
56+
# Plot the gaussian pdfs for the different parameters specified above
5457
for param in params:
5558
x = np.linspace(-5.0, 5.0, 500)
5659
plt.plot(

src/sample_mean_corr_rhein.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@
1010

1111
def my_mean(data_sample) -> float:
1212
"""Implement a function to find the mean of an input List."""
13-
# TODO: Implement me.
13+
# TODO: 1.1 Implement me.
1414
return 0.
1515

1616

1717
def my_std(data_sample) -> float:
1818
"""Implement a function to find the standard deviation of a sample in a List."""
19-
# TODO: Implement me.
19+
# TODO: 1.2 Implement me.
2020
return 0.
2121

2222

@@ -29,7 +29,8 @@ def auto_corr(x: np.ndarray) -> np.ndarray:
2929
Returns:
3030
np.ndarray: Autocorrelation of input signal of shape (signal_length*2 - 1,).
3131
"""
32-
# TODO: Implement me.
32+
# TODO: 2.1 Implement me.
33+
# TODO: 2.2 Check your implementation via nox -s test.
3334
return np.zeros_like(x)
3435

3536

@@ -57,6 +58,14 @@ def auto_corr(x: np.ndarray) -> np.ndarray:
5758
]
5859

5960

60-
# TODO: compute the mean and standard deviation before and after 2000.
61+
# TODO: 1.3 Compute the mean and standard deviation before 2000.
62+
# TODO: 1.4 Compute the mean and standatd deviation after 2000.
6163

62-
# TODO: Compare the autocorrelation functions of the rhine data and of a random signal.
64+
#----------------------------------------------------------------------------------------------#
65+
# TODO: 2.3 Normalize the data of the Rhine level measurements since 2000.
66+
# TODO: 2.4 Compute and plot the autocorrelation.
67+
68+
# TODO: 2.5 Create a random signal.
69+
# TODO: 2.6 Normalize the random signal.
70+
# TODO: 2.7 Compute the autocorrelation.
71+
# TODO: 2.8. Plot both autocorrelations and compare the results.

0 commit comments

Comments
 (0)