numerical-analysis/src/06_Iterative_methods.jl at 1a43ab0e0adf4f222c7265dfa5a5af38e2a3554a · epfl-matmat/numerical-analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
### A Pluto.jl notebook ###
# v0.20.5

using Markdown
using InteractiveUtils

# ╔═╡ 0501f270-c39c-11ee-2077-7f7f409b6c66
begin
	using LinearAlgebra
	using PlutoUI
	using PlutoTeachingTools
	using Plots
	using LaTeXStrings
	using HypertextLiteral
end

# ╔═╡ 63bb7fe9-750f-4d2f-9d18-8374b113373e
md"""
!!! info ""
    [Click here to view the PDF version.](https://teaching.matmat.org/numerical-analysis/06_Iterative_methods.pdf)
"""

# ╔═╡ 7d9c9392-3aec-4efd-a9ba-d8965687b163
TableOfContents()

# ╔═╡ 4e0e5311-3428-44f5-a616-68cf7634214f
md"""
# Iterative methods for linear systems

In the previous notebook we looked at direct methods for solving linear systems
$\mathbf{A} \mathbf{x} = \mathbf{b}$ based on LU factorisation of the system matrix $\mathbf{A}$. We saw that even for sparse matrices we may need $O(n^3)$ computational time to perform the factorisation and $O(n^2)$ memory and to store the $\mathbf{L}$ and $\mathbf{U}$ factors due to fill in.

Both can make the **solution of linear systems prohibitively expensive for large matrices**. In this notebook we will develop *iterative methods*.
These build a sequence of solution vectors $\mathbf{x}^{(k)}$,
which is designed to converge to the solution $\mathbf{x}$,
that is $\lim_{k \to \infty} \mathbf{x}^{(k)} = \mathbf{x}$.
Typically these methods are based on performing
matrix-vector products $\textbf{A} \textbf{v}$
in each iteration step.
Exactly this difference to the direct methods
--- that is *not* employing the matrix $\textbf{A}$, but only the matrix-vector product ---
is what often leads to an overall reduction of the computational cost.

"""

# ╔═╡ b5bf49df-f6ff-49f9-b016-7f0159b4af7d
md"""
## Richardson iterations

A general framework for building iterative methods for solving linear systems
is Richardson's method. The idea is to introduce a **matrix splitting** $\mathbf{A} = \mathbf{P} - \mathbf{R}$ where $\mathbf{P}$ is an invertible matrix
and $\mathbf{R} = \mathbf{P} - \mathbf{A}$.
With this idea we rewrite $\mathbf{A} \mathbf{x} = \mathbf{b}$ to
```math
(\mathbf{P} - \mathbf{R})\,\mathbf{x} = \mathbf{b}
\qquad \Leftrightarrow \qquad
\mathbf{P} \mathbf{x} = (\mathbf{P} - \mathbf{A})\, \mathbf{x} + \mathbf{b}
\qquad \Leftrightarrow \qquad
\mathbf{x} = \mathbf{P}^{-1} \left[(\mathbf{P} - \mathbf{A}) \mathbf{x} + \mathbf{b}\right]
```
If we define $g(\mathbf{x}) = \mathbf{P}^{-1} \left[(\mathbf{P} - \mathbf{A}) \mathbf{x} + \mathbf{b}\right]$ we observe this to be exactly the setting of
the multi-dimensional fixed-point methods we developed
in chapter 4,
i.e. our goal is to obtain a solution $\mathbf{x}_\ast$
with $\mathbf{x}_\ast = g(\mathbf{x}_\ast) = \mathbf{P}^{-1} \left[(\mathbf{P} - \mathbf{A}) \mathbf{x}_\ast + \mathbf{b}\right]$.

Starting from an initial vector $\mathbf{x}^{(0)}$ we thus iterate
```math
\tag{2}
\mathbf{P} \mathbf{x}^{(k+1)} = (\mathbf{P} - \mathbf{A}) \mathbf{x}^{(k)} + \mathbf{b},
\qquad k = 0, 1, \ldots,
```
that is in each step we solve the linear system $\mathbf{P} \mathbf{x}^{(k+1)} = \widetilde{\mathbf{b}}^{(k)}$ where $\widetilde{\mathbf{b}}^{(k)} = (\mathbf{P} - \mathbf{A}) \mathbf{x}^{(k)} + \mathbf{b}$. Assuming this sequence converges,
i.e. that $\lim_{k\to\infty} \textbf{x}^{(k)} = \textbf{x}_\ast$,
then
```math
\mathbf{P} \mathbf{x}_\ast = (\mathbf{P} - \mathbf{A}) \mathbf{x}_\ast + \mathbf{b}
\qquad \Leftrightarrow \qquad \textbf{0} = \mathbf{b} - \mathbf{A}\mathbf{x}_\ast,
```
i.e. the limit $\mathbf{x}_\ast$ is indeed the solution to the original equation.
"""

# ╔═╡ 1292abe9-5d88-45b0-8904-b89a03009968
md"""
The matrix $\textbf{P}$ is usually referred
to as the **preconditioner**. Since we need to solve a linear
system $\textbf{P} \textbf{u}^{(k)} = \textbf{r}^{(k)}$ in *each
iteration* $k$, an important criterion for a good preconditioner
is that solving such linear systems is cheap.

Let us rewrite equation (2) as
```math
\mathbf{P} \underbrace{\left(\textbf{x}^{(k+1)} - \textbf{x}^{(k)} \right)}_{=\textbf{u}^{(k)}} = \underbrace{\textbf{b} - \textbf{A} \textbf{x}^{(k)}}_{=\textbf{r}^{(k)}}.
```
The quantity $\textbf{r}^{(k)} = \textbf{b} - \textbf{A} \textbf{x}^{(k)}$
appearing in this expression is the **residual** at iteration $k$.
As usual it measures how far the iterate $\textbf{x}^{(k)}$
is from being a solution to $\mathbf{A} \mathbf{x} = \mathbf{b}$:
if $\textbf{r}^{(k)} = \textbf{0}$, then indeed
$\mathbf{A} \mathbf{x}^{(k)} = \mathbf{b}$,
i.e. that $\textbf{x}^{(k)}$ is the solution.
"""

# ╔═╡ c6617ee2-be59-49b2-b0bc-7423b49c2fce
md"""
We formulate Richardson's iterations:

!!! info "Algorithm 1: Richardson iteration"
    Let $\mathbf{A} \in \mathbb{R}^{n\times n}$ be a system matrix,
    right-hand side $\mathbf{b} \in \mathbb{R}^n$, an invertible
    preconditioner $\mathbf{P} \in \mathbb{R}^{n\times n}$,
    an initial guess $\mathbf{x}^{(0)} \in \mathbb{R}^n$
    as well as the desired convergence tolerance $ε$.
    For $k = 0, 1, \ldots$ iterate:
    1. Compute residual $\mathbf{r}^{(k)} = \mathbf{b} - \mathbf{A} \mathbf{x}^{(k)}$
    2. Solve linear system $\mathbf{P} \mathbf{u}^{(k)} = \mathbf{r}^{(k)}$
       for the update $\mathbf{u}^{(k)}$.
    3. Compute new $\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \mathbf{u}^{(k)}$.
    The iteration is stopped as soon as
    $\frac{\| \mathbf{r}^{(k)} \|}{\| \mathbf{b} \|} < ε$.

We will discuss the rationale behind the stopping condition
$\| \mathbf{r}^{(k)} \| / \| \mathbf{b} \| < ε$
in the subsection on error control below.

Comparing with our discussion on the
[fixed point methods we studied in chapter 4](https://teaching.matmat.org/numerical-analysis/04_Nonlinear_equations.html)
we notice that Algorithm 1 is essentially fixed-point iteration
$\mathbf{x}^{(k+1)} = g(\mathbf{x}^{(k)})$ with the map $g$ given by
```math
\tag{3}
g(\mathbf{x}) = \mathbf{x} + \mathbf{P}^{-1} \underbrace{\left( \textbf{b} - \mathbf{A} \mathbf{x} \right)}_{=\mathbf{r}}
```
and as discussed above the fixed point  $\mathbf{x}_\ast = g(\mathbf{x}_\ast)$
necessarily satisfies $\mathbf{0} = \mathbf{r}_\ast = \textbf{b} - \mathbf{A} \mathbf{x}_\ast$ and therefore is a solution to the linear system.

An implementation of Algorithm 1 in Julia is given by
"""

# ╔═╡ 94a56970-0c97-4bc4-bbd5-d297cd54823e
function richardson(A, b, P; x=zero(b), tol=1e-6, maxiter=100)
	history  = [float(x)]  # Keep history of xs
	relnorms = Float64[]   # Track relative residual norm
	for k in 1:maxiter
		r = b - A * x

		relnorm = norm(r) / norm(b)
		push!(relnorms, relnorm)
		if relnorm < tol
			break
		end

		u = P \ r
		x = x + u

		push!(history, x)
	end
	(; x, relnorms, history)  # Return current iterate and history
end

# ╔═╡ 76dcdc97-b0e3-4f3f-8ebd-615df0cea57f
md"""Let us consider the test problem $\textbf{A} \textbf{x} = \textbf{b}$ with"""

# ╔═╡ 1c4ba2a0-553f-4b35-8b45-3de1c9e3a4e8
# Generate a random matrix, which has large entries on the diagonal
A = randn(100, 100) + Diagonal(15 .+ 50rand(100))

# ╔═╡ 49134b41-d3ca-4286-ae9a-5be0d1569241
b = rand(100);

# ╔═╡ e4a1e62c-646b-4378-b2a5-86cfe36bada9
md"""In this case the problem is sufficiently small that it's reference can still be computed by a direct method, e.g. the LU factorisation performed by default when employing Julia's `\` operator:"""

# ╔═╡ 9a4c02fb-6cbf-4b25-9a3a-74066ba14666
x_reference = A \ b

# ╔═╡ 65bdbf4a-b304-43d7-a3b7-a5c932b23964
md"""
To apply Richeardson iterations, we need a preconditioner $\mathbf P$.
Due to the construction of $\mathbf A$ we chose
it has its largest entries in each row on the diagonal.
A particularly simple preconditioner, which typically works well for such matrices
is to use the diagonal of $\mathbf A$ as $\mathbf P$, i.e.
"""

# ╔═╡ c9419d03-d32b-4811-aef7-644803e7d971
P = Diagonal(A)

# ╔═╡ 9efa5334-f2e9-43d6-899a-8cb90fff8c6a
md"""
Applying this preconditioner within the Richardson iterations, yields the correct result to the chosen tolerance:
"""

# ╔═╡ 33ebd4d0-c369-4994-9aab-455a4a4e9c4b
begin
	richardson_result = richardson(A, b, P; tol=1e-6)
	maximum(abs, richardson_result.x - x_reference)
end

# ╔═╡ 0b374ba9-24af-43a2-86cb-c3a4cf5d999c
let
	plot(richardson_result.relnorms;
	     yaxis=:log, mark=:o, lw=2, ylabel=L"||r|| / ||b||",
	     title="Richardson convergence", label=L"$P = \textrm{Diagonal(}A\textrm{)}$")
end

# ╔═╡ 03984b5d-3fe9-4557-9a0b-ddcc7d0a6c34
md"""
## Computational cost

Our main motivation for considering iterative methods was to overcome the $O(n^3)$ computational cost of Gaussian elimination / LU factorisation for solving linear systems.

Let us revist Richardson iteration (Algorithm 1) and discuss the computational cost
for the case where $\mathbf{A} \in \mathbb{R}^{n\times n}$ is a full matrix.
Recall that the cost of LU factorisation scaled as $O(n^3)$ for this setting.

1. In the **first step** the most costly operation is the **computation of the matrix-vector product** $\mathbf{A} \mathbf{x}^{(k)}$. For a full matrix this costs $O(n^2)$.
2. In the **second step** we solve $\mathbf{P} \mathbf{u}^{(k)} = \mathbf{r}^{(k)}$. If $\mathbf{P}$ is a **diagonal matrix** (like we just considered), this costs $O(n)$,
   whereas for a **triangular matrix** (where one would perform backward or forward substitution) the cost is $O(n^2)$.
3. The computation of the new $\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \mathbf{u}^{(k)}$ again costs only $O(n)$.

Overall **for full matrices** the **cost of Richardson iterations** is thus only $O(n^2)$.
"""

# ╔═╡ 5913dace-5e27-4d4d-8bc3-0b4118b1b093
md"""
For **sparse matrices** $\mathbf{A} \mathbf{x}^{(k)}$ can computed in $O(n)$ computational cost. Similarly using clever preconditioners step 2 can be done in $O(n)$ time. Examples would be a sparsity-adapted forward substitution algorithm or again a diagonal preconditioner. Overall Richardson iterations thus **only cost $O(n)$** --- in stark contrast to the $O(n^3)$ cost also required for LU factorisation in this setting.
"""

# ╔═╡ 7b8bb857-35d3-4014-851b-9ab263f48967
md"""
!!! note "Observation: Computational cost of iterative methods"
    When solving linear systems using iterative methods,
    the computational cost (per iteration) scales
    - for **full matrices** $\mathbf{A} \in \mathbb{R}^{n\times n}$ as $O(n^2)$
    - for **sparse matrices** as $O(n)$.
    A recurring theme is that the cost of the matrix-vector product
    $\mathbf{A} \mathbf{x}^{(k)}$ essentially determis the
    cost of the iterative scheme.
"""

# ╔═╡ 1f7765a2-4986-4ef2-849e-ca1efb59abcf
md"""
## Convergence analysis

Having established above that Richardson iteration indeed leads to numerically the correct answer, we proceed to analyse its convergence.

As an iterative method Algorithm 1 can be brought into the setting of a fixed point method $\mathbf{x}^{(k+1)} = g(\mathbf{x}^{(k)})$ by choosing
```math
g(\mathbf{x}) = \mathbf{x} + \mathbf{P}^{-1} \left( \textbf{b} - \mathbf{A} \mathbf{x} \right).
```
In line with our discussion in [Root finding and fixed-point problems](https://teaching.matmat.org/numerical-analysis/04_Nonlinear_equations.html) the convergence depend on the properties of the **Jacobian** (i.e the derivative) of this map at the fixed point. The term $\mathbf{P}^{-1} \mathbf{b}$ does not depend on $\mathbf{x}$ and thus drops, for the rest we compute:
```math
\begin{align*}
g'(\mathbf{x}_\ast) &= \underbrace{\mathbf{I} - \mathbf{P}^{-1} \mathbf{A}}_{=\mathbf{B}}
\end{align*}
```
where the matrix $\mathbf{B}$ is usually referred to as the **iteration matrix**.
"""

# ╔═╡ ba77a115-f9fb-4f80-a0ea-eb50ddd9ae65
md"""
Previously we considered scalar-valued functions where the condition for convergence was that $|g'(x_\ast)| < 1$. A careful analysis in fact reveals that for vector-valued fixed-point maps like in the case of Richardson iterations this condition becomes
```math
\| g'(\mathbf{x}_\ast) \| = \| \mathbf{B} \| < 1,
```
i.e. the modulus is simply replaced by the matrix norm.

As a reminder, recall that the definition of the matrix norm for a matrix $M \in \mathbb{R}^{n\times n}$ is
```math
\|\mathbf{M}\| = \max_{\mathbf{0}\neq\mathbf{v}\in\mathbb{R}^n} \frac{\| \mathbf{M}\mathbf{v}\|}{\|\mathbf{v}\|}.
```
"""

# ╔═╡ 6f3d76f5-aeec-4a05-8aea-a4f97d625a4a
md"""
We summarise in a theorem:

!!! info "Theorem 1"
    Given a system matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$,
    RHS $\mathbf{b} \in \mathbb{R}^n$ and preconditioner
    $\mathbf{P} \in \mathbb{R}^{n \times n}$ invertible,
    the Richardson method with iteration matrix
    $\mathbf{B} = \mathbf{I} - \mathbf{P}^{-1} \mathbf{A}$
    converges for *any* initial guess $\mathbf{x}^{(0)}$
    if $\left\|\mathbf{B}\right\| < 1$.
    Moreover
    ```math
    \|\mathbf{x}_\ast - \mathbf{x}^{(k)}\| ≤ \left\|\mathbf{B}\right\|^k \|\mathbf{x}_\ast - \mathbf{x}^{(0)}\|.
    ```
	This is **linear convergence with rate**  $\| \mathbf{B}\|$.


Notice that the theorem mentions that Richardson iterations
converges for *any initial guess*, which for general fixed-point methods
is not true.
"""

# ╔═╡ 28d6cb69-71ab-493f-a9a1-60f2fe37c1ed
md"""
This is in fact a consequence of the fact that $g$ is a linear function,
i.e. that its Taylor expansion
```math
g(\mathbf x^{(k)}) = g(\mathbf{x}_\ast) + g'(\mathbf{x}_\ast) (\mathbf{x}^{(k)} - \mathbf{x}_\ast)
```
terminates after the linear term as we easily verify by inserting our obtained expressions into the right-hand side:
```math
\begin{aligned}
g(\mathbf{x}_\ast) + g'(\mathbf{x}_\ast) (\mathbf{x}^{(k)} - \mathbf{x}_\ast)
&= \mathbf{x}_\ast + \mathbf{P}^{-1} \left( \textbf{b} - \mathbf{A} \mathbf{x}_\ast \right) + \left(\mathbf{I} - \mathbf{P}^{-1} \mathbf{A} \right) (\mathbf{x}^{(k)} - \mathbf{x}_\ast) \\
&=\mathbf{x}_\ast + \mathbf{P}^{-1} \left( \textbf{b} - \mathbf{A} \mathbf{x}_\ast \right) + \left(\mathbf{x}^{(k)} - \mathbf{x}_\ast - \mathbf{P}^{-1} \mathbf{A}\mathbf{x}^{(k)} + \mathbf{P}^{-1} \mathbf{A}\mathbf{x}_\ast\right) \\
&= \mathbf{x}^{(k)} +  \mathbf{P}^{-1} \mathbf{b} - \mathbf{P}^{-1} \mathbf{A}\mathbf{x}^{(k)} \\
&= g(\mathbf{x}^{(k)})
\end{aligned}
```
Denoting the error in the $k$-th iteration as
$\mathbf{e}^{(k)} = \mathbf{x}^{(k)} - \mathbf{x}_\ast$
one can thus show that
```math
\tag{4}
\mathbf{e}^{(k+1)} = \underbrace{ \left(\mathbf{I} - \mathbf{P}^{-1} \mathbf{A} \right) }_{= \mathbf{B}  \,=\, g'(\mathbf{x}_\ast)} \,\mathbf{e}^{(k)},
```
such that if $\left\|\mathbf{B}\right\| < 1$ the error is *guaranteed to shrink* in an iteration, independent on the current iterate $\mathbf{x}^{(k)}$.
"""

# ╔═╡ 6a5346df-6c21-47fb-9f99-d0bc75532135
md"""
This is in contrast to the Convergence analysis discussion in [Root finding and fixed-point problems](https://teaching.matmat.org/numerical-analysis/04_Nonlinear_equations.html)
where for the case of *non-linear* fixed point functions we found that
```math
e^{(k+1)} = g'(x_\ast) e^{(k)} + O(|e^{(k)}|^2)
```
where thus this was *not* an *exact equality* and one can thus only ensure
the error to shrink for $k$ to $k+1$ if the higher-order terms are small enough,
i.e. if $|e^{(k)}|$ is small enough, i.e. if one starts sufficiently close to the fixed point $x_\ast$.
"""

# ╔═╡ f2d4fbe9-e19a-41e0-9f3b-2f77ef0b12a7
md"""
!!! exercise
	Show that for the fixed-point map of Richardson iteration
	$g(\mathbf{x}) = \mathbf{x} + \mathbf{P}^{-1} \left( \textbf{b} - \mathbf{A} \mathbf{x} \right)$ we indeed have that
	$\mathbf{e}^{(k+1)} = \mathbf{B} \,\mathbf{e}^{(k)}$
	where $\mathbf{B} = \mathbf{I} - \mathbf{P}^{-1} \mathbf{A}$.
	Making use of the inequality
	```math
	\| \mathbf{M}\mathbf{v}\| \leq \|\mathbf{M}\| \, \|\mathbf{v}\|
	```
	and by adapting our arguments in the
	convergence analysis discussion of [Root finding and fixed-point problems](https://teaching.matmat.org/numerical-analysis/04_Nonlinear_equations.html)
	show that for Richardson iteration the error always tends to zero
	as the iteration progresses, i.e. that $\lim_{k\to\infty} \|\mathbf{e}^{(k)}\| = 0$.
"""

# ╔═╡ ef1ecb29-3ca0-4505-b96f-a56abd961979
md"""
From Theorem 1 we take away that the norm of iteration matrix
$\|\mathbf{B}\|$ is the crucial quantity to determine not only *if*
Richardson iterations converge, but also *at which rate*.
Recall in Lemma 4 of [Direct methods for linear systems](https://teaching.matmat.org/numerical-analysis/05_Direct_methods.html)
we had the result that for any matrix  $\mathbf{B} \in \mathbb{R}^{m \times n}$
```math
\tag{5}
\|\mathbf{B}\| = \sqrt{ \lambda_\text{max}(\mathbf{B}^T\mathbf{B}) }.
```
With this and the condition $\|\mathbf{B}\| < 1$ for convergence
in Theorem we can understand the **role of preconditioning**.
If we were not to perform any preconditioning, i.e. $\mathbf{P} = \mathbf{I}$,
then $\|\mathbf{B}\| = \|\mathbf{I} - \mathbf{A}\|$
becomes
"""

# ╔═╡ e277ff4c-87a8-440d-bba4-f30a20b47788
A

# ╔═╡ 54cb29d7-4933-4ab4-a3b5-26431d8432f9
let
	B = I - A    # Iteration matrix for P = I, i.e. no preconditioner
	norm_b = sqrt(maximum(eigvals( B' * B )))
end

# ╔═╡ af630960-a734-4124-81d3-a2657456f6c2
md"""
However, when using a diagonal preconditioner `P = Diagonal(A)`,
i.e. $\|\mathbf{B}\| = \|\mathbf{I} - \mathbf{P}^{-1} \mathbf{A}\|$,
then
"""

# ╔═╡ 51030114-04d7-49eb-9ac1-042941681446
let
	P = Diagonal(A)
	B = I - inv(P) * A  # Iteration matrix for P = Diagonal(A)
	norm_b = sqrt(maximum(eigvals( B' * B )))
end

# ╔═╡ 114bc28c-7888-4c6c-85bd-8c6232624491
md"""
Which is much smaller, in particular less than $1$.
Therefore only the preconditioned iterations converge:
"""

# ╔═╡ d8e0e584-985d-4ff0-8a5c-b21c83c226d9
let
	P = Diagonal(A)
	richardson_diagonal = richardson(A, b, P; tol=1e-10, maxiter=30)
	richardson_no_preconditioner = richardson(A, b, I; tol=1e-10, maxiter=30)

	plot(richardson_diagonal.relnorms;
	     yaxis=:log, mark=:o, lw=2, ylabel=L"||r|| / ||b||",
	     label=L"$P = \textrm{Diagonal(}A\textrm{)}$", legend=:bottomright,
		 ylims=(1e-10, 1e5))
	plot!(richardson_no_preconditioner.relnorms;
	      label=L"$P = I$", lw=2, mark=:o)
end

# ╔═╡ 96f2e900-d8e0-45d5-a79f-8df6d654fa3f
md"""
## Choosing a good preconditioner

From the above experiments we notice:

!!! info "Observation: Preconditioners"
    A good preconditioner $P$ in the Richardson iterations, satisfies the following properties:
    1. It is *cheap to invert*, that is linear systems $\mathbf{P} \mathbf{u} = \mathbf{r}$ are cheap to solve.
    2. The iteration matrix norm $\|\mathbf I - \mathbf P^{-1} \mathbf A\|$ is as small as possible,
      definitely smaller than one (to ensure convergence).

Clearly condition 2. suggests that the *perfect preconditioner* is $\mathbf{P} = \textbf{A}$, such that $\mathbf{P}^{-1} = \mathbf{A}^{-1}$. In this setting $\|\mathbf I - \mathbf{P}^{-1} \mathbf{A}\| = \|\mathbf I - \mathbf I\| = 0$,
i.e. we converge in a single Richardson step !
The trouble of this choice is that step 2 of Algorithm 1 (Richardson iteration) now effectively requires to solve the system $\mathbf A \mathbf u^{(k)} = \mathbf r^{(k)}$ for $\mathbf u^{(k)}$, i.e. we are back to solving the original problem.

On the other hand condition 1. suggests to use $\mathbf P^{-1} = \mathbf I^{-1}$ (since the identity is cheapest to invert --- nothing needs to be done). However, this does not really do anything and does thus not reduce the value of $\|\mathbf{B}\|$. It will just be $\|\mathbf I - \mathbf A\|$.
"""

# ╔═╡ a7ebacfb-b163-40aa-99c8-f5873478249b
md"""
In practice we thus need to find a compromise between the two. As mentioned above standard choices for the preconditioner are:
- the diagonal of $\mathbf A$
- the lower triangle of $\mathbf A$
- another common choice is the *incomplete LU factorisation*, i.e. where one seeks a factorisation $\tilde{\mathbf L} \tilde{\mathbf U} \approx \mathbf A$ with a lower triangular matrix $\tilde{\mathbf L}$ and an upper triangular $\tilde{\mathbf U}$, which only *approximately* factorises $\mathbf A$. This gives additional freedom in designing the factorisations, in particular to avoid fill-in for sparse matrices.
- in many physics or engineering applications $\mathbf A$ results from a physical model. A preconditioner $\mathbf P$ can thus be resulting from an approximation to the employed physical model (e.g. by dropping the modelling of some physical effects).

We will not discuss the art of designing good preconditioners any further at this stage. Interested readers are referred to the excellent book [Youssef Saad *Iterative methods for Sparse Linear Systems*, SIAM (2003)](https://www-users.cse.umn.edu/~saad/IterMethBook_2ndEd.pdf).
"""

# ╔═╡ d9bb93ce-5fac-4f3a-98b4-3024a23bcfb1
md"""
### Error control and stopping criterion

Let us return to clarifying the choice of the stopping criterion in the Richardson iterations (Algorithm 1).

As usual our goal is to control the error $\|\mathbf{x}_\ast - \mathbf{x}^{(k)}\|$
based on our stopping criterion,
but without having access to the exact solution $x_\ast$.
However, we know that $\mathbf{A} \mathbf{x}_\ast = \mathbf{b}$,
since $\mathbf{x}_\ast$ is just the solution to the linear system
we want to solve.

Similarly $\mathbf{r}^{(k)} =  \mathbf{b} - \mathbf{A} \mathbf{x}^{(k)}$
directly implies
```math
\mathbf{A} \mathbf{x}^{(k)} = \underbrace{\mathbf{b} - \mathbf{r}^{(k)}}_{=\widetilde{\mathbf{b}}}
```
where we defined the "perturbed" right-hand side $\widetilde{\mathbf{b}}$.
Notably $\mathbf{x}^{(k)}$ is thus *exact* solution of a linear system
involving $\mathbf{A}$ and this perturbed RHS,
while we actually care to find the true solution $\mathbf{x}_\ast$
obtained by solving the system $\mathbf{A} \mathbf{x}_\ast = \mathbf{b}$ employing an "unperturbed" right-hand side $\mathbf{b}$.

We are thus in exactly the same setting as our
final section on *Numerical stability* in our discussion
on [Direct methods for linear systems](https://teaching.matmat.org/numerical-analysis/05_Direct_methods.html)
where instead of solving $\mathbf{A} \mathbf{x}_\ast = \mathbf{b}$
we are only able to solve the perturbed system
$\mathbf{A} \widetilde{\textbf{x}} = \widetilde{\mathbf{b}}$.
"""

# ╔═╡ 55a69e52-002f-40dc-8830-7fa16b7af081
md"""
We can thus directly apply Theorem 2
from [Direct methods for linear systems](https://teaching.matmat.org/numerical-analysis/05_Direct_methods.html), which states that
```math
\frac{\|\mathbf{x}_\ast - \widetilde{\mathbf{x}} \|}{\| \mathbf{x}_\ast \|}
≤ κ(\mathbf{A})
\frac{ \left\| \mathbf{b} - \widetilde{\mathbf{b}} \right\| }{\| \mathbf{b} \|}
```
Keeping in mind that here $\widetilde{\mathbf{x}} = \mathbf{x}^{(k)}$
and $\mathbf{b} - \widetilde{\mathbf{b}} = \mathbf{r}^{(k)}$
we thus obtain
```math
\frac{\|\mathbf{x}_\ast - \mathbf{x}^{(k)} \|}{\| \mathbf{x}_\ast \|}
≤ κ(\mathbf{A})
\frac{ \left\| \mathbf{r}^{(k)} \right\| }{\| \mathbf{b} \|}
```
Combining this with our stopping criterion from **Algorithm 1**, that is
```math
\frac{\left\|\mathbf{r}^{(k)} \right\|}{\| \mathbf{b} \|} < ε,
```
we finally obtain
```math
\frac{\|\mathbf{x}_\ast - \mathbf{x}^{(k)} \|}{\| \mathbf{x}_\ast \|}
< κ(\mathbf{A}) \, ε.
```
In other words **our stopping criterion ensures** that the **relative error of the returned solution** is smaller than $κ(\mathbf{A})$ times the chosen tolerance.

If the matrix is well-conditioned, i.e. $κ(\mathbf{A})$ is close to $1$,
then the relative residual $\frac{\left\|\mathbf{r}^{(k)} \right\|}{\| \mathbf{b} \|}$
provides a good estimate of the true error and our stopping criterion is appropriate.
However, if $κ(\mathbf{A})$ is large, even a small residual may imply a large error in the returned solution.
"""

# ╔═╡ a5bedab1-b6a3-4539-82af-c18de2da4e92
Foldable("Alternative derivation without applying the previous Theorem 2",
md"""
We start by noting that $\mathbf{A} \mathbf{x}_\ast = \mathbf{b}$
and as discussed above that
$\mathbf{A} \mathbf{x}^{(k)} = \mathbf{b} - \mathbf{r}^{(k)} = \widetilde{\mathbf{b}}$. Now using the inequality
```math
\tag{6}
\|\mathbf{b}\| = \|\mathbf{A} \mathbf{x}_\ast\| ≤ \|\mathbf{A}\| \, \|\mathbf{x}_\ast\|
\quad
\Rightarrow
\quad \|\mathbf{x}_\ast\| ≥ \frac{\|\mathbf{b}\|}{\|\mathbf{A}\|}
```
we develop for the relative error
```math
\begin{aligned}
\frac{\|\mathbf{x}_\ast - \mathbf{x}^{(k)} \|}{\| \mathbf{x}_\ast \|}
\stackrel{(6)}{≤} \frac{ \| \mathbf{A} \| \, \left\|\mathbf{A}^{-1} \left(\mathbf{b} - \widetilde{\mathbf{b}}\right) \right\|}{\| \mathbf{b} \|}
≤ \| \mathbf{A} \| \, \| \mathbf{A}^{-1} \| \, \frac{ \left\| \mathbf{b} - \widetilde{\mathbf{b}} \right\| }{\| \mathbf{b} \|}
=  \| \mathbf{A} \| \, \| \mathbf{A}^{-1} \| \, \frac{ \left\| \mathbf{r}^{(k)} \right\| }{\| \mathbf{b} \|}.
\end{aligned}
```
Recall the previously discussed condition number
```math
κ(\mathbf{A}) = \| \mathbf{A} \| \, \| \mathbf{A}^{-1} \|
= \sqrt{ \frac{λ_\text{max}(\mathbf{A}^T\mathbf{A} ) }{ λ_\text{min}(\mathbf{A}^T \mathbf{A})  }  }
```
as well as our stopping criterion from Algorithm 1, that is
```math
\frac{\left\|\mathbf{r}^{(k)} \right\|}{\| \mathbf{b} \|} < ε,
```
With these quantities we rewrite to again obtain
```math
\frac{\|\mathbf{x}_\ast - \mathbf{x}^{(k)} \|}{\| \mathbf{x}_\ast \|}
≤ κ(\mathbf{A}) \frac{\left\|\mathbf{r}^{(k)} \right\|}{\| \mathbf{b} \|} = κ(\mathbf{A}) \, ε.
```
""")

# ╔═╡ 858b7c87-80de-4c1a-8598-03657608e323
md"""
## Jacobi and Gauss-Seidel method

*Note:* We will only discuss the high-level ideas of this part in the lecture. You can expect that there will not be any detailed exam questions on Jacobi and Gauss-Seidel  without providing you with the formulas and algorithms.


We will now briefly discuss the Jacobi and Gauss-Seidel methods,
which can be seen as particular cases of Richardson iterations.
"""

# ╔═╡ d3f4ef3a-1a17-4078-b7a8-9e9b767ee541
md"""
Recall equation (2)
```math
\mathbf{P} \mathbf{x}^{(k+1)} = (\mathbf{P} - \mathbf{A}) \mathbf{x}^{(k)} + \mathbf{b},
\qquad k = 0, 1, \ldots.
```

In the **Jacobi method** the key idea is to use the *diagonal* of the matrix $\mathbf{A}$ as the preconditioner
```math
\mathbf{P} = \begin{pmatrix}
A_{11} &        &        & \\
       & A_{22} &        & \\
       &        & \ddots &  \\
       &        &        & A_{nn}
\end{pmatrix}.
```
and to explicitly insert this expression into equation (2).
Rewriting we obtain in the $(k+1)$-st iteration of Jacobi's method
```math
\begin{pmatrix}
A_{11} &        &        & \\
       & A_{22} &        & \\
       &        & \ddots &  \\
       &        &        & A_{nn}
\end{pmatrix}
\begin{pmatrix}
x_1^{(k+1)} \\ x_2^{(k+1)} \\ \vdots \\ x_n^{(k+1)}
\end{pmatrix}
= -\begin{pmatrix}
0 & A_{12} & \cdots & A_{1n} \\
A_{21}       & 0      &        & \vdots \\
\vdots       &        & \ddots &  \\
A_{n1}       & \cdots & A_{n,n-1} & 0
\end{pmatrix}
\begin{pmatrix}
x_1^{(k)} \\ x_2^{(k)} \\ \vdots \\ x_n^{(k)}
\end{pmatrix}
+
\begin{pmatrix}
b_1 \\ b_2 \\ \vdots \\ b_n
\end{pmatrix}
```
The diagonal structure of $\textbf{P}$ allows the explicit
computation of $\textbf{x}^{(k+1)}$. Its $i$-th component can be obtained as
```math
\tag{7}
x^{(k+1)}_i = \frac{1}{A_{ii}} \left( b_i - \sum_{\stackrel{j=1}{j\neq i}}^n A_{ij} \, x^{(k)}_j \right)
```
for all $i = 1, \ldots, n$ and as long as $A_{ii} \neq 0$.
"""

# ╔═╡ 77f1c8d5-8058-47cb-a0b5-436b4259aec9
function jacobi(A, b; x=zero(b), tol=1e-6, maxiter=100)
	history  = [float(x)]  # History of iterates
	relnorms = Float64[]   # Track relative residual norm

	n  = length(x)
	xᵏ = x
	for k in 1:maxiter
		xᵏ⁺¹ = zeros(n)
		for i in 1:n
			Aᵢⱼxⱼ = 0.0
			for j in 1:n
				if i ≠ j
					Aᵢⱼxⱼ += A[i, j] * xᵏ[j]
				end
			end  # Loop j
			xᵏ⁺¹[i] = (b[i] - Aᵢⱼxⱼ) / A[i, i]
		end  # Loop i

		relnorm_rᵏ = norm(b - A * xᵏ⁺¹) / norm(b)  # Relative residual norm
		push!(relnorms, relnorm_rᵏ)  # Push to history
		if relnorm_rᵏ < tol          # Check convergence
			break
		end

		xᵏ = xᵏ⁺¹
		push!(history, xᵏ)
	end  # Loop k
	(; x=xᵏ, relnorms, history)
end

# ╔═╡ 96352c57-7f86-4af8-ae6c-a3c20f190461
md"""
The **Gauss-Seidel method** employs the lower triangular part of $A$ as the preconditioner $\mathbf{P}$ (in Julia `P = LowerTriangular(A)`):
```math
\mathbf{P} = \begin{pmatrix}
A_{11} &        &        & \\
A_{21} & A_{22} &        & \\
\vdots &        & \ddots &  \\
A_{n1} & A_{n2} & \cdots & A_{nn}
\end{pmatrix}.
```
By inserting the lower-triangular $\textbf{P}$ into (2) we obtain in the $(k+1)$-st iteration of Gauss-Seidel:
```math
\begin{pmatrix}
A_{11} &        &        & \\
A_{21} & A_{22} &        & \\
\vdots &        & \ddots &  \\
A_{n1} & A_{n2} & \cdots & A_{nn}
\end{pmatrix}
\begin{pmatrix}
x_1^{(k+1)} \\ x_2^{(k+1)} \\ \vdots \\ x_n^{(k+1)}
\end{pmatrix}
= -\begin{pmatrix}
0 & A_{12} & \cdots & A_{1n} \\
           & 0 &  \ddots      & \vdots \\
           &   & \ddots & A_{n-1,n} \\
           &   &        & 0
\end{pmatrix}
\begin{pmatrix}
x_1^{(k)} \\ x_2^{(k)} \\ \vdots \\ x_n^{(k)}
\end{pmatrix}
+
\begin{pmatrix}
b_1 \\ b_2 \\ \vdots \\ b_n
\end{pmatrix}.
```
Using forward substitution to solve this linear system
leads to the following form for the $i$-th component
of the vector $\mathbf{x}^{(k+1)}$.
```math
\tag{8}
x^{(k+1)}_i = \frac{1}{A_{ii}} \left( b_i - \sum_{j=1}^{i-1} A_{ij} \, x^{(k+1)}_j - \sum_{j=i+1}^n A_{ij} \, x^{(k)}_j \right)
```
for all $i = 1, \ldots, n$ and as long as $A_{ii} \neq 0$.

Notice that Gauss-Seidel is very similar to Jacobi's method, just with the small difference that for computing the new component $x^{(k+1)}_i$ we use the components $x^{(k+1)}_j$ for $j<i$, which have already been updated to their new values.
"""

# ╔═╡ d376fb74-6ebd-4b07-9c5c-c34e7e16c25a
function gauss_seidel(A, b; x=zero(b), tol=1e-6, maxiter=100)
	history  = [float(x)]  # History of iterates
	relnorms = Float64[]   # Track relative residual norm

	n  = length(x)
	xᵏ = x
	for k in 1:maxiter
		xᵏ⁺¹ = zeros(n)
		for i in 1:n
			Aᵢⱼxⱼ = 0.0
			for j in 1:i-1
				Aᵢⱼxⱼ += A[i, j] * xᵏ⁺¹[j]
			end  # Loop j
			for j in i+1:n
				Aᵢⱼxⱼ += A[i, j] * xᵏ[j]
			end  # Loop j
			xᵏ⁺¹[i] = (b[i] - Aᵢⱼxⱼ) / A[i, i]
		end  # Loop i

		relnorm_rᵏ = norm(b - A * xᵏ⁺¹) / norm(b)  # Relative residual norm
		push!(relnorms, relnorm_rᵏ)  # Push to history
		if relnorm_rᵏ < tol          # Check convergence
			break
		end

		xᵏ = xᵏ⁺¹
		push!(history, xᵏ)
	end  # Loop k
	(; x=xᵏ, relnorms, history)
end

# ╔═╡ de24e92e-faab-41fe-afc2-287835b1e015
md"""
These two cases just provide two examples of the many flavours
of Richardson iterations, which are used in practice.
A good overview provides chapter 4 of
[Youssef Saad *Iterative methods for Sparse Linear Systems*, SIAM (2003)](https://www-users.cse.umn.edu/~saad/IterMethBook_2ndEd.pdf).
"""

# ╔═╡ b36bda99-fddb-422c-9ac4-cab7d3495334
md"We return to our example problem $\mathbf{A} \mathbf{x} = \mathbf{b}$ with"

# ╔═╡ bc76f57d-22ba-488b-9c57-99e0b89c09f0
A

# ╔═╡ 08ca7c51-34c9-4cff-8e11-d875f75c899e
md"and"

# ╔═╡ f4484dda-074e-4323-844a-319c69a478b0
b

# ╔═╡ 5e217a36-ceb1-4dc9-9e3c-ad89a4c83503
md"On this problem the convergence is as follows:"

# ╔═╡ d8b7f223-20fc-4052-8614-28e5f226014d
let
	result_jacobi       = jacobi(A, b; tol=1e-10, maxiter=30)
	result_gauss_seidel = gauss_seidel(A, b; tol=1e-10, maxiter=30)


	P = Diagonal(A)
	richardson_diagonal = richardson(A, b, P; tol=1e-6, maxiter=30)
	richardson_no_preconditioner = richardson(A, b, I; tol=1e-6, maxiter=30)

	plot(result_jacobi.relnorms;
	     yaxis=:log, mark=:o, lw=2, ylabel=L"||r|| / ||b||",
	     label="Jacobi", legend=:topright)
	plot!(result_gauss_seidel.relnorms;
	      label="Gauss Seidel", lw=2, mark=:o)
end

# ╔═╡ 08dc1f90-2fc6-4328-a39d-8856f215db33
md"""
## Linear systems involving symmetric positive-definite matrices

In many applications in engineering and the sciences
the arising system matrices $\mathbf{A}$ have additional properties,
which can be exploited to obtain better-suited algorithms.
An important case are the symmetric positive definite matrices.

This part of the notes only provides a condensed introduction.
A more detailed, but very accessible introduction to the steepest descent and conjugate gradient methods discussed in this section provides
[Jonathan Shewchuk *An Introduction to the Conjugate Gradient Method Without the Agonizing Pain* (1994)](https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf).

Let us first recall the definition of symmetric positive definite matrices.

!!! info "Definition: Positive definite"
    A matrix $\textbf{A} \in \mathbb{R}^{n\times n}$ is called **positive definite**
    if
    ```math
    \forall\, \mathbf{0} \neq \mathbf{v} \in \mathbb{R}^n
    \qquad \mathbf{v}^T \mathbf{A} \mathbf{v} > 0
    ```
"""

# ╔═╡ 33cb7e52-abe1-40f0-81d0-f8b00468a9ff
md"""
For **symmetric matrices** we additionally have that $\mathbf{A}^T = \mathbf{A}$.
Symmetric positive definite matrices (often abbreviated as s.p.d. matrices)
arise naturally in physical systems when looking for configurations
minimising their energy. A result underlining this construction is

!!! info "Theorem 2"
    An matrix $\textbf{A} \in \mathbb{R}^{n\times n}$ is
    s.p.d. if and only if all its eigenvalues are real and positive.
"""

# ╔═╡ b46abcc8-f739-4da0-9d63-f6673ed6884f
md"""
To every linear system $\mathbf{A} \mathbf{x} = \mathbf{b}$
involving an s.p.d. system matrix $\mathbf{A}$ we can associate an
**energy function** $\phi$
```math
ϕ\, :\, \mathbb{R}^n \to \mathbb{R}\, :\, ϕ(\mathbf{v}) = \frac12 \mathbf{v}^T\mathbf{A}\mathbf{v} - \mathbf{v}^T \mathbf{b}
```
Using basic vector calculus one shows that
```math
\tag{9}
\begin{aligned}
\text{Gradient} && \nabla ϕ(\mathbf{x}) &= \mathbf{A} \mathbf{x} - \mathbf{b} = -\mathbf{r} && \text{\textit{(negative residual)}} \\
\text{Hessian}  && H_ϕ(\mathbf{x}) &= \mathbf{A} && \text{\textit{(system matrix)}}
\end{aligned}
```
Setting the gradient to zero, we obtain the stationary points.
The corresponding equation
```math
\mathbf{0} = \nabla ϕ(\mathbf{x}) = \mathbf{A} \mathbf{x} - \mathbf{b}
```
only has a single solution, provided that $\mathbf{A}$ is non-singular.
Since the Hessian $H_ϕ(\mathbf{x}) = \mathbf{A}$ is positive definite,
this stationary point is a minimum. As a result we obtain

!!! info "Proposition 3"
    Given an s.p.d. matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$,
    the solution of the system $\mathbf{A} \mathbf{x}_\ast = \mathbf{b}$
    is the unique minimum of the function $ϕ$, i.e.
    ```math
    \tag{10}
    \mathbf{x}_\ast = \text{argmin}_{\textbf{v} \in \mathbb{R}^{n}} \, ϕ(\mathbf{v}).
    ```
"""

# ╔═╡ 41a269c8-fe64-4660-aa66-054b8af8af20
md"""
Importantly there is thus a **relation between optimisation problems** and **solving linear systems** if the system matrix $\textbf{A}$ is s.p.d.
"""

# ╔═╡ bf9a171a-8aa4-4f21-bde3-56ccef40de24
md"""
SPD matrices are not unusual. For example, recall that in polynomial regression problems (see least-squares problems in [Interpolation](https://teaching.matmat.org/numerical-analysis/07_Interpolation.html)),
where we wanted to find the best polynomial through the points
$(x_i, y_i)$ for $i=1, \ldots n$ by minimising the least-squares error,
we had to solve the *normal equations*
```math
\mathbf{V}^T \mathbf{V} \mathbf{c} = \mathbf{V}^T \mathbf{y}
```
where $\mathbf{c} \in \mathbf{R}^m$ are the unknown coefficients of the polynomial $\sum_{j=0}^m c_j x^j$, $\mathbf{y}$ is the vector
collecting all $y_i$ values and the $\mathbf{V}$ is the Vandermonde matrix
```math
\mathbf{V} = \left(\begin{array}{ccc}
1 & x_1 & \ldots & x_1^m \\
1 & x_2 & \ldots & x_2^m \\
\vdots \\
1 &x_n & \ldots & x_n^m \\
\end{array}\right) \in \mathbb{R}^{n\times (m+1)}.
```
In this case the system matrix $\mathbf{A} = \mathbf{V}^T \mathbf{V}$ is *always* s.p.d.
"""

# ╔═╡ 590e8877-0600-4bbb-8e88-7033548815b5
md"""
## Steepest descent method
"""

# ╔═╡ fff80ab4-d1df-4bef-a1d2-2aa08c78ce54
md"""
If we view the problem of solving linear systems as an optimisation problem,
a relatively simple idea is to construct an iterative method, where at every step $k$ we try to decrease the energy function $\phi$.

To guide our thoughts we consider a 2D problem which is easy to visualise. We take as an example
```math
\mathbf A^\text{2D} = \begin{pmatrix} 2& 0 \\ 0& 60 \end{pmatrix}
\qquad \mathbf b^\text{2D} = \begin{pmatrix}0\\0\end{pmatrix}
```
such that
```math
\phi(\mathbf x) = \frac12 \mathbf x^T \mathbf A^\text{2D} \mathbf x - \mathbf x^T \mathbf b^\text{2D} = x_1^2 + 30 x_2^2
```
"""

# ╔═╡ bb45f71b-b4cc-4b9a-abcb-7b0647f32a9b
A2d = [1.0  0.0;
	   0.0 20.0]

# ╔═╡ 53cc323c-d742-4cc6-8b96-e966e2b1ccb4
b2d = [0.0;
       0.0]

# ╔═╡ e22c0abe-caa8-4e65-a3c1-8b05e820b7e7
ϕ(x₁, x₂) = x₁^2 + 20 * x₂^2

# ╔═╡ 46cf4acb-5883-4cf0-bc15-3150f401a401
md"""
This problem is visualised as a contour plot below:
"""

# ╔═╡ 92035249-948e-4f2f-a00e-8b4ae0ec2742
begin
	function plot_contour(; show_grad=false)
		x = -6:0.1:6
		y = -2:0.1:2
		p = contour(x, y, ϕ; levels=5:5:100, clim=(0, 100), xlabel=L"x_1", ylabel=L"x_2", title="Contour plot of ϕ", aspect_ratio=1, legend=:topleft)

		scatter!(p, [5], [1]; mark=:o, label=L"x^{(k)}", markersize=5, c=1)

		if show_grad
			α = 1/10
			grad = A2d * [5; 1]
			plot!(p, [5, 5 - α*grad[1]], [1, 1 - α*grad[2]],
					 arrow=true, lw=3, c=3, label=L"$-\frac{1}{10} \, \nabla \phi(x^{(k)})$ "
					)
		end

		scatter!(p, [0], [0]; mark=:utriangle, label=L"Minimum $x_\ast$", markersize=6, c=1)
	end

	plot_contour(; show_grad=true)
end

# ╔═╡ 32b2a6e4-31ca-477f-9839-71cd940949b8
md"""
Now suppose we have an estimate $\mathbf{x}^{(k)}$ of the solutin of $\mathbf{A}\mathbf{x} = \mathbf{b}$ shown above by the blue dot.
This $\mathbf{x}^{(k)}$ is also an estimate of the minimum of $\phi$.
Our goal is thus to find a $\textbf{x}^{(k+1)}$ satisfying
$\phi(\textbf{x}^{(k+1)}) < \phi(\textbf{x}^{(k)})$,
i.e. to get closer to the minimum (blue triangle).

The gradient $\nabla \phi(\textbf{x}^{(k)})$ provides
the slope of the function $\phi$ at $\textbf{x}^{(k)}$
in the *upwards* direction.
Therefore, taking a step along the direction of $-\nabla \phi$
takes us downhill (see green arrow). We thus propose an algorithm
```math
\tag{11}
\begin{aligned}
\textbf{x}^{(k+1)} &= \textbf{x}^{(k)} - \alpha_k \nabla \phi(\textbf{x}^{(k)}) \\&= \textbf{x}^{(k)} + \alpha_k \textbf r^{(k)},
\end{aligned}
```
where we used that $\nabla \phi(\textbf{x}^{(k)}) = - \textbf r^{(k)}$,
and where $\alpha_k > 0$ is a parameter determining
how far to follow along the direction of the negative gradient.

Note, that this method is quite related to Richardson's iterations:
for $α_k = 1$ we actually recover **Algorithm 1** with $\mathbf{P} = \mathbf{I}$.
"""

# ╔═╡ f8f62fa0-6a68-4083-909f-4a3367157f12
Foldable("Rationale for introducing αₖ. Or: why not just take αₖ = 1 ?",
md"""
The rationale of having this parameter $\alpha_k$
is that taking too large steps from $\mathbf x^{(k)}$
to $\mathbf x^{(k+1)}$ is in general
not useful. For example in the example visualised above the green arrow
represents only $1/40$ times the actual gradient.
Therefore simply
updating $\mathbf x^{(k+1)} = \mathbf x^{(k)} + \mathbf r^{(k)}$
with $\alpha_k = 1$ will substantially overshoot,
such that $\mathbf x^{(k+1)}$ will be even further from the minimum
than $\mathbf x^{(k)}$.
Therefore we employ $0 < \alpha_k < 1$ to down-scale this step.
""")

# ╔═╡ cf1af660-1326-4c57-900f-ea12e9836f85
md"""
Since overall our goal is to find the minimum of $\phi$,
a natural idea to determine $\alpha_k$ exactly such that
we make the value of $\phi(\textbf{x}^{(k+1)})$ as small as possible.
We compute

```math
\begin{aligned}
ϕ(\mathbf{x}^{(k+1)}) &= ϕ(\mathbf{x}^{(k)} + α_k \mathbf{r}^{(k)}) \\
&= \frac12 \left(\mathbf{x}^{(k)} + α_k \mathbf{r}^{(k)}\right)^T \mathbf{A}
\left(\mathbf{x}^{(k)} + α_k \mathbf{r}^{(k)}\right)
- \left(\mathbf{x}^{(k)} + α_k \mathbf{r}^{(k)}\right)^T\mathbf{b}\\
&= \underbrace{\frac12 (\mathbf{r}^{(k)})^T\mathbf A\mathbf{r}^{(k)}}_{=a} α_k^2
+ \underbrace{\left[
	\frac12 (\mathbf{x}^{(k)})^T\mathbf A\mathbf{r}^{(k)}
	+ \frac12 (\mathbf{r}^{(k)})^T\mathbf A\mathbf{x}^{(k)}
	- (\mathbf{r}^{(k)})^T \mathbf{b}
\right]}_{=b} α_k^{(k)} \\
	&\hspace{8.8em} + \underbrace{\left[
	\frac12 (\mathbf{x}^{(k)})^T\mathbf A\mathbf{x}^{(k)}
	- (\mathbf{x}^{(k)})^T \mathbf{b}
\right]}_{=c}
\end{aligned}
```
where
```math
\begin{aligned}
a &= \frac12 (\mathbf{r}^{(k)})^T\mathbf A\mathbf{r}^{(k)}&
c &=
	\frac12 (\mathbf{x}^{(k)})^T\mathbf A\mathbf{x}^{(k)}
	- (\mathbf{x}^{(k)})^T \mathbf{b}
\end{aligned}
```
and
```math