Skip to content

Commit cb6f5d4

Browse files
committed
doc for node_failure + tests, GS missing
1 parent 726f6ee commit cb6f5d4

18 files changed

+240
-42
lines changed

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ Projects
3535
:maxdepth: 2
3636

3737
projects/parallelSDC.rst
38+
projects/node_failure.rst
3839

3940
Playgrounds
4041
-----------
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
Full code: `projects/node_failure/boussinesq_example.py <https://github.com/Parallel-in-Time/pySDC/blob/pySDC_v2/projects/node_failure/boussinesq_example.py>`_
2+
3+
.. literalinclude:: ../../../projects/node_failure/boussinesq_example.py
4+
5+
Results:
6+
7+
.. image:: ../../../data/BOUSSINESQ_steps_vs_iteration_hf_SPREAD.png
8+
:width: 19%
9+
10+
.. image:: ../../../data/BOUSSINESQ_steps_vs_iteration_hf_SPREAD_PREDICT.png
11+
:width: 19%
12+
13+
.. image:: ../../../data/BOUSSINESQ_steps_vs_iteration_hf_INTERP.png
14+
:width: 19%
15+
16+
.. image:: ../../../data/BOUSSINESQ_steps_vs_iteration_hf_INTERP_PREDICT.png
17+
:width: 19%
18+
19+
.. image:: ../../../data/BOUSSINESQ_steps_vs_iteration_hf_INTERP_PREDICT.png
20+
:width: 19%
21+
22+
.. image:: ../../../data/BOUSSINESQ_Kadd_vs_NOFAULT_hf.png
23+
:width: 33%
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
Full code: `projects/node_failure/hard_faults_detail.py <https://github.com/Parallel-in-Time/pySDC/blob/pySDC_v2/projects/node_failure/hard_faults_detail.py>`_
2+
3+
.. literalinclude:: ../../../projects/node_failure/hard_faults_detail.py
4+
5+
Results:
6+
7+
Heat equation:
8+
9+
.. image:: ../../../data/HEAT_steps_vs_iteration_hf_7x7_NOFAULT.png
10+
:width: 19%
11+
12+
.. image:: ../../../data/HEAT_steps_vs_iteration_hf_7x7_SPREAD.png
13+
:width: 19%
14+
15+
.. image:: ../../../data/HEAT_steps_vs_iteration_hf_7x7_SPREAD_PREDICT.png
16+
:width: 19%
17+
18+
.. image:: ../../../data/HEAT_steps_vs_iteration_hf_7x7_INTERP.png
19+
:width: 19%
20+
21+
.. image:: ../../../data/HEAT_steps_vs_iteration_hf_7x7_INTERP_PREDICT.png
22+
:width: 19%
23+
24+
.. image:: ../../../data/HEAT_residuals_allstrategies.png
25+
:width: 33%
26+
27+
28+
29+
Advection equation:
30+
31+
.. image:: ../../../data/ADVECTION_steps_vs_iteration_hf_7x7_NOFAULT.png
32+
:width: 19%
33+
34+
.. image:: ../../../data/ADVECTION_steps_vs_iteration_hf_7x7_SPREAD.png
35+
:width: 19%
36+
37+
.. image:: ../../../data/ADVECTION_steps_vs_iteration_hf_7x7_SPREAD_PREDICT.png
38+
:width: 19%
39+
40+
.. image:: ../../../data/ADVECTION_steps_vs_iteration_hf_7x7_INTERP.png
41+
:width: 19%
42+
43+
.. image:: ../../../data/ADVECTION_steps_vs_iteration_hf_7x7_INTERP_PREDICT.png
44+
:width: 19%
45+
46+
.. image:: ../../../data/ADVECTION_residuals_allstrategies.png
47+
:width: 33%
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
Full code: `projects/node_failure/hard_faults_test.py <https://github.com/Parallel-in-Time/pySDC/blob/pySDC_v2/projects/node_failure/hard_faults_test.py>`_
2+
3+
.. literalinclude:: ../../../projects/node_failure/hard_faults_test.py
4+
5+
Results:
6+
7+
Heat equation:
8+
9+
.. image:: ../../../data/HEAT_iteration_counts_hf_SPREAD.png
10+
:width: 19%
11+
12+
.. image:: ../../../data/HEAT_iteration_counts_hf_SPREAD_PREDICT.png
13+
:width: 19%
14+
15+
.. image:: ../../../data/HEAT_iteration_counts_hf_INTERP.png
16+
:width: 19%
17+
18+
.. image:: ../../../data/HEAT_iteration_counts_hf_INTERP_PREDICT.png
19+
:width: 19%
20+
21+
22+
Advection equation:
23+
24+
.. image:: ../../../data/ADVECTION_iteration_counts_hf_SPREAD.png
25+
:width: 19%
26+
27+
.. image:: ../../../data/ADVECTION_iteration_counts_hf_SPREAD_PREDICT.png
28+
:width: 19%
29+
30+
.. image:: ../../../data/ADVECTION_iteration_counts_hf_INTERP.png
31+
:width: 19%
32+
33+
.. image:: ../../../data/ADVECTION_iteration_counts_hf_INTERP_PREDICT.png
34+
:width: 19%
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.. include:: /../../projects/node_failure/README.rst

projects/node_failure/README.rst

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
Fault-tolerance with PFASST: node failures
2+
==========================================
3+
4+
In this project, we explore PFASST's potential to deal with node failures.
5+
We derive different strategies which allow PFASST to continue after one time-step has failed.
6+
Failure injectation as well as the different strategies are contained in ``emulate_hard_faults`` and the modified controller ``allinclusive_classic_nonMPI_hard_faults`` allows faults to appear before a fine sweep.
7+
We test our ideas for two simple toy problems and two more complex show cases.
8+
This project contains the code for the publication `Toward fault-tolerant parallel-in-time integration with PFASST <https://arxiv.org/abs/1510.08334>`_ of pySDC v2,
9+
while the original code can be found under `pySDC: Fault-tolerant PFASST <https://doi.org/10.5281/zenodo.32765>`_.
10+
Note that due to the long runtime, the results are not generated via Travis. Only the visualization is tested.
11+
12+
Propagation of a single node failure
13+
------------------------------------
14+
15+
We start by analyzing the propagation and containment of a single fault at step 7, iteration 7, see ``hard_faults_detail``.
16+
We do this for the heat and advection equation, both in 1D.
17+
Four different strategies are tested:
18+
19+
- ``SPREAD``: a simple restart from scratch, i.e. the node/time-step is restarted by copying u0 to all quadrature nodes
20+
- ``SPREAD_PREDICT``: in addition to copying u0, we also do multiple SDC sweeps on the coarse level
21+
- ``INTERP``: instead of copying u0, we interpolate the values at the quadrature nodes by taking the next and the following time-step into account
22+
- ``INTERP_PREDICT``: in addition to interpolation, we also do coarse SDC sweeps
23+
24+
The results are plotted using ``postproc_hard_faults_detail``.
25+
26+
.. include:: doc_node_failure_hard_faults_detail.rst
27+
28+
Node failures at different steps and iterations
29+
-----------------------------------------------
30+
31+
The next step is to check how faults impact the convergence of PFASST at different steps and iterations.
32+
We systematically study this in ``hard_faults_test``, where for the heat and the advection equation each combination of step and iteraston is tested separately.
33+
Heat maps generated by ``postproc_hard_faults_test`` then show how many more iterations are required to converge.
34+
35+
.. include:: doc_node_failure_hard_faults_test.rst
36+
37+
38+
The Boussinesq test case
39+
------------------------
40+
41+
A first, more complex test case is the semi-implicit 2D Boussinesq system (order-coarsening only).
42+
We inject faults randomly with a rate of 3%, i.e. in 3% of all fine sweeps a node fails.
43+
To ensure comparability, we define these fails a priori, so that each run has to deal with the same failures.
44+
45+
.. include:: doc_node_failure_boussinesq.rst
46+
47+
The Gray-Scott test case
48+
------------------------
3.14 KB
Binary file not shown.
Binary file not shown.
2.81 KB
Binary file not shown.
2.81 KB
Binary file not shown.

0 commit comments

Comments
 (0)