Tolerate linear solver failure in non-linear schemes#6825
Tolerate linear solver failure in non-linear schemes#6825lhy11009 wants to merge 2 commits intogeodynamics:mainfrom
Conversation
|
Here I’m experimenting with an option that allows the nonlinear solver to continue to the next iteration even if the cheap Stokes solve fails to reach its tolerance and skip the expensive one. From a strict numerical standpoint this isn’t ideal, but in practice I’ve found that a “failed” Stokes iteration can still move the solution into a better state for the next nonlinear iteration. In my runs, I use At this stage, I mainly want to get feedback on whether this makes sense to you all, and whether there are already existing mechanisms in ASPECT that achieve something similar. Depending on that, we can then decide whether it’s worth formalizing something along these lines. |
|
Hy Haoyuan, I think this is a very interesting idea. Is this mostly useful if you have strongly nonlinear material model properties that may lead to difficult / unrealistic results in the first nonlinear iterations? Conceptually this is similar to choosing a coarse linear solver tolerance inside the nonlinear solver under the assumption that the nonlinear solver will iterate until its tolerance is reached. Something like this (changing the linear solver tolerance between nonlinear iterations) is possible in the Newton solver, where the linear solver doesnt need to reach a very tight tolerance anyway in order to improve the solution. However, your approach would work for any nonlinear solver. We can check others opinions -- I would be curious to hear what @bangerth and @tjhei think about this -- but I wouldnt be opposed to adding a parameter like
What do you think? |
|
Hey Rene, Yes—this is very useful in cases with strongly nonlinear material behavior during the first nonlinear iterations. In practice, it saves a significant amount of wall-clock time because the cheap iterations are much faster (even though I haven’t quantified this yet). I’ve used this scheme across both 2-D and 3-D models, as well as setups with slightly different configurations (e.g., resolution changes), and it has worked consistently well. I think all of your suggestions make sense. In particular, it would be important to consider how this new parameter relates to “Maximum number of expensive Stokes solver steps” and “Nonlinear solver failure strategy.” Conceptually, I also agree that the final linear solve before exiting the nonlinear solver loop should still converge. I’d be curious to hear what others think as well. |
|
I discussed this with @YiminJin today, and he pointed out the similarity between my approach and the E–W method. However, in the E–W method, the tolerance of the linear solver is gradually tightened as the non-linear solver converges. In contrast, my approach effectively uses no explicit tolerance for the linear solver; instead, it relies on a fixed number of cheap Stokes iterations. |
|
I am not opposed to a strategy like the one @gassmoeller suggests whereby we tolerate the linear solver failing. I find that more appealing than futzing around with the tolerance or solver parameters of the linear solver because it would make sure that what you implement remains localized in the nonlinear solver, rather than spilling into the linear solver code. At the end of the day, you are exactly right: If you understand the E-W method as a way to say "We don't actually care about exactly solving the linear system while we're still far from the nonlinear solution", then choosing cheaper options for the linear solver or ignoring its non-convergence makes sense too. I get where @gassmoeller is coming from, but I'd even accept if that happens in the last nonlinear iteration -- all we really care about, and should care about, is the nonlinear residual. From a code perspective, this is about the nonlinear solver, and so it would be nice if we could contain the changes to that part of the code base. |
|
Both of you pointed out that what we should aim for is to “tolerate linear solver failure in non-linear schemes.” Based on this, I have changed the title of this PR. @gassmoeller suggests an implementation of storing and checking the Stokes solution from the last step as a safety check. This would make particular sense when the non-linear and linear solver tolerances are set to the same value. In several of my examples, the linear solver fails during the first few non-linear iterations but does converge in the final iteration, just before the overall non-linear solver converges. Which is kind of cool. Looking forward to the discussion next Monday. |
49f268e to
26f828c
Compare
|
I tried to follow the lines of discussion we had, and now the implementation works partially. I also added a test using this new strategy. In this test, the nonlinear solver converges in one step after ~40 iterations, despite the linear solver reporting a failure. However, I ran into an issue regarding the implementation. Ideally, we would only modify the nonlinear solver so that it continues the iteration even if the linear solver fails. But I realized that this does not work in practice: if we only change how the nonlinear solver handles exceptions, the linear solver will still throw the exception before updating the residual, so the nonlinear solver cannot proceed correctly. Instead, in this example I went to the lowest level of the linear solver, where the exception is first thrown, and modified it so that it continues from there. This might not be the best design choice, so I am very open to suggestions on how this should be handled. |
gassmoeller
left a comment
There was a problem hiding this comment.
I agree with your assessment, we need to indeed do this at the level of the linear solver. I think your general approach is correct, but I have some comments to the code (see below).
Also: could you model this approach after what we do with nonlinear_failure_strategy? I.e. use an enum type to allow more possible behaviors in the future (e.g. we could extend this at some point to cut the timestep if the linear solver fails).
Also please introduce a variable similar to nonlinear_solver_failures that keeps track of how often the linear solver failed, and then output that number at the end of the simulation like done in simulator/core.cc:1280.
source/simulator/parameters.cc
Outdated
| "`single Advection, single Stokes' or `single Advection, no Stokes'."); | ||
|
|
||
| // todo_solver | ||
| prm.declare_entry ("Continue nonlinear solver loop after linear solver failure", "false", |
There was a problem hiding this comment.
I didnt think about this earlier, but could you model this after the parameter nonlinear_solver_failure_strategy?
I.e. the name of the parameter should be Linear solver failure strategy,
its type should be Patterns::Selection("continue|abort") and you can create a type named LinearSolverFailureStrategy in parameters.h (just like for NonlinearSolverFailureStrategy).
There was a problem hiding this comment.
Instead of continue, I renamed this entry to continue_with_nonlinear_solver. The reason is that using continue as an enum conflicts syntactically with the continue keyword in C++.
In the same NonlinearSolverFailureStrategy enum, there is another option named "abort program". For that entry, I followed your suggestion and renamed it to simply abort.
7ef9e8b to
a7c84dc
Compare
| { | ||
| case Parameters<dim>::LinearSolverFailureStrategy::continue_with_nonlinear_solver: | ||
| { | ||
| this->get_pcout() << " linear solver failed (GMG), continuing" << std::endl; |
There was a problem hiding this comment.
I have put this both in the AMG and the BMG solver, and tweaked the outputs slightly to show these differences. I also have separate tests for these different linear solver schemes.
| exc, | ||
| mpi_communicator, | ||
| parameters.output_directory+"solver_history.txt"); | ||
| break; |
There was a problem hiding this comment.
Here, I would get a warning if I don't have this break, but I think as long as the previous "throw" is executed, this break is never reached, is that correct?
There was a problem hiding this comment.
Yes that is correct. It is good practice to end every case statement with break otherwise the other cases will be checked as well (unless of course that is the intended behavior). But throwing the exception will exit this function immediately, so we never reach the break statement.
|
@gassmoeller Done with replying to your comments. I also left a couple of open comments and questions. |
gassmoeller
left a comment
There was a problem hiding this comment.
Yes, I think this is the right behavior.
I would like to ask for the following changes, just to improve the safety of the functionality:
-
Add a variable
linear_solver_failuresin simulator.h like the variablenonlinear_solver_failures. Search for all places wherenonlinear_solver_failuresis used and letlinear_solver_failuresbehave the same way. Add a warning at the end of the model run (like in core.cc:2329) if any linear solver failures occured and report how many. This will document for the user how many times the linear solver failed (and hopefully push them to investigate if too many failures happened). Also I noticed thatnonlinear_solver_failuresis not serialized into the checkpoint for restarts. We should do that (and do the same forlinear_solver_failures). Can you add both to checkpoint_restart.cc:609? -
Add a function to parameters.h similar to
is_defect_correctionbut name itsolve_Stokes_iterativelyand check if any iterative Stokes solver scheme is selected. Then add something like the following in parameters.cc after reading the linear solver failure strategy:
if (linear_solver_failure_strategy == continue)
AssertThrow(Parameters<dim>::solve_Stokes_iteratively(nonlinear_solver) == true, ExcMessage("You are not allowed to select the linear solver failure strategy 'continue' with a solver scheme that does not iterate the Stokes equations."));
This will not be quite as safe as I originally hoped for (checking if the last Stokes solver converged), but at least it will make sure no one can continue with a non-converged solution if the solver scheme does not at least try to iterate.
| continue_with_nonlinear_solver, | ||
| abort | ||
| }; | ||
| /** |
There was a problem hiding this comment.
put an empty line here before the comment
| exc, | ||
| mpi_communicator, | ||
| parameters.output_directory+"solver_history.txt"); | ||
| break; |
There was a problem hiding this comment.
Yes that is correct. It is good practice to end every case statement with break otherwise the other cases will be checked as well (unless of course that is the intended behavior). But throwing the exception will exit this function immediately, so we never reach the break statement.
| set Dimension = 2 | ||
| set Start time = 0 | ||
| set End time = 0 | ||
| set Use years in output instead of seconds = true |
There was a problem hiding this comment.
please use the new spelling to avoid the deprecation warnings in the output:
| set Use years in output instead of seconds = true | |
| set Use years instead of seconds = true |
| set Dimension = 2 | ||
| set Start time = 0 | ||
| set End time = 0 | ||
| set Use years in output instead of seconds = true |
There was a problem hiding this comment.
as above:
| set Use years in output instead of seconds = true | |
| set Use years instead of seconds = true |
Pull Request Checklist. Please read and check each box with an X. Delete any part not applicable. Ask on the forum if you need help with any step.
This is a tentative PR that I opened to first discuss it with the maintainers. The code change is merely for demonstrating the idea. See the comments below.
Before your first pull request:
For all pull requests:
For new features/models or changes of existing features: