Skip to content

Tolerate linear solver failure in non-linear schemes#6825

Open
lhy11009 wants to merge 2 commits intogeodynamics:mainfrom
lhy11009:skip_expensive_interations
Open

Tolerate linear solver failure in non-linear schemes#6825
lhy11009 wants to merge 2 commits intogeodynamics:mainfrom
lhy11009:skip_expensive_interations

Conversation

@lhy11009
Copy link
Contributor

@lhy11009 lhy11009 commented Dec 31, 2025

Pull Request Checklist. Please read and check each box with an X. Delete any part not applicable. Ask on the forum if you need help with any step.

This is a tentative PR that I opened to first discuss it with the maintainers. The code change is merely for demonstrating the idea. See the comments below.

Before your first pull request:

For all pull requests:

For new features/models or changes of existing features:

  • I have tested my new feature locally to ensure it is correct.
  • I have created a testcase for the new feature/benchmark in the tests/ directory.
  • I have added a changelog entry in the doc/modules/changes directory that will inform other users of my change.

@lhy11009
Copy link
Contributor Author

Here I’m experimenting with an option that allows the nonlinear solver to continue to the next iteration even if the cheap Stokes solve fails to reach its tolerance and skip the expensive one. From a strict numerical standpoint this isn’t ideal, but in practice I’ve found that a “failed” Stokes iteration can still move the solution into a better state for the next nonlinear iteration.

In my runs, I use
Number of cheap Stokes solver steps = 60
and essentially skip the expensive Stokes solves altogether. With this setup, I’m able to reach a nonlinear solver tolerance of 5e-3 in fewer than 10 iterations. This is for a 3-D subduction model with viscosities spanning 10^19–10^24 Pa·s, using the visco-plastic module with composite rheology, a finest resolution of ~3 km, and ~10 million cells, so the problem is not trivial.

At this stage, I mainly want to get feedback on whether this makes sense to you all, and whether there are already existing mechanisms in ASPECT that achieve something similar. Depending on that, we can then decide whether it’s worth formalizing something along these lines.

@gassmoeller
Copy link
Member

Hy Haoyuan, I think this is a very interesting idea. Is this mostly useful if you have strongly nonlinear material model properties that may lead to difficult / unrealistic results in the first nonlinear iterations?

Conceptually this is similar to choosing a coarse linear solver tolerance inside the nonlinear solver under the assumption that the nonlinear solver will iterate until its tolerance is reached. Something like this (changing the linear solver tolerance between nonlinear iterations) is possible in the Newton solver, where the linear solver doesnt need to reach a very tight tolerance anyway in order to improve the solution. However, your approach would work for any nonlinear solver.

We can check others opinions -- I would be curious to hear what @bangerth and @tjhei think about this -- but I wouldnt be opposed to adding a parameter like Continue nonlinear solver loop after linear solver failure. However I would suggest the following modifications, do you think they make sense?

  • The parameter should not be limited to the cheap solver. If the cheap solver does not converge, and there are expensive iterations requested (by setting Maximum number of expensive Stokes solver steps to larger than 0 or leaving it at the default 1000) then those number of expensive iterations should be performed next. If that parameter is set to 0, or the expensive solver does not converge either then we can check the same new parameter to decide if we should throw an exception and crash or not (see solver.cc:912).
  • The last linear solve before exiting the nonlinear solver loop should succeed. I am not sure this is mathematically required, because if the nonlinear residual is small enough to fulfil the convergence criteria the Stokes residual should be small as well. But I would feel quite uncomfortable continuing with a Stokes solution that did not converge. Also many users set a maximum number of nonlinear iterations and continue even if the nonlinear solution is not converged. This would be dangerous if the linear solver didnt converge either. To check if the last Stokes solve converged may require storing the state (converged/not converged) of the last Stokes solution result somewhere (maybe we can work with the SolverControl objects we use inside solve_stokes()).
  • (Maybe) we could require that the parameter Nonlinear solver failure strategy is not set to continue with next timestep if your new parameter is set. This would mean we can never get into the situation I described above (linear solver doesnt converge, nonlinear solver doesnt converge either, but we reach the maximum number of nonlinear iterations and continue anyway).

What do you think?
Happy to hear the thoughts of others on the topic.

@lhy11009
Copy link
Contributor Author

lhy11009 commented Jan 7, 2026

Hey Rene,

Yes—this is very useful in cases with strongly nonlinear material behavior during the first nonlinear iterations. In practice, it saves a significant amount of wall-clock time because the cheap iterations are much faster (even though I haven’t quantified this yet). I’ve used this scheme across both 2-D and 3-D models, as well as setups with slightly different configurations (e.g., resolution changes), and it has worked consistently well.

I think all of your suggestions make sense. In particular, it would be important to consider how this new parameter relates to “Maximum number of expensive Stokes solver steps” and “Nonlinear solver failure strategy.” Conceptually, I also agree that the final linear solve before exiting the nonlinear solver loop should still converge.

I’d be curious to hear what others think as well.

@lhy11009
Copy link
Contributor Author

I discussed this with @YiminJin today, and he pointed out the similarity between my approach and the E–W method. However, in the E–W method, the tolerance of the linear solver is gradually tightened as the non-linear solver converges. In contrast, my approach effectively uses no explicit tolerance for the linear solver; instead, it relies on a fixed number of cheap Stokes iterations.

@bangerth
Copy link
Contributor

I am not opposed to a strategy like the one @gassmoeller suggests whereby we tolerate the linear solver failing. I find that more appealing than futzing around with the tolerance or solver parameters of the linear solver because it would make sure that what you implement remains localized in the nonlinear solver, rather than spilling into the linear solver code.

At the end of the day, you are exactly right: If you understand the E-W method as a way to say "We don't actually care about exactly solving the linear system while we're still far from the nonlinear solution", then choosing cheaper options for the linear solver or ignoring its non-convergence makes sense too. I get where @gassmoeller is coming from, but I'd even accept if that happens in the last nonlinear iteration -- all we really care about, and should care about, is the nonlinear residual.

From a code perspective, this is about the nonlinear solver, and so it would be nice if we could contain the changes to that part of the code base.

@lhy11009 lhy11009 changed the title skip expensive iterations by user choice Tolerate linear solver failure in non-linear schemes Jan 23, 2026
@lhy11009
Copy link
Contributor Author

lhy11009 commented Jan 23, 2026

Both of you pointed out that what we should aim for is to “tolerate linear solver failure in non-linear schemes.” Based on this, I have changed the title of this PR.

@gassmoeller suggests an implementation of storing and checking the Stokes solution from the last step as a safety check. This would make particular sense when the non-linear and linear solver tolerances are set to the same value. In several of my examples, the linear solver fails during the first few non-linear iterations but does converge in the final iteration, just before the overall non-linear solver converges. Which is kind of cool.

Looking forward to the discussion next Monday.

@lhy11009 lhy11009 force-pushed the skip_expensive_interations branch from 49f268e to 26f828c Compare March 6, 2026 01:58
@lhy11009 lhy11009 changed the title Tolerate linear solver failure in non-linear schemes Tolerate linear solver failure in non-linear schemes (WIP) Mar 6, 2026
@lhy11009
Copy link
Contributor Author

lhy11009 commented Mar 6, 2026

I tried to follow the lines of discussion we had, and now the implementation works partially. I also added a test using this new strategy. In this test, the nonlinear solver converges in one step after ~40 iterations, despite the linear solver reporting a failure.

However, I ran into an issue regarding the implementation.

Ideally, we would only modify the nonlinear solver so that it continues the iteration even if the linear solver fails. But I realized that this does not work in practice: if we only change how the nonlinear solver handles exceptions, the linear solver will still throw the exception before updating the residual, so the nonlinear solver cannot proceed correctly.

Instead, in this example I went to the lowest level of the linear solver, where the exception is first thrown, and modified it so that it continues from there.

This might not be the best design choice, so I am very open to suggestions on how this should be handled.

Copy link
Member

@gassmoeller gassmoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your assessment, we need to indeed do this at the level of the linear solver. I think your general approach is correct, but I have some comments to the code (see below).

Also: could you model this approach after what we do with nonlinear_failure_strategy? I.e. use an enum type to allow more possible behaviors in the future (e.g. we could extend this at some point to cut the timestep if the linear solver fails).

Also please introduce a variable similar to nonlinear_solver_failures that keeps track of how often the linear solver failed, and then output that number at the end of the simulation like done in simulator/core.cc:1280.

"`single Advection, single Stokes' or `single Advection, no Stokes'.");

// todo_solver
prm.declare_entry ("Continue nonlinear solver loop after linear solver failure", "false",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt think about this earlier, but could you model this after the parameter nonlinear_solver_failure_strategy?

I.e. the name of the parameter should be Linear solver failure strategy,
its type should be Patterns::Selection("continue|abort") and you can create a type named LinearSolverFailureStrategy in parameters.h (just like for NonlinearSolverFailureStrategy).

Copy link
Contributor Author

@lhy11009 lhy11009 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of continue, I renamed this entry to continue_with_nonlinear_solver. The reason is that using continue as an enum conflicts syntactically with the continue keyword in C++.

In the same NonlinearSolverFailureStrategy enum, there is another option named "abort program". For that entry, I followed your suggestion and renamed it to simply abort.

@lhy11009 lhy11009 force-pushed the skip_expensive_interations branch from 7ef9e8b to a7c84dc Compare March 12, 2026 19:14
@lhy11009 lhy11009 changed the title Tolerate linear solver failure in non-linear schemes (WIP) Tolerate linear solver failure in non-linear schemes Mar 12, 2026
{
case Parameters<dim>::LinearSolverFailureStrategy::continue_with_nonlinear_solver:
{
this->get_pcout() << " linear solver failed (GMG), continuing" << std::endl;
Copy link
Contributor Author

@lhy11009 lhy11009 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put this both in the AMG and the BMG solver, and tweaked the outputs slightly to show these differences. I also have separate tests for these different linear solver schemes.

exc,
mpi_communicator,
parameters.output_directory+"solver_history.txt");
break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I would get a warning if I don't have this break, but I think as long as the previous "throw" is executed, this break is never reached, is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is correct. It is good practice to end every case statement with break otherwise the other cases will be checked as well (unless of course that is the intended behavior). But throwing the exception will exit this function immediately, so we never reach the break statement.

@lhy11009
Copy link
Contributor Author

@gassmoeller Done with replying to your comments. I also left a couple of open comments and questions.

Copy link
Member

@gassmoeller gassmoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is the right behavior.

I would like to ask for the following changes, just to improve the safety of the functionality:

  1. Add a variable linear_solver_failures in simulator.h like the variable nonlinear_solver_failures. Search for all places where nonlinear_solver_failures is used and let linear_solver_failures behave the same way. Add a warning at the end of the model run (like in core.cc:2329) if any linear solver failures occured and report how many. This will document for the user how many times the linear solver failed (and hopefully push them to investigate if too many failures happened). Also I noticed that nonlinear_solver_failures is not serialized into the checkpoint for restarts. We should do that (and do the same for linear_solver_failures). Can you add both to checkpoint_restart.cc:609?

  2. Add a function to parameters.h similar to is_defect_correction but name it solve_Stokes_iteratively and check if any iterative Stokes solver scheme is selected. Then add something like the following in parameters.cc after reading the linear solver failure strategy:

if (linear_solver_failure_strategy == continue)
  AssertThrow(Parameters<dim>::solve_Stokes_iteratively(nonlinear_solver) == true, ExcMessage("You are not allowed to select the linear solver failure strategy 'continue' with a solver scheme that does not iterate the Stokes equations."));

This will not be quite as safe as I originally hoped for (checking if the last Stokes solver converged), but at least it will make sure no one can continue with a non-converged solution if the solver scheme does not at least try to iterate.

continue_with_nonlinear_solver,
abort
};
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put an empty line here before the comment

exc,
mpi_communicator,
parameters.output_directory+"solver_history.txt");
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is correct. It is good practice to end every case statement with break otherwise the other cases will be checked as well (unless of course that is the intended behavior). But throwing the exception will exit this function immediately, so we never reach the break statement.

set Dimension = 2
set Start time = 0
set End time = 0
set Use years in output instead of seconds = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the new spelling to avoid the deprecation warnings in the output:

Suggested change
set Use years in output instead of seconds = true
set Use years instead of seconds = true

set Dimension = 2
set Start time = 0
set End time = 0
set Use years in output instead of seconds = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above:

Suggested change
set Use years in output instead of seconds = true
set Use years instead of seconds = true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants