👌 `PwBaseWorkChain`: Improve handling of SCF convergence issues by mbercx · Pull Request #987 · aiidateam/aiida-quantumespresso

mbercx · 2023-11-24T13:02:59Z

Fixes #961

The current error handler for the PwBaseWorkChain that deals with SCF convergence issues only tries to reduce the mixing-beta by 20% each time the calculation fails, and doesn't consider the rate of convergence.

Here we improve the error handling for this case by calculating the slope on the logarithm of the "scf accuracy". Based on some statistics presented in #961, we noted two points:

Most of the calculations that converge do so within 50 SCF steps.
For those that don't, there is a very low chance of convergence in case the scf accuracy slope is higher than -0.1.

Hence, we:

Limit the default number of maximum steps (electron_maxstep) to 50.
In case of SCF convergence failure, check the scf accuracy slope. If < -0.1, do a full restart.
Else, check the mixing_beta. If above 0.1, half it and restart the calculation.

TODO

Update tests once we agree on the approach.
Add similar logic for the SCF failures during relaxations. We should probably gather the logic in a method and call this in both error handlers.

bastonero

Thanks a lot @mbercx for the draft! I've already left some comments/feedbacks as I would love to see this merged soon! Let me know.
We can also think of starting merging a first solution and refine it after we play a bit around with this?

bastonero · 2023-11-24T13:14:49Z

src/aiida_quantumespresso/workflows/protocols/pw/base.yaml

                degauss: 0.01
            ELECTRONS:
-                electron_maxstep: 80
+                electron_maxstep: 50


Shall we set it a bit larger, e.g. 60, as to include 99% of cases instead? Or you say that eventually one would just restart the calculation if the slope is good?

That was my thinking. And we would be a bit more pre-emptive in fixing ones that do not, so potentially save time that would otherwise be wasted. In the end I suppose the question is what we prefer: avoiding restarts or avoiding wasted resources? With a slope threshold of -0.1, it's very likely that we'll catch all the calculations who just need more steps, and those rare edge cases might still benefit from some reduced mixing and hence not be lost.

That said, 60 is probably also fine (it includes 98.8% of cases, so 1.3% more ^^).

Then again... I was redoing the analysis on the slopes, but then instead of looking at the slopes over all steps, look only at the 50 we actually calculated in case we'd set electron_maxstep to 50:

Now we see quite a few with slopes > -0.1 that do converge. Looking at the evolution of some of their scf accuracies:

We can see that indeed, only after around step 50 the SCF "catches its groove", and the convergence begins accelerating. Maybe this would also happen if we restart from the charge density and reduce the mixing, but that isn't certain and it'd most likely be slower.

You can play round with the "max nstep" to consider. Of course as you increase it, you have fewer cases where the slope before "max nstep" is > -0.1 and the calculation converges, but you have fewer cases of successful convergence between "max step" and 80 steps as well.

So we could ask two questions: if I would have set electron_maxstep to X, and used -0.1 as the slope threshold

what's the chance that we did an "incorrect stop", i.e. the slope is > -0.1, but the calculation converges in 80 steps?

what's the chance that we did an "incorrect restart", i.e. the slope is < -0.1, but the calculation fails to converge in 80 steps?

max nstep # incorrect stop % incorrect stop # incorrect restart % incorrect restart

70 13 6.3 1153 46.2

60 27 5.4 1292 51.7

50 54 5.3 1425 57.0

40 61 2.9 1584 63.4

30 72 1.6 1700 68.1

20 80 0.7 1919 76.8

But frankly, these % data are also skewed by the fact that we only have data up to step 80, i.e. if we had data up to 200 steps the % incorrect stop would go down for all "max nstep", but most likely more for 70 than for 40.

So to do this analysis "right", we'd need to gather statistics for a larger electron_maxstep. Moreover, the question is what happens if we "incorrectly stop" a calculation and decide to e.g. reduce the mixing. It's all quite a lot of analysis without too much benefit in the end, I think. 60 seems like a reasonable choice to me, so I'll pick that unless there are objections.

One note though: it's clear from the "scf accuracy" curves above that extrapolation may not always give a reasonable value. I'd keep it simple: if the slope < -0.1, just double the electron_maxstep to 120 and restart from the last charge density.

Thanks @mbercx for the analysis. I must say I couldn't completely follow the arguments. But the figure you showed is also interesting. Maybe we should compute the slope only on e.g. 20 of the last steps, and look at the slope, and decide whether to restart or to change the mixing mode?

Yeah, it was a little late, so maybe my explanation wasn't the best. ^^ Let me summarize:

In my previous analysis I used the slope over all 80 steps to see if we can decide after 50 steps to continue with more steps or switch to another strategy. That's of course not correct, since at step 50 we'd only have information about the first 50. If we look at the slope distributions for the first 50 steps (first image in my post above), we can see that there are more cases where selecting -0.1 as the threshold would lead to our approach switching to another strategy (e.g. adapting mixing_beta), or doing an "incorrect stop".

I was curious if you could "optimize" the choice for electron_maxstep, i.e. minimize the % of incorrect stops. But to judge this properly I think we need to have data for larger electron_maxstep runs, since we don't really know which of the failed runs would have converged if we had run with more than 80 steps. Also: we don't know if switching to e.g. a lower mixing_beta would improve or worsen the convergence after step 50. So perhaps we shouldn't overdo this analysis for now.

Maybe we should compute the slope only on e.g. 20 of the last steps

I was also thinking of this approach. Here is the analysis:

$ ./gather_stats.py slopes --last 20 --max-nstep 50 workchain/PBEsol/scf Report: Number of completed pw.x with > 50 SCF steps: 3524 Report: Calculating slopes from step 30 to 50. Report: Incorrect stops: 130 | 12.7% of 1026 successful runs with > 50 steps Report: Incorrect restarts: 834 | 33.4% of 2498 failed runs

So based on this setup (information about 80 steps, trial "max nstep" 50, slope threshold -0.1), only checking the last 20 steps increases the # of incorrect stops, but reduces the # of incorrect restarts. Below are some examples of runs where we would incorrectly stop:

So not sure if this really improves things. I suppose it depends what we prioritise: convergence or saving resources.

Thanks @mbercx ! Ok, I see now. Well, the the point is also whether a change of strategy (e.g. mixing beta) will ever produce significant change in the first 50 (or 60, or whatever) iterations. There is a high chance to keep trying changing strategy, while having just some iterations more would solve the convergence.
So I don't know how much we are going to save resources if we are not really sure to fix with a certain probability the problem. So I would vote to even increase such maximum number of iterations (which btw on QE I think the default is to 100).

bastonero · 2023-11-24T13:17:11Z

src/aiida_quantumespresso/workflows/pw/base.py

        'delta_threshold_degauss': 30,
        'delta_factor_degauss': 0.1,
-        'delta_factor_mixing_beta': 0.8,
+        'delta_factor_mixing_beta': 0.5,


So if we start from 0.4 then the next will be: 0.2, 0.1, 0.05, ... Nevertheless, 0.05 is rather uncommon. Shall we set something to reproduce e.g. 0.3, 0.2, 0.1, 0.05? Or probably just set a list of them? Ok, here I don't consider starting from a different mixing beta. What do you thik?

Actually coming back here - probably it is just best to leave 0.5 as a factor, son one explores quickly the possible betas. No need to stay too close I guess.

Frankly, I'm not sure either. My thinking is that in case reducing the mixing improves the convergence, better to reduce a bit more and then check if the slope is < -0.1 after 50 steps and then give it more max steps.

Yeah makes sense, something like in the ph.x code, where you can chnage mixing beta at a certain point

bastonero · 2023-11-24T13:18:09Z

src/aiida_quantumespresso/workflows/pw/base.py

        'delta_factor_max_seconds': 0.95,
        'delta_factor_nbnd': 0.05,
        'delta_minimum_nbnd': 4,
+        'conv_slope_threshold': -0.1,


max_electronic_steps too? E.g. ~200-250?

src/aiida_quantumespresso/workflows/pw/base.py

bastonero · 2023-11-24T13:22:21Z

src/aiida_quantumespresso/workflows/pw/base.py

+        import numpy
+
+        scf_accuracy = calculation.tools.get_scf_accuracy(0)
+        scf_accuracy_slope = numpy.polyfit(numpy.arange(0, len(scf_accuracy)), numpy.log(scf_accuracy), 1)[0]


I would also take the intercept to use for extrapolating the amount of extra steps to perform. Or otherwise, we can simply increase it to a fixed maximum? E.g. electrons_maxstep = self.defaults.max_electronic_steps?

bastonero · 2023-11-24T13:23:17Z

src/aiida_quantumespresso/workflows/pw/base.py

+        scf_accuracy_slope = numpy.polyfit(numpy.arange(0, len(scf_accuracy)), numpy.log(scf_accuracy), 1)[0]
+
+        if scf_accuracy_slope < self.defaults.conv_slope_threshold:
+


Update ELECTRONS.electron_maxstep. Discussion needed

bastonero · 2023-11-24T13:29:41Z

src/aiida_quantumespresso/workflows/pw/base.py

+            self.set_restart_type(RestartType.FULL, calculation.outputs.remote_folder)
+            self.report_error_handled(calculation, action)
+            return ProcessHandlerReport(True)



mixing_mode = self.ctx.inputs.parameters['ELECTRONS'].get_default('mixing_beta', 'plain') if mixing_mode == 'plain': self.ctx.inputs.parameters['ELECTRONS']['mixing_mode'] = 'local-TF' [...] mixing_ndim = self.ctx.inputs.parameters['ELECTRONS'].get_default('mixing_ndim', 8) if mixing_ndim <= 8: self.ctx.inputs.parameters['ELECTRONS']['mixing_ndim'] = 20 [...]

I would try also these solutions. Maybe at the same time?

AndresOrtegaGuerrero · 2024-05-25T08:42:57Z

@mbercx i was wondering how is the size of the system? the chemistry (elements) , did you consider systems with rare elements ? and/or dft+u calculations ?

mbercx · 2025-09-24T06:59:43Z

@mbercx i was wondering how is the size of the system? the chemistry (elements) , did you consider systems with rare elements ? and/or dft+u calculations ?

Ciao @AndresOrtegaGuerrero! Apologies, as usual something else came up and we never got this to 100%, and now I don't remember much. 😓

My plan is to pick this PR up again and so some field testing. Frankly, almost anything is better than the current approach of slowly changing the mixing.

But I can already see this issue: what will actually be a good strategy will depend on the system indeed, and we might not always be able to figure that out automatically. It's clear to me that having a hard-coded approach for error handlers is just not flexible enough in many cases, see also aiidateam/aiida-core#4765. But if fixing that still makes sense will have to be seen.

AndresOrtegaGuerrero · 2025-09-24T08:29:22Z

@mbercx i was wondering how is the size of the system? the chemistry (elements) , did you consider systems with rare elements ? and/or dft+u calculations ?

Ciao @AndresOrtegaGuerrero! Apologies, as usual something else came up and we never got this to 100%, and now I don't remember much. 😓

My plan is to pick this PR up again and so some field testing. Frankly, almost anything is better than the current approach of slowly changing the mixing.

But I can already see this issue: what will actually be a good strategy will depend on the system indeed, and we might not always be able to figure that out automatically. It's clear to me that having a hard-coded approach for error handlers is just not flexible enough in many cases, see also aiidateam/aiida-core#4765. But if fixing that still makes sense will have to be seen.

One thing to add is that if there are lanthanides present, we might need to increase the number of steps to about 3–4 times the default value (default is 50). These systems can sometimes be a bit noisy, but they usually converge with more steps. In the app, we now have control over both the number of steps and the mixing mode, which gives us ways to overcome convergence issues.

mbercx · 2025-09-24T08:38:12Z

Thanks @AndresOrtegaGuerrero! I'll first focus on getting the documentation in a semi-decent shape, while I test some of the approaches here in the background. At some point we can have a meeting and more ideas if needed.

The current error handler for the `PwBaseWorkChain` that deals with SCF convergence issues only tries to reduce the `mixing-beta` by 20% each time the calculation fails, and doesn't consider the rate of convergence. Here we improve the error handling for this case by calculating the slope on the logarithm of the "scf accuracy". Based on some statistics presented in aiidateam#961 We noted two points: 1. Most of the calculations that converge do so within 50 SCF steps. 2. For those that don't, there is a very low chance of convergence in case the scf accuracy slope is higher than -0.1. Hence, we: * Limit the default number of maximum steps (`electron_maxstep`) to 50. * In case of SCF convergence failure, check the scf accuracy slope. If < -0.1, do a full restart. * Else, check the `mixing_beta`. If above 0.1, half it and restart the calculation.

This was referenced Nov 24, 2023

PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation #961

Open

PhBaseWorkChain: better handler for convergence #986

Merged

bastonero reviewed Nov 24, 2023

View reviewed changes

mbercx force-pushed the fix/961/pw-scf-conv branch from 8f1066a to 33a7259 Compare November 29, 2023 00:01

mbercx force-pushed the main branch from 1c5b365 to 8a13266 Compare February 3, 2025 12:15

GeigerJ2 mentioned this pull request Sep 9, 2025

Usability: Changes to the documentation aiidateam/team-compass#27

Open

5 tasks

mbercx force-pushed the fix/961/pw-scf-conv branch from 4d15f6d to f9d7c65 Compare September 25, 2025 04:09

max nstep	# incorrect stop	% incorrect stop	# incorrect restart	% incorrect restart
70	13	6.3	1153	46.2
60	27	5.4	1292	51.7
50	54	5.3	1425	57.0
40	61	2.9	1584	63.4
30	72	1.6	1700	68.1
20	80	0.7	1919	76.8

		scf_accuracy_slope = numpy.polyfit(numpy.arange(0, len(scf_accuracy)), numpy.log(scf_accuracy), 1)[0]

		if scf_accuracy_slope < self.defaults.conv_slope_threshold:

Conversation

mbercx commented Nov 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

bastonero left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndresOrtegaGuerrero commented May 25, 2024

Uh oh!

mbercx commented Sep 24, 2025

Uh oh!

AndresOrtegaGuerrero commented Sep 24, 2025

Uh oh!

mbercx commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbercx commented Nov 24, 2023 •

edited

Loading