-
Notifications
You must be signed in to change notification settings - Fork 232
Description
I have tried a stress-test of the new AiiDA daemon after the replacement of tornado with asyncio.
I have updated to the most recent develop (716a1d8), updated also aiida-qe to develop (commit 1a9713aefbcd235c20ecfb65d0df226b5544bf7d of that repo), pip installed both, run reentry scan, and stopped+started the daemon.
I try to roughly describe also what I've been doing.
Then, roughly, I have prepared a script to submit something of the order of ~2000 relax workflows.
While the submission was happening, I quickly reached the number of slots (warning message at the end of verdi process list indicating a % > 100%), so I did verdi daemon incr 7 to work with 8 workers.
After having submitted more than half of the workflows, I stopped because anyway 8 workers weren't enough and I didn't want to overload the supercomputer with too many connections from too many workers.
I left it run overnight, the next morning I was in a stalled situation all slots were taken, so I increased a bit more the workers, and after a while submitted the rest of the workflows, and let them finish.
Since I realised that most were excepting (see below), I also stopped the daemon (that took a bit, made sure it was stopped, and started again with just one worker to finish the work.
I have seen a number of issues unfortunately ( :-( ) where most calculations had some kind of problem. Pinging @sphuber @unkcpz @muhrin as they have been working on this so they should be able to help debugging/fixing the bugs.
I am going report below as different comments some of the issues that I'm seeing, but I'm not sure how to debug more, so if you need specific logs please let me know what to run (or @sphuber I can give you temporarily access to the machine if it's easier).
While I write I have the last few (~30) jobs finishing, but I can already start reporting the issues I see.