Performance fix for FieldsIO #538

brownbaerchen · 2025-03-31T10:22:59Z

Turns out that collective IO is much faster than individual operations when using FieldsIO within pySDC runs on HPC. This PR changes from individual to collective IO if possible and raises a warning if not. To be specific, you can generate a decomposition with uneven amount of data on the tasks, which prevents collective IO and probably leads to load balancing issues in other places as well, but you will get a warning telling you to change resolution / number of tasks.
There are a few other little things in this PR such as adding some missing pytest markers.

tlunet · 2025-03-31T10:49:24Z

But can't we make collective IO with _all work already for an unbalanced distribution ? This would be a modification in the addField and readField method, and I'm not sure why this should be very complicated ...

tlunet · 2025-03-31T10:50:30Z

Also, maybe we can make already the small modifications to make it work for python <= 3.10 ...

distributions

brownbaerchen · 2025-03-31T12:07:23Z

I implemented the requests by @tlunet in the last commit:

With unbalanced distributions as much as possible of the IO is now using the fast collective operations, no more warning about unbalanced distributions
Changed the unpacking syntax to be compatible with older python versions, which now also test FieldsIO

tlunet · 2025-03-31T13:46:10Z

pySDC/helpers/fieldsIO.py

        offset0 += self.tSize

+        _num_writes = 0
        for (iVar, *iBeg) in itertools.product(range(self.nVar), *[range(n) for n in self.nLoc[:-1]]):


That's not exactly what I suggested ... all IO operations must be collective with _all. It's just that we have to ensure that all processes are calling exactly the same number of time MPI_WRITE_AT or MPI_READ_AT. It means that you have to add after the first for loop :

for _ in range(self.num_collective_IO - _num_writes): self.MPI_WRITE_AT(0, field[:0]) # should produce a no-op

That should avoid the deadlock

Why is that better? I expect little performance hit by non-collective IO here because there are few non-collective operations relative to total operations.

Still counts ... and you can have a majority of non-collective operation with some decomposition. For instance, a 2D problem of size [10,32] decomposed in the first direction in 6 sub-domains => 4 processes have 2 points, 2 have 1 point. Because you need to write contiguous chunk of data in the file, for the first loop all are doing collective write for the first points, but for the second points 4 processes are doing a non-collective operation.

It's just better to always have all process doing collective write, with those who don't write anything doing a no-op, and the MPI implementation is hopefully well done enough to avoid losing anything on that.

Actually, I found some inconsistent behaviour when reading and writing nothing on some tasks on my laptop. Namely, if I print something, it goes through, but if I don't, it deadlocks. I don't understand this behaviour. I suggest we leave this with some non-collective IO operations and if you want, you can fix it when you return from vacation.

I found where the problem was : self.num_collective_IO must be called at the beginning of readField or addField. If instead of that, the property is evaluated on the second for loop, it calls a allreduce but then the process having the maximum number of collective IO are still launching a write_at_all or a read_at_all, which creates the deadlock.

Now, the allreduce is done before any other collective operation, the tests did pass on my laptop. Let's see what happen with the github tests ...

pySDC/helpers/fieldsIO.py

tlunet · 2025-04-01T11:39:20Z

Seems like all the relevant tests passed. Feel free to merge it when you want, I'll update #534

brownbaerchen added 3 commits March 31, 2025 10:16

Performance fixes for FieldsIO

4005f26

Added missing test markers

334a264

Reverting to non-collective IO if needed

5f20c44

Doing as much collective IO as possible also with unbalanced

bd35b4b

distributions

tlunet reviewed Mar 31, 2025

View reviewed changes

pySDC/helpers/fieldsIO.py Show resolved Hide resolved

tlunet reviewed Mar 31, 2025

View reviewed changes

pySDC/helpers/fieldsIO.py Outdated Show resolved Hide resolved

tlunet reviewed Mar 31, 2025

View reviewed changes

pySDC/helpers/fieldsIO.py Outdated Show resolved Hide resolved

tlunet reviewed Mar 31, 2025

View reviewed changes

pySDC/helpers/fieldsIO.py Outdated Show resolved Hide resolved

brownbaerchen and others added 2 commits March 31, 2025 17:14

Small cleanup

f186407

TL: debug the full collective approach

c99d637

tlunet mentioned this pull request Apr 1, 2025

Add a generic hook class to output data using the FieldsIO helpers + demo + documentation. #534

Open

7 tasks

pancetta merged commit 4116ae3 into Parallel-in-Time:master Apr 1, 2025
87 checks passed

brownbaerchen deleted the fieldsIO_fix branch April 1, 2025 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance fix for FieldsIO #538

Performance fix for FieldsIO #538

Uh oh!

brownbaerchen commented Mar 31, 2025

Uh oh!

tlunet commented Mar 31, 2025

Uh oh!

tlunet commented Mar 31, 2025

Uh oh!

brownbaerchen commented Mar 31, 2025

Uh oh!

tlunet Mar 31, 2025 •

edited

Loading

Uh oh!

brownbaerchen Mar 31, 2025

Uh oh!

tlunet Mar 31, 2025

Uh oh!

brownbaerchen Mar 31, 2025

Uh oh!

tlunet Apr 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlunet commented Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Performance fix for FieldsIO #538

Performance fix for FieldsIO #538

Uh oh!

Conversation

brownbaerchen commented Mar 31, 2025

Uh oh!

tlunet commented Mar 31, 2025

Uh oh!

tlunet commented Mar 31, 2025

Uh oh!

brownbaerchen commented Mar 31, 2025

Uh oh!

tlunet Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brownbaerchen Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

tlunet Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

brownbaerchen Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

tlunet Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlunet commented Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tlunet Mar 31, 2025 •

edited

Loading

tlunet Apr 1, 2025 •

edited

Loading