Skip to content

Conversation

@brownbaerchen
Copy link
Contributor

Turns out that collective IO is much faster than individual operations when using FieldsIO within pySDC runs on HPC. This PR changes from individual to collective IO if possible and raises a warning if not. To be specific, you can generate a decomposition with uneven amount of data on the tasks, which prevents collective IO and probably leads to load balancing issues in other places as well, but you will get a warning telling you to change resolution / number of tasks.
There are a few other little things in this PR such as adding some missing pytest markers.

@tlunet
Copy link
Member

tlunet commented Mar 31, 2025

But can't we make collective IO with _all work already for an unbalanced distribution ? This would be a modification in the addField and readField method, and I'm not sure why this should be very complicated ...

@tlunet
Copy link
Member

tlunet commented Mar 31, 2025

Also, maybe we can make already the small modifications to make it work for python <= 3.10 ...

@brownbaerchen
Copy link
Contributor Author

I implemented the requests by @tlunet in the last commit:

  • With unbalanced distributions as much as possible of the IO is now using the fast collective operations, no more warning about unbalanced distributions
  • Changed the unpacking syntax to be compatible with older python versions, which now also test FieldsIO

offset0 += self.tSize

_num_writes = 0
for (iVar, *iBeg) in itertools.product(range(self.nVar), *[range(n) for n in self.nLoc[:-1]]):
Copy link
Member

@tlunet tlunet Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not exactly what I suggested ... all IO operations must be collective with _all. It's just that we have to ensure that all processes are calling exactly the same number of time MPI_WRITE_AT or MPI_READ_AT. It means that you have to add after the first for loop :

for _ in range(self.num_collective_IO - _num_writes):
    self.MPI_WRITE_AT(0, field[:0]) # should produce a no-op

That should avoid the deadlock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that better? I expect little performance hit by non-collective IO here because there are few non-collective operations relative to total operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still counts ... and you can have a majority of non-collective operation with some decomposition. For instance, a 2D problem of size [10,32] decomposed in the first direction in 6 sub-domains => 4 processes have 2 points, 2 have 1 point. Because you need to write contiguous chunk of data in the file, for the first loop all are doing collective write for the first points, but for the second points 4 processes are doing a non-collective operation.

It's just better to always have all process doing collective write, with those who don't write anything doing a no-op, and the MPI implementation is hopefully well done enough to avoid losing anything on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I found some inconsistent behaviour when reading and writing nothing on some tasks on my laptop. Namely, if I print something, it goes through, but if I don't, it deadlocks. I don't understand this behaviour. I suggest we leave this with some non-collective IO operations and if you want, you can fix it when you return from vacation.

Copy link
Member

@tlunet tlunet Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found where the problem was : self.num_collective_IO must be called at the beginning of readField or addField. If instead of that, the property is evaluated on the second for loop, it calls a allreduce but then the process having the maximum number of collective IO are still launching a write_at_all or a read_at_all, which creates the deadlock.

Now, the allreduce is done before any other collective operation, the tests did pass on my laptop. Let's see what happen with the github tests ...

@tlunet
Copy link
Member

tlunet commented Apr 1, 2025

Seems like all the relevant tests passed. Feel free to merge it when you want, I'll update #534

@pancetta pancetta merged commit 4116ae3 into Parallel-in-Time:master Apr 1, 2025
87 checks passed
@brownbaerchen brownbaerchen deleted the fieldsIO_fix branch April 1, 2025 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants