-
Notifications
You must be signed in to change notification settings - Fork 37
Performance fix for FieldsIO #538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
But can't we make collective IO with _all work already for an unbalanced distribution ? This would be a modification in the |
|
Also, maybe we can make already the small modifications to make it work for python <= 3.10 ... |
|
I implemented the requests by @tlunet in the last commit:
|
| offset0 += self.tSize | ||
|
|
||
| _num_writes = 0 | ||
| for (iVar, *iBeg) in itertools.product(range(self.nVar), *[range(n) for n in self.nLoc[:-1]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not exactly what I suggested ... all IO operations must be collective with _all. It's just that we have to ensure that all processes are calling exactly the same number of time MPI_WRITE_AT or MPI_READ_AT. It means that you have to add after the first for loop :
for _ in range(self.num_collective_IO - _num_writes):
self.MPI_WRITE_AT(0, field[:0]) # should produce a no-opThat should avoid the deadlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that better? I expect little performance hit by non-collective IO here because there are few non-collective operations relative to total operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still counts ... and you can have a majority of non-collective operation with some decomposition. For instance, a 2D problem of size [10,32] decomposed in the first direction in 6 sub-domains => 4 processes have 2 points, 2 have 1 point. Because you need to write contiguous chunk of data in the file, for the first loop all are doing collective write for the first points, but for the second points 4 processes are doing a non-collective operation.
It's just better to always have all process doing collective write, with those who don't write anything doing a no-op, and the MPI implementation is hopefully well done enough to avoid losing anything on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I found some inconsistent behaviour when reading and writing nothing on some tasks on my laptop. Namely, if I print something, it goes through, but if I don't, it deadlocks. I don't understand this behaviour. I suggest we leave this with some non-collective IO operations and if you want, you can fix it when you return from vacation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found where the problem was : self.num_collective_IO must be called at the beginning of readField or addField. If instead of that, the property is evaluated on the second for loop, it calls a allreduce but then the process having the maximum number of collective IO are still launching a write_at_all or a read_at_all, which creates the deadlock.
Now, the allreduce is done before any other collective operation, the tests did pass on my laptop. Let's see what happen with the github tests ...
|
Seems like all the relevant tests passed. Feel free to merge it when you want, I'll update #534 |
Turns out that collective IO is much faster than individual operations when using FieldsIO within pySDC runs on HPC. This PR changes from individual to collective IO if possible and raises a warning if not. To be specific, you can generate a decomposition with uneven amount of data on the tasks, which prevents collective IO and probably leads to load balancing issues in other places as well, but you will get a warning telling you to change resolution / number of tasks.
There are a few other little things in this PR such as adding some missing pytest markers.