-
Notifications
You must be signed in to change notification settings - Fork 30
Fix buffer overflow in CPU block_ldlt in find_maxloc on NaNs #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@kkofler many thanks for the patch, could you rebase master onto your branch so it picks up our new ASan test? |
src/ssids/cpu/kernels/block_ldlt.hxx (spral::ssids::cpu::block_ldlt_internal::find_maxloc): Add the same sanity checks for (cloc < BLOCK_SIZE && rloc < BLOCK_SIZE) in the SIMD case as in the non-SIMD case. This fixes a buffer overflow in Ipopt when the input contains some NaNs. In that case, it can happen at this spot that cbest and rbest are still at their initial std::numeric_limits<int>::max() value, so cloc and rloc end up set to that, and so a[cloc*lda+rloc] is out of bounds. The result will still be garbage if we have NaNs, but at least SPRAL will not crash the entire application anymore. (spral::ssids::cpu::block_ldlt_internal::block_ldlt): Check whether the found maxloc is valid (t and m both < BLOCK_SIZE), throw a SingularError otherwise (which I think is a reasonable thing to claim when there were NaNs in the input). Otherwise we eventually exit the entire application with the pivsiz == 0 error case.
src/ssids/cpu/BuddyAllocator.hxx (spral::ssids::cpu::buddy_alloc_internal::Page::~Page): Remove invalid sanity check. When an exception is thrown, this condition is not satisfied.
7d8212d to
6299940
Compare
|
Done. |
mjacobse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, setting bestv_out like that seems to make some sense to me. Right now I am unable to test this though, I assume this case will only happen with a full block of size BLOCK_SIZE full of NaN? Would you be able to provide a test matrix that causes this failure? Otherwise I would have to try to come up with one myself.
On a more general note I'm not quite sure what SSIDS is really supposed to do if there are NaN in the input. Surely a user should just clean that up, or is there any actual usecase for this? Though I agree that running into a buffer overflow is not the nicest way to (not) handle this of course. But maybe dealing with this in a preliminary check and returning an error directly would be simpler? Not sure.
| find_maxloc<T,BLOCK_SIZE>(p, a, lda, bestv, t, m); | ||
|
|
||
| // Handle case where find_maxloc failed (e.g., due to NaNs in the input) | ||
| if (t >= BLOCK_SIZE || m >= BLOCK_SIZE) throw SingularError(p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure about this check. It unconditionally gives up with SingularError, ignoring the action setting. Also, if find_maxloc failed, the change above will cause bestv to be 0.0, and so fabs(bestv) < small will be true, so we should go into the handling of the singular case anyways, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if find_maxloc fails, no matter in what stage, it is always an error.
If I do not add this check, i.e., if I add only the bestv_out buffer overflow fix without this added check, then it continues attempting to compute with invalid indices (INT_MAX is not a valid index to work with) and eventually (I think in a later iteration, because in this one, we should have t==INT_MAX and m==INT_MAX and hence t==m, and so pivsiz should always be 1) runs into the
if(pivsiz == 0) {
// FIXME: debug remove
printf("broken!\n");
…
exit(1);
}crash case. Assuming it gets so far to begin with, because a11 = a[t*lda+t]; is also an invalid memory access if t is INT_MAX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in my tests, the "broken!" error is what I hit without this added check.
| } | ||
| ~Page() noexcept(false) { | ||
| if(next_ && head_[nlevel-1] != 0) | ||
| throw std::runtime_error("outstanding allocations on cleanup\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why this can just be removed and how it relates to the other changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to just remove? To be honest, I do not know. I just know that it did not crash when I removed this check, so I can only hope it did not corrupt memory somewhere.
What I do know is that if I do not remove this check, then the exception I throw with the if (t >= BLOCK_SIZE || m >= BLOCK_SIZE) throw SingularError(p); in block_ldlt.hxx causes ~Page to be run in this (next_ && head_[nlevel-1] != 0) state, so it throws that std::runtime_error, and std::terminate gets called as a result. So something has to be done about that.
If removing the check is not safe, then we need another way to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be that existing exceptions thrown in block_ldlt.hxx by this check:
if(fabs(bestv) < small) {
if(!action) throw SingularError(p);can also trigger this error case, but I am not sure about that either. It might also be that it only gets triggered because I am throwing an exception also in the case of a non-zero action (i.e., that that is the context in which the exception is unexpected). I do not understand the allocator usage well enough yet to know the answer.
|
I think it will not be easy to export the matrix. It comes from some complicated (and proprietary) Java code which calls Ipopt which calls SPRAL. I am not sure whether the block needs to be all NaN for the error to happen or just in some relevant places. And yes, I think checking for NaNs and returning an error before even starting the factorization would be another way to fix this crash. |
|
My view is this should be dealt with by the calling routine, not ssids. If I call an Lapack routine with NaNs, I don't expect to be told that the data is faulty, that is my job. Whether the fault is ipopt or the java program, it should be sorted up the chain.
|
|
Agreed, but at the same time SSIDS shouldn't be dying on a buffer overflow when it can't handle the input data. I think probably the best way to handle this is in a preliminary check that returns an error directly as @mjacobse suggests, otherwise there is a risk that with this PR we inadvertently break things lying very deep within the codebase. |
|
As I am sure you appreciate, adding checks comes with an overhead, and many users may already be doing so before the call. The analysis procedure has an argument check to enable checks if desired, but this is only for the integer data. It is a shame that this isn't a control parameter rather than an argument, as then doing the same for factorize wouldn't break existing calls. Of course, SSIDS is really showing its age, and time is approaching that |
This pull request fixes a buffer overflow in the SPRAL CPU code called from Ipopt when the input contains some NaNs, and 2 followup errors that occurred after I fixed the buffer overflow. With these 3 fixes, the input passed from Ipopt to SPRAL is declared singular by SPRAL at the rank in which the NaNs are encountered instead of crashing the entire application calling Ipopt.