Skip to content

Conversation

@kkofler
Copy link

@kkofler kkofler commented Apr 2, 2025

This pull request fixes a buffer overflow in the SPRAL CPU code called from Ipopt when the input contains some NaNs, and 2 followup errors that occurred after I fixed the buffer overflow. With these 3 fixes, the input passed from Ipopt to SPRAL is declared singular by SPRAL at the rank in which the NaNs are encountered instead of crashing the entire application calling Ipopt.

@amontoison
Copy link
Member

@mjacobse Can you review the PR?
@nimgould The modifications could be relevant for GALAHAD.

This PR could help to solve #218

@jfowkes
Copy link
Contributor

jfowkes commented Apr 2, 2025

@kkofler many thanks for the patch, could you rebase master onto your branch so it picks up our new ASan test?

kkofler added 2 commits April 2, 2025 14:55
src/ssids/cpu/kernels/block_ldlt.hxx
(spral::ssids::cpu::block_ldlt_internal::find_maxloc): Add the same
sanity checks for (cloc < BLOCK_SIZE && rloc < BLOCK_SIZE) in the SIMD
case as in the non-SIMD case. This fixes a buffer overflow in Ipopt when
the input contains some NaNs. In that case, it can happen at this spot
that cbest and rbest are still at their initial
std::numeric_limits<int>::max() value, so cloc and rloc end up set to
that, and so a[cloc*lda+rloc] is out of bounds. The result will still be
garbage if we have NaNs, but at least SPRAL will not crash the entire
application anymore.
(spral::ssids::cpu::block_ldlt_internal::block_ldlt): Check whether the
found maxloc is valid (t and m both < BLOCK_SIZE), throw a SingularError
otherwise (which I think is a reasonable thing to claim when there were
NaNs in the input). Otherwise we eventually exit the entire application
with the pivsiz == 0 error case.
src/ssids/cpu/BuddyAllocator.hxx
(spral::ssids::cpu::buddy_alloc_internal::Page::~Page): Remove invalid
sanity check. When an exception is thrown, this condition is not
satisfied.
@kkofler kkofler force-pushed the fix-buffer-overflow-in-block_ldlt branch from 7d8212d to 6299940 Compare April 2, 2025 12:55
@kkofler
Copy link
Author

kkofler commented Apr 2, 2025

Done.

Copy link
Collaborator

@mjacobse mjacobse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, setting bestv_out like that seems to make some sense to me. Right now I am unable to test this though, I assume this case will only happen with a full block of size BLOCK_SIZE full of NaN? Would you be able to provide a test matrix that causes this failure? Otherwise I would have to try to come up with one myself.

On a more general note I'm not quite sure what SSIDS is really supposed to do if there are NaN in the input. Surely a user should just clean that up, or is there any actual usecase for this? Though I agree that running into a buffer overflow is not the nicest way to (not) handle this of course. But maybe dealing with this in a preliminary check and returning an error directly would be simpler? Not sure.

find_maxloc<T,BLOCK_SIZE>(p, a, lda, bestv, t, m);

// Handle case where find_maxloc failed (e.g., due to NaNs in the input)
if (t >= BLOCK_SIZE || m >= BLOCK_SIZE) throw SingularError(p);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure about this check. It unconditionally gives up with SingularError, ignoring the action setting. Also, if find_maxloc failed, the change above will cause bestv to be 0.0, and so fabs(bestv) < small will be true, so we should go into the handling of the singular case anyways, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if find_maxloc fails, no matter in what stage, it is always an error.

If I do not add this check, i.e., if I add only the bestv_out buffer overflow fix without this added check, then it continues attempting to compute with invalid indices (INT_MAX is not a valid index to work with) and eventually (I think in a later iteration, because in this one, we should have t==INT_MAX and m==INT_MAX and hence t==m, and so pivsiz should always be 1) runs into the

      if(pivsiz == 0) {
         // FIXME: debug remove
         printf("broken!\n");
         …
         exit(1);
      }

crash case. Assuming it gets so far to begin with, because a11 = a[t*lda+t]; is also an invalid memory access if t is INT_MAX.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in my tests, the "broken!" error is what I hit without this added check.

}
~Page() noexcept(false) {
if(next_ && head_[nlevel-1] != 0)
throw std::runtime_error("outstanding allocations on cleanup\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why this can just be removed and how it relates to the other changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to just remove? To be honest, I do not know. I just know that it did not crash when I removed this check, so I can only hope it did not corrupt memory somewhere.

What I do know is that if I do not remove this check, then the exception I throw with the if (t >= BLOCK_SIZE || m >= BLOCK_SIZE) throw SingularError(p); in block_ldlt.hxx causes ~Page to be run in this (next_ && head_[nlevel-1] != 0) state, so it throws that std::runtime_error, and std::terminate gets called as a result. So something has to be done about that.

If removing the check is not safe, then we need another way to handle this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be that existing exceptions thrown in block_ldlt.hxx by this check:

      if(fabs(bestv) < small) {
         if(!action) throw SingularError(p);

can also trigger this error case, but I am not sure about that either. It might also be that it only gets triggered because I am throwing an exception also in the case of a non-zero action (i.e., that that is the context in which the exception is unexpected). I do not understand the allocator usage well enough yet to know the answer.

@kkofler
Copy link
Author

kkofler commented Apr 2, 2025

I think it will not be easy to export the matrix. It comes from some complicated (and proprietary) Java code which calls Ipopt which calls SPRAL. I am not sure whether the block needs to be all NaN for the error to happen or just in some relevant places.

And yes, I think checking for NaNs and returning an error before even starting the factorization would be another way to fix this crash.

@nimgould
Copy link
Contributor

nimgould commented Apr 2, 2025 via email

@jfowkes
Copy link
Contributor

jfowkes commented Apr 3, 2025

Agreed, but at the same time SSIDS shouldn't be dying on a buffer overflow when it can't handle the input data. I think probably the best way to handle this is in a preliminary check that returns an error directly as @mjacobse suggests, otherwise there is a risk that with this PR we inadvertently break things lying very deep within the codebase.

@nimgould
Copy link
Contributor

nimgould commented Apr 4, 2025

As I am sure you appreciate, adding checks comes with an overhead, and many users may already be doing so before the call. The analysis procedure has an argument check to enable checks if desired, but this is only for the integer data. It is a shame that this isn't a control parameter rather than an argument, as then doing the same for factorize wouldn't break existing calls.

Of course, SSIDS is really showing its age, and time is approaching that
it is re-implemented (especially for GPUs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants