Fix buffer overflow in CPU block_ldlt in find_maxloc on NaNs #234

kkofler · 2025-04-02T01:24:53Z

This pull request fixes a buffer overflow in the SPRAL CPU code called from Ipopt when the input contains some NaNs, and 2 followup errors that occurred after I fixed the buffer overflow. With these 3 fixes, the input passed from Ipopt to SPRAL is declared singular by SPRAL at the rank in which the NaNs are encountered instead of crashing the entire application calling Ipopt.

amontoison · 2025-04-02T04:39:31Z

@mjacobse Can you review the PR?
@nimgould The modifications could be relevant for GALAHAD.

This PR could help to solve #218

jfowkes · 2025-04-02T07:38:17Z

@kkofler many thanks for the patch, could you rebase master onto your branch so it picks up our new ASan test?

src/ssids/cpu/kernels/block_ldlt.hxx (spral::ssids::cpu::block_ldlt_internal::find_maxloc): Add the same sanity checks for (cloc < BLOCK_SIZE && rloc < BLOCK_SIZE) in the SIMD case as in the non-SIMD case. This fixes a buffer overflow in Ipopt when the input contains some NaNs. In that case, it can happen at this spot that cbest and rbest are still at their initial std::numeric_limits<int>::max() value, so cloc and rloc end up set to that, and so a[cloc*lda+rloc] is out of bounds. The result will still be garbage if we have NaNs, but at least SPRAL will not crash the entire application anymore. (spral::ssids::cpu::block_ldlt_internal::block_ldlt): Check whether the found maxloc is valid (t and m both < BLOCK_SIZE), throw a SingularError otherwise (which I think is a reasonable thing to claim when there were NaNs in the input). Otherwise we eventually exit the entire application with the pivsiz == 0 error case.

src/ssids/cpu/BuddyAllocator.hxx (spral::ssids::cpu::buddy_alloc_internal::Page::~Page): Remove invalid sanity check. When an exception is thrown, this condition is not satisfied.

kkofler · 2025-04-02T12:55:53Z

Done.

mjacobse

Thanks, setting bestv_out like that seems to make some sense to me. Right now I am unable to test this though, I assume this case will only happen with a full block of size BLOCK_SIZE full of NaN? Would you be able to provide a test matrix that causes this failure? Otherwise I would have to try to come up with one myself.

On a more general note I'm not quite sure what SSIDS is really supposed to do if there are NaN in the input. Surely a user should just clean that up, or is there any actual usecase for this? Though I agree that running into a buffer overflow is not the nicest way to (not) handle this of course. But maybe dealing with this in a preliminary check and returning an error directly would be simpler? Not sure.

mjacobse · 2025-04-02T17:30:08Z

src/ssids/cpu/kernels/block_ldlt.hxx

      find_maxloc<T,BLOCK_SIZE>(p, a, lda, bestv, t, m);

+      // Handle case where find_maxloc failed (e.g., due to NaNs in the input)
+      if (t >= BLOCK_SIZE || m >= BLOCK_SIZE) throw SingularError(p);


I'm not quite sure about this check. It unconditionally gives up with SingularError, ignoring the action setting. Also, if find_maxloc failed, the change above will cause bestv to be 0.0, and so fabs(bestv) < small will be true, so we should go into the handling of the singular case anyways, right?

Well, if find_maxloc fails, no matter in what stage, it is always an error.

If I do not add this check, i.e., if I add only the bestv_out buffer overflow fix without this added check, then it continues attempting to compute with invalid indices (INT_MAX is not a valid index to work with) and eventually (I think in a later iteration, because in this one, we should have t==INT_MAX and m==INT_MAX and hence t==m, and so pivsiz should always be 1) runs into the

if(pivsiz == 0) { // FIXME: debug remove printf("broken!\n"); … exit(1); }

crash case. Assuming it gets so far to begin with, because a11 = a[t*lda+t]; is also an invalid memory access if t is INT_MAX.

But in my tests, the "broken!" error is what I hit without this added check.

mjacobse · 2025-04-02T17:32:00Z

src/ssids/cpu/BuddyAllocator.hxx

   }
   ~Page() noexcept(false) {
-      if(next_ && head_[nlevel-1] != 0)
-         throw std::runtime_error("outstanding allocations on cleanup\n");


I do not understand why this can just be removed and how it relates to the other changes.

Is it safe to just remove? To be honest, I do not know. I just know that it did not crash when I removed this check, so I can only hope it did not corrupt memory somewhere.

What I do know is that if I do not remove this check, then the exception I throw with the if (t >= BLOCK_SIZE || m >= BLOCK_SIZE) throw SingularError(p); in block_ldlt.hxx causes ~Page to be run in this (next_ && head_[nlevel-1] != 0) state, so it throws that std::runtime_error, and std::terminate gets called as a result. So something has to be done about that.

If removing the check is not safe, then we need another way to handle this.

It might also be that existing exceptions thrown in block_ldlt.hxx by this check:

if(fabs(bestv) < small) { if(!action) throw SingularError(p);

can also trigger this error case, but I am not sure about that either. It might also be that it only gets triggered because I am throwing an exception also in the case of a non-zero action (i.e., that that is the context in which the exception is unexpected). I do not understand the allocator usage well enough yet to know the answer.

kkofler · 2025-04-02T17:52:07Z

I think it will not be easy to export the matrix. It comes from some complicated (and proprietary) Java code which calls Ipopt which calls SPRAL. I am not sure whether the block needs to be all NaN for the error to happen or just in some relevant places.

And yes, I think checking for NaNs and returning an error before even starting the factorization would be another way to fix this crash.

nimgould · 2025-04-02T18:05:23Z

My view is this should be dealt with by the calling routine, not ssids. If I call an Lapack routine with NaNs, I don't expect to be told that the data is faulty, that is my job. Whether the fault is ipopt or the java program, it should be sorted up the chain.

jfowkes · 2025-04-03T09:43:00Z

Agreed, but at the same time SSIDS shouldn't be dying on a buffer overflow when it can't handle the input data. I think probably the best way to handle this is in a preliminary check that returns an error directly as @mjacobse suggests, otherwise there is a risk that with this PR we inadvertently break things lying very deep within the codebase.

nimgould · 2025-04-04T11:13:15Z

As I am sure you appreciate, adding checks comes with an overhead, and many users may already be doing so before the call. The analysis procedure has an argument check to enable checks if desired, but this is only for the integer data. It is a shame that this isn't a control parameter rather than an argument, as then doing the same for factorize wouldn't break existing calls.

Of course, SSIDS is really showing its age, and time is approaching that
it is re-implemented (especially for GPUs)

jfowkes added the buffer-overflow label Apr 2, 2025

jfowkes requested a review from mjacobse April 2, 2025 06:54

kkofler added 2 commits April 2, 2025 14:55

Fix crash in CPU BuddyAllocator ~Page when block_ldlt throws

6299940

src/ssids/cpu/BuddyAllocator.hxx (spral::ssids::cpu::buddy_alloc_internal::Page::~Page): Remove invalid sanity check. When an exception is thrown, this condition is not satisfied.

kkofler force-pushed the fix-buffer-overflow-in-block_ldlt branch from 7d8212d to 6299940 Compare April 2, 2025 12:55

mjacobse reviewed Apr 2, 2025

View reviewed changes

Fix buffer overflow in CPU block_ldlt in find_maxloc on NaNs #234

Are you sure you want to change the base?

Fix buffer overflow in CPU block_ldlt in find_maxloc on NaNs #234

Uh oh!

Conversation

kkofler commented Apr 2, 2025

Uh oh!

amontoison commented Apr 2, 2025

Uh oh!

jfowkes commented Apr 2, 2025

Uh oh!

kkofler commented Apr 2, 2025

Uh oh!

mjacobse left a comment

Choose a reason for hiding this comment

Uh oh!

mjacobse Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

kkofler Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

kkofler Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

mjacobse Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

kkofler Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

kkofler Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

kkofler commented Apr 2, 2025

Uh oh!

nimgould commented Apr 2, 2025 via email • edited by jfowkes Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jfowkes commented Apr 3, 2025

Uh oh!

nimgould commented Apr 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nimgould commented Apr 2, 2025 via email •

edited by jfowkes

Loading