Skip to content

fix(core): make events file comparator crash/terminate for io problems#4194

Open
nkuehnel wants to merge 9 commits intomatsim-org:mainfrom
moia-oss:fixeventsfilecomparator
Open

fix(core): make events file comparator crash/terminate for io problems#4194
nkuehnel wants to merge 9 commits intomatsim-org:mainfrom
moia-oss:fixeventsfilecomparator

Conversation

@nkuehnel
Copy link
Member

@nkuehnel nkuehnel commented Aug 13, 2025

I had a workflow run that timed out during a test that compared event files:
https://github.com/matsim-org/matsim-libs/actions/runs/16922537224

As the test output is not logged on github, it was not apparent that there was an io issue that is silently swallowed, causing the test to run indefinitely. This PR should ensure that there is a proper test failure in the case that the reference file is not found or empty/corrupt.

Includes a test for that

@nkuehnel nkuehnel requested a review from mrieser August 13, 2025 13:39
@nkuehnel
Copy link
Member Author

@mrieser @paulheinr we'd appreciate some feedback if this is the correct way of dealing with this cyclic barrier stuff

@paulheinr
Copy link
Contributor

I am currently on vacation and will have a look at it in the first week of September :-)

@nkuehnel
Copy link
Member Author

I am currently on vacation and will have a look at it in the first week of September :-)

sounds like a plan, enjoy your vacation!

@nkuehnel
Copy link
Member Author

@paulheinr any chance you could have a look at this? :)

Copy link
Contributor

@paulheinr paulheinr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle failed events reading correctly, I think we need to use another type of barrier that can be set "truely invalid". The Phraser class looks promising, however it doesn't support a barrierAction natively as the CycleBarrier does. But there should be manual workarounds.


/** Missing file should lead to IO_ERROR (no hang). */
@Test
void testMissingFile_returnsIoError() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a deadlock when I run this test locally.

// Any I/O or parse error: break the barrier so the peer won't hang.
log.warn("Error while reading events file '{}': {}", this.eFile, t.toString());
try {
doComparison.reset(); // causes the peer's await() to throw BrokenBarrierException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason for the deadlock is the behaviour of reset(): It causes any waiting party to return with an Exception. But if this statement is executed before the other worker has reached the await(), there is no exception thrown.

This is in particular the case if event reading of worker 1 takes longer than resetting the barrier called by worker 0. In this case, worker 0 has reached its end but worker 1 is waiting at the barrier.

doComparison.reset(); // causes the peer's await() to throw BrokenBarrierException
} catch (Throwable ignored) {}
this.finished = true; // signal end for completeness
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even adding awaitBarrier() here causes problems: If the thread throwing an exception (let's call it t1) is slower than the other one (t2), t2 received the BrokenBarrierException and has finished. Thus, t1 will be waiting here forever 😕

@paulheinr
Copy link
Contributor

paulheinr commented Nov 14, 2025

@paulheinr any chance you could have a look at this? :)

Oh yes, sorry! I totally forgot that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants