fix(core): make events file comparator crash/terminate for io problems#4194
fix(core): make events file comparator crash/terminate for io problems#4194nkuehnel wants to merge 9 commits intomatsim-org:mainfrom
Conversation
|
@mrieser @paulheinr we'd appreciate some feedback if this is the correct way of dealing with this cyclic barrier stuff |
|
I am currently on vacation and will have a look at it in the first week of September :-) |
sounds like a plan, enjoy your vacation! |
|
@paulheinr any chance you could have a look at this? :) |
paulheinr
left a comment
There was a problem hiding this comment.
To handle failed events reading correctly, I think we need to use another type of barrier that can be set "truely invalid". The Phraser class looks promising, however it doesn't support a barrierAction natively as the CycleBarrier does. But there should be manual workarounds.
|
|
||
| /** Missing file should lead to IO_ERROR (no hang). */ | ||
| @Test | ||
| void testMissingFile_returnsIoError() throws IOException { |
There was a problem hiding this comment.
There is a deadlock when I run this test locally.
| // Any I/O or parse error: break the barrier so the peer won't hang. | ||
| log.warn("Error while reading events file '{}': {}", this.eFile, t.toString()); | ||
| try { | ||
| doComparison.reset(); // causes the peer's await() to throw BrokenBarrierException |
There was a problem hiding this comment.
I think the reason for the deadlock is the behaviour of reset(): It causes any waiting party to return with an Exception. But if this statement is executed before the other worker has reached the await(), there is no exception thrown.
This is in particular the case if event reading of worker 1 takes longer than resetting the barrier called by worker 0. In this case, worker 0 has reached its end but worker 1 is waiting at the barrier.
| doComparison.reset(); // causes the peer's await() to throw BrokenBarrierException | ||
| } catch (Throwable ignored) {} | ||
| this.finished = true; // signal end for completeness | ||
| } |
There was a problem hiding this comment.
Even adding awaitBarrier() here causes problems: If the thread throwing an exception (let's call it t1) is slower than the other one (t2), t2 received the BrokenBarrierException and has finished. Thus, t1 will be waiting here forever 😕
Oh yes, sorry! I totally forgot that. |
I had a workflow run that timed out during a test that compared event files:
https://github.com/matsim-org/matsim-libs/actions/runs/16922537224
As the test output is not logged on github, it was not apparent that there was an io issue that is silently swallowed, causing the test to run indefinitely. This PR should ensure that there is a proper test failure in the case that the reference file is not found or empty/corrupt.
Includes a test for that