Skip to content

fix: CI failures#848

Open
coderbirju wants to merge 3 commits intofirecracker-microvm:mainfrom
coderbirju:fix-test-failures
Open

fix: CI failures#848
coderbirju wants to merge 3 commits intofirecracker-microvm:mainfrom
coderbirju:fix-test-failures

Conversation

@coderbirju
Copy link

@coderbirju coderbirju commented Jan 27, 2026

Issue #, if available:

Our CI experiences two categories of failures:

  1. Consistently Failing Tests
  • tc-redirect-tap permission denied failures in go mod
    • This was due to the older version not being present in the go repository anymore
    • Changed this to the only available version - v0.0.0-20250516183331-34bf829e9a5c
  1. Intermittently Failing Tests
  • TestJailerCPUSet_Isolated
  • TestOOM_Isolated
  • TestCreateVM_Isolated
  • TestStopVM_Isolated
  • TestEvents_Isolated - Race condition in event collection logic
    • Current implementation strictly collects exactly 10 events in a specific order
    • Events arrive non-deterministically, causing test failures
    • Changed this to simply check for the events and not care about the ordering of the events
  • TestPauseResume_Isolated variants - vsock connection timeouts
  • TestBrokenPipe-Isolated This test simulates a broken ioPipe by removing the stdio and stderr streams and attaching another iostream to the same task - this is very flaky as sometimes the attach doesn’t happen properly and we end up with nothing on the new streams. This test case needs to be revisited and refactored as the method for doing this is not very deterministic in nature - We should skip this test if the failures are consistent.

Most of these failures happens either because of timing delays during agent setup and cleanup, some I have tried to alleviate this by adding timeouts and individual contexts in as many places as possible but it is not consistent and the only way to get this to be consistent is probably look into how the tests are structured.

Recent changes include runc update, firecracker-go-sdk update and various other small dependencies being updated, including the docker image used for testing. Any of these can be a reason for added flakyness.

Description of changes:

  • added extra timeout and cleanup functions

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@coderbirju coderbirju requested a review from a team as a code owner January 27, 2026 13:35
@coderbirju coderbirju changed the title fix: go mod failures fix: CI failures Jan 27, 2026
@coderbirju coderbirju force-pushed the fix-test-failures branch 25 times, most recently from f0a783f to a0c8483 Compare February 4, 2026 21:04
@coderbirju coderbirju force-pushed the fix-test-failures branch 2 times, most recently from 0d4d785 to de08c40 Compare February 4, 2026 22:20
@coderbirju coderbirju force-pushed the fix-test-failures branch 6 times, most recently from 8aa17b9 to b5f0cd4 Compare February 5, 2026 18:20
Signed-off-by: Arjun Raja Yogidas <arjunry@amazon.com>
When the agent receives a Shutdown request, it may close the ttrpc
connection before sending the response. This is expected behavior.
The runtime should proceed to Wait() for the VM to exit rather than
treating this as a failure and force-terminating.

Signed-off-by: Arjun Raja Yogidas <arjunry@amazon.com>
…r subtest

Signed-off-by: Arjun Raja Yogidas <arjunry@amazon.com>
@coderbirju coderbirju force-pushed the fix-test-failures branch 2 times, most recently from 23f5494 to 1ff34b4 Compare February 5, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant