Skip to content

Conversation

kalyazin
Copy link
Contributor

@kalyazin kalyazin commented Oct 7, 2024

Changes

Backport #4835 and its dependencies: #4782 , #4796 , #4798 , #4812 .

Reason

Regression testing in the release branch.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • [ ] If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • [ ] API changes follow the Runbook for Firecracker API changes.
  • [ ] User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • [ ] New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@kalyazin kalyazin self-assigned this Oct 7, 2024
@kalyazin kalyazin added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Oct 7, 2024
ShadowCurse
ShadowCurse previously approved these changes Oct 7, 2024
Copy link

codecov bot commented Oct 7, 2024

Codecov Report

Attention: Patch coverage is 38.46154% with 8 lines in your changes missing coverage. Please review.

Project coverage is 84.34%. Comparing base (3370eaf) to head (d062c9e).
Report is 17 commits behind head on firecracker-v1.9.

Files with missing lines Patch % Lines
src/vmm/src/device_manager/mmio.rs 0.00% 8 Missing ⚠️
Additional details and impacted files
@@                 Coverage Diff                  @@
##           firecracker-v1.9    #4836      +/-   ##
====================================================
- Coverage             84.35%   84.34%   -0.01%     
====================================================
  Files                   249      249              
  Lines                 27447    27449       +2     
====================================================
  Hits                  23152    23152              
- Misses                 4295     4297       +2     
Flag Coverage Δ
5.10-c5n.metal 84.56% <38.46%> (-0.01%) ⬇️
5.10-m5n.metal 84.55% <38.46%> (-0.01%) ⬇️
5.10-m6a.metal 83.83% <38.46%> (-0.02%) ⬇️
5.10-m6g.metal 80.90% <38.46%> (-0.01%) ⬇️
5.10-m6i.metal 84.54% <38.46%> (-0.02%) ⬇️
5.10-m7g.metal 80.90% <38.46%> (-0.01%) ⬇️
6.1-c5n.metal 84.56% <38.46%> (-0.01%) ⬇️
6.1-m5n.metal 84.54% <38.46%> (-0.02%) ⬇️
6.1-m6a.metal 83.84% <38.46%> (-0.01%) ⬇️
6.1-m6g.metal 80.90% <38.46%> (-0.01%) ⬇️
6.1-m6i.metal 84.54% <38.46%> (-0.02%) ⬇️
6.1-m7g.metal 80.90% <38.46%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

roypat
roypat previously approved these changes Oct 7, 2024
@kalyazin kalyazin dismissed stale reviews from roypat and ShadowCurse via dbd5660 October 7, 2024 15:40
roypat and others added 16 commits October 7, 2024 16:35
Tests that boot a microvm via a config file do not use `.start()`, as
`.spawn()` already start the bootup process. This is probably why these
tests were overlooked when we did the `.wait_for_up()` refactor.

Fix potential issues by adding the `wait_for_up()` calls here, to
replace various homegrown waiting solutions.

Some tests do not have network devices, and also don't really need to
wait for the guest to have booted as they only validate firecracker API
responses. Don't try to use wait_for_up in those (since we can't
anyway).

Signed-off-by: Patrick Roy <[email protected]>
The assigned value was immediately overwritten without ever being read.

Signed-off-by: Patrick Roy <[email protected]>
Remove assignment to a variable that was never read

Signed-off-by: Patrick Roy <[email protected]>
"Some times the ..." -> "Sometimes the ..." (no space)

Signed-off-by: Patrick Roy <[email protected]>
This test serves no purpose, as all it does is take a snapshot without
verifying anything about it. In this sense, it is perfectly subsumed by
`test_balloon_snapshot`, which also creates snapshots with balloon
devices, but actually verifies they restore as well (something that is
not given in this test, because we do not wait for the guest to boot
before taking a snapshot).

A look at the commit history shows that this test used to be responsible
for verifying cross-release snapshot support for the balloon device, but
since we no longer support such a thing, it became superfluous.

Signed-off-by: Patrick Roy <[email protected]>
Some of the tests in test_balloon.py were missing
`microvm.wait_for_up()` calls. This was mainly visible from those that
read the RSS of the firecracker process taking a long time to calibrate,
due to RSS being very unstable during guest boot up.

Signed-off-by: Patrick Roy <[email protected]>
Major page fault occur when a page fault can only be satisfied via disk
I/O [1]. In `test_stats`, we asserted that the number of major page
faults increased during execution of `make_guest_dirty_memory`, but this
cannot happen, as it uses MAP_ANONYMOUS, and thus will not trigger any
major page faults. The reason the test used to work in the past is that
it was not waiting for the VM to boot, so the first reading of the
balloon statistics happened before the boot process was finished.
However, during boot we _do_ expect major faults to happen, as things
need to be read from the rootfs.

Fix the test to instead just assert a non-zero number of major faults
happened during boot.

[1]: https://en.wikipedia.org/wiki/Page_fault#Major

Signed-off-by: Patrick Roy <[email protected]>
This is a regression test for the following fix:

commit a9e5f13
Author: Nikita Kalyazin <[email protected]>
Date:   Mon Sep 30 13:05:28 2024 +0000

    fix(net): set tap offload features on restore

The test verifies that tap offload features are configured for both
booted and restored VMs.

Signed-off-by: Nikita Kalyazin <[email protected]>
Add entry for vsock fix into CHANGELOG.

Signed-off-by: Egor Lazarchuk <[email protected]>
The test starts `socat` server on the host and `socat`
client in the guest. The test then validates that `socat`
client stops after snapshot is taken.

Signed-off-by: Egor Lazarchuk <[email protected]>
We need to kick vsock queue during resume
in order for a VM to process `TRANSPORT_RESET_EVENT`
we sent during snapshot creation. Otherwise it will
wait for it forever.

Signed-off-by: Egor Lazarchuk <[email protected]>
Always block inside `Microvm.start()` until the vm has finished booting
(if a network device is present in the microvm). This means that we no
longer need `wait_for_up` calls after every `start()` call.

Signed-off-by: Patrick Roy <[email protected]>
If we start Firecracker with a config file, then the microvm will boot
in `.spawn()` instead of `.start()` (since we are not driving
Firecracker via API requests). Thus, if this is the case, put the
`wait_for_up` into `Microvm.spawn()`.

Signed-off-by: Patrick Roy <[email protected]>
After restoring a snapshot, we are not waiting for the VM to finish
booting (it needs to have done that before we took a snapshot), but
having a `wait_for_up` here is still a good sanity check for ensuring
the VM is still responsive. So far our tests have been inconsistent on
whether to put a `wait_for_up` here. This makes them consistent by
always doing `wait_for_up` if a network device is available.

Signed-off-by: Patrick Roy <[email protected]>
 `wait_for_up`s were replaced with

```
grep -rl wait_for_up tests/integration_tests/ \
|xargs sed '/wait_for_up/d' -i
```

Signed-off-by: Patrick Roy <[email protected]>
`vm.start()` already asserts `vm.state == "Running"`, so no need to do
it again at the callsite.

Signed-off-by: Patrick Roy <[email protected]>
@kalyazin kalyazin force-pushed the tap_offload_test_1.9 branch from dbd5660 to ffaad42 Compare October 7, 2024 16:40
@kalyazin kalyazin requested review from pb8o and xmarcalx as code owners October 7, 2024 16:40
@kalyazin kalyazin changed the title [1.9] test(net): add test for tap offload features [1.9] add vsock fix and test for tap offload features Oct 7, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
@bchalios bchalios merged commit f6f21b4 into firecracker-microvm:firecracker-v1.9 Oct 8, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants