-
Notifications
You must be signed in to change notification settings - Fork 99
Description
Hi, I have several questions related to snapshotting. I am running one of the TestBenchParallelServe
tests in vhive/Makefile
on a single node setup.
sudo mkdir -m777 -p $(CTRDLOGDIR) && sudo env "PATH=$(PATH)" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>$(CTRDLOGDIR)/fccd_orch_noupf_log_bench.out 2>$(CTRDLOGDIR)/fccd_orch_noupf_log_bench.err &
sudo env "PATH=$(PATH)" go test $(EXTRAGOARGS) -run TestBenchParallelServe -args $(WITHSNAPSHOTS) $(WITHUPF) -benchDirTest configREAP -metricsTest -funcName helloworld
./scripts/clean_fcctr.sh
If I'm correct, this script will spawn parallelNum
concurrent (and same) functions, in my case, helloworld
, with both snapshots and REAP enabled. However, the maximum parallelNum
it supports is only 10 on my machine. When getting larger, it fails with the following error:
cc@vhive-inst-01:~/vhive$ make bench
sudo mkdir -m777 -p /tmp/ctrd-logs && sudo env "PATH=/home/cc/.vscode-server/bin/3b889b090b5ad5793f524b5d1d39fda662b96a2a/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>/tmp/ctrd-logs/fccd_orch_noupf_log_bench.out 2>/tmp/ctrd-logs/fccd_orch_noupf_log_bench.err &
sudo env "PATH=/home/cc/.vscode-server/bin/3b889b090b5ad5793f524b5d1d39fda662b96a2a/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin" go test -v -race -cover -run TestBenchParallelServe -args -snapshotsTest -upfTest -benchDirTest configREAP -metricsTest -funcName helloworld
time="2022-07-27T15:56:12.352968106-04:00" level=info msg="Orchestrator snapshots enabled: true"
time="2022-07-27T15:56:12.353343756-04:00" level=info msg="Orchestrator UPF enabled: true"
time="2022-07-27T15:56:12.353431463-04:00" level=info msg="Orchestrator lazy serving mode enabled: false"
time="2022-07-27T15:56:12.353527886-04:00" level=info msg="Orchestrator UPF metrics enabled: true"
time="2022-07-27T15:56:12.353589272-04:00" level=info msg="Drop cache: true"
time="2022-07-27T15:56:12.353640659-04:00" level=info msg="Bench dir: configREAP"
time="2022-07-27T15:56:12.353695564-04:00" level=info msg="Registering bridges for tap manager"
time="2022-07-27T15:56:12.355698518-04:00" level=info msg="Creating containerd client"
time="2022-07-27T15:56:12.358589476-04:00" level=info msg="Created containerd client"
time="2022-07-27T15:56:12.358791972-04:00" level=info msg="Creating firecracker client"
time="2022-07-27T15:56:12.359535647-04:00" level=info msg="Created firecracker client"
=== RUN TestBenchParallelServe
time="2022-07-27T15:56:12.370809757-04:00" level=info msg="New function added" fID=plr-fnc image="ghcr.io/ease-lab/helloworld:var_workload" isPinned=true servedTh=0
... (omitted some logs here)
time="2022-07-27T15:56:50.439361536-04:00" level=info msg="Creating snapshot for 1, vmID is 1-0"
time="2022-07-27T15:56:50.441571578-04:00" level=info msg="Orchestrator received CreateSnapshot" vmID=1-0
time="2022-07-27T15:56:50.666040404-04:00" level=info msg="Creating snapshot for 0, vmID is 0-0"
time="2022-07-27T15:56:50.667999825-04:00" level=info msg="Orchestrator received CreateSnapshot" vmID=0-0
time="2022-07-27T15:56:55.440588667-04:00" level=error msg="failed to create snapshot of the VM" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" vmID=1-0
time="2022-07-27T15:56:55.440898692-04:00" level=panic msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
panic: (*logrus.Entry) 0xc0002390a0
goroutine 186 [running]:
github.com/sirupsen/logrus.(*Entry).log(0xc000238bd0, 0x0, {0xc0002ae870, 0x43})
/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:259 +0x95b
github.com/sirupsen/logrus.(*Entry).Log(0xc000238bd0, 0x0, {0xc000d05630, 0x1, 0x1})
/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:285 +0x8c
github.com/sirupsen/logrus.(*Logger).Log(0xc00021fdc0, 0x0, {0xc000d05630, 0x1, 0x1})
/root/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:198 +0x85
github.com/sirupsen/logrus.(*Logger).Panic(...)
/root/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:247
github.com/sirupsen/logrus.Panic(...)
/root/go/pkg/mod/github.com/sirupsen/[email protected]/exported.go:129
github.com/ease-lab/vhive.(*Function).CreateInstanceSnapshot(0xc00041c370)
/home/cc/vhive/functions.go:463 +0x51b
github.com/ease-lab/vhive.(*Function).Serve.func2()
/home/cc/vhive/functions.go:308 +0x46
sync.(*Once).doSlow(0xc0001c26d4, 0xc000d05ae8)
/usr/local/go/src/sync/once.go:68 +0x102
sync.(*Once).Do(0xc0001c26d4, 0xc000536750?)
/usr/local/go/src/sync/once.go:59 +0x47
github.com/ease-lab/vhive.(*Function).Serve(0xc00041c370, {0x1ca7c15?, 0x1?}, {0x1ca7c15, 0x1}, {0x1cad79c, 0x28}, {0x1c81435, 0x6})
/home/cc/vhive/functions.go:306 +0xeee
github.com/ease-lab/vhive.(*FuncPool).Serve(0xc00021fdc0?, {0x1e9d988, 0xc0001c2008}, {0x1ca7c15, 0x1}, {0x1cad79c, 0x28}, {0x1c81435, 0x6})
/home/cc/vhive/functions.go:121 +0xea
github.com/ease-lab/vhive.createSnapshots.func1({0x1ca7c15, 0x1})
/home/cc/vhive/bench_test.go:297 +0x18a
created by github.com/ease-lab/vhive.createSnapshots
/home/cc/vhive/bench_test.go:292 +0xb6
exit status 2
FAIL github.com/ease-lab/vhive 43.155s
It fails in createSnapshots() in bench_test.go where you are performing parallel snapshots saving. It throws error in createSnapshot in which it's sending createSnapshot gRPC request. I'm using a single node on chameleon cloud that has 128GB RAM and 48 cores of Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz. I doubt it's hardware capacity issue because I was using htop
that shows for the entire testing duration, RAM usage is below 4GB. However, I see from your ASPLOS paper, "We use the the helloworld function and consider up to 64 concurrent independent function arrivals". Are you testing on a cluster in which those functions run on different machines? Or do you think it's constrained by concurrent SSD IO bandwidth? df -h shows:
cc@vhive-inst-01:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 7.3M 13G 1% /run
/dev/sda1 230G 73G 148G 33% /
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/loop0 56M 56M 0 100% /snap/core18/1932
/dev/loop1 68M 68M 0 100% /snap/lxd/18150
/dev/loop3 56M 56M 0 100% /snap/core18/2538
/dev/loop4 47M 47M 0 100% /snap/snapd/16292
/dev/loop5 62M 62M 0 100% /snap/core20/1581
/dev/loop6 68M 68M 0 100% /snap/lxd/22753
shm 64M 0 64M 0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/56a07329fbcd10e05e218702b95ed3b5c42d75d55321af0bfdc884db4711cbbf/shm
shm 64M 0 64M 0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/64fd4707bc99b77b6da37ce5cd6f188b5bb001d3e1bbcf9ba35b4f6c734b7d66/shm
... (tons of shms)
My second question is, it seems you are storing snapshots/ws files in local, under /fccd/snapshots/
. What if the next invocation of the same function is on another machine, where you don't have the snapshots there? In this case, S3, or a distributed storage solution should be considered, right? Or if you have better ideas, love to hear that.
Also, using lsblk shows tons of stuff like this, which I have no idea. Is it created by Firecracker or vhive CRI?
loop0 7:0 0 55.4M 1 loop /snap/core18/1932
loop1 7:1 0 67.8M 1 loop /snap/lxd/18150
loop2 7:2 0 100G 0 loop
loop3 7:3 0 55.6M 1 loop /snap/core18/2538
loop4 7:4 0 47M 1 loop /snap/snapd/16292
loop5 7:5 0 62M 1 loop /snap/core20/1581
loop6 7:6 0 67.8M 1 loop /snap/lxd/22753
loop7 7:7 0 2G 0 loop
loop8 7:8 0 100G 0 loop
loop9 7:9 0 2G 0 loop
loop10 7:10 0 100G 0 loop
loop11 7:11 0 2G 0 loop
loop12 7:12 0 100G 0 loop
loop13 7:13 0 2G 0 loop
loop14 7:14 0 100G 0 loop
loop15 7:15 0 2G 0 loop
...
I appreciate your time reviewing and answering these. Thank you!