Skip to content

Snapshots Issue - Cannot Serve More Than 10 Concurrent Functions #579

@DanielLee343

Description

@DanielLee343

Hi, I have several questions related to snapshotting. I am running one of the TestBenchParallelServe tests in vhive/Makefile on a single node setup.

sudo mkdir -m777 -p $(CTRDLOGDIR) && sudo env "PATH=$(PATH)" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>$(CTRDLOGDIR)/fccd_orch_noupf_log_bench.out 2>$(CTRDLOGDIR)/fccd_orch_noupf_log_bench.err &
sudo env "PATH=$(PATH)" go test $(EXTRAGOARGS) -run TestBenchParallelServe -args $(WITHSNAPSHOTS) $(WITHUPF) -benchDirTest configREAP -metricsTest -funcName helloworld
./scripts/clean_fcctr.sh

If I'm correct, this script will spawn parallelNum concurrent (and same) functions, in my case, helloworld, with both snapshots and REAP enabled. However, the maximum parallelNum it supports is only 10 on my machine. When getting larger, it fails with the following error:

cc@vhive-inst-01:~/vhive$ make bench 
sudo mkdir -m777 -p /tmp/ctrd-logs && sudo env "PATH=/home/cc/.vscode-server/bin/3b889b090b5ad5793f524b5d1d39fda662b96a2a/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>/tmp/ctrd-logs/fccd_orch_noupf_log_bench.out 2>/tmp/ctrd-logs/fccd_orch_noupf_log_bench.err &
sudo env "PATH=/home/cc/.vscode-server/bin/3b889b090b5ad5793f524b5d1d39fda662b96a2a/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin" go test -v -race -cover -run TestBenchParallelServe -args -snapshotsTest -upfTest -benchDirTest configREAP -metricsTest -funcName helloworld
time="2022-07-27T15:56:12.352968106-04:00" level=info msg="Orchestrator snapshots enabled: true"
time="2022-07-27T15:56:12.353343756-04:00" level=info msg="Orchestrator UPF enabled: true"
time="2022-07-27T15:56:12.353431463-04:00" level=info msg="Orchestrator lazy serving mode enabled: false"
time="2022-07-27T15:56:12.353527886-04:00" level=info msg="Orchestrator UPF metrics enabled: true"
time="2022-07-27T15:56:12.353589272-04:00" level=info msg="Drop cache: true"
time="2022-07-27T15:56:12.353640659-04:00" level=info msg="Bench dir: configREAP"
time="2022-07-27T15:56:12.353695564-04:00" level=info msg="Registering bridges for tap manager"
time="2022-07-27T15:56:12.355698518-04:00" level=info msg="Creating containerd client"
time="2022-07-27T15:56:12.358589476-04:00" level=info msg="Created containerd client"
time="2022-07-27T15:56:12.358791972-04:00" level=info msg="Creating firecracker client"
time="2022-07-27T15:56:12.359535647-04:00" level=info msg="Created firecracker client"
=== RUN   TestBenchParallelServe
time="2022-07-27T15:56:12.370809757-04:00" level=info msg="New function added" fID=plr-fnc image="ghcr.io/ease-lab/helloworld:var_workload" isPinned=true servedTh=0
... (omitted some logs here)
time="2022-07-27T15:56:50.439361536-04:00" level=info msg="Creating snapshot for 1, vmID is 1-0"
time="2022-07-27T15:56:50.441571578-04:00" level=info msg="Orchestrator received CreateSnapshot" vmID=1-0
time="2022-07-27T15:56:50.666040404-04:00" level=info msg="Creating snapshot for 0, vmID is 0-0"
time="2022-07-27T15:56:50.667999825-04:00" level=info msg="Orchestrator received CreateSnapshot" vmID=0-0
time="2022-07-27T15:56:55.440588667-04:00" level=error msg="failed to create snapshot of the VM" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" vmID=1-0
time="2022-07-27T15:56:55.440898692-04:00" level=panic msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
panic: (*logrus.Entry) 0xc0002390a0

goroutine 186 [running]:
github.com/sirupsen/logrus.(*Entry).log(0xc000238bd0, 0x0, {0xc0002ae870, 0x43})
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:259 +0x95b
github.com/sirupsen/logrus.(*Entry).Log(0xc000238bd0, 0x0, {0xc000d05630, 0x1, 0x1})
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:285 +0x8c
github.com/sirupsen/logrus.(*Logger).Log(0xc00021fdc0, 0x0, {0xc000d05630, 0x1, 0x1})
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:198 +0x85
github.com/sirupsen/logrus.(*Logger).Panic(...)
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:247
github.com/sirupsen/logrus.Panic(...)
	/root/go/pkg/mod/github.com/sirupsen/[email protected]/exported.go:129
github.com/ease-lab/vhive.(*Function).CreateInstanceSnapshot(0xc00041c370)
	/home/cc/vhive/functions.go:463 +0x51b
github.com/ease-lab/vhive.(*Function).Serve.func2()
	/home/cc/vhive/functions.go:308 +0x46
sync.(*Once).doSlow(0xc0001c26d4, 0xc000d05ae8)
	/usr/local/go/src/sync/once.go:68 +0x102
sync.(*Once).Do(0xc0001c26d4, 0xc000536750?)
	/usr/local/go/src/sync/once.go:59 +0x47
github.com/ease-lab/vhive.(*Function).Serve(0xc00041c370, {0x1ca7c15?, 0x1?}, {0x1ca7c15, 0x1}, {0x1cad79c, 0x28}, {0x1c81435, 0x6})
	/home/cc/vhive/functions.go:306 +0xeee
github.com/ease-lab/vhive.(*FuncPool).Serve(0xc00021fdc0?, {0x1e9d988, 0xc0001c2008}, {0x1ca7c15, 0x1}, {0x1cad79c, 0x28}, {0x1c81435, 0x6})
	/home/cc/vhive/functions.go:121 +0xea
github.com/ease-lab/vhive.createSnapshots.func1({0x1ca7c15, 0x1})
	/home/cc/vhive/bench_test.go:297 +0x18a
created by github.com/ease-lab/vhive.createSnapshots
	/home/cc/vhive/bench_test.go:292 +0xb6
exit status 2
FAIL	github.com/ease-lab/vhive	43.155s

It fails in createSnapshots() in bench_test.go where you are performing parallel snapshots saving. It throws error in createSnapshot in which it's sending createSnapshot gRPC request. I'm using a single node on chameleon cloud that has 128GB RAM and 48 cores of Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz. I doubt it's hardware capacity issue because I was using htop that shows for the entire testing duration, RAM usage is below 4GB. However, I see from your ASPLOS paper, "We use the the helloworld function and consider up to 64 concurrent independent function arrivals". Are you testing on a cluster in which those functions run on different machines? Or do you think it's constrained by concurrent SSD IO bandwidth? df -h shows:

cc@vhive-inst-01:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             63G     0   63G   0% /dev
tmpfs            13G  7.3M   13G   1% /run
/dev/sda1       230G   73G  148G  33% /
tmpfs            63G     0   63G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/loop0       56M   56M     0 100% /snap/core18/1932
/dev/loop1       68M   68M     0 100% /snap/lxd/18150
/dev/loop3       56M   56M     0 100% /snap/core18/2538
/dev/loop4       47M   47M     0 100% /snap/snapd/16292
/dev/loop5       62M   62M     0 100% /snap/core20/1581
/dev/loop6       68M   68M     0 100% /snap/lxd/22753
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/56a07329fbcd10e05e218702b95ed3b5c42d75d55321af0bfdc884db4711cbbf/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/64fd4707bc99b77b6da37ce5cd6f188b5bb001d3e1bbcf9ba35b4f6c734b7d66/shm
... (tons of shms)

My second question is, it seems you are storing snapshots/ws files in local, under /fccd/snapshots/. What if the next invocation of the same function is on another machine, where you don't have the snapshots there? In this case, S3, or a distributed storage solution should be considered, right? Or if you have better ideas, love to hear that.

Also, using lsblk shows tons of stuff like this, which I have no idea. Is it created by Firecracker or vhive CRI?

loop0               7:0    0  55.4M  1 loop /snap/core18/1932
loop1               7:1    0  67.8M  1 loop /snap/lxd/18150
loop2               7:2    0   100G  0 loop 
loop3               7:3    0  55.6M  1 loop /snap/core18/2538
loop4               7:4    0    47M  1 loop /snap/snapd/16292
loop5               7:5    0    62M  1 loop /snap/core20/1581
loop6               7:6    0  67.8M  1 loop /snap/lxd/22753
loop7               7:7    0     2G  0 loop 
loop8               7:8    0   100G  0 loop 
loop9               7:9    0     2G  0 loop 
loop10              7:10   0   100G  0 loop 
loop11              7:11   0     2G  0 loop 
loop12              7:12   0   100G  0 loop 
loop13              7:13   0     2G  0 loop 
loop14              7:14   0   100G  0 loop 
loop15              7:15   0     2G  0 loop
...

I appreciate your time reviewing and answering these. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions