Description
This is a global description for next week to summarize the current status of the code.
Branches
Setup
Tyche + Linux
This is unchanged and should work the same as before.
Gramine
You don't need to clone the repository, it will be cloned within Tyche-bench.
Tyche-bench
I added a lot of commands to do the entire setup in this repository.
There is a readme that should describe everything.
On a high-level, the setup recipes create a to-copy directory that contains everything that should be copied to the VM.
The gramine subfolder is path sensitive and must be placed at / in the VM.
The others can be moved anywhere. To check that your setup works, once you have copied the folders, do:
cd /path/to/gramine-benchmarks
# This should run the gramine-linux helloworld (not tyche)
make helloworld
# This should run the gramine-tyche helloworld
sudo TYCHE=1 make helloworld
The Makefile inside gramine-benchmarks attempts to run both the gramine program and the measurement logic (e.g., the client making http requests). I copied the default gramine logic for benchmarks for the moment but we can change that (especially if we have issues with wrk).
For benchmarks that create an output measurement, the results should be populated inside automatically inside gramine-benchmarks/results/name-of-the-app/[results+date].txt.
For the moment, we have the following benchmarks:
- helloworld
- rust (runs rust hyper http server)
- sqlite
- lighttpd
- redis
Make sure sqlite3 is installed on the VM with sudo apt install sqlite3.
I also played around with other benchmarks that are not fully supported:
- memcached (present in gramine original repo): It requires 16 threads, so 16 cores.
- gzip (not in gramine original repo): it uses linux PIPEs which gramine does not support in a confidential setting
- blender (present in gramine original repo): requires 64 threads
- Llama (not in gramine original repo): requires a lot of threads and memory + the models are big.
Potential issues:
- missing
wrk: wrk is a small program used by gramine to drive network-related benchmarks. We compile it from source and place it inside to-copy/my_bin. It should be added to the VM somewhere within a path that's available to the user AND the sudo command.
- segfault
wrk: I have observed that sometimes the command fails unexpectedly. Maybe we could fix/replace that.
- lighttpd not found: lighttpd is also compiled from source and placed inside
/gramine/utils/lighttpd. It should normally be found by gramine manifest.
- Unexpected tyche driver
alreadyaliased error: I only had that once or twice while running benchmarks. It might require rebooting the machine.
- Threads vs. Cores: we do not have a thread abstraction and require one core per declared thread in gramine. Hopefully we won't need it or we will implement support for threads later...
- Gramine inside TD1: Gramine-tyche requires confidential memory by default so I do NOT EXPECT it to work inside TD1 just yet. We need to either (1) figure out how to run confidential VMs, or (2) allow gramine to run sandboxes rather than enclaves.
Changes in Linux tyche driver
To support gramine I had to extend the drivers to be more flexible in the memory management.
The main fixes are:
Allow arbitrary size mmap allocation
We used to reject mmap that were greater than MAX_ORDER which is the Linux limit for alloc_exact_pages.
I removed that limitation that supporting one user mmap to be mapped to multiple alloc_exact_pages allocations (segments). The driver aggressively attempts to keep mmaped segments sorted and merge them when continuous.
This means that from the userspace, one mmap appears contiguous but inside the driver's state this might not be the case.
One mmap to rule them all
We used to have duplicated logic between contalloc and tyche-driver for mmaps.
I removed the duplication by making contalloc call tyche-driver's implementation of mmaps.
Support for foreign mmap values
Gramine requires memory to hold futexes shared between untrusted and trusted world.
Unfortunately, memory mapped by a driver is marked as VM_IO which prevents it from hosting futexes.
To solve this issue, I added calls to the driver that allow to add a memory segment allocated (mmaped) by Linux rather than the driver.
The driver goes through each page of the mmaped region and creates the corresponding contiguous physical memory mmap segments in the driver's state.
This potentially leads to a lot of fragmentation so we attempt to keep such regions small.
Support for TD1 with memory aliasing in drivers
As we want to run enclaves inside TD1, and since TD1 might be aliased (gpa != hpa), we need to figure out real physical addresses in the driver (hpa).
I added a new GET_HPA call to the tyche monitor that takes a GPA and a gpa_size.
It returns an HPA and an hpa_size such that, if GPA+size is within one contiguous HPA segment, gpa_size == hpa_size.
Otherwise, it returns the size until the end of the contiguous segment. Here is an example of a complex scenario:
# Guest virtual address
gpa
# Guest virtual address size
gpa_size
# GPA world view of memory
[gpa..........................gpa+gpa_size]
# Hpa world view where x + y == gpa_size
[hpa.......hpa+x] [ hpa2.... hpa2+y]
# Tyche-driver calls and logic.
get_hpa(gpa, gpa_size) = (hpa, x)
// Register one mmap segment [gpa, hpa, size: x]
get_hpa(gpa+x, gpa_size-x) = (hpa2, y)
// Register another mmap segment [gpa+x, hpa2, y]
// If y > (gpa_size-x), we would have returned
get_hpa(gpa+x, gpa_size-x) = (hpa2, gpa_size-x)
// Note that the tyche monitor does not modify any capability for GET_HPA
// It simply parses the tracker info and the alias info to figure out the mappings.
Research on CMA
For RISC-V we might need to enable (C)ontinuous (M)emory (A)llocation.
This tells Linux to reserve some memory for later use by devices.
Apparently we can enable it inside the Linux config by adding:
CONFIG_CMA=y
CONFIG_CMA_SIZE_MBYTES=64 # For example, setting CMA size to 64MB
We can also (I don't know if we need both or either) change the linux command line:
I am not sure about the difference between the first and the second argument.
Inside the driver, there are different ways to allocate the memory apparently (to replace the alloc_exact_pages):
#include <linux/cma.h>
#include <linux/dma-mapping.h>
...
void *cma_mem;
dma_addr_t dma_handle;
// I think this returns a physical address directly but NOT SURE
cma_mem = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
if (!cma_mem) {
pr_err("Failed to allocate CMA memory\n");
return -ENOMEM;
}
...
void *cma_mem;
dma_addr_t dma_handle;
// Same, I think this is physical memory inside dma_handle.
cma_mem = dma_alloc_attrs(dev, size, &dma_handle, GFP_KERNEL, DMA_ATTR_FORCE_CONTIGUOUS);
if (!cma_mem) {
pr_err("Failed to allocate CMA memory\n");
return -ENOMEM;
}
...
struct page *cma_pages;
unsigned long count = size >> PAGE_SHIFT;
// This seems to return a page/pages (list). Getting the physical address with page_to_phys
cma_pages = cma_alloc(cma_area, count, 0, GFP_KERNEL);
if (!cma_pages) {
pr_err("Failed to allocate CMA memory\n");
return -ENOMEM;
}
KVM & confidential VMs
We are trying to figure out what requires KVM to read VM memory so that we can evaluate the feasibility of turning memory confidential.
The main function that attempts to access VM memory is kvm_vcpu_read_guest_page in virt/kvm/kvm_main.c.
I have a partially trace with stack dumps here that seems to show that x86_decode_instruction is the main culprit.
However, that doesn't seem to be the only responsible.
I have another trace that prints the gfn and pfn for each access, and a corresponding dump of capabilities here.
Both traces are partial because there is A LOT of them.
These traces are produced using qemu, the following one is with lkvm: here
I think most of them are due to IOInstructions and some are Ept violations during boot at what seems to be the SCSI initialization.
Once booted, we seem to have mostly this for qemu
platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling ApicWrite for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
And this for lkvm:
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
Apparently there's a lot of emulate seg read in arch/x84/emulate.c inside the x86_emulate_insn which seems to be the IO instruction read? The code in emulate.c says it reads memory from a special area of memory.
The Linux instrumentation to obtain the prints can be found here
Potential solutions:
- Confidential VMs: Maybe if we can locate the Linux kernel code in memory (physical range) we can leave Read access rights to dom0 for these regions. That might be enough to emulate most instructions? We would need to mark the APIC as RW for dom0 too.
- Non-Confidential VM + enclaves: We could use a CMA allocator for enclaves inside td1. We'd only need to mark the CMA region as confidential in that case.
Description
This is a global description for next week to summarize the current status of the code.
Branches
aghosn_devSetup
Tyche + Linux
This is unchanged and should work the same as before.
Gramine
You don't need to clone the repository, it will be cloned within Tyche-bench.
Tyche-bench
I added a lot of commands to do the entire setup in this repository.
There is a readme that should describe everything.
On a high-level, the setup recipes create a
to-copydirectory that contains everything that should be copied to the VM.The
graminesubfolder is path sensitive and must be placed at/in the VM.The others can be moved anywhere. To check that your setup works, once you have copied the folders, do:
The Makefile inside
gramine-benchmarksattempts to run both the gramine program and the measurement logic (e.g., the client making http requests). I copied the default gramine logic for benchmarks for the moment but we can change that (especially if we have issues withwrk).For benchmarks that create an output measurement, the results should be populated inside automatically inside
gramine-benchmarks/results/name-of-the-app/[results+date].txt.For the moment, we have the following benchmarks:
Make sure sqlite3 is installed on the VM with
sudo apt install sqlite3.I also played around with other benchmarks that are not fully supported:
Potential issues:
wrk:wrkis a small program used by gramine to drive network-related benchmarks. We compile it from source and place it insideto-copy/my_bin. It should be added to the VM somewhere within a path that's available to the user AND the sudo command.wrk: I have observed that sometimes the command fails unexpectedly. Maybe we could fix/replace that./gramine/utils/lighttpd. It should normally be found by gramine manifest.alreadyaliasederror: I only had that once or twice while running benchmarks. It might require rebooting the machine.Changes in Linux tyche driver
To support gramine I had to extend the drivers to be more flexible in the memory management.
The main fixes are:
Allow arbitrary size mmap allocation
We used to reject mmap that were greater than
MAX_ORDERwhich is the Linux limit foralloc_exact_pages.I removed that limitation that supporting one user mmap to be mapped to multiple
alloc_exact_pagesallocations (segments). The driver aggressively attempts to keep mmaped segments sorted and merge them when continuous.This means that from the userspace, one mmap appears contiguous but inside the driver's state this might not be the case.
One mmap to rule them all
We used to have duplicated logic between contalloc and tyche-driver for mmaps.
I removed the duplication by making contalloc call tyche-driver's implementation of mmaps.
Support for foreign mmap values
Gramine requires memory to hold futexes shared between untrusted and trusted world.
Unfortunately, memory mapped by a driver is marked as VM_IO which prevents it from hosting futexes.
To solve this issue, I added calls to the driver that allow to add a memory segment allocated (mmaped) by Linux rather than the driver.
The driver goes through each page of the mmaped region and creates the corresponding contiguous physical memory mmap segments in the driver's state.
This potentially leads to a lot of fragmentation so we attempt to keep such regions small.
Support for TD1 with memory aliasing in drivers
As we want to run enclaves inside TD1, and since TD1 might be aliased (gpa != hpa), we need to figure out real physical addresses in the driver (hpa).
I added a new
GET_HPAcall to the tyche monitor that takes a GPA and a gpa_size.It returns an HPA and an hpa_size such that, if GPA+size is within one contiguous HPA segment, gpa_size == hpa_size.
Otherwise, it returns the size until the end of the contiguous segment. Here is an example of a complex scenario:
Research on CMA
For RISC-V we might need to enable (C)ontinuous (M)emory (A)llocation.
This tells Linux to reserve some memory for later use by devices.
Apparently we can enable it inside the Linux config by adding:
We can also (I don't know if we need both or either) change the linux command line:
I am not sure about the difference between the first and the second argument.
Inside the driver, there are different ways to allocate the memory apparently (to replace the alloc_exact_pages):
KVM & confidential VMs
We are trying to figure out what requires KVM to read VM memory so that we can evaluate the feasibility of turning memory confidential.
The main function that attempts to access VM memory is
kvm_vcpu_read_guest_pageinvirt/kvm/kvm_main.c.I have a partially trace with stack dumps here that seems to show that x86_decode_instruction is the main culprit.
However, that doesn't seem to be the only responsible.
I have another trace that prints the gfn and pfn for each access, and a corresponding dump of capabilities here.
Both traces are partial because there is A LOT of them.
These traces are produced using qemu, the following one is with lkvm: here
I think most of them are due to IOInstructions and some are Ept violations during boot at what seems to be the SCSI initialization.
Once booted, we seem to have mostly this for qemu
And this for lkvm:
Apparently there's a lot of
emulate seg readinarch/x84/emulate.cinside thex86_emulate_insnwhich seems to be the IO instruction read? The code in emulate.c says it reads memory from a special area of memory.The Linux instrumentation to obtain the prints can be found here
Potential solutions: