Progress + gramine + branches

# Description

This is  a global description for next week to summarize the current status of the code.

# Branches

* tyche-devel: [explore_exp](https://github.com/epfl-dcsl/tyche-devel/tree/explore_exp)
* linux: [aghosn_dev](https://github.com/epfl-dcsl/linux-kvm-tyche/tree/aghosn_dev)
* [gramine](https://github.com/epfl-dcsl/gramine) : [tyche](https://github.com/epfl-dcsl/gramine/tree/tyche)
* [tyche-bench](https://github.com/epfl-dcsl/tyche-bench): [`aghosn_dev`](https://github.com/epfl-dcsl/tyche-bench/tree/aghosn_dev)

# Setup

## Tyche + Linux

This is unchanged and should work the same as before.

## Gramine

You don't need to clone the repository, it will be cloned within Tyche-bench.

## Tyche-bench

I added a lot of commands to do the entire setup in this repository.
There is a [readme](https://github.com/epfl-dcsl/tyche-bench/blob/aghosn_dev/readme.md) that should describe everything.

On a high-level, the setup recipes create a `to-copy` directory that contains everything that should be copied to the VM.
The `gramine` subfolder is path sensitive and must be placed at `/` in the VM. 
The others can be moved anywhere. To check that your setup works, once you have copied the folders, do:

```
cd /path/to/gramine-benchmarks

# This should run the gramine-linux helloworld (not tyche)
make helloworld 

# This should run the gramine-tyche helloworld
sudo TYCHE=1 make helloworld 
```
The Makefile inside `gramine-benchmarks` attempts to run both the gramine program and the measurement logic (e.g., the client making http requests). I copied the default gramine logic for benchmarks for the moment but we can change that (especially if we have issues with `wrk`).
For benchmarks that create an output measurement, the results should be populated inside automatically inside `gramine-benchmarks/results/name-of-the-app/[results+date].txt`.

For the moment, we have the following benchmarks:
* helloworld
* rust (runs rust hyper http server)
* sqlite
* lighttpd
* redis

Make sure sqlite3 is installed on the VM with `sudo apt install sqlite3`.

I also played around with other benchmarks that are not fully supported:
* memcached (present in gramine original repo): It requires 16 threads, so 16 cores.
* gzip (not in gramine original repo): it uses linux PIPEs which gramine does not support in a confidential setting
* blender (present in gramine original repo): requires 64 threads
* Llama (not in gramine original repo): requires a lot of threads and memory + the models are big.


**Potential issues**:
* missing `wrk`: `wrk` is a small program used by gramine to drive network-related benchmarks. We compile it from source and place it inside `to-copy/my_bin`.  It should be added to the VM somewhere within a path that's available to the user AND the sudo command.
* segfault `wrk`: I have observed that sometimes the command fails unexpectedly. Maybe we could fix/replace that.
* lighttpd not found: lighttpd is also compiled from source and placed inside `/gramine/utils/lighttpd`. It should normally be found by gramine manifest.
* Unexpected tyche driver `alreadyaliased` error: I only had that once or twice while running benchmarks. It might require rebooting the machine.
* Threads vs. Cores: we do not have a thread abstraction and require one core per declared thread in gramine. Hopefully we won't need it or we will implement support for threads later...
* Gramine inside TD1: Gramine-tyche requires confidential memory by default so I do NOT EXPECT it to work inside TD1 just yet. We need to either (1) figure out how to run confidential VMs, or (2) allow gramine to run sandboxes rather than enclaves.

# Changes in Linux tyche driver

To support gramine I had to extend the drivers to be more flexible in the memory management.
The main fixes are:

## Allow arbitrary size mmap allocation
 
 We used to reject mmap that were greater than `MAX_ORDER` which is the Linux limit for `alloc_exact_pages`.
 I removed that limitation that supporting one user mmap to be mapped to multiple `alloc_exact_pages` allocations (segments). The driver aggressively attempts to keep mmaped segments sorted and merge them when continuous.
This means that from the userspace, one mmap appears contiguous but inside the driver's state this might not be the case.

## One mmap to rule them all

We used to have duplicated logic between contalloc and tyche-driver for mmaps.
I removed the duplication by making contalloc call tyche-driver's implementation of mmaps.

## Support for foreign mmap values

Gramine requires memory to hold futexes shared between untrusted and trusted world.
Unfortunately, memory mapped by a driver is marked as VM_IO which prevents it from hosting futexes.
To solve this issue, I added calls to the driver that allow to add a memory segment allocated (mmaped) by Linux rather than the driver.
The driver goes through each page of the mmaped region and creates the corresponding contiguous physical memory mmap segments in the driver's state.
This potentially leads to a lot of fragmentation so we attempt to keep such regions small.

## Support for TD1 with memory aliasing in drivers

As we want to run enclaves inside TD1, and since TD1 might be aliased (gpa != hpa), we need to figure out real physical addresses in the driver (hpa).
I added a new `GET_HPA` call to the tyche monitor that takes a GPA and a gpa_size. 
It returns an HPA and an hpa_size such that, if GPA+size is within one contiguous HPA segment, gpa_size == hpa_size.
Otherwise, it returns the size until the end of the contiguous segment. Here is an example of a complex scenario:

```
# Guest virtual address
gpa

# Guest virtual address size
gpa_size

# GPA world view of memory
[gpa..........................gpa+gpa_size]

 
 # Hpa world view where x + y == gpa_size
 
[hpa.......hpa+x]      [ hpa2.... hpa2+y]


# Tyche-driver calls and logic.
get_hpa(gpa, gpa_size) = (hpa, x)
// Register one mmap segment [gpa, hpa, size: x]
get_hpa(gpa+x, gpa_size-x) = (hpa2, y)
// Register another mmap segment [gpa+x, hpa2, y]

// If y > (gpa_size-x), we would have returned
get_hpa(gpa+x, gpa_size-x) = (hpa2, gpa_size-x)

// Note that the tyche monitor does not modify any capability for GET_HPA
// It simply parses the tracker info and the alias info to figure out the mappings.

```

# Research on CMA

For RISC-V we might need to enable (C)ontinuous (M)emory (A)llocation.
This tells Linux to reserve some memory for later use by devices.

Apparently we can enable it inside the Linux config by adding:

```
CONFIG_CMA=y
CONFIG_CMA_SIZE_MBYTES=64  # For example, setting CMA size to 64MB
```

We can also (I don't know if we need both or either) change the linux command line:

```
cma=256M cma_alloc=16M
```

I am not sure about the difference between the first and the second argument.

Inside the driver, there are different ways to allocate the memory apparently (to replace the alloc_exact_pages):

```
#include <linux/cma.h>
#include <linux/dma-mapping.h>

...

void *cma_mem;
dma_addr_t dma_handle;

// I think this returns a physical address directly but NOT SURE
cma_mem = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
if (!cma_mem) {
    pr_err("Failed to allocate CMA memory\n");
    return -ENOMEM;
}

...

void *cma_mem;
dma_addr_t dma_handle;

// Same, I think this is physical memory inside dma_handle.
cma_mem = dma_alloc_attrs(dev, size, &dma_handle, GFP_KERNEL, DMA_ATTR_FORCE_CONTIGUOUS);
if (!cma_mem) {
    pr_err("Failed to allocate CMA memory\n");
    return -ENOMEM;
}

...

struct page *cma_pages;
unsigned long count = size >> PAGE_SHIFT;

// This seems to return a page/pages (list).  Getting the physical address with page_to_phys
cma_pages = cma_alloc(cma_area, count, 0, GFP_KERNEL);
if (!cma_pages) {
    pr_err("Failed to allocate CMA memory\n");
    return -ENOMEM;
}

```

# KVM & confidential VMs

We are trying to figure out what requires KVM to read VM memory so that we can evaluate the feasibility of turning memory confidential.

The main function that attempts to access VM memory is `kvm_vcpu_read_guest_page` in `virt/kvm/kvm_main.c`.
I have a partially trace with stack dumps [here](https://gist.github.com/aghosn/40b93abfb6af77d21420c1ccd2fbeadb) that seems to show that x86_decode_instruction is the main culprit.
However, that doesn't seem to be the only responsible.

I have another trace that prints the gfn and pfn for each access, and a corresponding dump of capabilities [here](https://gist.github.com/aghosn/48d6c75c8254c0e054c81ae6b33b8216).
Both traces are partial because there is A LOT of them.

These traces are produced using qemu, the following one is with lkvm: [here](https://gist.github.com/aghosn/9d60acd1d9545d26fba7f7725a0d35fc)

I think most of them are due to IOInstructions and some are Ept violations during boot at what seems to be the SCSI initialization.

Once booted, we seem to have mostly this for qemu
```
platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling ApicWrite for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling EptViolation for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
```
And this for lkvm:

```
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0
[INFO | tyche::x86_64::platform] Handling Hlt for dom 2 on core 0

```
Apparently there's a lot of  `emulate seg read` in `arch/x84/emulate.c` inside the `x86_emulate_insn` which seems to be the IO instruction read? The code in emulate.c says it reads memory from a special area of memory.

The Linux instrumentation to obtain the prints can be found [here](https://gist.github.com/aghosn/fb3bcf75ce5d3bcc7e4a330039335ee1)

---
Potential solutions:

* Confidential VMs: Maybe if we can locate the Linux kernel code in memory (physical range) we can leave Read access rights to dom0 for these regions. That might be enough to emulate most instructions? We would need to mark the APIC as RW for dom0 too.
* Non-Confidential VM + enclaves: We could use a CMA allocator for enclaves inside td1. We'd only need to mark the CMA region as confidential in that case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress + gramine + branches #109

Description

Branches

Setup

Tyche + Linux

Gramine

Tyche-bench

Changes in Linux tyche driver

Allow arbitrary size mmap allocation

One mmap to rule them all

Support for foreign mmap values

Support for TD1 with memory aliasing in drivers

Research on CMA

KVM & confidential VMs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Progress + gramine + branches #109

Description

Description

Branches

Setup

Tyche + Linux

Gramine

Tyche-bench

Changes in Linux tyche driver

Allow arbitrary size mmap allocation

One mmap to rule them all

Support for foreign mmap values

Support for TD1 with memory aliasing in drivers

Research on CMA

KVM & confidential VMs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions