Genode on seL4 - IPC and virtual memory

This is the second part of a series of hands-on articles about bringing Genode to the seL4 kernel. Read the previous part here...

After having created a minimalistic root task consisting of two threads, we can move forward with exercising the functionality provided by the kernel, namely inter-process communication and the handling of virtual memory. Once we have tested those functionalities in our minimalistic root task environment, we will be able to apply the gained knowledge to the actual porting effort of Genode's core component.

Inter-process communication

In the L4 universe, the term IPC (inter-process communication) usually stands for synchronous communication between two threads. In seL4, IPC has two uses. First, it enables threads of different protection domains (or the same protection domain) to exchange messages. So information can be transferred across protection-domain boundaries. Second, IPC is the mechanism used to delegate access rights throughout the system. This is accomplished by sending capabilities as message payload. When a capability is part of a message, the kernel translates the local name of the capability in the sender's protection domain to a local name in the receiver's protection domain.

In Genode, IPC is realized via two thin abstractions that build upon each other:

At the low level, the IPC library src/base/ipc/ipc.cc is responsible for sending and receiving messages using the kernel mechanism. It has a generic interface base/include/base/ipc.h, which supports the marshalling and un-marshalling of message arguments and capabilities using C++ insertion and extraction operators. Genode users never directly interact with the IPC library.
Built on top the IPC library, the so-called RPC framework adds the notion of RPC functions and RPC objects. RPC interfaces are declared using abstract C++ base classes with a few annotations. Under the hood, the RPC framework uses C++ meta-programming techniques to turn RPC definitions into code that transfers messages via the IPC library. In contrast to the IPC library, the RPC library is platform-agnostic.

To enable Genode's RPC mechanism on seL4, we merely have to provide a seL4-specific IPC library implementation. To warm up with seL4's IPC mechanism, however, we first modify our existing test program (see the first part) to let the main thread perform an IPC call to the second thread.

To let the second thread receive IPC messages, we first need to create a synchronous IPC endpoint using the seL4_Untyped_RetypeAtOffset function with seL4_EndpointObject as type, an offset that skips the already allocated TCB (the TCB object has a known size of 1024 bytes), and the designated capability number (EP_CAP).

As a first test, we want the second thread to receive an incoming message. So we change the entry function as follows:

 PDBG("call seL4_Wait");
 seL4_MessageInfo_t msg_info = seL4_Wait(EP_CAP, nullptr);
 PDBG("returned from seL4_Wait, call seL4_Reply");
 seL4_Reply(msg_info);
 PDBG("returned from seL4_Reply");

At the end of the main function (the context of the first thread), we try to call the second thread via seL4_Call:

 PDBG("call seL4_Call");
 seL4_MessageInfo_t msg_info = seL4_MessageInfo_new(0, 0, 0, 0);
 seL4_Call(EP_CAP, msg_info);
 PDBG("returned from seL4_Call");

When executing the code, we get an error as follows:

 int main(): call seL4_Call
 void second_thread_entry(): call seL4_Wait
 Caught cap fault in send phase at address 0x0
 while trying to handle:
 vm fault on data at address 0x4 with status 0x6
 in thread 0xe0100080 at address 0x10002e1

By looking at the output of objdump -lSd, we see that fault happens at the instruction

 mov %edi,%gs:0x4(,%ebx,4)

The issue is the same as the one we experienced before for the main thread - we haven't initialized the GS register with a proper segment, yet. This can be easily fixed by adding a call to our init_ipc_buffer function right at the start of the second thread's entry function. Still, the program does not work yet:

 vm fault on data at address 0x4 with status 0x6
 in thread 0xe0100080 at address 0x10002e8

Looking at the objdump output, we see that the fault still happens at the same instruction. So what is missing? The answer is that we haven't equipped the second thread with a proper IPC buffer. The attempt to call seL4_Wait, however, tries to access the IPC buffer of the calling thread. The IPC buffer can be configured for a thread using the seL4_TCB_SetIPCBuffer function. But wait - what arguments do we need to pass? In addition to the TCB capability, there are two arguments: a pointer to the IPC buffer and a page capability, which contains the IPC buffer. Well, I had hoped to get away without dealing with the memory management at this point. I figure that setting up the IPC buffer for the second thread would require me to create a seL4_IA32_4K frame object via seL4_Untyped_RetypeAtOffset and insert a mapping of the page frame within the roottask's address space, and possibly also create and install a page-table object.

To avoid becoming side-tracked by those memory-management issues, I decide to assign the IPC buffer of the second thread right at the same page as the one for the initial thread. Both the local address and the page capability for the initial thread's IPC buffer are conveniently provided by seL4's boot info structure. So let's give this a try:

 /* assign IPC buffer to second thread */
 {
   static_assert(sizeof(seL4_IPCBuffer) % 512 == 0,
                 "unexpected seL4_IPCBuffer size");

   int const ret = seL4_TCB_SetIPCBuffer(SECOND_THREAD_CAP,
                                         (seL4_Word)(bi->ipcBuffer + 1),
                                         seL4_CapInitThreadIPCBuffer);

   PDBG("seL4_TCB_SetIPCBuffer returned %d", ret);
 }

With the initialization of the IPC buffer in place, we finally get our desired output:

 int main(): call seL4_Call
 void second_thread_entry(): call seL4_Wait
 void second_thread_entry(): returned from seL4_Wait, call seL4_Reply
 int main(): returned from seL4_Call
 void second_thread_entry(): returned from seL4_Reply

Delegation of capabilities via IPC

The seL4 kernel supports the delegation of capabilities across address-space boundaries by the means of synchronous IPC. As Genode fundamentally relies on such a mechanism, I decide to give it a try by extending the simple IPC test. Instead of letting the main thread call the second thread without any arguments, the main thread will pass the thread capability of the second thread as argument. Upon reception of the call, the second thread will find a capability in its IPC buffer. To validate that the received capability corresponds to the thread cap, the second thread issues a seL4_TCB_Suspend operation on the received cap. It is supposed to stop its execution right there. This experiment requires the following steps:

At the caller side, we need to supply a capability as argument to the seL4_Call operation by specifying the number of capabilities to transfer at the extraCaps field of the seL4_MessageInfo, and marshalling the index of the capability via the seL4_SetCap function (specifying SECOND_THREAD_CAP as argument).
At the callee side, we need to define where to receive an incoming capability. First, we have to reserve a CNode slot designated for the new capability. For the test, a known-free index will do:
```
 enum { RECV_CAP = 0x102 };
```
Second, we have to configure the IPC buffer of the second thread to point to the RECV_CAP:
```
 seL4_SetCapReceivePath(seL4_CapInitThreadCNode, RECV_CAP, 32);
```
We specify 32 as receive depth because the CNode of the initial thread has a size of 2^12 and a guard of 20.

At this point, I am wondering that there is apparently no way to specify a receive window rather than an individual CNode for receiving capabilities. After revisiting Section 4.2.2 of the manual, I came to the realization that that seL4 does not support delegating more than one capability in a single IPC. From Genode's perspective, this could become an issue because Genode's RPC framework generally allows for the delegation of multiple capabilities via a single RPC call.

That said, the simple capability-delegation test works as expected.

When repeatedly performing an IPC call with a delegated capability, the RECV_CAP index will be populated by the first call. Subsequent attempts to override the RECV_CAP capability do not work (the extraCaps field of the received message info remains 0). The receiver has to make sure that the specified CapReceivePath is an empty capability slot. I.e., by calling seL4_CNode_Delete prior seL4_Wait.

Translation of capabilities aka "unwrapping"

In addition to receiving delegated capabilities under a new name, seL4's IPC mechanism allows the recipient of a capability that originated from the recipient to obtain a custom defined value instead of receiving a new name. In seL4 terminology, this mechanism is called "unwrapping".

Capability unwrapping is supposed to happen if the transferred capability is a minted endpoint capability and the recipient is the original creator of the capability. For testing the mechanism, we replace the SECOND_THREAD_CAP argument of the seL4_Call by a minted endpoint capability derived from the endpoint used by the second thread.

For creating a minted endpoint capability, we allocate a new index for the minted capability (EP_MINTED_CAP) and use the seL4_CNode_Mint operation as follows:

 seL4_CNode     const service    = seL4_CapInitThreadCNode;
 seL4_Word      const dest_index = EP_MINTED_CAP;
 uint8_t        const dest_depth = 32;
 seL4_CNode     const src_root   = seL4_CapInitThreadCNode;
 seL4_Word      const src_index  = EP_CAP;
 uint8_t        const src_depth  = 32;
 seL4_CapRights const rights     = seL4_Transfer_Mint;
 seL4_CapData_t const badge      = seL4_CapData_Badge_new(111);

 int const ret = seL4_CNode_Mint(service,
                                 dest_index,
                                 dest_depth,
                                 src_root,
                                 src_index,
                                 src_depth,
                                 rights,
                                 badge);

The badge is set to the magic value 111. When specifying the resulting EP_MINTED_CAP as IPC argument for a seL4_Call, the kernel will translate the capability to the badge value. The callee observes the reception of such an "unwrapped" capability via the capsUnwrapped field of the seL4_MessageInfo structure returned by the seL4_Wait operation. The badge value can be obtained from the IPC buffer via seL4_GetBadge(0). This simple experiment shows that the mechanism works as expected.

Management of virtual memory

Besides synchronous IPC, Genode relies on two other forms of inter-process communication, namely asynchronous notifications and shared memory. I will save the investigation of the former mechanism for later but focus on seL4's mechanisms for managing virtual memory for now.

Virtual-memory management in traditional L4 kernels

Traditional L4 kernels rely on an in-kernel mapping database to track memory mappings. If a protection domain has access to a range of memory pages, it can transfer a "mapping" of those memory pages to other protection domains as IPC payload. As a side effect of the IPC operation, the kernel populates the page tables of the receiver of the mapping accordingly. In reverse, the originator of a mapping can revoke the mapping by using a system call. This system call takes a virtual-address range of the calling protection domain as argument and flushes all mappings that originated from this range. Because mappings could be established transitively, the kernel has to maintain a tree structure (mapping database) that records how each memory mapping got established. This approach has several problems:

First, it intertwines two concerns, namely the delegation of the authority over memory pages and making memory pages accessible. In order to have the authorization over memory pages, i.e., the right to hand them out to other protection domains or to revoke them from other protection domains, a protection domain has to make the pages accessible within its own virtual address space. For Genode's core component, which needs to have the authority over the entirety of physical memory, this raises two problems: First, even though core is not supposed to ever touch memory pages that are in use by other components, it still has to make those memory pages accessible within its virtual address space. This complicates the assessment of the isolation properties of core with respect to the components on top. For example, a dangling pointer bug in core could leak or corrupt information that should be local to an individual component. Second, the virtual address space of core limits the amount of physical memory that can be handed out to other components. On a 32-bit machine with 4 GiB of memory, core can hand out a maximum of 3 GiB to other components because the upper 1 GiB of its virtual address space is owned by the kernel.

Second, the mapping database keeps records about how mappings got established. Thereby, the memory required for storing this information in the kernel depends on the behaviour of the user land. As a consequence, a malicious user-level program is able to provoke a high consumption of kernel memory by establishing mappings. Eventually, this represents an attack vector for denial-of-service attacks onto the kernel.

Virtual-memory management in seL4

Fortunately, the seL4 kernel breaks with this traditional approach. Authority over memory pages is represented by capabilities. In fact, in order to use a memory page, a capability for the memory page must exist. This capability can be delegated. But the possession of the authority over a memory page does not imply that the memory page is accessible. Very good! Second, seL4 dropped the traditional mapping data base. It does not record how memory mappings were established. But it provides an thin abstraction of page tables that define who has access to which memory page. The backing store for those page tables is managed outside the kernel. This effectively alleviates the denial-of-service attack vector present in traditional L4 kernels.

To test the virtual-memory management facilities of seL4, I decide to conduct a simple experiment: I want to attach a page of physical memory into the virtual address space of the root task twice. I will write a string at the first virtual address and expect to read the same string from the second virtual address. Compared to other kernels, this simple scenario is quote elaborative on seL4 because the kernel barely abstracts from the MMU hardware. I have to perform the following steps:

Creating a page table and a page frame. I want to attach the page somewhere at 0x40000000, which is an address range not yet in use but the root task. So I have to create an seL4_IA32_PageTableObject first. This is done by the seL4_Untyped_RetypeAtOffset operation as for all other kernel objects. In order to be able to use a portion of physical memory as actual memory, we also have to convert untyped memory to a seL4_IA32_4K kernel object, also using the Untyped_Retype operation.

With the page table created, we can tell seL4 to insert the page table into the page directory of the root task:

 seL4_IA32_PageTable     const service = PAGE_TABLE_CAP;
 seL4_IA32_PageDirectory const pd      = seL4_CapInitThreadPD;
 seL4_Word               const vaddr   = 0x40000000;
 seL4_IA32_VMAttributes  const attr    = seL4_IA32_Default_VMAttributes;

 int const ret = seL4_IA32_PageTable_Map(service, pd, vaddr, attr);

Now that the virtual memory range at 0x40000000 is backed by a page table, we can finally insert our page frame at the designated virtual address:

 seL4_IA32_Page          const service = PAGE_CAP;
 seL4_IA32_PageDirectory const pd      = seL4_CapInitThreadPD;
 seL4_Word               const vaddr   = 0x40001000;
 seL4_CapRights          const rights  = seL4_AllRights;
 seL4_IA32_VMAttributes  const attr    = seL4_IA32_Default_VMAttributes;

 int const ret = seL4_IA32_Page_Map(service, pd, vaddr, rights, attr);

After these steps, root task is able to touch the memory at 0x40001000 without crashing, which is just expected.

However, the attempt to attach the same page at a second virtual address fails:

 <<seL4 [decodeIA32FrameInvocation/1630 Te3ffd880 @100025f]:
   IA32Frame: Frame already mapped.>>

This is the point where the similarity of seL4's kernel interface to real page tables ends. With real page tables, one physical page frame can be present in any number of page tables by simply writing the frame number into the corresponding page-table entry. In principle, a frame number corresponds to an IA32_4K capability. But in contrast to a frame number, which can be reused for any number of page-table entries, on seL4, each insertion of a page frame into a page table requires a distinct IA32_4K capability. For a given IA32_4K capability, the creation of a second capability that refers to the same frame is easy enough:

 seL4_CNode     const service    =  seL4_CapInitThreadCNode;
 seL4_Word      const dest_index = PAGE_CAP_2;
 uint8_t        const dest_depth = 32;
 seL4_CNode     const src_root   = seL4_CapInitThreadCNode;
 seL4_Word      const src_index  = PAGE_CAP;
 uint8_t        const src_depth  = 32;
 seL4_CapRights const rights     = seL4_AllRights;

 int const ret = seL4_CNode_Copy(service, dest_index, dest_depth,
                                 src_root, src_index, src_depth, rights);

Inserting the page mapping using the copy of the original IA32_4K capability works:

 seL4_IA32_Page          const service = PAGE_CAP_2;
 seL4_IA32_PageDirectory const pd      = seL4_CapInitThreadPD;
 seL4_Word               const vaddr   = 0x40002000;
 seL4_CapRights          const rights  = seL4_AllRights;
 seL4_IA32_VMAttributes  const attr    = seL4_IA32_Default_VMAttributes;

 int const ret = seL4_IA32_Page_Map(service, pd, vaddr, rights, attr);

The subsequent test of writing a string to 0x40001000 and reading it from 0x40002000 produces the desired result.

Sentiments

Compared to traditional L4 kernels, the virtual memory management of seL4 is much more advanced. It solves pressing problems that plagued L4 kernels since an eternity. This is extremely valuable!

On the other hand, it does so by putting the burden of kernel-resource management onto the user land while further complicating the problem compared to the underlying mechanism provided by the MMU hardware. I understand that the copies of the page-frame capabilities are needed to enable to kernel to find and flush all page-table entries when a page frame is destroyed. This is a fundamental security property of the kernel.

In Genode, each range of physical memory is represented as a dataspace that can be attached to different or the same virtual address spaces any number of times. The seemingly small detail that the population of each page-table entry requires a dedicated kernel object raises quite a challenge. We do not only need to perform the book keeping of the physical page frames and their corresponding capability numbers, but additionally need to create and keep track of an additional kernel object (a copy of the page-frame capability) each time a physical page is mapped to a virtual address space. It goes without saying that all the book-keeping meta data must be allocated somewhere and accounted properly. Fortunately, I see how Genode's resource-trading mechanisms could come to the rescue here.