Execution on the NOVA microhypervisor (base-nova)

NOVA is a so-called microhypervisor, denoting the combination of microkernel and a virtualization platform (hypervisor). It is a high-performance microkernel for the x86 architecture. In contrast to other microkernels, it has been designed for hardware-based virtualization via user-level virtual-machine monitors. In line with Genode's architecture, NOVA's kernel interface is based on capability-based security. Hence, the kernel fully supports the model of a Genode kernel as described in Section Capability-based security.

NOVA website

http://hypervisor.org

NOVA kernel-interface specification

https://github.com/udosteinberg/NOVA/raw/master/doc/specification.pdf

Integration of NOVA with Genode

The NOVA kernel is available via Genode's ports mechanism detailed in Section Integration of 3rd-party software. The port description is located at repos/base-nova/ports/nova.port.

Building the NOVA kernel

Even though NOVA is a third-party kernel with a custom build system, the kernel is built directly by the Genode build system. NOVA's build system remains unused.

From within a Genode build directory configured for one of the nova_x86_32 or nova_x86_64 platforms, the kernel can be built via

 make kernel

The build description for the kernel is located at repos/base-nova/src/kernel/target.mk.

System-call bindings

NOVA is not accompanied with bindings to its kernel interface. There only is a description of the kernel interface in the form of the kernel specification available. For this reason, Genode maintains the kernel bindings for NOVA within the Genode source tree. The bindings are located at repos/base-nova/include/ in the subdirectories nova/, spec/32bit/nova/, and spec/64bit/nova/.

Bootstrapping of a NOVA-based system

After finishing its initialization, the kernel starts the second boot module, the first being the kernel itself, as root task. The root task is Genode's core. The virtual address space of core contains the text and data segments of core, the UTCB of the initial execution context (EC), and the hypervisor info page (HIP). Details about the HIP are provided in Section 6 of the NOVA specification.

BSS section of core

The kernel's ELF loader does not support the concept of a BSS segment. It simply maps the physical pages of core's text and data segments into the virtual memory of core but does not allocate any additional physical pages for backing the BSS. For this reason, the NOVA version of core does not use the genode.ld linker script as described in Section Linker scripts but the linker script located at repos/base-nova/src/core/core.ld. This version hosts the BSS section within the data segment. Thereby, the BSS is physically present in the core binary in the form of zero-initialized data.

Initial information provided by NOVA to core

The kernel passes a pointer to the HIP to core as the initial value of the ESP register. Genode's startup code saves this value in the global variable _initial_sp (Section Startup code).

Log output on modern PC hardware

Because transmitting information over legacy comports does not require complex device drivers, serial output over comports is still the predominant way to output low-level system logs like kernel messages or the output of core's LOG service.

Unfortunately, most modern PCs lack dedicated comports. This leaves two options to obtain low-level system logs.

  1. The use of vendor-specific platform-management features such as Intel VPro / Intel Advanced Management Technology (AMT) or Intel Platform Management Interface (IPMI). These platform features are able to emulate a legacy comport and provide the serial output over the network. Unfortunately, those solutions are not uniform across different vendors, difficult to use, and tend to be unreliable.

  2. The use of a PCI card or an Express Card that provides a physical comport. When using such a device, the added comport appears as PCI I/O resource. Because the device interface is compatible to the legacy comports, no special drivers are needed.

The latter option allows the retrieval of low-level system logs on hardware that lacks special management features. In contrast to the legacy comports, however, it has the minor disadvantage that the location of the device's I/O resources is not known beforehand. The I/O port range of the comport depends on the device-enumeration procedure of the BIOS. To enable the kernel to output information over this comport, the kernel must be configured with the I/O port range as assigned by the BIOS on the specific machine. One kernel binary cannot simply be used across different machines.

The Bender chain boot loader

To alleviate the need to adapt the kernel configuration to the used comport hardware, the bender chain boot loader can be used.

Bender is part of the MORBO tools

https://github.com/TUD-OS/morbo

Instead of starting the NOVA hypervisor directly, the multi-boot-compliant boot loader (such as GRUB) starts bender as the kernel. All remaining boot modules including the real kernel have already been loaded into memory by the original boot loader. Bender scans the PCI bus for a comport device. If such a device is found (e.g., an Express Card), it writes the information about the device's I/O port range to a known offset within the BIOS data area (BDA).

After the comport-device probing is finished, bender passes control to the next boot module, which is the real kernel. The comport device driver of the kernel does not use a hard-coded I/O port range for the comport but looks up the comport location in the BDA. The use of bender is optional. When not used, the BDA always contains the I/O port range of the legacy comport 1.

The Genode source tree contains a pre-compiled binary of bender at tool/boot/bender. This binary is automatically incorporated into boot images for the NOVA base platform when the run tool (Section Run tool) is used.

Relation of NOVA's kernel objects to Genode's core services

For the terminology of NOVA's kernel objects, refer to the NOVA specification mentioned in the introduction of Section Execution on the NOVA microhypervisor (base-nova). A brief glossary for the terminology used in the remainder of this section is given in table 1.

NOVA term
PD EC SC HIP IDC portal Protection domain Execution context (thread) Scheduling context Hypervisor information page Inter-domain call (RPC call) communication endpoint

Table 1: Glossary of NOVA's terminology

NOVA capabilities are not Genode capabilities

Both NOVA and Genode use the term "capability". However, the term does not have the same meaning in both contexts. A Genode capability refers to an RPC object or a signal context. In the context of NOVA, a capability refers to a NOVA kernel object. To avoid confusing both meanings of the term, Genode refers to NOVA's term as "capability selector", or simply "selector". A Genode signal context capability corresponds to a NOVA semaphore, all other Genode capabilities correspond to NOVA portals.

PD service

A PD session corresponds to a NOVA PD.

A Genode capability being a NOVA portal has a defined IP and an associated local EC (the Genode entrypoint). The invocation of a such a Genode capability is an IDC call to a portal. A Genode capability is delegated by passing its corresponding portal or semaphore selector as IDC argument.

Page faults are handled as explained in Section Page-fault handling on NOVA. Each memory mapping installed in a component implicitly triggers the allocation of a node in the kernel's mapping database.

CPU service

NOVA distinguishes between so-called global ECs and local ECs. A global EC can be equipped with CPU time by associating it with an SC. It can perform IDC calls but it cannot receive IDC calls. In contrast to a global EC, a local EC is able to receive IDC calls but it has no CPU time. A local EC is not executed before it is called by another EC.

A regular Genode thread is a global EC. A Genode entrypoint is a local EC. Core distinguishes both cases based on the instruction-pointer (IP) argument of the CPU session's start function. For a local EC, the IP is set to zero.

IO_MEM services

Core's RAM and IO_MEM allocators are initialized based on the information found in NOVA's HIP.

ROM service

Core's ROM service provides all boot modules as ROM modules. Additionally, a copy of NOVA's HIP is provided as a ROM module named "hypervisor_info_page".

IRQ service

NOVA represents each interrupt as a semaphore created by the kernel. By registration of a Genode signal context capability via the sigh method of the Irq_session interface, the semaphore of the signal context capability is bound to the interrupt semaphore. Genode signals and NOVA semaphores are handled as described in Asynchronous notifications on NOVA.

Upon the initial IRQ session's ack_irq call, a NOVA semaphore-down operation is issued within core on the interrupt semaphore, which implicitly unmasks the interrupt at the CPU. When the interrupt occurs, the kernel masks the interrupt at the CPU and performs the semaphore-up operation on the IRQ's semaphore. Thereby, the chained semaphore, which is the beforehand registered Genode signal context, is triggered and the interrupt is delivered as Genode signal. The interrupt gets acknowledged and unmasked by calling the IRQ session's ack_irq method.

Page-fault handling on NOVA

On NOVA, each EC has a pre-defined range of portal selectors. For each type of exception, the range has a dedicated portal that is entered in the event of an exception. The page-fault portal of a Genode thread is defined at the creation time of the thread and points to a pager EC per CPU within core. Hence, for each CPU, a pager EC in core pages all Genode threads running on the same CPU.

The operation of pager ECs

When an EC triggers a page fault, the faulting EC implicitly performs an IDC call to its pager. The IDC message contains the fault information. For resolving the page fault, core follows the procedure described in Page-fault handling. If the lookup for a dataspace within the faulter's region map succeeds, core establishes a memory mapping into the EC's PD by invoking the asynchronous map operation of the kernel and replies to the IDC message. In the case where the region lookup within the thread's corresponding region map fails, the faulted thread is retained in a blocked state via a kernel semaphore. In the event that the fault is later resolved by a region-map client as described in the paragraph "Region is empty" of Section Page-fault handling, the semaphore gets released, thus resuming the execution of the faulted thread. The faulter will immediately trigger another fault at the same address. This time, however, the region lookup succeeds.

Mapping database

NOVA tracks memory mappings in a data structure called mapping database and has the notion of the delegation of memory mappings (rather than the delegation of memory access). Memory access can be delegated only if the originator of the delegation has a mapping. Core is the only exception because it can establish mappings originating from the physical memory space. Because mappings can be delegated transitively between PDs, the mapping database is a tree where each node denotes the delegation of a mapping. The tree is maintained in order to enable the kernel to rescind the authority. When a mapping is revoked, the kernel implicitly cancels all transitive mappings that originated from the revoked node.

Asynchronous notifications on NOVA

To support asynchronous notifications as described in Section Asynchronous notifications, we extended the NOVA kernel semaphores to support signalling via chained NOVA semaphores. This extension enables the creation of kernel semaphores with a per-semaphore value, which can be bound to another kernel semaphore. Each bound semaphore corresponds to a Genode signal context. The per-semaphore value is used to distinguish different sources of signals.

On this base platform, the blocking of the signal thread at the signal source is realized by using a kernel semaphore shared by the PD session and the PD client. All chained semaphores (Signal contexts) are bound to this semaphore. When first issuing a wait-for-signal operation at the signal source, the client requests a capability selector for the shared semaphore (repos/base-nova/include/signal_session/source_client.h). It then performs a down operation on this semaphore to block.

If a signal sender issues a submit operation on a Genode signal capability, then a regular NOVA kernel semaphore-up syscall is used. If the kernel detects that the used semaphore is chained to another semaphore, the up operation is delegated to the one received during the initial wait-for-signal operation of the signal receiving thread.

In contrast to other base platforms, Genode's signal API is supported by the kernel so that the propagation of signals does not require any interaction with core's PD service. However, the creation of signal contexts is arbitrated by the PD service.

IOMMU support

As discussed in Section Direct memory access (DMA) transactions, misbehaving device drivers may exploit DMA transactions to circumvent their component boundaries. When executing Genode on the NOVA microhypervisor, however, bus-master DMA is subjected to the IOMMU.

The NOVA kernel applies a subset of the (MMU) address space of a protection domain to the (IOMMU) address space of a device. So the device's address space can be managed in the same way as one normally manages the address space of a PD. The only missing link is the assignment of device address spaces to PDs. This link is provided by the dedicated system call assign_pci that takes a PD capability selector and a device identifier as arguments. The PD capability selector represents the authorization over the protection domain, which is going to be targeted by DMA transactions. The device identifier is a virtual address where the extended PCI configuration space of the device is mapped in the specified PD. Only if a user-level device driver has access to the extended PCI configuration space of the device, is it able to get the assignment in place.

To make NOVA's IOMMU support available to Genode, the ACPI driver has the ability to lookup the extended PCI configuration space region for all devices and reports it via a Genode ROM. The platform driver on x86 evaluates the reported ROM and uses the information to obtain transparently for platform clients (device drivers) the extended PCI configuration space per device. The platform driver uses a NOVA-specific extension (assign_pci) to the PD session interface to associate a PCI device with a protection domain.

Even though these mechanisms combined should in theory suffice to let drivers operate with the IOMMU enabled, in practice, the situation is a bit more complicated. Because NOVA uses the same virtual-to-physical mappings for the device as it uses for the process, the DMA addresses the driver needs to supply to the device must be virtual addresses rather than physical addresses. Consequently, to be able to make a device driver usable on systems without IOMMU as well as on systems with IOMMU, the driver needs to become IOMMU-aware and distinguish both cases. This is an unfortunate consequence of the otherwise elegant mechanism provided by NOVA. To relieve the device drivers from worrying about both cases, Genode decouples the virtual address space of the device from the virtual address space of the driver. The former address space is represented by a dedicated protection domain called device PD independent from the driver. Its sole purpose is to hold mappings of DMA buffers that are accessible by the associated device. By using one-to-one physical-to-virtual mappings for those buffers within the device PD, each device PD contains a subset of the physical address space. The platform driver performs the assignment of device PDs to PCI devices. If a device driver intends to use DMA, it allocates a new DMA buffer for a specific PCI device at the platform driver. The platform driver responds to such a request by allocating a RAM dataspace at core, attaching it to the device PD using the dataspace's physical address as virtual address, and by handing out the dataspace capability to the client. If the driver requests the physical address of the dataspace, the address returned will be a valid virtual address in the associated device PD. This design implies that a device driver must allocate DMA buffers at the platform driver (specifying the PCI device the buffer is intended for) instead of using core's PD service to allocate buffers anonymously.

Genode-specific modifications of the NOVA kernel

NOVA is not ready to be used as a Genode base platform as is. This section compiles the modifications that were needed to meet the functional requirements of the framework. All modifications are maintained at the following repository:

Genode's version of NOVA

https://github.com/alex-ab/NOVA.git

The repository contains a separate branch for each version of NOVA that has been used by Genode. When preparing the NOVA port using the port description at repos/base-nova/ports/nova.port, the NOVA branch that matches the used Genode version is checked out automatically. The port description refers to a specific commit ID. The commit history of each branch within the NOVA repository corresponds to the history of the original NOVA kernel followed by a series of Genode-specific commits. Each time NOVA is updated, a new branch is created and all Genode-specific commits are rebased on top of the history of the new NOVA version. This way, the differences between the original NOVA kernel and the Genode version remain clearly documented. The Genode-specific modifications solve the following problems:

Destruction of kernel objects

NOVA does not support the destruction of kernel objects. I.e., PDs and ECs can be created but not destroyed. With Genode being a dynamic system, kernel-object destruction is a mandatory feature.

Inter-processor IDC

On NOVA, only local ECs can receive IDC calls. Furthermore each local EC is bound to a particular CPU (hence the name "local EC"). Consequently, synchronous inter-component communication via IDC calls is possible only between ECs that both reside on the same CPU but can never cross CPU boundaries. Unfortunately, IDC is the only mechanism for the delegation of capabilities. Consequently, authority cannot be delegated between subsystems that reside on different CPUs. For Genode, this scheme is too rigid.

Therefore, the Genode version of NOVA introduces inter-CPU IDC calls. When calling an EC on another CPU, the kernel creates a temporary EC and SC on the target CPU as a representative of the caller. The calling EC is blocked. The temporary EC uses the same UTCB as the calling EC. Thereby, the original IDC message is effectively transferred from one CPU to the other. The temporary EC then performs a local IDC to the destination EC using NOVA's existing IDC mechanism. Once the temporary EC receives the reply (with the reply message contained in the caller's UTCB), the kernel destroys the temporary EC and SC and unblocks the caller EC.

Support for priority-inheriting spinlocks

Genode's lock mechanism relies on a yielding spinlock for protecting the lock meta data. On most base platforms, there exists the invariant that all threads of one component share the same CPU priority. So priority inversion within a component cannot occur. NOVA breaks this invariant because the scheduling parameters (SC) are passed along IDC call chains. Consequently, when a client calls a server, the SCs of both client and server reside within the server. These SCs may have different priorities. The use of a naive spinlock for synchronization will produce priority inversion problems. The kernel has been extended with the mechanisms needed to support the implementation of priority-inheriting spinlocks in userland.

Combination of capability delegation and translation

As described in Section Capability delegation through capability invocation, there are two cases when a capability is specified as an RPC argument. The callee may already have a capability referring to the specified object identity. In this case, the callee expects to receive the corresponding local name of the object identity. In the other case, when the callee does not yet have a capability for the object identity, it obtains a new local name that refers to the delegated capability.

NOVA does not support this mechanism per se. When specifying a capability selector as map item for an IDC call, the caller has to specify whether a new mapping should be created or the translation of the local names should be performed by the kernel. However, in the general case, this question is not decidable by the caller. Hence, NOVA had to be changed to take the decision depending on the existence of a valid translation for the specified capability selector.

Support for deferred page-fault resolution

With the original version of NOVA, the maximum number of threads is limited by core's stack area: NOVA's page-fault handling protocol works completely synchronously. When a page fault occurs, the faulting EC enters its page-fault portal and thereby activates the corresponding pager EC in core. If the pager's lookup for a matching dataspace within the faulter's region map succeeds, the page fault is resolved by delegating a memory mapping as the reply to the page-fault IDC call. However, if a page fault occurs on a managed dataspace, the pager cannot resolve it immediately. The resolution must be delayed until the region-map fault handler (outside of core) responds to the fault signal. In order to enable core to serve page faults of other threads in the meantime, each thread has its dedicated pager EC in core.

Each pager EC, in turn, consumes a slot in the stack area within core. Since core's stack area is limited, the maximum number of ECs within core is limited too. Because one core EC is needed as pager for each thread outside of core, the available stacks within core become a limited resource shared by all CPU-session clients. Because each Genode component is a client of core's CPU service, this bounded resource is effectively shared among all components. Consequently, the allocation of threads on NOVA's version of core represents a possible covert storage channel.

To avoid the downsides described above, we extended the NOVA IPC reply system call to specify an optional semaphore capability selector. The NOVA kernel validates the capability selector and blocks the faulting thread in the semaphore. The faulted thread remains blocked even after the pager has replied to the fault message. But the pager immediately becomes available for other page-fault requests. With this change, it suffices to maintain only one pager thread per CPU for all client threads.

The benefits are manifold. First, the base-nova implementation converges more closely to other Genode base platforms. Second, core can not run out of threads anymore as the number of threads in core is fixed for a given setup. And the third benefit is that the helping mechanism of NOVA can be leveraged for concurrently faulting threads.

Remote revocation of memory mappings

In the original version of NOVA, roottask must retain mappings to all memory used throughout the system. In order to be able to delegate a mapping to another PD as response of a page fault, it must possess a local mapping of the physical page. Otherwise, it would not be able to revoke the mapping later on because the kernel expects roottask's mapping node as a proof of the authorization for the revocation of the mapping. Consequently, even though roottask never touches memory handed out to other components, it needs to have memory mappings with full access rights installed within its virtual address space.

To relieve Genode's roottask (core) from the need to keep local mappings for all memory handed out to other components and thereby let core benefit from a sparsely populated address space as described in Section Sparsely populated core address space for base-hw, we changed the kernel's revoke operation to take a PD selector and a virtual address within the targeted PD as argument. By presenting the PD selector as a token of authorization over the entire PD, we do no longer need core-locally installed mappings as the proof of authorization. Hence, memory mappings can always be installed directly from the physical address space to the target PD.

Support for write-combined access to memory-mapped I/O resources

The original version of NOVA is not able to benefit from write combining because the kernel interface does not allow the userland to specify cacheability attributes for memory mappings. To achieve good throughput to the framebuffer, write combining is crucial. Hence, we extended the kernel interface to allow the userland to propagate cacheability attributes to the page-table entries of memory mappings and set up the x86 page attribute table (PAT) with a configuration for write combining.

Support for the virtualization of 64-bit guest operating systems

The original version of NOVA supports 32-bit guest operations only. We enhanced the kernel to also support 64-bit guests.

Resource quotas for kernel resources

The NOVA kernel lacks the ability to adopt the kernel memory pool to the behavior of the userland. The kernel memory pool has a fixed size, which cannot be changed at runtime. Even though we have not removed this principal limitation, we extended the kernel with the ability to subject kernel-memory allocations to a userlevel policy at the granularity of PDs. Each kernel operation that consumes kernel memory is accounted to a PD whereas each PD has a limited quota of kernel memory. This measure prevents arbitrary userland programs to bring down the entire system by exhausting the kernel memory. The reach of damage is limited to the respective PD.

Asynchronous notification mechanism

We extended the NOVA kernel semaphores to support signalling via chained NOVA semaphores. This extension enables the creation of kernel semaphores with a per-semaphore value, which can be bound to another kernel semaphore. Each bound semaphore corresponds to a Genode signal context. The per-semaphore value is used to distinguish different sources of signals. Now, a signal sender issues a submit operation on a Genode signal capability via a regular NOVA semaphore-up syscall. If the kernel detects that the used semaphore is chained to another semaphore, the up operation is delegated to the chained one. If a thread is blocked, it gets woken up directly and the per-semaphore value of the bound semaphore gets delivered. In case no thread is currently blocked, the signal is stored and delivered as soon as a thread issues the next semaphore-down operation.

Chaining semaphores is an operation that is limited to a single level, which avoids attacks targeting endless loops in the kernel. The creation of such signals can solely be performed if the issuer has a NOVA PD capability with the semaphore-create permission set. On Genode, this effectively reserves the operation to core. Furthermore, our solution preserves the invariant of the original NOVA kernel that a thread may be blocked in only one semaphore at a time.

Interrupt delivery

We applied the same principle of the asynchronous notification extension to the delivery of interrupts by the NOVA kernel. Interrupts are delivered as ordinary Genode signals, which alleviate of the need for one thread per interrupt as required by the original NOVA kernel. The interrupt gets directly delivered to the address space of the driver in case of a Message Signalled Interrupt (MSI), or in case of a shared interrupt, to the x86 platform driver.

Known limitations of NOVA

This section summarizes the known limitations of NOVA and the NOVA version of core.

Fixed amount of kernel memory

NOVA allocates kernel objects out of a memory pool of a fixed size. The pool is dimensioned in the kernel's linker script nova/src/hypervisor.ld (at the symbol _mempool_f).

Bounded number of object capabilities within core

For each capability created via core's PD service, core allocates the corresponding NOVA portal or NOVA semaphore and maintains the capability selector during the lifetime of the associated object identity. Each allocation of a capability via core's PD service consumes one entry in core's capability space. Because the space is bounded, clients of the service could misuse core's capability space as covert storage channel.