Execution on bare hardware (base-hw)

The code specific to the base-hw platform is located within the repos/base-hw/ directory. In the following description, unless explicitly stated otherwise, all paths are relative to this directory.

In contrast to classical L4 microkernels where Genode's core process runs as user-level roottask on top of the kernel, base-hw executes Genode's core directly on the hardware with no distinct kernel underneath. Core and the kernel are melted into one hybrid component. Although all threads of core are running in privileged processor mode, they call a kernel library to synchronize hardware interaction. However, most work is done outside of that library. This design has several benefits. First, the kernel part becomes much simpler. For example, there are no allocators needed within the kernel. Second, base-hw side-steps long-standing difficult kernel-level problems, in particular the management of kernel resources. For the allocation of kernel objects, the hybrid core/kernel can employ Genode's user-level resource trading concepts as described in Section Resource trading. Finally and most importantly, merging the kernel with roottask removes a lot of redundancies between both programs. Traditionally, both kernel and roottask perform the book keeping of physical-resource allocations and the existence of kernel objects such as address spaces and threads. In base-hw, those data structures exist only once. The complexity of the combined kernel/core is significantly lower than the sum of the complexities of a traditional self-sufficient kernel and a distinct roottask on top. This way, base-hw helps to make Genode's TCB less complex.

The following subsections detail the problems that base-hw had to address to become a self-sufficient base platform for Genode.

Bootstrapping of base-hw

Early bootstrap

After the boot loader has loaded the kernel image into memory, it calls the kernel's entry point. At this stage, the MMU is still switched off and no CPU other than the primary boot CPU are initialized. The first job of the loaded kernel is the initialization of all CPUs and their transition from the use of physical memory to virtual memory. This one-time code path is called bootstrap. The corresponding code is located at src/bootstrap/. Besides enabling the MMU, this code performs system-global static hardware configurations such as setting up an ARM TrustZone policy. Once completed, bootstrap ELF-loads the actual core/kernel executable, which is designated to run entirely in virtual memory. After this stage is complete, bootstrap is no longer part of the picture.

Startup of the base-hw kernel

Core on base-hw uses Genode's regular linker script. Like any regular Genode component, its execution starts at the _start symbol. But unlike a regular component, core is started by the bootstrap component as a kernel running in privileged mode. Instead of directly following the startup procedure described in Section Startup code, base-hw uses custom startup code that initializes the kernel part of core first. For example, the startup code for the ARM architecture is located at src/core/spec/arm/crt0.s. It calls the kernel initialization code in src/core/kernel/main.cc. Core's regular C++ startup code (the _main function) is executed by the first thread created by the kernel (see the thread setup in the Core_main_thread::Core_main_thread() constructor).

Kernel entry and exit

The execution model of the kernel can be roughly characterized as a single-stack kernel. In contrast to traditional L4 kernels that maintain one kernel thread per user thread, the base-hw kernel is a mere state machine that never blocks in the kernel. State transitions are triggered by core or user-level threads that enter the kernel via a system call, by device interrupts, or by a CPU exception. Once entered, the kernel applies the state change depending on the event that caused the kernel entry, and leaves the kernel again. The transition between normal threads and kernel execution depends on the concrete architecture. For ARM, the corresponding code is located at src/core/spec/arm/exception_vector.S.

Interrupt handling and preemptive multi-threading

In order to respond to interrupts, base-hw has to contain a driver for the interrupt controller. The interrupt-controller driver is named Board::Pic. The implementation depends on the used board. For each board supported by the base-hw kernel, there exists a board.h file that ties together all board-specific definitions. For example, the file core/board/pbxa9/board.h and the included headers define the properties of the PBX-A9 platform by populating the Board namespace.

To support preemptive multi-threading, base-hw requires a hardware timer. The timer is programmed with the time slice length of the currently executed thread. Once the programmed timeout elapses, the timer device generates an interrupt that is handled by the kernel. Similarly to interrupt controllers, there exist a variety of different timer devices for different hardware platforms. Therefore, base-hw features a variety of timer drivers. The board-specific board.h file defines a Board::Timer type that is mapped to one of the available drivers. For example, the pbxa9/board.h file includes spec/arm/cortex_a9_global_timer.h, which contains the definition of Board::Timer.

The in-kernel handler of the timer interrupt invokes the thread scheduler (src/core/kernel/scheduler.h). The scheduler maintains a list of so-called scheduling contexts where each context refers to a thread. Each time the kernel is entered, the scheduler is updated with the passed duration. At kernel exit, the control is passed to the user-level thread that corresponds to the head of the scheduler list.

Split kernel interface

The system-call interface of the base-hw kernel is split into two parts. One part is usable by all components and solely contains system calls for inter-component communication and thread synchronization. The definition of this interface is located at include/kernel/interface.h. The second part is exposed only to core. It supplements the public interface with operations for the creation, the management, and the destruction of kernel objects. The definition of the core-private interface is located at src/core/kernel/core_interface.h.

The distinction between both parts of the kernel interface is enforced by the dedicated Thread::_call and Core_thread::_call functions where the latter is private to core only.

Public part of the kernel interface

Threads do not run independently but interact with each other via synchronous inter-component communication as detailed in Section Inter-component communication. Within base-hw, this mechanism is referred to as IPC (for inter-process communication). To allow threads to perform calls to other threads or to receive RPC requests, the kernel interface is equipped with system calls for performing IPC (rpc_call, rpc_wait, rpc_reply, _rpc_reply_and_wait). To keep the kernel as simple as possible, IPC is performed using so-called user-level thread-control blocks (UTCB). Each thread has a corresponding memory page that is always mapped in the kernel. This UTCB page is used to carry IPC payload. The largely simplified procedure of transferring a message is as follows. (In reality, the state space is more complex because the receiver may not be in a blocking state when the sender issues the message)

The sender marshals its payload into its UTCB and invokes the kernel,
The kernel transfers the payload from the sender's UTCB to the receiver's UTCB and schedules the receiver,
The receiver retrieves the incoming message from its UTCB.

Because all UTCBs are always mapped in the kernel, no page faults can occur during the second step. This way, the flow of execution within the kernel becomes predictable and no kernel exception handling code is needed.

Capabilities are delegated via IPC. The recipient of a capability controls the lifetime of the corresponding local ID (cap_ack, cap_delete).

In addition to IPC, threads interact via the synchronization primitives provided by the Genode API. To implement these portions of the API, the kernel provides system calls for managing the execution control of threads (thread_stop, thread_restart, thread_yield). For the efficient execution of virtual machines, the kernel offers the monitored execution of virtual CPUs (vcpu_run, vcpu_pause).

To support asynchronous notifications as described in Section Asynchronous notifications, the kernel provides system calls for the submission and reception of signals (signal_wait, signal_pending, signal_submit, and signal_ack) as well as the life-time management of signal contexts (signal_kill). In contrast to other base platforms, Genode's signal API is directly supported by the kernel so that the propagation of signals does not require any interaction with core's PD service. However, the creation of signal contexts is arbitrated by the PD service. This way, the kernel objects needed for the signalling mechanism are accounted to the corresponding clients of the PD service.

The kernel provides an interface to make the kernel's scheduling timer available as time source to the user land. Using this interface, components can bind signal contexts to timeouts (timeout) and follow the progress of time (time and timeout_max_us).

Depending on the CPU architecture, CPU-cache maintenance may require kernel privileges. Hence, the kernel interface provides a set of cache-control operations (cache_coherent_region, cache_clean_invalidate_data_region, cache_invalidate_data_region, and cache_line_size).

Core-private part of the kernel interface

The core-private part of the kernel interface allows core to perform the privileged operations defined at src/core/kernel/core_interface.h. Note that even though the kernel and core provide different interfaces, both are executed in privileged CPU mode, share the same address space and ultimately trust each other. The kernel is regarded a mere support library of core that executes those functions that shall be synchronized between different CPU cores and core's threads. In particular, the kernel does not perform any allocation. Instead, the allocation of kernel objects is performed as an interplay of core and the kernel.

Core allocates physical memory from its physical-memory allocator. Most kernel-object allocations are performed in the context of one of core's services. Hence, those allocations can be properly accounted to a session quota (Section Resource trading). This way, kernel objects allocated on behalf of core's clients are "paid for" by those clients.
Core allocates virtual memory to make the allocated physical memory visible within core and the kernel.
Core invokes the kernel to construct the kernel object at the location specified by core. This kernel invocation is actually a system call that enters the kernel via the kernel-entry path.
The kernel initializes the kernel object at the virtual address specified by core and returns to core via the kernel-exit path.

The core-private kernel interface consists of the following operations:

The creation and destruction of protection domains (PD_CREATE, PD_DESTROY), invoked by the PD service
The creation, manipulation, and destruction of threads with the operations prefixed with THREAD (CREATE, DESTROY, START, PAUSE, RESUME, SINGLE_STEP, CPU_STATE_GET, CPU_STATE_SET, EXC_STATE_GET, PAGER_SET, and PAGER_SIGNAL_ACK) as used by the CPU service, and the core-specific implementation of the Genode::Thread API (THREAD_CORE_CREATE).
The creation and destruction of signal receivers (SIGNAL_RECEIVER_CREATE, SIGNAL_RECEIVER_DESTROY) and signal contexts (SIGNAL_CONTEXT_CREATE, SIGNAL_CONTEXT_DESTROY), invoked by the PD service
The creation and destruction of kernel-protected object identities (OBJECT_CREATE, OBJECT_DESTROY)
The creation, manipulation, and destruction of interrupt kernel objects (IRQ_CREATE, IRQ_DESTROY, IRQ_ACK).
The mechanisms hosting virtual machines via core's VM service (VCPU_CREATE, VCPU_DESTROY).
Platform-control functions (CPU_SUSPEND, PD_INVALIDATE_TLB).

Scheduler of the base-hw kernel

CPU scheduling in traditional L4 microkernels is based on static priorities. The scheduler always picks the runnable thread with highest priority for execution. If multiple threads share one priority, the kernel schedules those threads in a round-robin fashion. Whereas being fast and easy to implement, this scheme has disadvantages: First, there is no way to prevent high-prioritized threads from starving lower-prioritized ones. Second, CPU time is usually accounted to threads, which becomes unfair whenever one thread works for another. The base-hw kernel addresses those concerns with a custom scheduler that deviates from the traditional approach by modelling different classes of workloads and passing CPU time along IPC call chains.

Sparsely populated core address space

Even though core has the authority over all physical memory, it has no immediate access to the physical pages. Whenever core requires access to a physical memory page, it first has to explicitly map the physical page into its own virtual memory space. This way, the virtual address space of core stays clean from any data of other components. Even in the presence of a bug in core (e.g., a dangling pointer), information cannot accidentally leak between different protection domains because the virtual memory of the other components is not necessarily visible to core.

Multi-processor support of base-hw

On uniprocessor systems, the base-hw kernel is single-threaded. Its execution model corresponds to a mere state machine. On SMP systems, it maintains one kernel thread and one scheduler per CPU core. Access to kernel objects gets fully serialized by one global spin lock that is acquired when entering the kernel and released when leaving the kernel. This keeps the use of multiple cores transparent to the kernel model, which greatly simplifies the code compared to traditional L4 microkernels. Given that the kernel is a simple state machine providing lightweight non-blocking operations, there is little contention for the global kernel lock. Even though this claim may not hold up when scaling to a large number of cores, current platforms can be accommodated well.

Cross-CPU inter-component communication

Regarding synchronous and asynchronous inter-processor communication - thanks to the global kernel lock - there is no semantic difference to the uniprocessor case. The only difference is that on a multiprocessor system, one processor may change the schedule of another processor by unblocking one of its threads (e.g., when an RPC call is received by a server that resides on a different CPU as the client). This condition may rescind the current scheduling choice of the other processor. To avoid lags in this case, the kernel lets the unaware target processor trap into an inter-processor interrupt (IPI). The targeted processor can react to the IPI by taking the decision to schedule the receiving thread. As the IPI sender does not have to wait for an answer, the sending and receiving CPUs remain largely decoupled. There is no need for a complex IPI protocol between sender and receiver.

TLB shootdown

With respect to the synchronization of core-local hardware, there are two different situations to deal with. Some hardware components like most ARM caches and branch predictors implement their own coherence protocol and thus need adaption in terms of configuration only. Others, like the TLBs lack this feature. When for instance a page table entry gets invalid, the TLB invalidation of the affected entries must be performed locally by each core. To signal the necessity of TLB maintenance work, an IPI is sent to all other cores. Once all cores have completed the cleaning, the thread that invoked the TLB invalidation resumes its execution.

Asynchronous notifications on base-hw

The base-hw platform improves the mechanism described in Section Asynchronous notification mechanism by introducing signal receivers and signal contexts as first-class kernel objects. Core's PD service is merely used to arbitrate the creation and destruction of those kernel objects but it does not play the role of a signal-delivery proxy. Instead, signals are communicated directly by using the public kernel operations signal_wait, signal_submit, signal_ack, and signal_pending.