Release notes for the Genode OS Framework 9.08

Whereas the previous releases were focused on adding features to the framework, the overall theme for the current release 9.08 was refinement. We took the chance to revisit several parts of the framework that we considered as interim solutions, and replaced them with solid and hopefully long-lasting implementations. Specifically, we introduce a new lock implementation, a new timer service, a platform-independent signalling mechanism, a completely reworked startup code for all platforms, and thread-local storage support. Even though some of the changes touches fundamental mechanisms, we managed to keep the actual Genode API almost unmodified.

With regard to features, the release introduces initial support for dynamic linking, a core extension to enable a user-level variant of Linux to run on the OKL4 version of Genode, and support for super pages and write-combined I/O memory access on featured L4 platforms.

The most significant change for the Genode Linux version is the grand unification with the other base platforms. Now, the Linux version shares the same linker script and most of the startup code with the supported L4 platforms. Thanks to our evolved system-call bindings, we were further able to completely dissolve Genode's dependency from Linux's glibc. Thereby, the Linux version of Genode is on the track to become one of the lowest-complexity (in terms of source-code complexity) Linux-kernel-based OSes available.

Base framework

New unified lock implementation

Since the first Genode release one year ago, the lock implementation had been a known weak spot. To keep things simple, we employed a yielding spinlock as basic synchronization primitive. All other thread-synchronization mechanisms such as semaphores were based on this lock. In principle, the yielding spinlock used to look like this:

 class Lock {
   private:
     enum Lock_variable { UNLOCKED, LOCKED };
     Lock_variable _lock_variable;

   public:
     void lock() {
       while (!cmpxchg(&_lock_variable, UNLOCKED, LOCKED))
         yield_cpu_time();
     }

     void Lock::unlock() { _lock_variable = UNLOCKED; }
 }

The compare-exchange is an atomic operation that compares the current value of _lock_variable to the value UNLOCKED, and, if equal, replaces the value by LOCKED. If this operation succeeds, cmpxchg returns true, which means that the lock acquisition succeeded. Otherwise, we know that the lock is already owned by someone else, so we yield the CPU time to another thread.

Besides the obvious simplicity of this solution, it does require minimal CPU time in the non-contention case, which we considered to be the common case. In the contention case however, this implementation has a number of drawbacks. First, the lock is not fair, one thread may be able to grab and release the lock a number of times before another thread has the chance to be scheduled at the right time to proceed with the lock acquisition if the lock is free. Second, the lock does not block the acquiring thread but lets it actively spin. This behavior consumes CPU time and slows down other threads that do real work. Furthermore, this lock is incompatible with the use of thread priorities. If the lock is owned by a low-priority thread and a high-priority thread tries to acquire a lock, the high-priority thread keeps being active after calling yield_cpu_time(). Therefore the lock owner starves and has no chance to release the lock. This effect can be partially alleviated by replacing yield_cpu_time() by a sleep function but this work-around implies higher wake-up latencies.

Because we regarded this yielding spinlock as an intermediate solution since the first release, we are happy to introduce a completely new implementation now. The new implementation is based on a wait queue of lock applicants that are trying to acquire the lock. If a thread detects that the lock is already owned by another thread (lock holder), it adds itself into the wait queue of the lock and calls a blocking system call. When the lock owner releases the lock, it wakes up the next member of the lock's wait queue. In the non-contention case, the lock remains as cheap as the yielding spinlock. Because the new lock employs a fifo wait queue, the lock guarantees fairness in the contention case. The implementation has two interesting points worth noting. In order to make the wait-queue operations thread safe, we use a simple spinlock within the lock for protecting the wait queue. In practice, we measured that there is almost never contention for this spin lock as two threads would need to acquire the lock at exactly the same time. Nevertheless, the lock remains safe even for this case. Thanks to the use of the additional spinlock within the lock, the lock implementation is extremely simple. The seconds interesting aspect is the base mechanism for blocking and waking up threads such that there is no race between detecting contention and blocking. On Linux, we use sleep for blocking and SIGUSR1 to cancel the sleep operation. Because Linux delivers signals to threads at kernel entry, the wake-up signal gets reliably delivered even if it occurs prior thread blocking. On OKL4 and Pistachio, we use the exchange-registers (exregs) system call for both blocking and waking up threads. Because exregs returns the previous thread state, the sender of the wake-up signal can detect if the targeted thread is already in a blocking state. If not, it helps the thread to enter the blocking state by a thread-switch and then repeats the wake-up. Unfortunately, Fiasco does not support the reporting of the previous thread state as exregs return value. On this kernel, we have to stick with the yielding spinlock.

New Platform-independent signalling mechanism

The release 8.11 introduced an API for asynchronous notifications. Until recently, however, we have not used this API to a large extend because it was not supported on all platforms (in particular OKL4) and its implementation was pretty heavy-weight. Until now signalling required one additional thread for each signal transmitter and each signal receiver. The current release introduces a completely platform-independent light-weight (in terms of the use of threads) signalling mechanism based on a new core service called SIGNAL. A SIGNAL session can be used to allocate multiple signal receivers, each represented by a unique signal-receiver capability. Via such a capability, signals can be submitted to the receiver's session. The owner of a SIGNAL session can receive signals submitted to the receivers of this session by calling the blocking wait_for_signal function. Based on this simple mechanism, we have been able to reimplement Genode's signal API. Each process creates one SIGNAL session at core and owns a dedicated thread that blocks for signals submitted to any receiver allocated by the process. Once, the signal thread receives a signal from core, it determines the local signal-receiver context and dispatches the signal accordingly.

The new implementation of the signal API required a small refinement. The original version allowed the specification of an opaque argument at the creation time of a signal receiver, which had been delivered with each signal submitted to the respective receiver. The new version replaces this opaque argument with a C++ class called Signal_context. This allows for a more object-oriented use of the signal API.

Generic support for thread-local storage

Throughout Genode we avoid relying on thread-local storage (TLS) and, in fact, we had not needed such a feature while creating software solely using the framework. However, when porting existing code to Genode, in particular Linux device drivers and Qt-based applications, the need for TLS arises. For such cases, we have now extended Genode's Thread class with generic TLS support. The static function Thread_base::myself() returns a pointer to the Thread_base object of the calling thread, which may be casted to a inherited thread type (holding TLS information) as needed.

The Thread_base object is looked up by using the current stack pointer as key into an AVL tree of registered stacks. Hence, the lookup traverses a plain data structure and does not rely on platform-dependent CPU features (such as gs segment-register TLS lookups on Linux).

Even though, Genode does provide a mechanism for TLS, we strongly discourage the use of this feature when creating new code with the Genode API. A clean C++ program never has to rely on side effects bypassing the programming language. Instead, all context information needed by a function to operate, should be passed to the function as arguments.

Core extensions to run Linux on top of Genode on OKL4

As announced on our road map, we are working on bringing a user-level variant of the Linux kernel to Genode. During this release cycle, we focused on enabling OKLinux aka Wombat to run on top of Genode. To run Wombat on Genode we had to implement glue code between the wombat kernel code and the Genode API, and slightly extend the PD service of core.

The PD-service extension is a great show case for implementing inheritance of RPC interfaces on Genode. The extended PD-session interface resides in base-okl4/include/okl4_pd_session and provides the following additional functions:

 Okl4::L4SpaceId_t space_id();
 void space_pager(Thread_capability);

The space_id function returns the L4 address-space ID corresponding to the PD session. The space_pager function can be used to set the protection domain as pager and exception handler for the specified thread. This function is used by the Linux kernel to register itself as pager and exception handler for all Linux user processes.

In addition to the actual porting work, we elaborated on replacing the original priority-based synchronization scheme with a different synchronization mechanism based on OKL4's thread suspend/resume feature and Genode locks. This way, all Linux threads and user processes run at the same priority as normal Genode processes, which improves the overall (best-effort) performance and makes Linux robust against starvation in the presence of a Genode process that is active all the time.

At the current stage, we are able to successfully boot OKLinux on Genode and start the X Window System. The graphics output and user input are realized via custom stub drivers that use Genode's input and frame-buffer interfaces as back ends.

We consider the current version as a proof of concept. It is not yet included in the official release but we plan to make it a regular part of the official Genode distribution with the next release.

Preliminary shared-library support

Our Qt4 port made the need for dynamically linked binaries more than evident. Statically linked programs using the Qt4 library tend to grow far beyond 10MB of stripped binary size. To promote the practical use of Qt4 on Genode, we ported the dynamic linker from FreeBSD (part of libexec) to Genode. The port consists of three parts

  1. Building the ldso binary on Genode, using Genode's parent interface to gain access to shared libraries and use Genode's address-space management facilities to construct the address space of the dynamically loaded program.

  2. Adding support for the detection of dynamically linked binaries, the starting of ldso in the presence of a dynamically linked binary, and passing the program's binary image to ldso.

  3. Adding support for building shared libraries and dynamically linked programs to the Genode build system.

At the current stage, we have completed the first two steps and are able to successfully load and run dynamically linked Qt4 applications. Thanks to dynamic linking, the binary size of Qt4 programs drops by an order of magnitude. Apparently, the use of shared qt libraries already pays off when using only two Qt4 applications.

You can find our port of ldso in the separate ldso repository. We will finalize the build-system integration in the next weeks and plan to support dynamic linking as regular feature as part of the os repository with the next release.

Operating-system services and libraries

Improved handling of XML configuration data

Genode allows for configuring a whole process tree via a single configuration file. Core provides the file named config as a ROM-session dataspace to the init process. Init attaches the dataspace into its own address space and reads the configuration data via a simple XML parser. The XML parser takes a null-terminated string as input and provides functions for traversing the XML tree. This procedure, however, is a bit flawed because init cannot expect XML data provided as a dataspace to be null terminated. On most platforms, this was no problem so far because boot modules, as provided by core's ROM service, used to be padded with zeros. However, there are platforms, in particular OKL4, that do not initialize the padding space between boot modules. In this case, the actual XML data is followed by arbitrary bits but possibly no null termination. Furthermore, there exists the corner case of using a config file with a size of a multiple of 4096 bytes. In this case, the null termination would be expected just at the beginning of the page beyond the dataspace.

There are two possible solutions for this problem: copying the content of the config dataspace to a freshly allocated RAM dataspace and appending the null termination, or passing a size-limit of the XML data to the XML parser. We went for the latter solution to avoid the memory overhead of copying configuration data just for appending the null termination. Making the XML parser to respect a string-length boundary involved the following changes:

  • The strncpy function had to be made robust against source strings that are not null-terminated. Strictly speaking, passing a source buffer without null-termination violates the function interface because, by definition, src is a string, which should always be null-terminated. The size argument usually refers to the bound of the dst buffer. However, in our use case, for the XML parser, the source string may not be properly terminated. In this case, we want to ensure that the function does not read any characters beyond src + size.

  • Enhanced ascii_to_ulong function to accept an optional size-limitation argument

  • Added support for size-limited tokens in base/include/util/token.h

  • Added support for constructing an XML node from a size-limited string

  • Adapted init to restrict the size of the config XML node to the file size of the config file

Nitpicker GUI server

  • Avoid superfluous calls of framebuffer.refresh() to improve the overall performance

  • Fixed stacking of views behind all others, but in front of the background. This problem occurred when seamlessly running another window system as Nitpicker client.

Misc

Alarm framework

Added next_deadline() function to the alarm framework. This function is used by the timer server to program the next one-shot timer interrupt, depending on the scheduled timeouts.

DDE Kit
  • Implemented dde_kit_thread_usleep() and dde_kit_thread_nsleep()

  • Removed unused/useless dde_kit_init_threads() function

Qt4

Added support for QProcess. This class can be used to start Genode applications from within Qt applications in a Qt4-compatible way.

Device drivers

New single-threaded timer service

With the OKL4 support added with the previous release, the need for a new timer service emerged. In contrast to the other supported kernels, OKL4 imposed two restrictions, which made the old implementation unusable:

  • The kernel interface of OKL4 does not provide a time source. The kernel uses a APIC timer internally to implement preemptive scheduling but, in contrast to other L4 kernels that support IPC timeouts, OKL4 does not expose wall-clock time to the user land. Therefore, the user land has to provide a timer driver that programs a hardware timer, handles timer interrupts, and makes the time source available to multiple clients.

  • OKL4 restricts the number of threads per address space according to a global configuration value. By default, the current Genode version set this value to 32. The old version of the timer service, however, employed one thread for each timer client. So the number of timer clients was severely limited.

Motivated by these observations, we created a completely new timer service that dispatches all clients with a single thread and also supports different time sources as back ends. For example, the back ends for Linux, L4/Fiasco, and L4ka::Pistachio simulate periodic timer interrupts using Linux' nanosleep system call - respective IPC timeouts. The OKL4 back end contains a PIT driver and operates this timer device in one-shot mode.

To implement the timer server in a single-threaded manner, we used an experimental API extension to Genode's server framework. Please note that we regard this extension as temporary and will possible remove it with the next release. The timer will then service its clients using the Genode's signal API.

Even though the timer service is a complete reimplementation, its interface remains unmodified. So this change remains completely transparent at the API level.

VESA graphics driver

The previous release introduced a simple PCI-bus virtualization into the VESA driver. At startup, the VESA driver uses the PCI bus driver to find a VGA card and provides this single PCI device to the VESA BIOS via a virtual PCI bus. All access to the virtualized PCI device are then handled locally by the VESA driver. In addition to PCI access, some VESA BIOS implementations tend to use the programmable interval timer (PIT) device at initialization time. Because we do not want to permit the VESA BIOS to gain access to the physical timer device, the VESA driver does now provide an extremely crippled virtual PIT. Well, it is just enough to make all VESA BIOS implementations happy that we tested.

On the feature side, we added support for VESA mode-list handling and a default-mode fallback to the driver.

Misc

SDL-based frame buffer and input driver

For making the Linux version of Genode more usable, we complemented the existing key-code translations from SDL codes to Genode key codes.

PS/2 mouse and keyboard driver

Improved robustness against ring-buffer overruns in cases where input events are produced at a higher rate than they can be handled, in particular, if there is no input client connected to the driver.

Platform-specific changes

Support for super pages

Previous Genode versions for the OKL4, L4ka::Pistachio, and L4/Fiasco kernels used 4K pages only. The most visible implication was a very noticeable delay during system startup on L4ka::Pistachio and L4/Fiasco. This delay was caused by core requesting the all physical memory from the root memory manager (sigma0) - page by page. Another disadvantage of using 4K pages only, is the resulting TLB footprint of large linear mappings such as the frame buffer. Updating a 10bit frame buffer with a resolution of 1024x768 would touch 384 pages and thereby significantly pollute the TLB.

This release introduces support for super pages for the L4ka::Pistachio and L4/Fiasco versions of Genode. In contrast to normal 4K pages, a super page describes a 4M region of virtual memory with a single entry in the page directory. By supporting super pages in core, the overhead of the startup protocol between core and sigma0 gets reduced by a factor of 1000.

Unfortunately, OKL4 does not support super pages such that this feature remains unused on this platform. However, since OKL4 does not employ a root memory manager, there is no startup delay anyway. Only the advantage of super pages with regard to reduced TLB footprint is not available on this platform.

Support for write-combined access to I/O memory

To improve graphics performance, we added principle support for write combined I/O access to the IO_MEM service of core. The creator of an IO_MEM session can now specify the session argument "write_combined=yes" at session-creation time. Depending on the actual base platform, core then tries to establish the correct page-table attribute configuration when mapping the corresponding I/O dataspace. Setting caching attributes differs for each kernel:

  • L4ka::Pistachio supports a MemoryControl system call, which allows for specifying caching attributes for a core-local virtual address range. The attributes are propagated to other processes when core specifies such a memory range as source operand during IPC map operations. However, with the current version, we have not yet succeeded to establish the right attribute setting, so the performance improvement is not noticeable.

  • On L4/Fiasco, we fully implemented the use of the right attributes for marking the frame buffer for write-combined access. This change significantly boosts the graphics performance and, with regard to graphics performance, serves us as the benchmark for the other kernels.

  • OKL4 v2 does not support x86 page attribute tables. So write-combined access to I/O memory cannot be enabled.

  • On Linux, the IO_MEM service is not yet used because we still rely on libSDL as hardware abstraction on this platform.

Unification of linker scripts and startup codes

During the last year, we consistently improved portability and the support for different kernel platforms. By working on different platforms in parallel, code duplications get detected pretty easily. The startup code was a steady source for such duplications. We have now generalized and unified the startup code for all platforms:

  • On all base platforms (Linux-x86_32, Linux-x86_64, OKL4, L4ka::Pistachio, and L4/Fiasco) Genode now uses the same linker script for statically linked binaries. Therefore, the linker script has now become part of the base repository.

  • We unified the assembly startup code (crt0) for all three L4 platforms. Linux has a custom crt0 code residing in base-linux/src/platform. For the other platforms, the crt0 codes resides in the base/src/platform/ directory.

  • We factored out the platform-depending bits of the C++ startup code (_main.cc) into platform-specific _main_helper.h files. The _main.cc file has become generic and moved to base/src/platform.

Linux

With the past two releases, we successively reduced the dependency of the Linux version of core from the glibc. Initially, this step had been required to enable the use of our custom libc. For example, the mmap function of our libc uses Genode primitives to map dataspace to the local address space. The back end of the used Genode functions, in turn, relied on Linux' mmap syscall. We cannot use syscall bindings provided by the glibc for issuing the mmap syscall because the binding would clash with our libc implementation of mmap. Hence we started to define our own syscall bindings.

With the current version, the base system of Genode has become completely independent of the glibc. Our custom syscall bindings for the x86_32 and x86_64 architectures reside in base-linux/src/platform and consist of 35 relatively simple functions using a custom variant of the syscall function. The only exception here is the clone system call, which requires assembly resides in a separate file.

This last step on our way towards a glibc-free Genode on Linux pushes the idea to only use the Linux kernel but no further Linux user infrastructure to the max. However, it is still not entirely possible to build a Linux based OS completely based on Genode. First, we have to set up the loopback device to enable Genode's RPC communication over sockets. Second, we still rely on libSDL as hardware abstraction and libSDL, in turn, relies on the glibc.

Implications

Because the Linux version is now much in line with the other kernel platforms, using custom startup code and direct system calls, we cannot support host tool chains to compile this version of Genode anymore. Host tool chains, in particular the C++ support library, rely on certain Linux features such as thread-local storage via the gs segment registers. These things are normally handled by the glibc but Genode leaves them uninitialized. To build the Linux version of Genode, you have to use the official Genode tool chain.

OKL4

The build process for Genode on OKL4 used to be quite complicated. Before being able to build Genode, one had to build the original Iguana user land of OKL4 because the Genode build system looked at the Iguana build directory for the L4 headers actually used. We now have simplified this process by not relying on the presence of the Iguana build directory anymore. All needed header files are now shadowed from the OKL4 source tree to an include location within Genode's build directory. Furthermore, we build Iguana's boot-info library directly from within the Genode build system, instead of linking the binary archive as produced by Iguana's build process.

Of course, to run Genode on OKL4, you still need to build the OKL4 kernel but the procedure of building the Genode user land is now much easier.

Misc changes:

  • Fixed split of unmap address range into size2-aligned flexpages. The unmap function did not handle dataspaces with a size of more than 4MB properly.

  • Fixed line break in the console driver by appending a line feed to each carriage return. This is in line with L4/Fiasco and L4ka::Pistachio, which do the same trick when text is printed via their kernel debugger.

L4ka::Pistachio

The previous version of core on Pistachio assumed a memory split of 2GB/2GB between userland and kernel. Now, core reads the virtual-memory layout from the kernel information page and thereby can use up to 3GB of virtual memory.

Important: Because of the added support for super pages, the Pistachio kernel must be built with the "new mapping database" feature enabled!

L4/Fiasco

Removed superfluous zeroing-out of the memory we get from sigma0. This change further improves the startup performance of Genode on L4/Fiasco.

Build infrastructure

Tool chain

  • Bumped binutils version to 2.19.1

  • Support both x86_32 and x86_64

  • Made tool_chain's target directory customizable to enable building and installing the tool chain with user privileges

Build system

  • Do not include dependency rules when cleaning. This change brings not only a major speedup but it also prevents dependency rules from messing with generic rules, in particular those defined in spec-okl4.mk.

  • Enable the use of -ffunction-sections combined with -gc-sections by default and thereby reduce binary sizes by an average of 10-15%.

  • Because all base platforms, including Linux, now depend on the Genode tool chain, the build system uses this tool chain as default. You can still override the tool chain by creating a custom etc/tools.conf file in your build directory.