Release notes for the Genode OS Framework 14.05
With Genode version 14.05, we address two problems that are fundamental for the scalability of the framework. The first problem is the way how Genode interoperates with existing software. A new concept for integrating 3rd-party source code with the framework makes the porting and use of software that is maintained outside the Genode source tree easier and more robust than ever. The rationale and the new concept are explained in Section Management of ported 3rd-party source code. The second problem is concerned about how programs that are built atop a C runtime (as is the case for most 3rd-party software) interact with the Genode world. Section Per-process virtual file systems describes how we consolidated many special-purpose solutions into one coherent design of using process-local virtual file systems.
In line with our road map, we put forward our storage-related agenda by enabling the use of NetBSD's cryptographic device driver (CGD) on Genode. Thereby, we continue our engagement with the rump kernel that we started to embrace with version 14.02. Section Block-level encryption using CGD explains the use of CGD as a Genode component.
Apart from those infrastructural improvements, the release cycle has focused on the NOVA and base-hw platforms. On NOVA, we are happy to have enabled static real-time priorities, which make the kernel much more appealing for the designated use for a general-purpose OS. Furthermore, we intensified our work on VirtualBox on NOVA by enabling guest-addition support and improving stability and performance. The NOVA-related improvements are covered by Sections VirtualBox on NOVA and NOVA microhypervisor.
The development of our custom base-hw kernel platform for the ARM architecture goes full steam ahead. With the added support for multiple processors, base-hw can finally leverage the CPU resources of modern ARM platforms. Furthermore, we largely redesigned the memory management to avoid the need to maintain identity mappings, which makes the kernel more robust. Section Execution on bare hardware (base-hw) explains those developments in detail.
Finally, we enhanced the driver support for x86-based platforms by enabling USB 3.0 in our Linux device-driver environment Section USB 3.0 for x86-based platforms outlines the steps we had to take.
Management of ported 3rd-party source code
Without the wealth of existing open-source software, Genode would be of little use. We regularly combine the work of more than 70 open-source projects with the framework. The number is steadily growing because each Genode user longs for different features.
Since version 11.08, we employed a common way of integrating 3rd-party software with Genode, which came in the form of a makefile per source-code repository. Each of those makefiles offered "prepare" and "clean" rules that automated the downloading and integration of 3rd-party code. The introduced automatism was a big relief for our work flows. Since then, the amount of 3rd-party code ported to Genode has been steadily increasing. It eventually reached a complexity that became hard to manage using the original mechanism. In order to make Genode easier to conquer for new users and more enjoyable for regular developers, we had to reconsider the way of how 3rd-party code is integrated with the framework.
We identified the following limitations of the existing approach:
-
From the viewpoint of Genode users, the most inconvenient limitation was the lack of proper error messages when a port was not prepared beforehand. Instead, the build system produced confusing error messages when unable to find the source code. According to the trouble-shooting requests on our mailing list, the missing preparation of 3rd-party code seems to be the most prominent road block for new users.
-
Still, when having prepared all required 3rd-party ports, the prepared version may become outdated when using Genode over time. Eventually the build process will expect a different version of the 3rd-party code than the one prepared. This happens particularly when switching between branches. In some cases the version of the 3rd-party code is updated quite often (e.g., base-nova). The build system could not detect such inconsistencies and consequently responded with arcane error messages, or even worse, produced binaries with unexpected runtime behaviour.
-
There are many source-code repositories that deal with downloading and integrating 3rd-party code in different ways, namely libports, ports, ports-foc, base-<kernel>, dde_ipxe, dde_rump, dde_linux, dde_oss, qt4. Even though all makefiles contained in those repositories used to contain the "prepare" and "clean" rules, they were not consistent with regard to the handling of corner cases, to the updating of packages, and with the use of additional arguments ("PKG="). Moreover, the individual port-description files (<repository>/ports/*.mk) files found in the ports and libports repositories contained a lot of boiler-plate content such as the rules for downloading files via wget, or the rules for checking signatures. Such duplicated code tends to degrade in quality and consistence over time, affecting the user experience and maintenance costs in a negative way.
-
The downloaded archives and the extracted 3rd-party code used to reside within the respective repositories (in the download/ and contrib/ subdirectories). This made the use of search tools like grep very inefficient when attempting to search in Genode's source code while excluding 3rd-party sources. For this reason, most regular Genode developers have crafted some special shell aliases for filtered search operations. But this should not be the way to go.
-
During the "make prepare" step, most ports of libraries used to create a bunch of symlinks within <rep-dir/include/ that pointed to the respective header files within <rep-dir>/contrib/. Effectively, this step touched Genode's source tree, which was bad in two ways. First, the portions of the source tree installed by the "make prepare" mechanism had to be blacklisted in Genode's .gitignore file. And second, executing the port-specific "make clean" rules was quite dangerous because those rules operated on the source tree.
The way forward
The points above made the need for a changed source-tree structure apparent. Traditionally, all of Genode's source-code repositories alongside the tool/ and doc/ directories were located at the root of the tree structure:
tool/ doc/ base/ base-okl4/Makefile download/ include/ lib/ src/ os/ ...
Repositories that incorporated 3rd-party code (e.g., base-okl4 as depicted above) hosted a makefile for the preparation, a download/ directory for the downloaded 3rd-party source code, and a contrib/ directory for the extracted source code. There was no notion of common tools that would work across repositories.
With Genode 14.05, we move all repositories to a repos/ directory:
tool/ doc/ repos/ base/ base-okl4/ os/ ... contrib/
Downloaded 3rd-party source code resides outside of the actual repository at the central contrib/ directory. By using this structure, we achieve the following:
-
Working with grep within the repositories works very efficient because downloaded and extracted 3rd-party code are no longer in the way. They reside next to the repositories.
-
In contrast to the original situation where we had no convention about the location of source-code repositories, tools can rely on a convention now. Being located at a known position within the tree, the tools for creating build directories and for managing ports become aware of the location of the repositories as well as the central contrib/ directory.
-
Adding a supplemental repository is pretty intuitive: Just clone a git repository into repos/.
-
Tutorials that describe the use of Genode could benefit from the introduced convention as they could suggest creating build directories at the top level, which no longer interferes with the location of the source-code repositories. This would make those tutorials a bit easier to follow.
-
The create_builddir tool can create build directories at sensible default locations. E.g., when create_builddir is called with nova_x86_64 as argument but with no BUILD_DIR argument, the tool will create a build directory build/nova_x86_64/ by default. This way, we reinforce a useful convention about the naming and location of build directories that will ease the support of Genode users.
-
Storing all build directories and downloaded 3rd-party source code somewhere outside the Genode source tree, let's say on different disk partitions, can be easily accomplished by creating a symbolic link for each of the build/ and contrib/ directory.
Of course, changing the source-tree structure at the top-level was no light-hearted decision. In particular, it raised the question of how to deal with topic branches that were branched off a Genode version with the old layout. During the transition, we observed the following patterns to deal with that problem:
-
Git can deal well with patches that change existing files, even if the file location has changed. For simple patches, e.g., small bug fixes, cherry-picking those individual commits to a current branch works quite well.
-
If a commit adds new files, the files will naturally end up at the location specified in the patch, i.e., somewhere outside of the repos/ directory. You will have to manually move them to the correct location using git mv and squash the resulting rename commit onto the original commit using git rebase -i.
-
For migrating a series of complex commits to the new layout, we use git format-patch to obtain a patch series for the topic branch, prefix the original pathnames with "repos/" using sed, and apply the result using git am.
Unification of the ports management
With the new source-tree layout in place, we could pursue a new take on unifying the management of ported 3rd-party source code. The new solution, which is very much inspired by the fabulous Nix package manager comes in the form of new tools to be found at tool/ports/.
Note that even though the port mechanism described herein looks a bit like "package management", it covers a different problem. The problem covered here is the integration of existing 3rd-party source code with the Genode source tree. Packaging, on the other hand, would provide a means to distribute self-contained portions of the Genode source tree including their respective 3rd-party counterparts as separate packages. Package management is not addressed yet.
The new tools capture all ports present in the repositories located under repos/. Using them is as simple as follows:
- Obtain a list of available ports
-
tool/ports/list
- Download and install a port
-
tool/ports/prepare_port <port-name>
The prepare_port tool will scan all repositories for the specified port and install the port into contrib/. Each version of an installed port resides in a dedicated subdirectory within the contrib/ directory. The port-specific directory is called port directory. It is named <port-name>-<fingerprint>. The <fingerprint> uniquely identifies the version of the port (it is a SHA1 hash of the ingredients of the port). If two versions of the same port are installed, each of them will have a different fingerprint. So they end up in different directories.
Within a source-code repository, a port is represented by two files, a <port-name>.port and a <port-name>.hash file. Both files reside at the ports/ subdirectory of the corresponding repository. The <port-name>.port file is the port description, which declares the ingredients of the port, e.g., the archives to download and the patches to apply. The <port-name>.hash file contains the fingerprint of the corresponding port description, thereby uniquely identifying a version of the port as expected by the checked-out Genode version.
So how does Genode's build system find the source code for a given port? If the build system encounters a target that incorporates ported source code, it looks up the respective <port-name>.hash file in the repositories as specified in the build configuration. The fingerprint found in the hash file is used to construct the path to the port directory under contrib/. If that lookup fails, a meaningful error is printed. Any number of versions of the same port can be installed at the same time. I.e., when switching Git branches that use different versions of the same port, the build system automatically finds the right port version as expected by the currently active branch.
For step-by-step instructions on how to add a port using the new mechanism, please refer to the updated porting guide:
- Genode Porting Guide
-
https://genode.org/documentation/developer-resources/porting
- Known limitations
-
There is no garbage collection of stale ports, yet. Each time when a port gets updated, a new version will be created within the contrib/ directory. However, the subdirectories can safely be deleted manually to regain disk space. In the worst case, if you deleted a port that is in use, the build system will let you know.
-
Even though some port files are equipped with information about cryptographic signatures, those signatures are not checked yet. However, each downloaded archive is checked against a known-good hash value declared in the port description so that the integrity of downloaded files is checked. But as illustrated by the signature declarations in the port descriptions, we plan to increase the confidence by enabling signature checks in addition to the hash-sum checks.
-
Dependencies between ports are not covered by port descriptions, yet.
- Transition to the new mechanism
We have reworked the majority of the more than 70 existing ports to the new mechanism. The only ports not covered so far are base-codezero, qt5, gcc, gdb, and qt4. During the next release cycle, we will keep the original "make prepare" mechanism as a front end intact. So the "make prepare" instructions as found in many tutorials will still work. But under the hood, "make prepare" just invokes the new tool/ports/prepare_port tool.
Block-level encryption using CGD
The need for protection of personal data is becoming generally accepted in the information age. Especially, against the background of ubiquitous storage devices in smart phones, notebooks, and tablet computers, which may go missing easily.
There are several different approaches to prevent unauthorized access to data storage. For example, data could be encrypted on a per file basis (e.g. EncFS or PEFS). Thereby each file is encrypted using a cipher but stored on a regular file system besides unencrypted files. Beyond this approach, it is also common to encrypt data on the lower block-device layer. With block-level encryption, each block on the storage device is encrypted respectively decrypted when written to or read from the device (e.g., TrueCrypt, FreeBSD's geli(8), Linux LUKS). On top of this cryptographic storage device, a regular file system may be used.
Additionally, it is desirable to access the encrypted data from various operating systems. In our case, we want to use the data from Genode as well as from our current development platform Linux.
In Genode 14.02, we introduced a port of the NetBSD based rump kernels to leverage file-system implementations, e.g., ext2. Beside file systems, NetBSD itself also offers block-level encryption in form of its cryptographic disk-driver cgd(4). In line with our roadmap, we enabled the cryptographic-device driver in our rump-kernels port as a first step to explore block-level encryption on Genode.
- https://www.netbsd.org/docs/guide/en/chap-cgd.html
-
NetBSD cryptographic-device driver (CGD)
The heart of our CGD port is the rump_cgd server, which encapsulates the rump kernels and the cgd device. The server uses a block session to get access to an existing block device and, in return, provides a block session to its client. Each block written or read by the client is transparently encrypted resp. decrypted by the server with a given key. This enables us to seamlessly integrate CGD into Genode's existing infrastructure.
To ease the use, the server interface is modelled after the interface of cgdconfig(8). This implies that the key must have the same format as used by cgdconfig, which means the key is a base64-encoded string. The first 4 bytes of the key string denote the actual length of the key in bits (these 4 bytes are stored in big endian order). For now, we only support the use of a stored key. However, we plan to add the use of passphrases in relation with keys later.
Currently, rump_cgd is only able to configure a cgd device but can not generate the configuration itself. A configuration or rather a working key may be generated by using the new tool/rump script. The used cipher is hard-coded to aes-cbc with a key size of 256 bit at the moment. Note, the server serves only one client as it transparently encrypts/decrypts one back-end block session. Though rump_cgd is currently limited with regard to the used cipher and the way key input is handled, we plan to extend this rump-kernel-based component step by step in the future.
If you want to get some hands on with CGD, the first step is to prepare a raw encrypted and ext2-formatted partition image by using the tool/rump script
dd if=/dev/urandom of=/path/to/image rump -c /path/to/image # key is printed to stdout rump -c -k <key> -f -F ext2fs /path/to/image
To use this disk image, the following config snippet can be used
<start name="rump_cgd"> <resource name="RAM" quantum="8M"/> <provides><service name="Block"/></provides> <config action="configure"> <params> <method>key</method> <key>AAABAJhpB2Y2UvVjkFdlP4m44449Pi3A/uW211mkanSulJo8</key> </params> </config> <route> <service name="Block"> <child name="ahci"/> </service> <any-service> <parent/> <any-child/> </any-service> </route> </start>
Note, we explicitly route the block-session requests for the underlying block device to the AHCI driver.
The block service provided by rump_cgd, in turn, is used by a file-system server.
<start name="rump_fs"> <resource name="RAM" quantum="16M"/> <provides><service name="File_system"/></provides> <config fs="ext2fs"> <policy label="" root="/" writeable="yes"/> </config> <route> <service name="Block"> <child name="rump_cgd"/> </service> <any-service> <parent/> <any-child/> </any-service> </route> </start>
Currently, the key to access the cryptographically secured device must be specified before using the device. Implementing a mechanism which asks for the key on the first attempt is in the works.
By using the rump kernels and the cryptographic-device driver, we are able to use block-level encryption on Genode and on Linux. In Linux case, we depend on rumprun, which can run unmodified NetBSD userland tools on top of the rump kernels to manage the cgd device. To ease this task, we provide the aforementioned rump wrapper script.
Since the rump script covers the most common use cases for the tools, the script is comparatively extensive, hence giving a short tutorial is reasonable.
- Format a disk image with Ext2
First, prepare the actual image file
dd if=/dev/zero of=/path/to/image bs=1M count=128
Second, use tool/rump to format the disk image:
rump -f -F ext2fs /path/to/image
Afterwards the file system just created may be populated with the contents of another directory by executing
rump -F ext2fs -p /path/to/source /path/to/image
To list the contents of the image run
rump -F ext2fs -l /path/to/image
- Create an encrypted disk image
Creating a cryptographic-disk image based on cgd(4) is done by executing the following command
rump -c /path/to/image
This will generate a key that may be used to decrypt the image later on. Since this command will only generate a key and not initialize the disk image, it is highly advised to prepare the disk image by using /dev/urandom instead of /dev/zero. In other words, only new blocks later written to the disk image are encrypted on the fly. In addition while generating the key, a temporary configuration file will be created. Although this file has proper permissions, it may leak the generated key if it is created on persistent storage. To specify a more secure directory, the -t option can be used:
rump -c -t /path/to/secure/directory /path/to/image
It is advised to carefully select an empty directory because the specified directory is removed at after completion.
Decrypting the disk image requires the key generated in the previous step:
rump -c -k <key> /path/to/image
For now this key has to be specified as command line argument. This is an issue if the shell, which is used, is maintaining a history of executed commands.
For the sake of completeness let us put all examples together by creating an encrypted ext2 image that will contain all files of Genode's demo scenario:
dd if=/dev/urandom of=/tmp/demo.img bs=1M count=16 rump -c /tmp/demo.img # key is printed to stdout rump -c -k <key> -f -F ext2fs -d /dev/rcgd0a /tmp/demo.img rump -c -k <key> -F ext2fs -p $(BUILD_DIR)/var/run/demo /tmp/demo.img
To check if the image was populated successfully, execute the following:
rump -c -k <key> -F ext2fs -l /tmp/demo.img
More detailed information about the options and arguments of this tool can be obtained by running:
rump -h
Since tool/rump just utilizes the rump kernels running on the host system to do its duty, there is a script called tool/rump_cgdconf that extracts the key from a cgdconfig(8) generated configuration file and is also able to generate such a file from a given key. Thereby, we try to accommodate the interoperability between the general rump-kernel-based tools and the rump_cgd server used on Genode.
Per-process virtual file systems
Our C runtime served us quite well over the years. At its core, it has a flexible plugin architecture that allows us to combine different back ends such as the lwIP socket API (using libc_lwip_nic_dhcp), using LOG as stdout (via libc_log), or using a ROM dataspace as a file (via libc_rom). Recently however, the original design has started to show its limitations:
Although there is the libc_fs plugin that allows a program to access files from a file-system server, there is no way to allow a program to access two different file-system servers. For example, if a web server wants to obtain its configuration and the website content from two different file systems.
Beside the lack of features of individual libc plugins, there are problems stemming from combining multiple plugins. For example, there is the libc_block plugin that makes a block session accessible as a pseudo block device named "/dev/blkdev". However, when combined with the libc_fs plugin, it is not defined which of the two plugins will respond to requests for a file with this name. As a quick and dirty work-around, the libc_fs plugin explicitly black-lists "/dev/blkdev". The need for such a work-around hints at a deficiency of the overall design. In general, if multiple plugins are combined, there is no consistent virtual file-system structure exposed via getdirentries.
Another inconvenience is a missing concept for handling standard input and output. Most programs use libc_log to direct stdout to the LOG service. But what if we want to direct the output of such a program to a terminal? Granted, there exists the terminal_log server to translate a LOG session to a terminal session but it would be much nicer to have this flexibility at the C-runtime level.
Finally, when looking at the implementation of the plugins, it becomes apparent that many of them look similar. We have to admit that there are quite a few dusty corners where duplicated code has been accumulated over the years. That said, the semantic details (e.g., the quality of error handling) differ from plugin to plugin. Seeing the number of file systems (and thereby the number of added libc plugins) grow, it became clear that our original design would make the situation even worse.
On the other hand, we have gathered overly positive experiences with the virtual file-system implementation of our Noux runtime, which is an environment for running Unix software on Genode. The VFS as implemented for Noux supports stacked file systems (similar to union mounts) of various types. It is stable and complete enough to run our tool chain to build Genode on Genode. Wouldn't it be a good idea to reuse the Noux VFS for the normal libc? With the current release cycle, we pursued this line of thoughts.
The first step was transplanting the VFS code from the Noux runtime to a free-standing library. The most substantial change was the decoupling of the VFS interfaces from the types provided by Noux. All those types had been moved to the VFS library. In the process of reshaping the Noux VFS into a library, several existing pseudo file systems received a welcome clean-up, and some new ones were added. In particular, there is a new "log" file system for writing data to a LOG session, a "rom" file system for reading ROM modules, and an "inline" file system for reading data defined within the VFS configuration.
The second step was the addition of a new libc_vfs plugin to the C runtime. This plugin makes the VFS library available to libc-using programs via the original libc plugin interface. It translates the types and functions of the VFS library to the types and functions of the C library. At this point, it was an optional plugin. As the VFS was meant to replace the various existing plugins instead of accompanying them, the next challenge was to revisit all the users of the various libc plugins and adapting them to use the libc_vfs plugin instead. This was, by far, the more elaborative step. More than 50 programs and their respective run scripts had to be adapted and tested. However, this process was very satisfying because we could see how the new VFS plugin satisfies all the use cases formerly accommodated by a zoo of special plugins.
As the last step, we could retire several libc plugins such as libc_rom, libc_block, libc_log, and libc_fs and merge the libc_vfs into the libc. Technically, it is still a plugin, but it is always present.
- How has the libc changed?
Each libc-using program can be configured with a program-local virtual file system as illustrated by the following example:
<config> ... <libc stdin="/dev/null" stdout="/dev/log" stderr="/dev/log"> <vfs> <dir name="dev"> <log/> <null/> </dir> <dir name="etc"> <dir name="lighttpd"> <inline name="lighttpd.conf"> ... </inline> </dir> </dir> <dir name="website"> <tar name="website.tar"/> </dir> </vfs> </libc> </config>
Here you see a lighttpd server that serves a website coming from a TAR archive (which is obtained from a ROM module named "website.tar"). There are two pseudo devices "/dev/log" and "/dev/null", to which the "stdin", "stdout", and "stderr" attributes refer. The "log" file system consists of a single node that represents a LOG session. The web server configuration is supplied inline as part of the configuration. (BTW, you can try out a very similar scenario using the ports/genode_org.run script)
The VFS implementation resides at os/include/vfs/. This is where you can see the file-system types that are available (look for *_file_system.h files). Because the same code is used by Noux, we have one unified and coherent VFS implementation throughout the framework now.
There are two things needed to adapt your work to the change.
-
Remove the use of the libc_{rom, block, log, fs} plugins from your target description files. Those plugins are no more. As of now, the VFS is still internally a plugin, but it is always included with the libc.
-
Configure the VFS of your libc-using program in your run script. For most former users of the sole libc_log plugin, this configuration looks like this:
<config> <libc stdout="/dev/log" stderr="/dev/log"> <vfs> <dir name="dev"> <log/> </dir> </vfs> </libc> </config>
For former users of other plugins, there are the block, rom, and fs file-system types available.
- Feature set and limitations
As of now, the following file-system types are supported:
- dir
-
represents a directory, which, in turn, can host multiple file systems.
- block
-
accesses a block session. The label of the session can be configured via the "label" attribute.
- fs
-
accesses a file-system server via a file-system session. The session label can be defined via the "label" attribute.
- inline
-
provides the content of the configuration node as the content of a read-only file.
- log
-
represents a pseudo device for writing to a LOG session. This type is useful for redirecting stdout to a LOG service such as the one provided by core.
- null and zero
-
represent pseudo devices similar to /dev/null and /dev/zero on Unix.
- rom
-
makes a ROM module available as a read-only file. If the name of the ROM module differs from the node name, the module name can be expressed by the "label" attribute.
- tar
-
obtains a TAR archive as ROM module and makes its content available as a file system. The name of the ROM module corresponds to the name of the tar node.
- terminal
-
is a pseudo device that accesses a terminal session. The session can be labeled using the "label" attribute.
There are still two major limitations: First, select is not supported yet. That means that programs cannot block for I/O (such as reading from a terminal). Because of this limitation, we still keep the libc_terminal around, which supports select. As the second limitation, the VFS interface performs read and write operations as synchronous requests. This is inherited from the Noux implementation. It goes without saying that we plan to change it to support non-blocking operations. But this step is not taken yet.
Revised session interfaces
The session interfaces for framebuffer and file-system access underwent the following minor changes.
- Framebuffer session
-
We simplified the framebuffer-session interface by removing the Framebuffer::Session::release() method. This step makes the mode-change protocol consistent with the way the ROM-session interface handles ROM-module changes. That is, the client acknowledges the release of its current dataspace by requesting a new dataspace via the Framebuffer::Session::dataspace() method.
To enable framebuffer clients to synchronize their operations with the display frequency, the session interface received the new sync_sigh function. Using this function, a client can register a handler for receiving display-synchronization events. As of now, no framebuffer service implements this feature in a useful way. But this will change in the upcoming release cycle when we overhaul Genode's GUI stack.
- File-system session
-
Until now, there was no exception type for the condition where a symbolic link was created on a file system w/o symlink support, e.g., FAT. The corresponding file-system server (ffat_fs) used to return a negative handle as a work-around. Hence, we added Permission_denied to the list of exceptions thrown by File_system::Session::symlink to handle this case in a clean way.
Ported 3rd-party software
VirtualBox on NOVA
With Genode 14.02, we successfully executed more than seven guest-operating systems, including MS Windows 7, on top of Genode/NOVA. Based on this proof of concept, we invested significant efforts to stabilize and extend our port of VirtualBox during the last three months. We also paid attention to user friendliness (i.e., features) by enabling support for guest-additions.
Regarding stability, one issue we encountered has been occasional synchronization problems during the early VMM bootstrap phase. Several internal threads in the VMM are started concurrently, like the timer thread, emulation thread (EMT), virtual CPU handler thread, hard-disk thread, and user-interface front-end thread. Some of these threads are favoured regarding their execution over others according to their importance. VirtualBox expresses this by host-specific mechanisms like priorities and nice levels of the host operating system. For Genode, we implemented this specific part accordingly by using multiple Genode CPU sessions.
The next working field was the emulation code and the code for handling VM exits, which have been executed by two different threads. We chose this structure in the original port to satisfy the following specific characteristics of the underlying NOVA kernel. The emulation code is provided by VirtualBox and is started as a pthread (EMT thread). In contrast, the hardware accelerated vCPU thread is running solely in the context of the VM in guest mode. When a VM exit happens, the exit is reflected by an IPC message sent through a NOVA portal and received by a vCPU handler thread running in our port of the VirtualBox VMM. This thread must be a NOVA worker thread, one which has no scheduling context (SC) associated. The emulation thread however is a global thread with an associated SC.
Using two separate threads and synchronization points between them enabled us in the first release of the port to quickly make progress, which led to the successful execution of Windows guests. Now, one goal was to merge both threads in order to avoid thread-context switching costs between them. Also, we wanted to get rid of transferring the state between vCPU handler and emulation thread back and forth including all that ugly synchronization code. For that purpose, we changed the startup of the emulation code: We first setup the vCPU handler thread and then start the vCPU in the VM. Hereafter, the VM exits immediately via a NOVA specific vCPU startup exception and the vCPU handler thread gets in control. The vCPU handler thread then actually starts executing the VirtualBox specific emulation code (originally executed by the EMT thread). Now the vCPU handler thread and the VirtualBox EMT thread are physically one execution context. Whenever the emulation code decides to switch to hardware accelerated mode, the vCPU handler thread can directly setup the transfer of the VM state from the VirtualBox emulation mode into the state fields of the vCPU of the guest.
Additionally, we had to re-adjust the memory management of our port to meet requirements expected by VirtualBox. For some internal data structures, VirtualBox saves a pointer to a memory location not just as absolute pointer, but instead splits this pointer into a process-absolute base and a base-local offset. These structures can thereby be shared over different protection domains where the base pointer typically differs (shared memory attached at different addresses). For the Genode port, we actually don't need this shared memory features, however, we had to recognize that the space for the offset value is a signed integer (int32_t). On a 64bit host, this feature caused trouble if the distance of two memory pointers was larger than 31 bit (2 GiB). Fortunately, each memory-allocation request for such data structures comes with a type field, which we can use to make sure that all allocations per type are located within a 2 GiB virtual range.
Finally, we optimized the VM exits marginally and now try to avoid entering the emulation mode during a recall VM exit. If we detect that an IRQ is pending by the VMM models during the recall VM-exit handling, we inject the IRQ directly into the VM instead of changing into the VirtualBox emulation mode by default.
Regarding our keen endeavor to enable VirtualBox's guest additions, we started by enabling the VMMDev PCI pseudo device, which is the basis for VMM-specific hypercalls executed by guest systems. Beside basic functions (e.g., software version reporting from host to guest and vice versa) also complex communication protocols can be implemented by storing request structures in guest-physical memory and passing their addresses to the VMMDev request I/O port. The communication mechanism in VirtualBox is called host-guest-communication manager (HGCM) and provides host services to the enlightened guest-operating system. Among the available services, the most interesting service for us was support for shared folders to exchange data between Genode and the guest OS. Now, we are able to configure shares in VirtualBox, which are mapped to VFS directories. For example
<start name="virtualbox"> ... <config> ... <libc> <vfs> <dir name="ram"> <fs label="ram" /> </dir> </vfs> </libc> <share host="/ram/miezekatze" guest="miezekatze" /> ... </config> <route> <service name="File_system"> <if-arg key="label" value="ram" /> <child name="ram_fs"/> </service> ... </route> </start>
configures one shared folder miezekatze, which is backed by a VFS mount to a pre-populated RAM file system.
Furthermore, we integrated the guest-pointer device with the Nitpicker pointer and connected the real-time clock VMM model to our RTC-device driver. Both features are enabled by default and need no further configuration. Currently, both Nitpicker and the guest OS draw the mouse pointers on screen. We will improve this in the future as the guest informs about GUI state via distinct pointer shapes.
During our development, we updated our port to VirtualBox 4.2.24 with the rough plan to go for 4.3 during the rest of the year.
Ported libraries
We updated OpenSSL to version 1.0.1g, which contains a fix for the heart-bleed bug. Furthermore, we enabled OpenSSL and curl for the ARM architecture.
Device drivers
USB 3.0 for x86-based platforms
Having support for USB 3.0 or XHCI host controllers on the Exynos 5 platform since mid 2013, we decided it was about time to enable USB 3.0 on x86 platforms. Because XHCI is a standardized interface, which is also exposed by the Exynos 5 host controller, the enablement was relatively straight forward. The major open issue for x86 was the missing connection of the USB controller to the PCI bus. For this, we ported the XHCI-PCI part from Linux and connected it with the internal-PCI driver of our dde_linux environment. This step enabled basic XHCI support for x86 platforms. Unfortunately, there seems not to be a single USB 3.0 controller without quirks. Thus, we tested some PCI cards and notebooks and added controller-specific quirks as needed. These quirks may not cover all current production chips though.
We also enabled and tested the HID, storage, and network profiles for USB 3.0, where the supported network chip is, as for Exynos 5, the ASIX AX88179 Gigabit-Ethernet Adapter.
Platforms
Execution on bare hardware (base-hw)
Multi-processor support
When we started to contemplate the support for symmetric multiprocessing within the base-hw kernel, a plenty of fresh influences on this subject floated around in our minds. Most notably, the NOVA port of Genode recently obtained SMP support in the course of a prototypically comparison of different models for inter-processor communication. In addition to the very insightful conclusions of this evaluation, our knowledge about other kernel projects and their way to SMP went in. In general, this showed us that the subject - if addressed too ambitious - may boast lots of complex stabilization problems, and coping with them easily draws down SMP efficiency in the aftermath.
Against this backdrop, we decided - as so often in the evolution of the base-hw kernel - to pick the easiest-to-reach and easiest-to-grasp solution first with preliminary disregard to secondary requirements like scalability. As the base-hw kernel is single-threaded on uniprocessor systems, it was obvious to maintain one kernel thread per SMP processor and, as far as possible, let them all work in a similar way. To moreover keep the code base of the kernel as unmodified as possible while introducing SMP, access to kernel objects get fully serialized by one global spin lock. Therewith, we had a very minimalistic starting point for what shall emerge on the kernel side.
Likewise, we started with a feature set narrowed to only the essentials on the user side, prohibiting thread migration, any kind of inter-processor communication, and also the unmapping of dataspaces, as this would have raised the need for synchronization of TLBs. While thread migration is still an open issue, means of inter-processor communication and TLB synchronization were added successively after having the basics work stable.
First of all, the startup code of the kernel had to be adapted. The simple uniprocessor instantiation was split into three phases: At the very beginning, the primary processor runs alone and initializes everything that is needed for calling a simple C function, which then prepares and performs the activation of the other processors. For each processor, the program provides a dedicated piece of memory for the local kernel stack to live in. Now each processor goes through the second (the asynchronous multiprocessor) phase, initializing its local caches and its memory-management unit. This is a basic prerequisite for spin locks to behave globally coherent, which also implies that memory accesses at this level can't be synchronized. Therefore, the first initialization phase prepares everything in such a way, that the second phase can be done without writing to global memory. As soon as the processors are done with the second phase, they acquire the global spin lock that protects all kernel data. This way, all processors consecutively pass the third initialization phase that handles all remaining drivers and kernel objects. This is the last time the primary processor plays a special role by doing all the work that isn't related to processor-local resources. Afterwards the processors can proceed to the main function that is called on every kernel pass.
Another main challenge was the mode-transition assembler code path that performs both transitions from a processor exception to the call of the kernel-main function and from the return of the kernel-main function back to the user space. As this can't be synchronized, all corresponding data must be provided per processor. This brought in additional offset calculations, which were a little tricky to achieve without polluting the user state. But after we managed to do so, the kernel was already able to handle user threads on different processors as long as they didn't interact with each other.
When it came to synchronous and asynchronous inter-processor communication, we enjoyed a big benefit of our approach. Due to fully serializing all kernel code paths, none of the communication models had changed with SMP. Thanks to the cache coherence of ARM hardware, even shared memory amongst processors isn't a problem. The only difference is that now a processor may change the schedule of another processor by unblocking one of its threads on communication feedback. This may rescind the current scheduling choice of the other processor. To avoid lags in this case, we let the unaware processor trap into an IPI. As the IPI sender doesn't have to wait for an answer, this isn't a big deal neither conceptually nor according to performance.
The last problem we had to solve for common Genode scenarios was the coherency of the TLBs. When unmapping a dataspace at one processor, the corresponding TLB entries must be invalidated on all processors, which - at least on ARM systems - can be done processor-local only. Thus we needed a protocol to broadcast the operation. First, we decided to leave it to the user land to reserve a worker thread at each processor and synchronize between them. This way, we didn't have to modify the kernel back end that was responsible for updating the caches back in uniprocessor mode. Unfortunately, the revised memory management explained in Section Sparsely populated core address space relies on unmap operations at the startup of user threads, which led us into a chicken-and-egg situation. Therefore, the broadcasting was moved from the userland into the kernel. If a user thread now asks the kernel to update the TLBs, the kernel blocks the thread and informs all processors. The last processor that completes the operation unblocks the user thread. If this unblocking happens remotely, the kernel acts exactly the same as described above in the user-communication model. This way, the kernel never blocks itself but only the thread that requests a TLB update.
Given that all kernel operations are lightweight non-blocking operations, we assume that there is little contention for the global kernel lock. So we hope that the simple SMP model will perform well for the foreseeable future where we will have to accommodate only a handful of processors. If this assumption turns out to be wrong, or if the kernel should scale to large-scale SMP systems one day, we still have the choice to advance to a more sophisticated approach without much backpedaling.
Sparsely populated core address space
As the base-hw platform started as an experiment, its memory management was built pretty straight forward. All physical memory of the corresponding hardware was mapped to the virtual memory-address space of the kernel/core one-by-one. This approach comes with several limitations:
-
The amount of physical memory that can be used is limited to a maximum of 4GB on 32-bit ARM platforms
-
Several classes of potential memory bugs within base-hw's core may remain undetected (i.e., dangling pointers)
-
A static mapping of the core/kernel code within a dedicated, restricted area of the address space of all tasks is impossible. Although, this might be valuable to minimize runtime overhead of interrupts, and page faults.
-
As all physical RAM is mapped into core/kernel's address space as cacheable memory, in general it is impossible to map a portion of RAM with other caching attributes, as the cache is working with physical addresses on ARM. This caused problems when dealing with DMA memory, or when sharing uncached memory between TrustZone's secure and normal world in the past.
These limitations are resolved as only memory actually used by base-hw's core/kernel is mapped on demand now. Moreover, the mapping from physical to virtual isn't necessarily one-by-one anymore.
NOVA microhypervisor
In line with most L4 kernels, the NOVA microhypervisor supports priority-based round robin scheduling. However, on Genode we did not leverage this feature. The reason was simple: We had no use for priorities on NOVA until now. This changes when we are heading towards using Genode on a daily basis to perform our work. On live Genode systems, we want to prioritize particular workloads over others. Admittedly, we also wanted to postpone the solution of one challenging technical issue beside just enabling priority configuration.
The NOVA kernel supports the creation of threads with and without a scheduling context attached. Scheduling contexts define a time quantum, a budget, and a priority. The scheduler uses contexts to decide which activity runs next on the CPU. Therefore, a thread without a scheduling context attached can be executed only if a thread with a scheduling context transfers the context during IPC or during an exception implicitly for the time of the request. The transfer of the scheduling context implicitly defines the thread's current priority level. As a consequence, entrypoint threads inherit the priority of client threads and may run on completely different priority levels than other threads in the same process. Unfortunately, the described behavior interferes with the invariant, which is required for Genode's yielding spinlock implementation: All threads of one process are running at the same priority level. Otherwise, the system may end up in a live lock. Although, the user-level yielding spinlock implementation is used solely to protect some few instructions in the lock implementation, the live-lock bears a high risk for the system.
To overcome this issue in base-nova, we replaced the generic yielding spinlock implementation with a NOVA specific helping lock. So, lower-priority threads potentially holding the helping-lock get lent the scheduling context of a higher-priority lock applicant and thereby can finish the critical section. The core idea is to store the identity of the lock holder in form of an execution-context capability in the lock variable. Other lock applicants use the stored capability and instruct the kernel to help the lock holder with their own scheduling context. Consequently, the lock-holder thread will run on the budget of the scheduling context obtained by the helping thread and, therefore, implicitly at the inherited priority level. The lock holder will instruct the kernel to pass back the lent scheduling context to the applicant when leaving the critical section.
We had to extend the NOVA syscall interface to express that a thread wants to pass its current scheduling context explicitly to another thread if and only if both threads belong to the same process and CPU. On reschedule, the context implicitly returns to the lending thread. Additionally, a thread may request an explicit reschedule in order to return a lent scheduling context obtained from another thread.
The current solution enables Genode to make use of NOVA's static priorities.
Another unrelated NOVA extension is the ability for a thread to yield the CPU. The context gets enqueued at the end of the run queue without refreshing the left budget.
Build system and tools
Build system
Sometimes software requires custom tools that are used to generate source code or other ingredients for the build process, for example IDL compilers. Such tools won't be executed on top of Genode but on the host platform during the build process. Hence, they must be compiled with the tool chain installed on the host, not the Genode tool chain. The Genode build system received new support for building such host tools as a side effect of building a library or a target.
Even though it is possible to add the tool compilation step to a regular build description file, it is recommended to introduce a dedicated pseudo library for building such tools. This way, the rules for building host tools are kept separate from rules that refer to Genode programs. By convention, the pseudo library should be named <package>_host_tools and the host tools should be built at <build-dir>/tool/<package>/. With <package>, we refer to the name of the software package the tool belongs to, e.g., qt5 or mupdf. To build a tool named <tool>, the pseudo library contains a custom make rule like the following:
$(BUILD_BASE_DIR)/tool/<package>/<tool>: $(MSG_BUILD)$(notdir $@) $(VERBOSE)mkdir -p $(dir $@) $(VERBOSE)...build commands...
To let the build system trigger the rule, add the custom target to the HOST_TOOLS variable:
HOST_TOOLS += $(BUILD_BASE_DIR)/tool/<package>/<tool>
Once the pseudo library for building the host tools is in place, it can be referenced by each target or library that relies on the respective tools via the LIBS declaration. The tool can be invoked by referring to $(BUILD_BASE_DIR)/tool/<package>/tool.
For an example of using custom host tools, please refer to the mupdf package found within the libports repository. During the build of the mupdf library, two custom tools fontdump and cmapdump are invoked. The tools are built via the lib/mk/mupdf_host_tools.mk library-description file. The actual mupdf library (lib/mk/mupdf.mk) has the pseudo library mupdf_host_tools listed in its LIBS declaration and refers to the tools relative to $(BUILD_BASE_DIR).
Rump-kernel tools
During our work on porting the cryptographic-device driver to Genode, we identified the need for tools to process block-device and file-system images on our development machines. For this purpose, we added the rump-kernel-based tools, which are used for preparing and populating disk images as well as creating cgd(4)-based cryptographic disk devices.
The rump-tool chain can be built (similar to building GCC for Genode) by executing tool/tool_chain_rump build. Afterwards, the tools can be installed via tool/tool_chain_rump install to the default install location /usr/local/genode-rump. As mentioned in Block-level encryption using CGD, instead of using the tools directly, we added the wrapper shell script tool/rump.