Resource trading
As introduced in Section Component ownership, child components are created out of the resources of their respective parent components. This section describes the underlying mechanism. It first introduces the concept of PD sessions as resource accounts in Section Resource assignment. Section Trading memory between clients and servers explains how PD sessions are used to trade resources between components. The resource-trading mechanism ultimately allows servers to become resilient against client-driven resource-exhaustion attacks. However, such servers need to take special precautions that are explained in Section Component-local heap partitioning. Section Dynamic resource balancing presents a mechanism for the dynamic balancing of resources among cooperative components.
Resource assignment
In general, it is the operating system's job to manage the physical resources of the machine in a way that enables multiple applications to utilize them in a safe and efficient manner. The physical resources are foremost the physical memory, the processing time of the CPUs, and devices.
The traditional approach to resource management
Traditional operating systems usually provide abstractions of physical resources to applications running on top of the operating system. For example, instead of exposing the real interface of a device to an application, a Unix kernel provides a representation of the device as a pseudo file in the virtual file system. An application interacts with the device indirectly by operating on the respective pseudo file via a device-class-specific API (ioctl operations). As another example, a traditional OS kernel provides each application with an arbitrary amount of virtual memory, which may be much larger than the available physical memory. The application's virtual memory is backed with physical memory not before the application actually uses the memory. The pretension of unlimited memory by the kernel relieves application developers from considering memory as a limited resource. On the other hand, this convenient abstraction creates problems that are extremely hard or even impossible to solve by the OS kernel.
-
The amount of physical memory that is at the disposal for backing virtual memory is limited. Traditional OS kernels employ strategies to uphold the illusion of unlimited memory by swapping memory pages to disk. However, the swap space on disk is ultimately limited, too. At one point, when the physical resources are exhausted, the pretension of unlimited memory becomes a leaky abstraction and forces the kernel to take extreme decisions such as killing arbitrary processes to free up physical memory.
-
Multiple applications including critical applications as well as potentially misbehaving applications share one pool of physical resources. In the presence of a misbehaving application that exhausts the physical memory, all applications are equally put at risk.
-
Third, by granting each application the legitimate ability to consume as much memory as the application desires, applications cannot be held accountable for their consumption of physical memory. The kernel cannot distinguish a misbehaving from a well-behaving memory-demanding application.
There are several approaches to relieve those problems. For example, OS kernels that are optimized for resource utilization may employ heuristics that take the application behavior into account for parametrizing page-swapping strategies. Another example is the provisioning of a facility for pinned memory to applications. Such memory is guaranteed to be backed by physical memory. But such a facility bears the risk of allowing any application to exhaust physical memory directly. Hence, further heuristics are needed to limit the amount of pinned memory an application may use. Those counter measures and heuristics, while making the OS kernel more complex, are mere attempts to fight symptoms but unable to solve the actual problems caused by the lack of accounting. The behavior of such systems remains largely indeterministic.
As a further consequence of the abstraction from physical resources, the kernel has to entail functionality to support the abstraction. For example, for swapping memory pages to disk, the kernel has to depend on an in-kernel disk driver. For each application, whether or not it ever touches the disk, the in-kernel disk driver is part of its trusted computing base.
PD sessions and balances
Genode does not abstract from physical resources. Instead, it solely arbitrates the access to such resources and provides means to delegate the authority over resources between components. Low-level physical resources are represented as services provided by the core component at the root of the component tree. The core component is described in detail in Section Core - the root of the component tree. The following description focuses on memory as the most prominent low-level resource managed by the operating system. Processing time is subject to the kernel's scheduling policy whereas the management of the higher-level resources such as disk space is left to the respective servers that provide those resources.
Physical memory is handed out and accounted by the PD service of core. The best way to describe the idea is to draw an analogy between the PD service and a bank. Each PD session corresponds to a bank account. Initially, when opening a new account, there is no balance. However, by having the authority over an existing bank account with a balance, one can transfer funds from the existing account to the new account. Naturally, such a transaction will decrease the balance of the originating account. Internally at the bank, the transfer does not involve any physical bank notes. The transaction is merely a change of balances of both bank accounts involved. A bank customer with the authority over a given bank account can use the value stored on the bank account to purchase physical goods while withdrawing the costs from the account. Such a withdrawal will naturally decrease the balance on the account. If the account is depleted, the bank denies the purchase attempt. Analogously to purchasing physical goods by withdrawing balances from a bank account, physical memory can be allocated from a PD session. The balance of the PD session is the PD session's quota. A piece of allocated physical memory is represented by a so-called dataspace (see Section Dataspaces for more details). A RAM dataspace is a container of physical memory that can be used for storing data.
Subdivision of budgets
Similar to a person with a bank account, each component of a Genode system has a session at core's PD service. At boot time, the core component creates an initial PD session with the balance set to the amount of available physical memory. This PD session is designated for the init component, which is the first and only child of core. On request by init, core delegates the capability for this initial PD session to the init component.
For each child component spawned by the init component, init creates a new PD session at core. Figure 1 exemplifies this step for one child. As the result from the session creation, it obtains the capability for the new PD session. Because it has the authority over both its own and the child's designated PD session, it can transfer a certain amount of RAM quota from its own account to the child's account by invoking its own PD-session capability and specifying the beneficiary's PD-session capability as argument. Core responds to the request by atomically adjusting the quotas of both PD sessions by the specified amount. In the case of init, the amount depends on init's configuration. Thereby, init explicitly splits its own RAM budget among its child components. Each child created by init can obtain the capability for its own PD session from init via the parent interface and thereby gains the authority over the memory budget that was assigned to it. Note however, that no child has the authority over init's PD session nor the PD sessions of any siblings. The mechanism for distributing a given budget among multiple children works recursively. The children of init can follow the same procedure to further subdivide their budgets for spawning grandchildren.
Protection against resource stealing
A parent that created a child subsystem out of its own memory resources, expects to regain the spent resources when destructing the subsystem. For this reason, it must not be possible for a child to transfer funds to another branch of the component tree without the consent of the parent. Figure 2 illustrates an example scenario that violates this expectation. The client and server components conspire to steal memory from the child. The client was created by the child and received a portion of the child's memory budget. The client requested a session for a service that was eventually routed to the server. The client-server relationship allows the client to delegate capabilities to the server. Therefore, it is able to delegate its own PD session capability to the server. The server, now in possession of the client's and its own PD session capabilities, can transfer memory from the client's to its own PD session. After this transaction, the child has no way to regain its memory resources because it has no authority over the server's PD session.
To prevent such resource-stealing scenarios, Genode restricts the quota transfer between arbitrary PD sessions. Each PD session must have a reference PD session, which can be defined only once. Transfers are permitted only between a PD session and its reference PD session. When creating the PD session of a child component, the parent registers its own PD session as the child's reference PD session. This way, the parent becomes able to transfer budgets between its own and the child's PD session.
PD session destruction
When a PD session is closed, core destroys all dataspaces that were allocated from the PD session and transfers the PD session's final budget to the corresponding reference PD session.
Trading memory between clients and servers
An initial assignment of memory to a child is not always practical because the memory demand of a given component may be unknown at its construction time. For example, the memory needed by a GUI server over its lifetime is not known a priori but depends on the number of its clients, the number of windows on screen, or the amount of pixels that must be held at the server. In many cases, the memory usage of a server depends on the behavior of its clients. In traditional operating systems, system services like a GUI server would allocate memory on behalf of its clients. Even though the allocation was induced by a client, the server performs the allocation. The OS kernel remains unaware of the fact that the server solely needs the allocated memory for serving its client. In the presence of a misbehaving client that issues an infinite amount of requests to the server where each request triggers a server-side allocation (for example the creation of a new window), the kernel will observe the server as a resource hog. Under resource pressure, it will likely select the server to be punished. Each server that performs allocations on behalf of its clients is prone to this kind of attack. Genode solves this problem by letting clients pay for server-side allocations. Client and server may be arbitrary nodes in the component tree.
Session quotas
As described in the previous section, at the creation time of a child, the parent assigns a part of its own memory quota to the new child. Since the parent retains the PD-session capabilities of all its children, it can issue further quota transfers back and forth between the children's PD sessions and its own PD session, which represents the reference account for all children. When a child requests a session at the parent interface, it can attach a fraction of its quota to the new session by specifying an amount of memory to be donated to the server as a session argument. This amount is called session quota. The session quota can be used by the server during the lifetime of the session. It is returned to the client when the session is closed.
When receiving a session request, the parent has to distinguish three different cases depending on its session-routing decision as described in Section Services and sessions.
- Parent provides the service
-
If the parent provides the requested service by itself, it first checks whether the session quota meets its need for providing the service. If so, it transfers the session quota from the requesting child's PD session to its own PD session. This step may fail if the child offered a session quota larger than the available quota in the child's PD session.
- Server is another child
-
If the parent decides to route the session request to another child, it transfers the session quota from the client's PD session to the server's PD session. Because the PD sessions are not related to each other as both have the parent's PD session as reference account, this transfer from the client to the server consists of two steps. First, the parent transfers the session quota to its own PD session. If this step succeeded, it transfers the session quota from its own PD session to the server's PD session. The parent keeps track of the session quota for each session so that the quota transfers can be reverted later when closing the session. Not before the transfer of the session quota to the server's PD session succeeded, the parent issues the actual session request at the server's root interface along with the information about the transferred session quota.
- Forward to grandparent
-
The parent may decide to forward the session request to its own parent. In this case, the parent requests a session on behalf of its child. The grandparent neither knows nor cares about the actual origin of the request and will simply decrease the memory quota of the parent. For this reason, the parent transfers the session quota from the requesting child to its own PD session before issuing the session request at the grandparent.
Quota transfers may fail if there is not enough budget on the originating account. In this case, the parent aborts the session creation and reflects the lack of resources as an error to the originator of the session request.
This procedure works recursively. Once the server receives the session request along with the information about the provided session quota, it can use this information to decide whether or not to provide the session under these resource conditions. It can also use the information to tailor the quality of the service according to the provided session quota. For example, a larger session quota might enable the server to use larger caches or communication buffers for the client's session.
Session upgrades
During the lifetime of a session, the initial session quota may turn out to be too scarce. Usually, the server returns such a scarcity condition as an error of operations that imply server-side allocations. The client may handle such a condition by upgrading the session quota of an existing session by issuing an upgrade request to its parent along with the targeted session capability and the additional session quota. The upgrade works analogously to the session creation. The server will receive the information about the upgrade via the root interface of the service.
Closing sessions
If a child issues a session-close request to its parent, the parent determines the corresponding server, which, depending on the route of the original session request, may be locally implemented, provided by another child, or provided by the grandparent. Once the server receives the session-close request, it is responsible for releasing all resources that were allocated from the session quota. The release of resources should revert all allocations the server has performed on behalf its client. Stressing the analogy with the bank account, the server has to sell the physical goods (i.e., RAM dataspaces) it purchased from the client's session quota to restore the balance on its PD session. After the server has reverted all session-specific allocations, the server's PD session is expected to have at least as much available budget as the session quota of the to-be-closed session. As a result, the session quota can be transferred back to the client.
However, a misbehaving server may fail to release those resources by malice or because of a bug. For example, the server may be unable to free a dataspace because it mistakenly used the dataspace for another client's data. Another example would be a memory leak in the server. Such misbehavior is detected on the attempt to withdraw the session quota from the server's PD session. If the server's available RAM quota after closing a session remains lower than the session quota, the server apparently peculated memory. If the misbehaving server was locally provided by the parent, it has the full authority to not hand back the session quota to its child. If the misbehaving service was provided by the grandparent, the parent (and its whole subsystem) has to subordinate. If, however, the server was provided by another child and the child refuses to release resources, the parent's attempt to withdraw the session quota from the server's PD session will fail. It is up to the policy of the parent to handle such a failure either by punishing the server (e.g., killing the component) or by granting more of its own quota. Generally, misbehavior is against the server's own interests. A server's best interest is to obey the parent's close request to avoid intervention.
Component-local heap partitioning
Components that perform memory allocations on behalf of untrusted parties must take special precautions for the component-local memory management. There are two prominent examples for such components. As discussed in Section Trading memory between clients and servers, a server may be used by multiple clients that must not interfere with each other. Therefore, server-side memory allocations on behalf of a particular client must strictly be accounted to the client's session quota. Second, a parent with multiple children may need to allocate memory to perform the book keeping for the individual children, for example, maintaining the information about their open sessions and their session quotas. The parent should account those child-specific allocations to the respective children. In both cases, it is not sufficient to merely keep track of the amount of memory consumed on behalf of each untrusted party but the actual allocations must be performed on independent backing stores.
Figure 3 shows a scenario where a server performs anonymous memory allocations on behalf of two session. The memory is allocated from the server's heap. Whereas allocations from the heap are of byte granularity, the heap's backing store consists of several dataspaces. Those dataspaces are allocated from the server's PD session as needed but at a much larger granularity. As depicted in the figure, allocations from both sessions end up in the same dataspaces. This becomes a problem once one session is closed. As described in the previous section, the server's parent expects the server to release all resources that were allocated from the corresponding session quota. However, even if the server reverts all heap allocations that belong to the to-be-closed session, the server could still not release the underlying backing store because all dataspaces are still occupied with memory objects of another session. Therefore, the server becomes unable to comply with the parent's expectation.
The solution of this problem is illustrated in Figure 4. For each session, the server maintains a separate heap partition. Each memory allocation on behalf of a client is performed from the session-specific heap partition rather than from a global heap. This way, memory objects of different sessions populate disjoint dataspaces. When closing a session, the server reverts all memory allocations from the session's heap. After freeing the session's memory objects, the heap partition becomes empty. So it can be destroyed. By destroying the heap partition, the underlying dataspaces that were used as the backing store can be properly released.
Dynamic resource balancing
As described in Section Resource assignment, parent components explicitly assign physical resource budgets to their children. Once assigned, the budget is at the disposal of the respective child subsystem until the subsystem gets destroyed by the parent.
However, not all components have well-defined resource demands. For example, a block cache should utilize as much memory as possible unless the memory is needed by another component. The assignment of fixed amount of memory to such a block cache cannot accommodate changes of workloads over the potentially long lifetime of the component. If dimensioned too small, there may be a lot of slack memory remaining unutilized. If dimensioned too large, the block cache would prevent other and possibly more important components to use the memory. A better alternative is to enable a component to adapt its resource use to the resource constraints of its parent. The parent interface supports this alternative with a protocol for the dynamic balancing of resources.
The resource-balancing protocol uses a combination of synchronous remote procedure calls and asynchronous notifications. Both mechanisms are described in Section Inter-component communication. The child uses remote procedure calls to talk to its parent whereas the parent uses asynchronous notifications to signal state changes to the child. The protocol consists of two parts, which are complementary.
Resource requests
By issuing a resource request to its parent, a child applies for an upgrade of its resources. The request takes the amount of desired resources as argument. A child would issue such a request if it detects scarceness of resources. A resource request returns immediately regardless of whether additional resources have been granted or not. The child may proceed working under the low resource conditions or it may block and wait for a resource-available signal from its parent. The parent may respond to this request in different ways. It may just ignore the request, possibly stalling the child. Alternatively, it may immediately transfer additional quota to the child's PD session. Or it may take further actions to free up resources to accommodate the child. Those actions may involve long-taking operations such as the destruction of subsystems or the further propagation of resource request towards the root of the component tree. Once the parent has freed up enough resources to accommodate the child's request, it transfers the new resources to the child's PD session and notifies the child by sending a resource-available signal.
Yield requests
The second part of the protocol enables the parent to express its wish for regaining resources. The parent notifies the child about this condition by sending a yield signal to the child. On the reception of such a signal, the child picks up the so-called yield request at the parent using a remote procedure call. The yield request contains the amount of resources the parent wishes to regain. It is up to the child to comply with a yield request or not. Some subsystems have meaningful ways to respond to yield requests. For example, an in-memory block cache could write back the cached information and release the memory consumed by the cache. Once the child has succeeded in freeing up resources, it reports to the parent by issuing a so-called yield response via a remote procedure call to the parent. The parent may respond to a yield response by withdrawing resources from the child's PD session.