One of the essential differences between EROS and L4 is their handling of mapping state. In light of Hermann's note:
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2003/001559.html
I am beginning to believe that the difference in mapping architectures may be the only really basic difference. It affects how we think about caching of mapping state, how we think about memory management algorithms, and so forth.
Before I start, I want to say "thank you" to Volkmar. In trying to understand the L4 model, I have gone through a progression of confusions. It is possible that I am still confused. The key clarification was Volkmar's comment that the current L4 map operation proceeds entirely on the native page tables in the x86 implementations.
I currently believe that *neither* the L4 mapping system nor the EROS mapping system is adequate today. I have been thinking lately about how to find a happy middle ground, and I provisionally believe that I may have one. That will be the subject of my next note.
So:
The key differences between the L4 mapping state and the EROS-NG (next generation) mapping state may be described as follows:
+ In L4, the only important state is the state recorded in the mapping database. This state is a cache, and applications are required to be able to reconstruct their own mapping state on demand.
Mappings are not named by any sort of name (local or global) that can be directly referenced by any application. Because they cannot be named, mappings are not first class objects -- there is no contract with any application about the kernel's logical mapping structure.
Page fault handling is done at the granularity of a 'task'.
+ In EROS-NG, address spaces are expressed as guarded page tables (GPTs), and every GPT is named by a capability. GPTs are therefore first-class objects exported by the kernel. The kernel can "page" these objects in the same way that it can "page" data pages, but it is not free to discard them.
Page fault handlers can be injected at any arbitrary GPT.
There seem to be advantages and disadvantages to each design.
1. Cost of mapping:
I believe the the correct way to measure the cost of a map operation is to include all of the costs necessary to actually get a valid PTE into the recipient address space.
* L4 Map
The dominant cost in the L4 map operation is the cost to build the necessary mapping DB entries in the kernel. PTEs are copied aggressively, so there are usually no further hardware page faults needed to load them when the recipient starts running.
In addition to the kernel-level map operation, the recipient task must record the incoming mapping in some per-task database. This database essentially duplicates the state in the kernel, though my guess is that it can be accomplished via a region-based recording scheme in the usual cases (that is: I record that X bytes from File Y got mapped starting at address A, and that faults in this region should be recovered by making a request to thread-id T). It is not clear to me what the practical overhead of this additional tracking is.
Unless something has changed in a way that I have failed to understand, L4 does not provide a means to share page tables across more than one task.
* EROS-NG map (capability transfer)
The dominant cost in the EROS-NG design is the cost of the page faults. PTEs are NOT transferred eagerly, and the recipient takes a fast path per-page fault to validate each mapping when it is first encountered. These mapping validation faults can be "batched" for performance, but the current EROS implementation does not do so, and there does not appear to be any compelling performance-motivated reason to *want* to do so. The kernel uses a variety of tricks to make these traversals more efficient on architectures that implement hierarchical page tables.
In the current EROS system, there are several steps required on the part of both sender and receiver to extract the relevant capability for transmission and to insert it into the desired location in the recipient space. In EROS-NG these steps are accomplished in the kernel, though they may result in what I might call "GPT faults" -- I'll try to explain this in my next mail.
For some architectures, it is an important point that the EROS strategy guarantees page table sharing whenever doing so is (a) possible and (b) correct -- even across threads that do not share an address space. In our experience, page table sharing is very important to overall system performance. This sort of sharing is especially important to provide efficient support for shared libraries.
2. Cost of Unmap:
The costs of unmap in the two systems seem comparable if the implementation is done with care. Because EROS-NG provides explicit names for the GPTs, and because it is possible to build "alias" GPTs, the EROS unmap operation can be used more selectively than the L4 unmap operation. This may not be a critical issue.
3. Encapsulation
If I understand matters correctly, an L4 page fault is always reported first to the client's per-task page fault handler. In EROS, page fault handlers can be associated with arbitrary regions of the address space, and faults delivered to these handlers are usually invisible to the client (except for latency). Mach, just for comparison, does per-region page fault handlers. They are not structured like those of EROS, but like EROS they allow specification of address fault handlers on a region by region basis.
When I first encountered KeyKOS, I didn't see why this distinction mattered. Here is an example that may help clarify whether or not this distinction is important:
Imagine that client C maps a file F into the address space of C. Portions of the file data may not be in memory, and may need to be demand paged.
Note that C cannot have enough information to accomplish this demand paging action. At *best*, C has enough information to ask the file server to provide the missing mapping. Because C is not in a position to actually solve the problem, I would argue that the kernel has misdirected the fault. The L4 design appears to require more IPC steps than the corresponding EROS design, because it requires an extra IPC operation.
From an abstraction perspective, there is a second objection: the demand
paging of this object is really none of C's business, but C is given full knowledge of all of the faults. This is an encapsulation failure. In fact, the design places a burden of mapping work on C that appears to me to be unnecessary.
The one argument I see that might make this design desirable is the argument that the L4 kernel is simplified as a result. While I certainly believe that this is true, I think that the absence of first-class memory objects is a potential weakness.
4. Space
The EROS-NG design is more space intensive in the kernel, because the kernel must maintain two representations of the mapping structure and a correlation structure:
1. The GPTs 2. The page tables 3. The depend table (for dependency tracking)
The hardware page tables and the depend tables can be discarded at any time, and the GPTs are pageable.
I do not believe that requiring more kernel space is inherently bad -- the issue is the amount of total pageable space at user and kernel levels combined. We have no evidence from the EROS work that total kernel virtual addressable space is at risk from this particular increase.
The EROS kernel translation strategy is more complex than the corresponding L4 strategy. I do not believe that this is necessarily bad. More precisely, I believe that *all* of the complexities of mapping need to be examined together; not just the complexities that are incurred in the kernel. It is reasonable and desirable to push function out of the kernel when the cost of doing so is not excessive.
Random Comments:
It appears to me that there are peculiar boundary conditions implicit in the L4 design: if the very last page of a process is paged out, I am not sure how its pager thread runs, because I do not understand what memory that pager thread references in order to initiate instructions or store temporary data. I suspect that the solution is that the pager thread in turn specifies a pager thread (which I will call the meta-pager), and the meta-pager arranges to page enough state back in that the pager thread can make progress.
Probably this all seems perfectly obvious to people who are familiar with L4. I find it confusing that every process must have two threads or must delegate mapping management to a third-party task.
shap
Just a few clarifications, at least I hope I clarify things ;)
On Mon, 7 Dec 2003, Jonathan S. Shapiro wrote:
So:
The key differences between the L4 mapping state and the EROS-NG (next generation) mapping state may be described as follows:
- In L4, the only important state is the state recorded in the mapping database. This state is a cache, and applications are required to be able to reconstruct their own mapping state on demand.
The mapping database is a cache only in the sence that at any time a higher level pager can revoke a mapping from an application or lower-level papers. In the current model the kernel 'never' throws away mappings.
Page fault handlers can be injected at any arbitrary GPT.
There seem to be advantages and disadvantages to each design.
- Cost of mapping:
I believe the the correct way to measure the cost of a map operation is to include all of the costs necessary to actually get a valid PTE into the recipient address space.
- L4 Map
The dominant cost in the L4 map operation is the cost to build the necessary mapping DB entries in the kernel. PTEs are copied aggressively, so there are usually no further hardware page faults needed to load them when the recipient starts running.
In addition to the kernel-level map operation, the recipient task must record the incoming mapping in some per-task database. This database essentially duplicates the state in the kernel, though my guess is that it can be accomplished via a region-based recording scheme in the usual cases (that is: I record that X bytes from File Y got mapped starting at address A, and that faults in this region should be recovered by making a request to thread-id T). It is not clear to me what the practical overhead of this additional tracking is.
The mapping database is not per-task. There is a single mapping database in the kernel and it consists of a mapping tree per physical frame of memory. In fact, in some of the implementations the mapping database nodes are all stored in page tables leaf nodes.
Unless something has changed in a way that I have failed to understand, L4 does not provide a means to share page tables across more than one task.
L4 does not explicitly export any concept of sharing page table sub-tree's but you can build this knowledge from the map/grant/unmap primitives. There have been proposals floated to make sharing explicit (ie the Link operation) and via mapping hints.
Random Comments:
It appears to me that there are peculiar boundary conditions implicit in the L4 design: if the very last page of a process is paged out, I am not sure how its pager thread runs, because I do not understand what memory that pager thread references in order to initiate instructions or store temporary data. I suspect that the solution is that the pager thread in turn specifies a pager thread (which I will call the meta-pager), and the meta-pager arranges to page enough state back in that the pager thread can make progress.
The paper thread need not be in the same address space as the thread it is paging so as you point out if all threads in an address-space (bar one) use an address-space local thread as their pager, then you could speficy an address-space external thread as that thread's pager.
Probably this all seems perfectly obvious to people who are familiar with L4. I find it confusing that every process must have two threads or must delegate mapping management to a third-party task.
Pagers are not specified per address space. They are specified per thread. No third party task would be required if by some estabilished protocol, the minimal set of pages an address-space-local pager needed were pinned in memory.
Cheers, Adam
-- Adam "WeirdArms" Wiggins School of Computer Sci. & Eng. PhD Student The University of NSW Phone: +61 2 9385 7359 UNSW SYDNEY NSW 2052, Australia Fax: +61 2 9385 7942 http://www.cse.unsw.edu.au/~awiggins
l4-hackers@os.inf.tu-dresden.de