Memory mapping - l4-hackers

7 Dec 2003


      One of the essential differences between EROS and L4 is their handling
of mapping state. In light of Hermann's note:
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2003/001559.html
I am beginning to believe that the difference in mapping architectures
may be the only really basic difference. It affects how we think about
caching of mapping state, how we think about memory management
algorithms, and so forth.
Before I start, I want to say "thank you" to Volkmar. In trying to
understand the L4 model, I have gone through a progression of
confusions. It is possible that I am still confused. The key
clarification was Volkmar's comment that the current L4 map operation
proceeds entirely on the native page tables in the x86 implementations.
I currently believe that *neither* the L4 mapping system nor the EROS
mapping system is adequate today. I have been thinking lately about how
to find a happy middle ground, and I provisionally believe that I may
have one. That will be the subject of my next note.
So:
The key differences between the L4 mapping state and the EROS-NG (next
generation) mapping state may be described as follows:
+ In L4, the only important state is the state recorded in
    the mapping database. This state is a cache, and applications
    are required to be able to reconstruct their own mapping
    state on demand.
Mappings are not named by any sort of name (local or global) that
    can be directly referenced by any application. Because they cannot
    be named, mappings are not first class objects -- there is no
    contract with any application about the kernel's logical mapping
    structure.
Page fault handling is done at the granularity of a 'task'.
+ In EROS-NG, address spaces are expressed as guarded page tables
    (GPTs), and every GPT is named by a capability. GPTs are therefore
    first-class objects exported by the kernel. The kernel can "page"
    these objects in the same way that it can "page" data pages, but
    it is not free to discard them.
Page fault handlers can be injected at any arbitrary GPT.
There seem to be advantages and disadvantages to each design.
1. Cost of mapping:
I believe the the correct way to measure the cost of a map operation is
to include all of the costs necessary to actually get a valid PTE into
the recipient address space.
* L4 Map
The dominant cost in the L4 map operation is the cost to build the
necessary mapping DB entries in the kernel. PTEs are copied
aggressively, so there are usually no further hardware page faults
needed to load them when the recipient starts running.
In addition to the kernel-level map operation, the recipient task must
record the incoming mapping in some per-task database. This database
essentially duplicates the state in the kernel, though my guess is that
it can be accomplished via a region-based recording scheme in the usual
cases (that is: I record that X bytes from File Y got mapped starting at
address A, and that faults in this region should be recovered by making
a request to thread-id T). It is not clear to me what the practical
overhead of this additional tracking is.
Unless something has changed in a way that I have failed to understand,
L4 does not provide a means to share page tables across more than one
task.
* EROS-NG map (capability transfer)
The dominant cost in the EROS-NG design is the cost of the page faults.
PTEs are NOT transferred eagerly, and the recipient takes a fast path
per-page fault to validate each mapping when it is first encountered.
These mapping validation faults can be "batched" for performance, but
the current EROS implementation does not do so, and there does not
appear to be any compelling performance-motivated reason to *want* to do
so. The kernel uses a variety of tricks to make these traversals more
efficient on architectures that implement hierarchical page tables.
In the current EROS system, there are several steps required on the part
of both sender and receiver to extract the relevant capability for
transmission and to insert it into the desired location in the recipient
space. In EROS-NG these steps are accomplished in the kernel, though
they may result in what I might call "GPT faults" -- I'll try to explain
this in my next mail.
For some architectures, it is an important point that the EROS strategy
guarantees page table sharing whenever doing so is (a) possible and (b)
correct -- even across threads that do not share an address space. In
our experience, page table sharing is very important to overall system
performance. This sort of sharing is especially important to provide
efficient support for shared libraries.
2. Cost of Unmap:
The costs of unmap in the two systems seem comparable if the
implementation is done with care. Because EROS-NG provides explicit
names for the GPTs, and because it is possible to build "alias" GPTs,
the EROS unmap operation can be used more selectively than the L4 unmap
operation. This may not be a critical issue.
3. Encapsulation
If I understand matters correctly, an L4 page fault is always reported
first to the client's per-task page fault handler. In EROS, page fault
handlers can be associated with arbitrary regions of the address space,
and faults delivered to these handlers are usually invisible to the
client (except for latency). Mach, just for comparison, does per-region
page fault handlers. They are not structured like those of EROS, but
like EROS they allow specification of address fault handlers on a region
by region basis.
When I first encountered KeyKOS, I didn't see why this distinction
mattered. Here is an example that may help clarify whether or not this
distinction is important:
Imagine that client C maps a file F into the address space of C.
Portions of the file data may not be in memory, and may need to be
demand paged.
Note that C cannot have enough information to accomplish this demand
paging action. At *best*, C has enough information to ask the file
server to provide the missing mapping. Because C is not in a position to
actually solve the problem, I would argue that the kernel has
misdirected the fault. The L4 design appears to require more IPC steps
than the corresponding EROS design, because it requires an extra IPC
operation.
...
From an abstraction perspective, there is a second objection: the demand
paging of this object is really none of C's business, but C is given
full knowledge of all of the faults. This is an encapsulation failure.
In fact, the design places a burden of mapping work on C that appears to
me to be unnecessary.
The one argument I see that might make this design desirable is the
argument that the L4 kernel is simplified as a result. While I certainly
believe that this is true, I think that the absence of first-class
memory objects is a potential weakness.
4. Space
The EROS-NG design is more space intensive in the kernel, because the
kernel must maintain two representations of the mapping structure and a
correlation structure:
1. The GPTs
  2. The page tables
  3. The depend table (for dependency tracking)
The hardware page tables and the depend tables can be discarded at any
time, and the GPTs are pageable.
I do not believe that requiring more kernel space is inherently bad --
the issue is the amount of total pageable space at user and kernel
levels combined. We have no evidence from the EROS work that total
kernel virtual addressable space is at risk from this particular
increase.
The EROS kernel translation strategy is more complex than the
corresponding L4 strategy. I do not believe that this is necessarily
bad. More precisely, I believe that *all* of the complexities of mapping
need to be examined together; not just the complexities that are
incurred in the kernel. It is reasonable and desirable to push function
out of the kernel when the cost of doing so is not excessive.
Random Comments:
It appears to me that there are peculiar boundary conditions implicit in
the L4 design: if the very last page of a process is paged out, I am not
sure how its pager thread runs, because I do not understand what memory
that pager thread references in order to initiate instructions or store
temporary data. I suspect that the solution is that the pager thread in
turn specifies a pager thread (which I will call the meta-pager), and
the meta-pager arranges to page enough state back in that the pager
thread can make progress.
Probably this all seems perfectly obvious to people who are familiar
with L4. I find it confusing that every process must have two threads or
must delegate mapping management to a third-party task.
shap