I am trying to understand the implications of the "mapping is a cache" design argument. I suspect that this design can only be upheld if encapsulation is violated. First, however, I would like to understand the sequence of events in the following scenario:
Consider a situation in which
A maps some region to B B completes the receive operation, and therefore now has a copy of the mapping B is immediately preempted, before it can do any user-level book keeping about the mapping ... other stuff runs ... kernel runs out of mapping cache space, chooses to evict the mapping just received by B ... other stuff runs ... B attempts to reference the region that it believes should be mapped, and page faults.
Can someone explain the process by which B is able to get the mapping reconstructed?
shap
[Jonathan S Shapiro]
I am trying to understand the implications of the "mapping is a cache" design argument. I suspect that this design can only be upheld if encapsulation is violated. First, however, I would like to understand the sequence of events in the following scenario:
Consider a situation in which
A maps some region to B B completes the receive operation, and therefore now has a copy of the mapping B is immediately preempted, before it can do any user-level book keeping about the mapping ... other stuff runs ... kernel runs out of mapping cache space, chooses to evict the mapping just received by B ... other stuff runs ... B attempts to reference the region that it believes should be mapped, and page faults.
Can someone explain the process by which B is able to get the mapping reconstructed?
A really quick answer:
B's pager, Pb, receives the page fault Pb requests the mapping from A
Note that Pb and A could here be the same thread, in which case A must know how to translate a virtual address in B's space to some region. In practice, however, B will not use A as its pager because:
o A might be an untrusted entity. o Allowing the virtual address in the page fault to be first somehow translated into a higher-level object allows for much greater flexibility.
Of course, in order for this scheme to work, A (and Pb), not B, must keep some sort of data structures that allows page faults to be resolved. These data structures must be initialized *before* A actually maps the memory region to B.
A longer answer would require a better understanding of our concept of "data spaces", "data space managers", and "region maps" [1]. Here's a rather shortish explanation of this scheme:
Data space: An unstructured data container, e.g., a file, anonymous memory, pinned memory, etc.
Data space manager: A server that manages accesses to a particular data space. The data space manager will typically have parts (or the whole) of the data space mapped into its own address space. It will map these parts off to clients.
Region map: A region map is a part of the client's address space that contains parts (or the whole) of a data space. Note that the region map need not be fully populated. If the client accesses a part of the region which is not mapped, a page fault will be generated.
Region mapper: The region mapper serves as the page fault handler for the threads within the client. The region mapper keeps track of all region maps attached to the address space. When the region mapper catches page faults it translated these page faults into requests that are forwarded to the respective data space manager.
Data spaces are typically constructed recursively. At the bottom (or top depending on your point of view) there is a data space that manages the complete physical memory. On top of this data space one can build data spaces that handle anonymous memory, pinned memory, frame buffer memory, etc. The anonymous memory data spaces can implement various policies for paging, one can build data spaces on top of this that provides access to files, distributed shared memory, etc.
Now, to map our concepts of data spaces onto your question. The thread A in your scheme would correspond to a data space manager, B would correspond to a client thread, and Pb would correspond to the region mapper. For B to access parts of the data space, the following steps would typically be taken (Rm = region mapper, Dm = data space manager):
1. Rm: Create region (R) 2. Rm: Request data space manager (Dm) to attach a data space (D) to R. 3. B: Touch some memory in R. Nothing is mapped yet and a page fault is therefore raised. 4. Rm: Receive page fault and use virtual address to identify region. 5. Rm: Request Dm to map parts of the data space to R. 6. Dm: Map parts of D to R.
An obvious optimization here is for Rm to request parts of the region map to be pre-populated before step 3.
At any time when B attempts to access parts of R that is not mapped, the region mapper will translate the page fault into a data space request. It does not matter why the memory is not mapped. All that matters is that Rm and Dm keep data structures that allow the page fault to be resolved.
[Hmm... my "short" answer turned out to be a bit longer than expected.]
eSk
[1] http://i30www.ira.uka.de/research/documents/l4ka/sawmill-framework.pdf
On Mon, 2003-12-08 at 09:30, Espen Skoglund wrote:
[Jonathan S Shapiro]
A maps some region to B B completes the receive operation, and therefore now has a copy of the mapping B is immediately preempted, before it can do any user-level book keeping about the mapping ... other stuff runs ... kernel runs out of mapping cache space, chooses to evict the mapping just received by B ... other stuff runs ... B attempts to reference the region that it believes should be mapped, and page faults.
Can someone explain the process by which B is able to get the mapping reconstructed?
A really quick answer:
B's pager, Pb, receives the page fault Pb requests the mapping from A
Note that Pb and A could here be the same thread...
This makes sense to me, but it also seems to me that if A is a process implementing the file server, and B has memory mapped a file from A, then the current design requires Pb to act as an intermediary -- primarily for the purpose of normalizing file offsets and doing a little bit of protocol translation.
Further, it seems to me that there is an interesting problem of deceiting here, since the file server may not know that Pb and B are equivalent for access control purposes.
Am I missing something that simplifies this scenario?
A longer answer would require a better understanding of our concept of "data spaces", "data space managers", and "region maps" [1]. Here's a rather shortish explanation of this scheme:
Data space: An unstructured data container, e.g., a file, anonymous memory, pinned memory, etc.
Data space manager: A server that manages accesses to a particular data space. The data space manager will typically have parts (or the whole) of the data space mapped into its own address space. It will map these parts off to clients.
Region map: A region map is a part of the client's address space that contains parts (or the whole) of a data space. Note that the region map need not be fully populated. If the client accesses a part of the region which is not mapped, a page fault will be generated.
Region mapper: The region mapper serves as the page fault handler for the threads within the client. The region mapper keeps track of all region maps attached to the address space. When the region mapper catches page faults it translated these page faults into requests that are forwarded to the respective data space manager.
Okay. This is roughly the model that I was reconstructing from first principles. I will try to use these terms from here on to avoid confusion.
Based on your description, I am now reasonably convinced that the L4 operations are individually faster, but that the collective end to end protocol needed to resolve page faults when data spaces are involved may be significantly more complicated in L4 than it is in EROS. I suspect that the aggregate end to end costs in L4 are likely to be *slower* than EROS, but at best they are going to be very similar.
For B to access parts of the data space, the following steps would typically be taken (Rm = region mapper, Dm = data space manager):
- Rm: Create region (R)
- Rm: Request data space manager (Dm) to attach a data space (D) to R.
- B: Touch some memory in R. Nothing is mapped yet and a page fault is therefore raised.
- Rm: Receive page fault and use virtual address to identify region.
- Rm: Request Dm to map parts of the data space to R.
- Dm: Map parts of D to R.
An obvious optimization here is for Rm to request parts of the region map to be pre-populated before step 3.
A better optimization might be to provide sufficient information to the kernel so that it can more directly localize the correct fault handler.
Or perhaps the L4 design embeds a philosophical argument that resolving these things at user level is (a) feasible and (b) likely as efficient than any kernel implementation, and therefore should not be done in the kernel? If so, I understand the philosophical point, and I am not sure that I agree. In my mind, the answer depends on what gets the job done best on an end to end basis.
Please note that I'm not advocating placing policy in the kernel here. I'm wondering if there might be a better *mechanism* by which to express the user-desired policy.
[Hmm... my "short" answer turned out to be a bit longer than expected.]
Perhaps so, but it was VERY helpful!
shap
Jonathan S. Shapiro wrote:
This makes sense to me, but it also seems to me that if A is a process implementing the file server, and B has memory mapped a file from A, then the current design requires Pb to act as an intermediary -- primarily for the purpose of normalizing file offsets and doing a little bit of protocol translation.
Further, it seems to me that there is an interesting problem of deceiting here, since the file server may not know that Pb and B are equivalent for access control purposes.
Am I missing something that simplifies this scenario?
That is a result of using thread ids for identification of senders which I consider a bad idea. If we need indeed (which I tend to believe) sender identification, the id space should be designed such that ids can be managed in user space and enforced by the kernel, i.e. Pb and the file server should be enabled to act under the same sender id.
--hermann
On Mon, 2003-12-08 at 11:39, Hermann Härtig wrote:
Jonathan S. Shapiro wrote: That is a result of using thread ids for identification of senders which I consider a bad idea. If we need indeed (which I tend to believe) sender identification, the id space should be designed such that ids can be managed in user space and enforced by the kernel, i.e. Pb and the file server should be enabled to act under the same sender id.
We need to talk further about whether sender ID's are useful, but let me answer on the assumption that they are necessary.
If sender ID's are used, then any change to a sender ID must be a privileged operation. It can be managed in user mode, but the software that does the management must be universally trusted.
shap
Jonathan S. Shapiro wrote:
We need to talk further about whether sender ID's are useful, but let m
Ok. This is a tough issue on which have not finally chosen which side I am on.
answer on the assumption that they are necessary.
If sender ID's are used, then any change to a sender ID must be a privileged operation. It can be managed in user mode, but the software that does the management must be universally trusted.
No, not all changes are allowed and trust needs not be universal. The analogy again is pager hierarchies. Each pager must be trusted with respect to pages it manages, the root pager must be trusted universally. It is a hierarchy. Exactly the same with Ids.
Here is the proposal for id management in the generalized mapping scheme that we discuss in Dresden (Marcus is redoing the internal paper I wrote years ago and which is exremely obsolete, I tried to include that in my email of 7.12., but obviously to short).
- send mapping is a "descriptor"/"capability"/"mapping"(whichever) that is used by its task (L4 speek for something that has its addressspace + at least one thread) via a local name and that allows to send a message to another task no threads are named as receiver, just the task (more options in my email 0f 7.12.). - send mappings (as page mappings and all other mappings) can be mapped, flushed etc. - A send mapping (for example send mapping with local name 57) has the form (abbreviated): (physical task address, Principal ID: LENGTH,VALUE) and cannot be inspected. The only operation allowed is to increase LENGTH and append a value to VALUE during a map operation (IPC containing a send mapping) - a so called Procipal ID is used as sender id.
IPC may send out a message M by calling: send (57, M). The kernel assigns "57.VALUE" to the first part of M where the "first part" has length "57.LENGTH" as a sender id.
Send mappings can be mapped from pagers to pagees. While doing so, LENGTH may increase, but not decrease. That way, senders further down a pager hierarchy can be restricted in their "Principal ID" space or - in other words - in which content the first part of messages may contain. Almost arbitrary user-level naming schemes can be provided.
As a (stupid) example for a name space: The root pager owns all ids, i.e. the LENGTH field in all its send mappings is 0. It may map the send descriptor to one of its pagees X with (VALUE="85", LENGTH=8 bits). Thus X can only use sender ids starting ith "85". X can map the send mapping to Y with (VALUE="854", LENGTH= 12 bits). In this situation, Y must be trusted with respect to the name space "854x..x", X with respect to name space "85x...x", the root pager with respect to the full name space. X can flush Y's send mappings, thus prohibiting Y to further use any id of space "854x..x". This is an exact analgoy to resources provided by the kernel. Thus, ids are treated as a kind of resource.
(I omit to discuss the possibilities and problems on the receiver side that need to be addressed in this scheme. That is where the various proposal differ and real confusion still exists at this stage of discussion.)
--hermann
[Jonathan S Shapiro]
Based on your description, I am now reasonably convinced that the L4 operations are individually faster, but that the collective end to end protocol needed to resolve page faults when data spaces are involved may be significantly more complicated in L4 than it is in EROS. I suspect that the aggregate end to end costs in L4 are likely to be *slower* than EROS, but at best they are going to be very similar.
Let's try to summarize what needs to be done for resolving a page fault using the data space model.
1. Page fault is raised, execution traps into the kernel. 2. Kernel translates the page fault into a page fault IPC. 3. The kernel switches to the pager---the region mapper. The region mapper resides in the same address space, so no address space switch needed. 4. The region mapper translates the page fault into a region map access. 5. The region mapper sends a request to the corresponding data space manager. Note that the request is sent "deceiving" or "propegating", meaning that the address space manager can reply direcly to the faulting thread. 6. The data space manager checks if the request is valid and translates the request to a map operation (this translation can be implemented very efficiently). 7. The data space manager replies with a mapping to the faulting thread. 8. The faulting thread resumes execution.
As you point out below, it would be possible to associate a separate pager with different regions of virtual memory, but for reasons I argue below, this reduces flexibility.
Anyhow, by associating a pager with separate memory regions we can only avoid one (intra address space) IPC operation (step 3). I'm not convinced that this matters much in practice since page faults are generally treated by the hardware as exceptions and incurs a substantial overhead in the first place (pipeline flushing, various synchronization when updating page tables, change of cache working sets, etc.). The performance numbers of the "data spaces" paper I cited in the last mail substantiate these claims.
Or perhaps the L4 design embeds a philosophical argument that resolving these things at user level is (a) feasible and (b) likely as efficient than any kernel implementation, and therefore should not be done in the kernel? If so, I understand the philosophical point, and I am not sure that I agree. In my mind, the answer depends on what gets the job done best on an end to end basis.
Yes, this philosophical argument does apply to the design decisions of L4.
Please note that I'm not advocating placing policy in the kernel here. I'm wondering if there might be a better *mechanism* by which to express the user-desired policy.
We've found that the only mechanism that allows us to express the user-desired policy is to perform all these policy decisions on user-level. For instance, the region mapper might suddenly choose to delay all write accesses to a particular region map, e.g., to capture a consistent snapshot of the region. It can do this by revoking all write accesses to the region, and if anyone tries to perform a write operation it can delay the write operation until the snapshot has been taken.
On a more general level the argument seems to stem down to whether the kernel transparently handles exceptions or whether exceptions are exposed to the application and handled in an application specific way. Clearly, the L4 desgin favours the second approach. I realize that this approach may be unsatisfactory in certain situations, e.g., if one for some reason needs to make the application unaware of external communication. This is one of the gripes I have with the current versions of L4.
One solution to make exceptions transparent to applicaitons, while still allowing policies to be defined on user-level could be to disallow an application to change its pager/exception handler, thereby making sure that the exception IPCs can not be seen by the application. Such a scheme would not really work with the current L4 specification because an application is always allowed to change its pager/exception handler. Even if we could make this restriction about non-changeable pagers/exception handlers, the application could still be able to intercept the exception IPCs by timely probing the state of other threads in the address space and aborting ongoing exception IPC operations.
eSk
On Tue, 2003-12-09 at 04:00, Espen Skoglund wrote:
- The region mapper sends a request to the corresponding data space manager. Note that the request is sent "deceiving" or "propegating", meaning that the address space manager can reply direcly to the faulting thread.
- The data space manager checks if the request is valid and translates the request to a map operation (this translation can be implemented very efficiently).
- The data space manager replies with a mapping to the faulting thread.
- The faulting thread resumes execution.
This protocol is actually an optimization of the following original protocol which is implemented in SawMill Linux:
5. RM sends non-propagating request and waits for reply 6. DSM checks validity and translates request 7. DSM replies to RM 8. RM replies to faulting thread 9. fautling thread resumes execution
With this protocol, the client does not have to trust the DSM. If the DSM does not reply (properly), the RM can fix up the faulting thread to e.g. execute something like a signal handler.
Stefan
Stefan Götz wrote:
- RM sends non-propagating request and waits for reply
- DSM checks validity and translates request
- DSM replies to RM
In general the DSM should not be able to alter regions it is not managing, i.e. at least the receive window must be shrinked to the region, the fault occured in. On x86-Pistachio the hack of modifying BR0 in the UTCB of the faulting thread might work. On architectures that place BR0 in Registers this is not possible, so the steps 6 and 7 are required anyway.
- RM replies to faulting thread
- fautling thread resumes execution
With this protocol, the client does not have to trust the DSM. If the DSM does not reply (properly), the RM can fix up the faulting thread to e.g. execute something like a signal handler.
Marcus
[Marcus Völp]
Stefan Götz wrote:
- RM sends non-propagating request and waits for reply
- DSM checks validity and translates request
- DSM replies to RM
In general the DSM should not be able to alter regions it is not managing, i.e. at least the receive window must be shrinked to the region, the fault occured in. On x86-Pistachio the hack of modifying BR0 in the UTCB of the faulting thread might work. On architectures that place BR0 in Registers this is not possible, so the steps 6 and 7 are required anyway.
I believe our intention is to change the specs so that this becomes possible, e.g., by making the ExchangeRegisters system call more general.
eSk
l4-hackers@os.inf.tu-dresden.de