Hi, I am directing this question to both, the Karlsruhe and the Dresden, L4 mailing list, because I know both are working on new L4 specifications.
(Background:) I am working with L4 in Dresden on real-time systems, which become fairly complex lately. One of the questions, which began to bother me more and more, is: What is the use of finite IPC timeouts?
Ideally sending an IPC should be immediate (zero timeout) or should be performed eventually (infinite timeout). The only (useful) application of finite timeouts, I have seen so far, is the combination of multiple tasks in a single threaded server. That is waiting for IPC and for a regularly appearing timer event. (This sometimes implied adjusting the IPC timeout by the time used between the IPC waits.) Otherwise the use of finite timeouts is rather inconvenient, since the time used for the timeout depends on the whole surrounding system (software and hardware) and is thus impractical.
I'm personally convinced that you do not need finite IPC timeouts. A single bit differentiating zero and infinte timeouts should be sufficient. To imitate the concurrent waiting for an IPC or timer event one could set up a timer with the kernel and inculde the "timer-source" into the receive scope of the wait IPC. Timer events can be dropped if the receiver is not ready. This would also simplify the tasks using IPC timeouts as timer event source.
Coming back to my questions: What is the use of finite timeouts?
Or maybe the question should be: Would it simplify kernel (and user land) design if seperating the concepts of finite timeouts and IPC?
If you have reasons to include finite timeouts into future versions of the L4 Specification, please convince me.
Thanks, Ron.
On Wed, 2005-02-23 at 18:29 +0100, Ronald Aigner wrote:
I'm personally convinced that you do not need finite IPC timeouts.
I believe that this was agreed at the Dresden L4 summit meeting. The only well-motivated case for timeouts appears to be interacting with physical real-world devices that have embedded timeouts (e.g. disk drives -- if the 15ms seek is not complete in 20ms, your drive is dead). That is, the "watchdog pattern".
My memory is that it was agreed that this case is rare enough, and occurs in software that is unusual enough, that it is not justified to preserve this function in the IPC.
The remaining cases of "no timeout" and "block indefinitely" should remain.
Jonathan S. Shapiro
Jonathan S. Shapiro said:
On Wed, 2005-02-23 at 18:29 +0100, Ronald Aigner wrote:
I'm personally convinced that you do not need finite IPC timeouts.
I believe that this was agreed at the Dresden L4 summit meeting. The only well-motivated case for timeouts appears to be interacting with physical real-world devices that have embedded timeouts (e.g. disk drives -- if the 15ms seek is not complete in 20ms, your drive is dead). That is, the "watchdog pattern".
My memory is that it was agreed that this case is rare enough, and occurs in software that is unusual enough, that it is not justified to preserve this function in the IPC.
The remaining cases of "no timeout" and "block indefinitely" should remain.
I don't recall that, but then: even better :-)
It was brought to my attention that pagefault timeouts _are_ important as to enforce trust relation with your communication partner. I don't know what the semantic of a zero pagefault timeout is. If it means that the page has to be present and a infinite pagefault timeout means that you don't care, then finite pagefault timeouts seems reasonable. Still, defining a useful value seems unpractical to me.
Ron.
On Wed, 2005-02-23 at 19:49 +0100, Ronald Aigner wrote:
It was brought to my attention that pagefault timeouts _are_ important as to enforce trust relation with your communication partner.
Unfortunately, this is true. Even more unfortunately, there is absolutely no way to set a robust timeout for this case. In consequence, the need for this timeout must be seen as a fundamental architectural deficiency.
To resolve this problem even in part, the architecture must distinguish between (a) addresses that are logically undefined, and (b) addresses that are currently unmapped because of being paged out. The former case is *always* an error in the logic of the recipient. The latter case is a situation where either the sender trusts the paging agent completely or no safe foundation for *any* communication of data can exist in the architecture.
For some of the details, you might want to review "Vulnerabilities in Synchronous IPC Designs" from IEEE Security and Privacy a few years ago:
http://www.eros-os.org/papers/IPC-Assurance.ps
shap
At Wed, 23 Feb 2005 19:49:44 +0100 (CET), "Ronald Aigner" ra3@os.inf.tu-dresden.de wrote:
It was brought to my attention that pagefault timeouts _are_ important as to enforce trust relation with your communication partner. I don't know what the semantic of a zero pagefault timeout is. If it means that the page has to be present and a infinite pagefault timeout means that you don't care, then finite pagefault timeouts seems reasonable. Still, defining a useful value seems unpractical to me.
If you use string items in a reply from a server to the client, I think even small timeouts can be used for DoS attacks. This is why I use timeout and transfer timeout 0 for all IPC from the server to the client. The client just has to be ready, and all buffers to receive string items need to be wired down (or other mechanisms need to be used, like trusted buffer objects, or resuming the operation for the not-transfered data). Of course, other systems may have different trust considerations.
To answer your original question:
Finite IPC timeouts seem to be necessary to sleep for a specified time (receiving from yourself), for implementing functions like sleep() and timed waits on a synchronization primitive. I don't care much about the mechanism, but it must be robust - dropping a timeout and thus going into an infinite receive does not seem to be enough to me (maybe I misunderstood your proposal).
However, a slightly different scheme could work. The timer event is not dropped, but instead defered - the next time the thread does an IPC, and there are no pending partners, it is canceled immediately and does not block. In addition, any IPC operation will always clear any pending timer before it returns (so it won't accidentially affect later IPC operations). I have not thought this completely through - there seems to be some hair attached.
It's interesting that you raise this issue though - we have a patch for L4 that implements asymmetric xfer timeouts (ie, the timeout depends on where the page fault happens), and the semantics are clear if you have only 0 or inf timeouts, but a bit unclear when you have other timeouts _and_ multiple page faults, some in the sender and some in the receiver (this is why we opted for another, simpler asymmetric xfer timeout scheme, where the xfer timeout for local pagefaults always defaults to "infinite").
Thanks, Marcus
On Wed, 2005-02-23 at 21:52 +0100, Marcus Brinkmann wrote:
At Wed, 23 Feb 2005 19:49:44 +0100 (CET), "Ronald Aigner" ra3@os.inf.tu-dresden.de wrote:
It was brought to my attention that pagefault timeouts _are_ important as to enforce trust relation with your communication partner. I don't know what the semantic of a zero pagefault timeout is. If it means that the page has to be present and a infinite pagefault timeout means that you don't care, then finite pagefault timeouts seems reasonable. Still, defining a useful value seems unpractical to me.
If you use string items in a reply from a server to the client, I think even small timeouts can be used for DoS attacks. This is why I use timeout and transfer timeout 0 for all IPC from the server to the client. The client just has to be ready, and all buffers to receive string items need to be wired down (or other mechanisms need to be used, like trusted buffer objects, or resuming the operation for the not-transfered data). Of course, other systems may have different trust considerations.
[In the following, please note that when I use the term "paging" I mean the virtualization of available physical memory by migrating pages of content between store and memory. This should not be confused with "page fault handling", which is the implementation of a policy that defines the validity and protection of locations in an address space. Allowing a timeout for the recipient page definition policy engine (which in L4 is a pager) is potentially useful, but not compellingly so.]
Marcus is exactly correct. The problem with the need for a zero timeout on the pager is that it violates the encapsulation of paging. A paging system is supposed to be able to page out portions of a process without altering the semantics of its behavior (ignoring latency). An immediate consequence of this definition is that it should not be necessary for a server transmitting a string item to a client to consider the behavior of the paging agent to be part of the threat model. Because the L4 memory model cannot distinguish between the logical presence/absence of a mapping and the physical presence/absence of a mapping, it is not possible to accurately capture the semantics of paging within the L4 operational semantics.
Some might object that the pager should not have to be trusted. What is emerging from the discussion at hand is that this architectural view does not map well to real usage. Marcus's approach of using a zero pager timeout in server->client sends implies that the client must have means to pin its receive area (a client-defined number of page frames) for an indefinite time (there may be no contract on how fast the server replies). This can induce a very real and urgent resource denial of service problem, and may be in contradiction with the residency policy requirements under which the client must operate. It works very well in non-paging environments such as embedded applications.
Just to be clear: I'm certainly not claiming for an instant that EROS got this stuff perfect. And I think it would definitely be nice to have a paging agent that could be outside of the universally trusted computing base. Unfortunately, nobody has been able to suggest a paging architecture in which this has turned out to be feasible. In all current L4-derived systems, the region manager (or the software that plays the equivalent functional role) is necessarily trusted completely by all of the programs that it serves.
It appears to me that there is no behavioral difference that motivates the desire for untrusted pagers. All pagers do basically the same thing: move pages into and out of memory. The rationale for untrusted pagers usually turns out to be a desire to have distinguishable *policies* for page replacement (equivalently: for residency retention). Separating the policy of paging from the mechanism of paging is definitely possible, and it simplifies the trust problem greatly. Once these things are separated, we may reduce the trust assumption to:
1. The *mechanism* of paging must be universally trusted. 2. The relevant *policy* under which the page fault is triggered must ensure progress of transfer for the string item that is being returned to the client. The latency characteristics of this policy must be understood by the sender (the server).
We have spent some time thinking about this in the Coyotos design, and we have concluded that the *second* requirement (knowledge of the policy) is still problematic. Our conclusion is that the sender must always be in charge of the policy under which any page (including a receiver page) involved in an IPC is fetched in. Only if this is true can the server know whether the delay characteristics of the recipient page faults will be acceptable. We have therefore concluded that the *sender* must have the ability to specify the working set (or whatever page replacement policy embodiment is used) that should be used to bring in recipient pages if the receiver is untrusted. The reverse is also true -- the receiver must be able to deny the sender control over receiver residency policy. The end result is very much like schedule donation by mutual agreement.
Once the IPC is done, these pages become subject to cleaning in the usual way, and will remain resident only if the recipient has ensured adequate working set guarantees to ensure their residence.
This is not a simple mechanism, and there are serious difficulties in reasoning about residence behavior of redundantly sponsored pages, but it is the best that we have been able to come up with.
Finite IPC timeouts seem to be necessary to sleep for a specified time (receiving from yourself), for implementing functions like sleep() and timed waits on a synchronization primitive.
This is indeed a reasonable way to do these things in L4. As an alternative approach to consider, these functions are provided in EROS by kernel-implemented services. In the case of both kernel-implemented and process-implemented services, what the invoker sees is that they are invoking a capability.
Apologies if this aside is off topic. My point is only that there is a choice of design spaces, and embedding the timeout in the IPC specification may be a reasonable choice, but that the desire for delays of the type that Marcus identifies for self-send and sleep do not imply a requirement for timeout in the IPC primitive.
The timer event is not dropped, but instead defered - the next time the thread does an IPC, and there are no pending partners, it is canceled immediately and does not block.
It appears to me that this amounts to a special case of a more general problem: non-blocking reliable delivery of event notification. We have run into places in EROS where there is a serious need for such a mechanism, and we have been contemplating how to achieve this in Coyotos. We have now concluded that it should *not* be done in endpoints, because it is extremely desirable for endpoints to be stateless and pending events must be recorded somewhere. It is unclear at this point what alternative will emerge.
shap
At Wed, 23 Feb 2005 17:24:47 -0500, "Jonathan S. Shapiro" shap@eros-os.org wrote:
Some might object that the pager should not have to be trusted. What is emerging from the discussion at hand is that this architectural view does not map well to real usage. Marcus's approach of using a zero pager timeout in server->client sends implies that the client must have means to pin its receive area (a client-defined number of page frames) for an indefinite time (there may be no contract on how fast the server replies). This can induce a very real and urgent resource denial of service problem, and may be in contradiction with the residency policy requirements under which the client must operate. It works very well in non-paging environments such as embedded applications.
The reason why I hope we will get away with it is that we are pursueing a model where every task is self-paged. Tasks get a contingent of physical memory, which is guaranteed over a long time (exact details how to negotiate that number are not determined yet). There are many complications, but in the end this allows the client to make paging decisions itself, and thus it can wire down a region of memory effectively.
Sounds good, but there is a problem, and I think this just illustrates the point you were making: The physical memory server needs to be able to revoke arbitrary mappings temporarily, for example to make space for DMA buffer regions or to reorganize memory for super-page allocation. This means that the operation of the physical memory server is not transparent to the clients, and thus even to the server. (Only the kernel could make this operation transparent by an atomic copy-and-remap operation).
So, yes, the whole notion is troublesome.
Just to be clear: I'm certainly not claiming for an instant that EROS got this stuff perfect. And I think it would definitely be nice to have a paging agent that could be outside of the universally trusted computing base. Unfortunately, nobody has been able to suggest a paging architecture in which this has turned out to be feasible. In all current L4-derived systems, the region manager (or the software that plays the equivalent functional role) is necessarily trusted completely by all of the programs that it serves.
Well, we are at least trying. If it is feasible we will see when we have implemented it...
The timer event is not dropped, but instead defered - the next time the thread does an IPC, and there are no pending partners, it is canceled immediately and does not block.
It appears to me that this amounts to a special case of a more general problem: non-blocking reliable delivery of event notification. We have run into places in EROS where there is a serious need for such a mechanism, and we have been contemplating how to achieve this in Coyotos. We have now concluded that it should *not* be done in endpoints, because it is extremely desirable for endpoints to be stateless and pending events must be recorded somewhere. It is unclear at this point what alternative will emerge.
Yes. For me, such issues tend to crop up when I think about cancellation (for example due to signals). Due to lack of low-level support, we will have to do it with expensive and complicated high level constructs in the endpoints. (Now, for us this is probably OK, as we are going to need some such support anyway to correctly implement POSIX semantics. Still, it seems to be a burden and better low-level support could possibly simplify things a lot).
Thank you also for your other comments, they are very much appreciated.
Marcus
Good Morning ;) after reading tonights discussion I believe that finite timeouts in IPCs do not have to be supported, even for transfer-timeouts (IPC pagefault timeouts). Finite timeouts in IPCs are impractical.
Concerning the usage of IPC timeouts to sleep/wait for a finite amount of time: I claim that seperating the support for finite time events (see below) and IPC will *not* make a microkernel more complex but rather more simple. The reason for this claim is that it is easier to verify the behaviour of IPC with either zero or infinite timeouts then with additional finite timeouts. (I agree that this is painting black and white and leaving all the gray out.)
Finite time events are in some way already implemented in Fiasco: Udo Steinberg provided mechanisms in the kernel to set up time-slices for real-time execution. When the time-slice expired or a thread reached the deadline of its period an "event" (read IPC) was send to the threads preempter, which can react to this event. The real-time thread could avoid the delivery of the event by synchronizing with the kernel: it could wait for the start of the next period (or the start of the next timeslice--albeit this is a rather simplistic description). This could also be combined with the receiption of an IPC, thus combining "receive IPC with (absolute) timeout". I would argue that this mechanism can be used (with modifications) to implement time events in a microkernel.
I disagree with the opinion that the complexity of a microkernel should be measured by the number of its system-calls. I find it rather complex to multiplex dozen flavours of IPC via one systemcall.
Marcus Brinkmann said:
The reason why I hope we will get away with it is that we are pursueing a model where every task is self-paged. Tasks get a contingent of physical memory, which is guaranteed over a long time (exact details how to negotiate that number are not determined yet). There are many complications, but in the end this allows the client to make paging decisions itself, and thus it can wire down a region of memory effectively.
Sounds good, but there is a problem, and I think this just illustrates the point you were making: The physical memory server needs to be able to revoke arbitrary mappings temporarily, for example to make space for DMA buffer regions or to reorganize memory for super-page allocation. This means that the operation of the physical memory server is not transparent to the clients, and thus even to the server. (Only the kernel could make this operation transparent by an atomic copy-and-remap operation).
I assume that the available physical memory for a task (its working set?) will entriely be used for IPC. Therefore you might differentiate between pinned and pageable physical memory.
Greetings, Ron.
On Thu, 2005-02-24 at 09:15 +0100, Ronald Aigner wrote:
I disagree with the opinion that the complexity of a microkernel should be measured by the number of its system-calls. I find it rather complex to multiplex dozen flavours of IPC via one systemcall.
I am not sure quite what motivated this comment. We have argued in the EROS design that having exactly one system call ("invoke capability", in our case) is good, but not because it reduces microkernel complexity -- in fact, it complicates it.
Our rationale for having a common method of invocation in EROS came from two objectives:
1. We wanted it to be very obvious that every call was authorized by a capability, and for simplicity of reasoning about the evolution of security state we wanted this to be very regular and very obvious in the interface.
2. We wanted to be able to "virtualize" kernel services by having a process implement a "front end".
For example, KeyKOS (the predecessor to EROS) implemented exactly one kernel timer. This was multiplexed by user-level code. The interface of the user-level timer was *identical* to the interface of the kernel capability.
There is a substantial cost to this design decision: because the EROS "call" mechanism lets the "returnee" be specified explicitly, it is possible to invoke a kernel-implemented capability in such a way that the return goes to another process. This introduces quite a number of corner cases in the implementation. In fact, the EROS implementation has a three-tiered invocation implementation:
1. The assembly path, which handles interprocess invocations. 2. The fast C path, which handles the simple call/return case. 3. The general path, which deals with all of the corner cases.
One concern in Coyotos is that when we moved to an endpoint-based design, the "call" operation effectively went away, and it may no longer be possible to easily virtualize kernel services. We have not yet had a chance to look at this aspect of the design.
shap
[Jonathan S Shapiro]
On Thu, 2005-02-24 at 09:15 +0100, Ronald Aigner wrote:
I disagree with the opinion that the complexity of a microkernel should be measured by the number of its system-calls. I find it rather complex to multiplex dozen flavours of IPC via one systemcall.
I am not sure quite what motivated this comment. We have argued in the EROS design that having exactly one system call ("invoke capability", in our case) is good, but not because it reduces microkernel complexity -- in fact, it complicates it.
Would also like to add that the reason for multiplexing all these operations in the IPC mechanism is not really to reduce complexity. We multiplex all these operations because the operations are similar enough in nature to share the same codepath and thereby reduce the cache/memory footprint of the kernel.
I don't think anyone has ever stated that the number of system calls is a measure for the *complexity* of the kernel.
eSk
On Feb 24, 2005, at 6:52 AM, Espen Skoglund wrote:
[Jonathan S Shapiro]
On Thu, 2005-02-24 at 09:15 +0100, Ronald Aigner wrote:
I disagree with the opinion that the complexity of a microkernel should be measured by the number of its system-calls. I find it rather complex to multiplex dozen flavours of IPC via one systemcall.
I am not sure quite what motivated this comment. We have argued in the EROS design that having exactly one system call ("invoke capability", in our case) is good, but not because it reduces microkernel complexity -- in fact, it complicates it.
Would also like to add that the reason for multiplexing all these operations in the IPC mechanism is not really to reduce complexity. We multiplex all these operations because the operations are similar enough in nature to share the same codepath and thereby reduce the cache/memory footprint of the kernel.
I don't think anyone has ever stated that the number of system calls is a measure for the *complexity* of the kernel.
Just to chime in quickly here... At the most primitive levels of the kernel everything can be multiplexed to one system call and I don't think this would really "bother" me in trying to understand how to use the kernel because it always seems that someone comes along and wraps that one system call with normally very lightweight abstractions to help demystify the heavy amount of multiplexing.
In this way that perceived complexity can be reduced with little overhead.
Can't this sort of thing be "generated" either by macro or IDL to add an artificial abstraction layer to make it more clear to the higher level L4 coder what exact functionality is being called upon?
I'm not sure I see how heavy multiplexing has to be a problem for anyone. If it's possible to do it all in one syscall, I'd go for it :). After all increasing the complexity of the upper layers is traditionally what modern microkernels do but once you have a really decent library of tools I bet a lot of the perceived complexity falls away. [maybe not so much as to make it as "easy" to code for as a monolithic kernel but good enough :)]
Is this consistent with the current philosophy of microkernels?
Dave
eSk
[David Leimbach]
I'm not sure I see how heavy multiplexing has to be a problem for anyone. If it's possible to do it all in one syscall, I'd go for it :). After all increasing the complexity of the upper layers is traditionally what modern microkernels do but once you have a really decent library of tools I bet a lot of the perceived complexity falls away. [maybe not so much as to make it as "easy" to code for as a monolithic kernel but good enough
)]
Is this consistent with the current philosophy of microkernels?
Our system call ABIs are typically pretty much optimized to hold as many relevant parameters as possible in registers (so as to avoid memory accesses). You could of course percieve a solution where parts of one register holds the "sycall type" and the remaning registers have semantics depending on this type. I don't see how this gains you anything at all (appart from consuming some register real estate and requiring a demultiplexer---not relly what I'd call gains). If you know that you are performing a completely different operation you might as well jump directly to the routine that implements that operation.
As Jonathan pointed out, there can be other reasons for having a single syscall, though.
eSk
Multiplexing many operations on a single syscall isn't the most efficient way to go about it. And yet many operating systems do just that: They use a register to encode the syscall number. And some even have to deal with conflicting allocations of the syscall numbers: Darwin multiplexes both Mach and BSD system calls, and accomplishes this by splitting them between negative and positive numbers.
A bit of information can be encoded into the instruction stream, helping to avoid allocating precious processor registers. Thus if you jump straight to the routine that implements the system call, you can avoid allocating a register for a sub-function ID, and you can take advantage of the processor's branch prediction hardware that is based on instruction addresses.
IA64 easily supports system call implementations that jump straight to the routines' implementations. It has the epc instruction. But other architectures can mimic the same behavior. I do it on PowerPC, by using the sc instruction to raise the privilege level of the code. The user-level programs get to jump straight to the system calls' implementations, and the kernel raises the privilege as appropriate. Thus I avoid the typical table lookup and indirect function call. I even eliminate more branches by placing the C++ kernel code immediately after their assembler prologues.
-Josh
On Feb 24, 2005, at 16:21, David Leimbach wrote:
On Feb 24, 2005, at 6:52 AM, Espen Skoglund wrote:
[Jonathan S Shapiro]
On Thu, 2005-02-24 at 09:15 +0100, Ronald Aigner wrote:
I disagree with the opinion that the complexity of a microkernel should be measured by the number of its system-calls. I find it rather complex to multiplex dozen flavours of IPC via one systemcall.
I am not sure quite what motivated this comment. We have argued in the EROS design that having exactly one system call ("invoke capability", in our case) is good, but not because it reduces microkernel complexity -- in fact, it complicates it.
Would also like to add that the reason for multiplexing all these operations in the IPC mechanism is not really to reduce complexity. We multiplex all these operations because the operations are similar enough in nature to share the same codepath and thereby reduce the cache/memory footprint of the kernel.
I don't think anyone has ever stated that the number of system calls is a measure for the *complexity* of the kernel.
Just to chime in quickly here... At the most primitive levels of the kernel everything can be multiplexed to one system call and I don't think this would really "bother" me in trying to understand how to use the kernel because it always seems that someone comes along and wraps that one system call with normally very lightweight abstractions to help demystify the heavy amount of multiplexing.
In this way that perceived complexity can be reduced with little overhead.
Can't this sort of thing be "generated" either by macro or IDL to add an artificial abstraction layer to make it more clear to the higher level L4 coder what exact functionality is being called upon?
I'm not sure I see how heavy multiplexing has to be a problem for anyone. If it's possible to do it all in one syscall, I'd go for it :). After all increasing the complexity of the upper layers is traditionally what modern microkernels do but once you have a really decent library of tools I bet a lot of the perceived complexity falls away. [maybe not so much as to make it as "easy" to code for as a monolithic kernel but good enough :)]
Is this consistent with the current philosophy of microkernels?
Dave
eSk
On Thu, 2005-02-24 at 02:08 +0100, Marcus Brinkmann wrote:
Marcus's approach of using a zero pager timeout in server->client sends implies that the client must have means to pin its receive area (a client-defined number of page frames) for an indefinite time (there may be no contract on how fast the server replies). This can induce a very real and urgent resource denial of service problem...
The reason why I hope we will get away with it is that we are pursueing a model where every task is self-paged. Tasks get a contingent of physical memory, which is guaranteed over a long time (exact details how to negotiate that number are not determined yet)...
This would certainly look initially attractive, but let's dig in on it a little bit.
First, let us make the assumption that all string items returned from servers have a length that is known at client compile time -- the dynamically sized return strings need to be handled through other mechanisms anyway.
So: given a client's source, we can determine the longest reply string it will receive.
This number effectively sets a least bound on the number of pages of that client that must be pinned in order for the client to receive these IPC replies. Note that "pin" is a commitment to *real* memory, not virtual memory.
I believe we will shortly conclude that the design effectively sets a hard limit on the number of client processes that can be effectively run.
The physical memory server needs to be able to revoke arbitrary mappings temporarily, for example to make space for DMA buffer regions or to reorganize memory for super-page allocation. This means that the operation of the physical memory server is not transparent to the clients, and thus even to the server. (Only the kernel could make this operation transparent by an atomic copy-and-remap operation).
So, yes, the whole notion is troublesome.
I don't like having a universally trusted paging agent either, but in 15 years of thinking about this in my spare cycles I haven't come up with anything better yet. :-)
In all current L4-derived systems, the region manager (or the software that plays the equivalent functional role) is necessarily trusted completely by all of the programs that it serves.
Well, we are at least trying. If it is feasible we will see when we have implemented it...
Good luck. If my comments help to resolve what the critical collision of design objectives may be, and perhaps what degrees of design freedom might exist to resolve it, then they have perhaps achieved something useful.
It appears to me that this amounts to a special case of a more general problem: non-blocking reliable delivery of event notification. We have run into places in EROS where there is a serious need for such a mechanism, and we have been contemplating how to achieve this in Coyotos. We have now concluded that it should *not* be done in endpoints, because it is extremely desirable for endpoints to be stateless and pending events must be recorded somewhere. It is unclear at this point what alternative will emerge.
Yes. For me, such issues tend to crop up when I think about cancellation (for example due to signals). Due to lack of low-level support, we will have to do it with expensive and complicated high level constructs in the endpoints.
Since this is a separate topic, I shall reply separately with a new subject line.
shap
Ronald Aigner wrote:
<...>
To imitate the concurrent waiting for an IPC or timer event one could set up a timer with the kernel and inculde the "timer-source" into the receive scope of the wait IPC. Timer events can be dropped if the receiver is not ready. This would also simplify the tasks using IPC timeouts as timer event source.
I see one problem with this:
Scenario:
I use finite timeouts in a library emulating keyboard repeat behaviour (you press the key -> one press event is generated, you keep the key pressed -> after 250 ms another event is generated every 100 ms until you release the key).
This is implemented with IPC-Recv from an event thread with the timeouts mentioned.
Problem:
The setup and start of the timer and the ipc operation in this scenario is atomic. If I would have to setup and start the timer independently of the IPC I could be interupted and/or delayed in-between if no other meassures (e.g. delayed preemption) would be taken.
There could be similar more critical examples (although I think that a spongy keyboard is bad enough).
Greets, Martin
Martin Pohlack said:
Ronald Aigner wrote:
<...>
To imitate the concurrent waiting for an IPC or timer event one could set up a timer with the kernel and inculde the "timer-source" into the receive scope of the wait IPC. Timer events can be dropped if the receiver is not ready. This would also simplify the tasks using IPC timeouts as timer event source.
I see one problem with this:
Scenario:
I use finite timeouts in a library emulating keyboard repeat behaviour (you press the key -> one press event is generated, you keep the key pressed -> after 250 ms another event is generated every 100 ms until you release the key).
This is implemented with IPC-Recv from an event thread with the timeouts mentioned.
Problem:
The setup and start of the timer and the ipc operation in this scenario is atomic. If I would have to setup and start the timer independently of the IPC I could be interupted and/or delayed in-between if no other meassures (e.g. delayed preemption) would be taken.
There could be similar more critical examples (although I think that a spongy keyboard is bad enough).
Wrong assumption: You receive an IPC (key pressed). Then you do some calculation, and decide to wait for the KEY_UP IPC with an timeout (which you have to calculate as 250ms - the time it took you from IPC receive to start of next IPC). And who ensures you that between the calculation of the new timeout you are not interrupted? You might as well set up the (oneshot) timer in the meanwhile. If it fires and you're not there to get it (because you received the KEY_UP IPC before that), who cares, its dropped.
Greetings, Ron.
Ronald Aigner wrote:
Martin Pohlack said:
Ronald Aigner wrote:
<...>
To imitate the concurrent waiting for an IPC or timer event one could set up a timer with the kernel and inculde the "timer-source" into the receive scope of the wait IPC. Timer events can be dropped if the receiver is not ready. This would also simplify the tasks using IPC timeouts as timer event source.
I see one problem with this:
Scenario:
I use finite timeouts in a library emulating keyboard repeat behaviour (you press the key -> one press event is generated, you keep the key pressed -> after 250 ms another event is generated every 100 ms until you release the key).
This is implemented with IPC-Recv from an event thread with the timeouts mentioned.
Problem:
The setup and start of the timer and the ipc operation in this scenario is atomic. If I would have to setup and start the timer independently of the IPC I could be interupted and/or delayed in-between if no other meassures (e.g. delayed preemption) would be taken.
There could be similar more critical examples (although I think that a spongy keyboard is bad enough).
Wrong assumption: You receive an IPC (key pressed). Then you do some calculation, and decide to wait for the KEY_UP IPC with an timeout (which you have to calculate as 250ms - the time it took you from IPC receive to start of next IPC). And who ensures you that between the calculation of the new timeout you are not interrupted?
I did not say that this solution is currently perfect, but it might be easily fixed by "simple delayed preemption" (aka cli/sti) between the two IPC operations (ca. 10 lines of simple c-code).
You might as well set up the (oneshot) timer in the meanwhile. If it fires and you're not there to get it (because you received the KEY_UP IPC before that), who cares, its dropped.
However, if I have to do another thing between both IPCs (like setting up the timer) more things can go wrong: I would probable have to enter the kernel to setup the timer, which, at least with current Fiasco and "simple delayed preemption" is not possible as the interrupt flags are not preserved across kernel entries.
So, as a consequence, without IPC timeouts we need at least delayed preemption which is robust across kernel entries *and* fast and deterministic timer programing.
I'm not saying these are impossible but the consequences should be considered.
Greets, Martin
Just for your info, I don't have time to really get into the debate, our current thinking is that timeouts (besides poll and block forever) should go away and be replaced with a time service (implemented in a user-level server or the kernel depending on the granularity required and architecture, the API can remain the same for both).
Our reasons:
* Can (help) control the propagation of time in case where it is desirable to restrict it.
* The only design patterns that emerged for timeouts were used for where for generating timer events. Even for pagefault timeouts, most systems architected themselves such that zero or infintite could be used.
* No method emerged (even ideas?) that took advantage of finite timeouts in combination with IPC, the required analysis was always put in the too hard basket, and even given it, it may not scale beyond very simple systems. There still might be an opportunity for someone clever here, but it is not on anyones agenda we are aware of.
* The loose timekeeping in L4 combined with the way timeouts are specified means they are unusable in practice for precise timing. You can't specify a timeout (both absolute or relative) due to the potential for a pathological case delaying the IPC (registering of the event) such that the timeout becomes incorrect, or that given a timeout, that time being usurped by a burst of interrupt handling. I have yet to see the analysis that demonstrates it can't happen (implying the no identification of the requirements to avoid it), and such an analysis probably won't scale beyond simple systems.
The previous two points basically agree with shap's comment of lack of a robust method to set finite time, even with it, L4 does not appropriate accounting to give you the time you specify.
Summary: I'd argue strongly for absolute time events specified absolutely, not relative to some current time. There is obvious motivation for absolute wall clock events, but there may be cases for "absolute" events on a process virtual time scale. There maybe cases for relative time events on a process virtual time scale, but I think relative events based on wall clock (what we have now) is unusable where . The combination with IPC seems unwarranted.
Ronald Aigner <> scribbled on Thursday, 24 February 2005 4:29 AM:
(Background:) I am working with L4 in Dresden on real-time systems, which become fairly complex lately. One of the questions, which began to bother me more and more, is: What is the use of finite IPC timeouts?
Oops, email got out early. My apologies for the potentially loose wording, hopefully the points are clear enough (and I think others have captured similar points elsewhere).
- Kevin
Kevin Elphinstone <> scribbled on Thursday, 24 February 2005 9:32 AM:
[removed previous text]
[Kevin Elphinstone]
Summary: I'd argue strongly for absolute time events specified absolutely, not relative to some current time. There is obvious motivation for absolute wall clock events, but there may be cases for "absolute" events on a process virtual time scale. There maybe cases for relative time events on a process virtual time scale, but I think relative events based on wall clock (what we have now) is unusable where . The combination with IPC seems unwarranted.
I must say that I'm inclined to agree with much of this. I have *very* loosely been playing with the idea of having time/scheduling domains that can somehow be managed using some form of map/unmap mechanism. A time domain could encapsulate a single thread, a program, a collection of cooperating applications, a whole subsytem, a virtual machine, etc.
Time events would be relative to a particular time domain. Wall clock time events would have to be specified within the root domain.
A reason why timeouts are currently specified within the IPC is that doing so enables a program to invoke an IPC and set a timeout in one atomic operation. There is no risk of being preempted in between setting the timeout and invoking the IPC. Using time domains this becomes a non-issue. Time events (timeouts) are specified within a particular time domain, and if the thread is preempted before invoking the IPC it will not cause time to pass within this domain (unless of course the next thread being scheduled consumes time from the same domain).
Time domains may also give the possibility of performing time slice donation in a more controlled manner. They *may* also raise the bar for creating time based covert channels.
What the exact semantics of time domains would be and whether such time domains can be implemented in an efficient manner is of course a completely different matter.
Anyhow, agreed, the manner in which we currently deal with many things related to time and scheduling in L4 is somewhat... er... lacking.
eSk
On Thu, 2005-02-24 at 12:58 +0100, Espen Skoglund wrote:
A reason why timeouts are currently specified within the IPC is that doing so enables a program to invoke an IPC and set a timeout in one atomic operation.
Just to expand on what Espen is saying, this issue is very real. In EROS, the issue is implemented differently. An application that needs such a timeout has a "watchdog thread" (I am translating approximately into L4 terminology).
The timed IPC executes
watchMe := true /* shared memory */ doIPC() watchMe := false cancelTimer(wait-object)
The watchdog thread executes
call(wait-object, timeout) if (watchMe) "stun" the main thread if the main thread is in the receiving state, set an error code and advance the PC else do nothing
Yes, this is complex, but remember that this case is really very very rare.
I'm not sure if this is useful to the L4 discussion, but perhaps it will suggest other solutions that may have appeal.
l4-hackers@os.inf.tu-dresden.de