Hi,
I try to understand if (and if yes, why) unmap() in L4 architecture implementations is guaranteed to succeed. I hesitate to say I understood the description in Marcus Völp, Design and Implementation of the Recursive Virtual Address Space Model for Small Scale Multiprocessor Systems, but at least I skimmed through it and have developed sort of a mental picture of the post-order node processing. Bernhard Kauer in L4.sec Implementation also refers to this work.
I understand there are substantial variations of L4 implementations, and I don't have a complete picture of which one uses which algorithm.
In any case, I am worried about the following scenario, and wonder if you can share with me your insight on the question if this is a valid concern or not:
Let's say there is a mapping of the same page from A to B and from A to C. Now, unmap() proceeds in post-order, and deletes the mapping from A to C. Then it is preempted, and a restart point at B is noted. Next B maps the page to B', which updates the restart point to be at B'. Now, unmap() proceeds by deleting the mapping from B to B'. Again, the unmap operation is preempted, with the restart point being B. The cycle continues.
There are variations of this, where B' creates a mapping to B'', which creates a mapping to B''' etc, n times in total, and unmap() undos the last k such mappings before it is preempted. As long as n >= k, the B's can create new mappings as fast or faster as unmap can delete them, so there doesn't seem to be a way for unmap() to succeed eventually.
I have the feeling that this is an obvious threat scenario and thus is probably addressed already, but I can't find where. I might be missing something obvious.
Thanks, Marcus
Hi Marcus,
in short: you are right. Both L4.sec and my solution have this problem. Fiasco does not. Details below.
Best regards
Marcus
Marcus Brinkmann wrote:
Hi,
I try to understand if (and if yes, why) unmap() in L4 architecture implementations is guaranteed to succeed. I hesitate to say I understood the description in Marcus Völp, Design and Implementation of the Recursive Virtual Address Space Model for Small Scale Multiprocessor Systems, but at least I skimmed through it and have developed sort of a mental picture of the post-order node processing. Bernhard Kauer in L4.sec Implementation also refers to this work.
I understand there are substantial variations of L4 implementations, and I don't have a complete picture of which one uses which algorithm.
You are right, there are different implementations in the various kernels. To my knowledge none is currently using the algorithm I described in my diploma thesis.
In any case, I am worried about the following scenario, and wonder if you can share with me your insight on the question if this is a valid concern or not:
Let's say there is a mapping of the same page from A to B and from A to C. Now, unmap() proceeds in post-order, and deletes the mapping from A to C. Then it is preempted, and a restart point at B is noted. Next B maps the page to B', which updates the restart point to be at B'. Now, unmap() proceeds by deleting the mapping from B to B'. Again, the unmap operation is preempted, with the restart point being B. The cycle continues.
In my approach the rights have to be removed in preorder (don't remember whether this is already in the thesis). This way B cannot generate mapping that still have to be traversed by the unmapping thread and thus the restart point remains at B (and not at B'). The mapping node itself must not be removed until it is passed in the post-order sweep. What may still happen is that while the pre-order sweep is processing node A, a concurrent map creates a new mapping B -> B' and then while the unmap processes B, a concurrent map creates B' -> B'' and so on. The consequence is that unless you restrict the mapping tree size or play some nasty tricks with priorities you cannot give an upper bound on the duration of unmap.
L4.Sec works has the same problems, however, we use subtree helping locks instead of restart point tracking to synchronize concurrent unmaps.
Fiasco implements a different algorithm which does not show the above problems. In Fiasco the entire mapping tree corresponding of a page is locked via a helping lock that must be taken for both mapping and unmapping nodes in this tree. When splitting a superpage, additional locks are established for the smaller pages. The unmapping process takes and holds the lock corresponding to the root node of its mapping tree and all locks of smaller pages. To obtain this lock it may have to take temporarily locks corresponding to larger page sizes but these are freed once the smaller-page locks are taken. As a consequence, once the lock is held, the tree size is fixed and the unmap operation will finish in a bounded time. All other threads will first help out this unmap operation before they modify the tree. To obtain this lock, in the worst case all previously started operations must be helped out. This time is bounded as well but can potentially be very large.
... I have the feeling that this is an obvious threat scenario and thus is probably addressed already, but I can't find where.
So far we mostly ignored this problem as far as the main-stream implementations go. One of the reasons I think was that map and unmap were not used by real-time tasks so their potentially unbounded execution time was not much of a problem. Volkmar's Beleg discusses some ideas how to solve this problem.
From a security perspective this may really be a problem. Do you know how other systems solve it? Like Eros e.g., implements immediate revocation by destruction of the intermediate objects. However, internally they also have a list in which address spaces a capability is mapped.
Best regards
Marcus
[Marcus Voelp]
I understand there are substantial variations of L4 implementations, and I don't have a complete picture of which one uses which algorithm.
You are right, there are different implementations in the various kernels. To my knowledge none is currently using the algorithm I described in my diploma thesis.
Just thought I'd chip in to describe how the new mapping database in Pistachio works. Modificateion of access rights occur in a top-down order using a depth first traversal. Deletion of mappings or resetting access status bits occur in the opposite order (i.e., reversed depth first traversal) after the end of the mapping tree has been reached.
The problems you describe would not apply to any Pistachio implementation since locking (either of subtrees or whole mapping DB) would ensure that you'd have no concurrent conflicting operations.
eSk
At Wed, 13 Dec 2006 11:51:52 +0100, Marcus Voelp voelp@os.inf.tu-dresden.de wrote:
in short: you are right. Both L4.sec and my solution have this problem. Fiasco does not. Details below.
Hi Marcus,
thanks for the detailed explanation, I find them very helpful. Also thanks to Espen for describing Pistachio and the new implementation. I somehow managed to miss the preorder sweep, but as you point out there is a corresponding problem there. I can see how locking would change the picture. Another concern is that the time is bounded in principal, but depends on actions taken by subordinate processes (from a security point of view), but my question was not about that, I think the trade-offs there are sufficiently clear. More comments inline.
... I have the feeling that this is an obvious threat scenario and thus is probably addressed already, but I can't find where.
So far we mostly ignored this problem as far as the main-stream implementations go. One of the reasons I think was that map and unmap were not used by real-time tasks so their potentially unbounded execution time was not much of a problem. Volkmar's Beleg discusses some ideas how to solve this problem.
Is some of that material online?
From a security perspective this may really be a problem. Do you know how other systems solve it? Like Eros e.g., implements immediate revocation by destruction of the intermediate objects. However, internally they also have a list in which address spaces a capability is mapped.
Not really. "Removing an object" in EROS/KeyKOS means incrementing its allocation count, which immediately invalidates all keys pointing to the object, as long as the keys are on disk (recall we are talking about persistent systems). The story is a bit different for objects in memory ("prepared"): For this case, there is a table of indirection, and the indirection entry is invalidated. The rest is done at the next checkpoint by the scavenger.
The details are a bit complicated. The costs for keeping track of everything are swallowed up and thus amortized by the object cache magic. I am not sure, but it might be possible for a malicious process to try to screw with the checkpointing mechanism, for example by forcing early checkpoints and thus degrading the performance of the system, but if that is a possible threat scenario or just a paranoid feeling is difficult for me to assess at this point, and may be very difficult to analyse. Eventually you will run out of allocation counts, and then you can only recover object space by a garbage collection on the object store.
There is an article by Jonathan Shapiro about the dependency tracking on the EROS website:
http://www.eros-os.org/design-notes/DependencyMgmt/DependencyMgmt.html
Thanks, Marcus
Marcus Brinkmann wrote:
At Wed, 13 Dec 2006 11:51:52 +0100, voelp@os.inf.tu-dresden.de wrote:
From a security perspective this may really be a problem. Do you know how other systems solve it? Like Eros e.g., implements immediate revocation by destruction of the intermediate objects. However, internally they also have a list in which address spaces a capability is mapped.
Not really. "Removing an object" in EROS/KeyKOS means incrementing its allocation count, which immediately invalidates all keys pointing to the object, as long as the keys are on disk (recall we are talking about persistent systems). The story is a bit different for objects in memory ("prepared"): For this case, there is a table of indirection, and the indirection entry is invalidated. The rest is done at the next checkpoint by the scavenger.
You are both correct here. Prepared capabilities are most often found in the address space definition tree of a process. Objects which have been cached in page tables must also invalidate the PTEs referencing them in addition to the allocation count dance described above. KeyKOS and its successors (EROS, Coyotos, and CapROS) use a hash table to track the mapping from object -> list of PTEs which were produced by that object. It's called the "depends table", and is the list Marcus Voelp mentioned. The depends table is analogous to the mapping database in L4.
The time needed to unmap an object's PTEs is bounded by the number of entries in a hash bucket of the depends table (33 [start... end] ranges in EROS). On overflow during a page fault, an existing mapping must reclaimed, though it can be faulted in again later.
Malicious processes could exploit this overflow to degrade performance, and it formed part of an (overt) communication channel in the KeyKOS and EROS specifications (corrected in CapROS and Coyotos).
The details are a bit complicated. The costs for keeping track of everything are swallowed up and thus amortized by the object cache magic. I am not sure, but it might be possible for a malicious process to try to screw with the checkpointing mechanism, for example by forcing early checkpoints and thus degrading the performance of the system, but if that is a possible threat scenario or just a paranoid feeling is difficult for me to assess at this point, and may be very difficult to analyse. Eventually you will run out of allocation counts, and then you can only recover object space by a garbage collection on the object store.
Very few processes are given the capability to force a checkpoint.
There were designs for both offline and incremental GC of allocation counts, but none were implemented until a need came up - which never happened. With a 16 bit allocation court / 5 min. checkpoint interval (original KeyKOS), it takes over 7 months to wrap in the worst case. With 24 bit / 1 min, it takes almost 32 years.
-Eric
At Fri, 22 Dec 2006 12:36:56 -0500, Eric Northup digitale@digitaleric.net wrote:
Marcus Brinkmann wrote:
At Wed, 13 Dec 2006 11:51:52 +0100, voelp@os.inf.tu-dresden.de wrote:
From a security perspective this may really be a problem. Do you know how other systems solve it? Like Eros e.g., implements immediate revocation by destruction of the intermediate objects. However, internally they also have a list in which address spaces a capability is mapped.
Not really. "Removing an object" in EROS/KeyKOS means incrementing its allocation count, which immediately invalidates all keys pointing to the object, as long as the keys are on disk (recall we are talking about persistent systems). The story is a bit different for objects in memory ("prepared"): For this case, there is a table of indirection, and the indirection entry is invalidated. The rest is done at the next checkpoint by the scavenger.
You are both correct here. Prepared capabilities are most often found in the address space definition tree of a process. Objects which have been cached in page tables must also invalidate the PTEs referencing them in addition to the allocation count dance described above. KeyKOS and its successors (EROS, Coyotos, and CapROS) use a hash table to track the mapping from object -> list of PTEs which were produced by that object. It's called the "depends table", and is the list Marcus Voelp mentioned. The depends table is analogous to the mapping database in L4.
The time needed to unmap an object's PTEs is bounded by the number of entries in a hash bucket of the depends table (33 [start... end] ranges in EROS). On overflow during a page fault, an existing mapping must reclaimed, though it can be faulted in again later.
Malicious processes could exploit this overflow to degrade performance, and it formed part of an (overt) communication channel in the KeyKOS and EROS specifications (corrected in CapROS and Coyotos).
Eric, thanks for correcting me and providing this information.
The details are a bit complicated. The costs for keeping track of everything are swallowed up and thus amortized by the object cache magic. I am not sure, but it might be possible for a malicious process to try to screw with the checkpointing mechanism, for example by forcing early checkpoints and thus degrading the performance of the system, but if that is a possible threat scenario or just a paranoid feeling is difficult for me to assess at this point, and may be very difficult to analyse. Eventually you will run out of allocation counts, and then you can only recover object space by a garbage collection on the object store.
Very few processes are given the capability to force a checkpoint.
I realize that, but my concern is not that they have a capability to force a checkpoint, but that they can force a checkpoint simply by exhausting the object table. I am stretching out my neck here, because I definitely don't have all the details. Is it feasible that there is a sequence of create and rescind operations one or multiple cooperating programs could perform to spill the object table that provides the indirection layer? What happens in that case? Also, what happens if it happens at a time where a checkpoint is already in progress.
There were designs for both offline and incremental GC of allocation counts, but none were implemented until a need came up - which never happened. With a 16 bit allocation court / 5 min. checkpoint interval (original KeyKOS), it takes over 7 months to wrap in the worst case. With 24 bit / 1 min, it takes almost 32 years.
I agree that we can ignore the allocation count (for now :)
Thanks, Marcus
Marcus Brinkmann schrieb:
... I have the feeling that this is an obvious threat scenario and thus is probably addressed already, but I can't find where.
So far we mostly ignored this problem as far as the main-stream implementations go. One of the reasons I think was that map and unmap were not used by real-time tasks so their potentially unbounded execution time was not much of a problem. Volkmar's Beleg discusses some ideas how to solve this problem.
Is some of that material online?
Yes, but unfortunately only in german:
http://os.inf.tu-dresden.de/papers_ps/micro_kernel_memory_management.pdf
Best regards
Marcus
l4-hackers@os.inf.tu-dresden.de