That is wrong. The direct lookup drastically reduces cache and TLB footprint. For a full IPC we have to access two TCBs (which are virtually mapped and have the stack in the same page) which costs two TLB entries. The complete lookup is therefore a simple mask (plus maybe a shift), a relative mov (e.g. mov 0xe0000000(%eax), %ebx) and a compare. Overall costs therefore (on IA32):
- 2 TLB entries (but we need them anyway for the stack, they could be
reduced to one TLB entry when using 4M pages for all TCBs, but that would add an indirection table and therefore a cache line); refetch costs ~80 cycles/entry
- shift and move (~3 cycles)
- 1 cache line for the thread id (which is shared with thread state
etc).
Assume you add 2 more TLB entries and 5 more L2 cache lines--your aftercosts for IPC go up by 2*80 + 5*80 = 560 cycles. Considering overall IPC costs of 1000 cycles on a P4 with all those nasty cache and TLB flushes you add an overhead of >50%.
The "but we need them anyway for the stack, they could be reduced to one TLB entry when using 4M pages for all TCBs, but that would add an indirection table and therefore a cache line" part seems very interesting, I've seen something about it in your microkernel presentations, so I guess you have done some speed measurements.
Now if you would give each address space it's own indirection table, you would have a thread space. Have you done measurements? What does this methode cost?
-- Rudy Koot
_________________________________________________________________ MSN Search, for accurate results! http://search.msn.nl
On Mon, 2004-01-05 at 11:00, Rudy Koot wrote:
Now if you would give each address space it's own indirection table, you would have a thread space. Have you done measurements? What does this methode cost?
Having thought about Volkmar's statement on cache misses, I actually don't believe that this is the right thing to do. Most clients invoke a small number of services, so a per-client space is not amortized well in the TCB, and the misses are likely to be expensive as a result.
I think that a better approach is to aggregate the descriptors in a global hash table of the form
(source-unique-ID, sender's-recipient-name, &target-PCB, &hash-next)
Convert the sender-supplied thread-id field into an opaque value that is matched against 'sender's-recipient-name'. The hash-table entry matches IFF
(source-unique-id == sender-id && requested-name == sender's-recipient-name)
I seem to recall that Trent's design used &source-PCB as the source-unique-ID. I would recommend using a field in the PCB for this instead. It is NOT likely to cause an extra cache miss (since it resides next to the registers, and is therefore likely to be in-cache necessarily), and it would allow all of the threads of a client to share a common access policy if this is desired without creating pressure in the indirection table.
In order to support descriptor spaces, however, one would require a very efficient invalidation mechanism. One possibility is to add a version number in the sender PCB and in the indirection table. The version numbers must match in order for the entry to be valid, and is incremented by the kernel whenever a thread is mapped into the recipient thread space. This causes all recipient descriptors to become temporarily invalid, and the thread fault handler can clean things up on the next invocation.
The kernel, when performing descriptor mapping, would update the recipient descriptor mapping table, but is NOT responsible for updating the descriptor cache. It is only responsible for ensuring that the existing entries in the descriptor cache are efficiently invalidated. A global, user-mode fault handler (perhaps sigma-0) does on-demand revalidation of the cache entries.
Hypothesis: thread id mapping is a rare thing. The requirement is to do invalidation efficiently, not revalidation.
I'm sure that there is some better way to do this, but at the moment I do not see what it is. This is the best I can think of without spending some time at a whiteboard.
shap
l4-hackers@os.inf.tu-dresden.de