RE: IPC/Capabilities Overview

1 Jan 2004

      On Wed, 2003-12-31 at 10:25, Volkmar Uhlig wrote:
...
...
-----Original Message-----

L4 IPC speed has relatively little to do with the "direct lookup"

aspect of thread ids. Any indirect encoding will carry a cost, but in
terms of the overall performance of IPC this cost is quite small.
That is wrong.  The direct lookup drastically reduces cache and TLB
footprint.  For a full IPC we have to access two TCBs (which are
virtually mapped and have the stack in the same page) which costs two
TLB entries.  The complete lookup is therefore a simple mask (plus maybe
a shift), a relative mov (e.g. mov 0xe0000000(%eax), %ebx) and a
compare.  Overall costs therefore (on IA32):

2 TLB entries (but we need them anyway for the stack, they could be

reduced to one TLB entry when using 4M pages for all TCBs, but that
would add an indirection table and therefore a cache line); refetch
costs ~80 cycles/entry

shift and move (~3 cycles)
1 cache line for the thread id (which is shared with thread state

etc).
This part of your argument I understand. I don't agree with it, because
you aren't measuring the right times, but I understand the argument that
you are making.
You aren't measuring the right times because you fail to consider
application-level costs that are imposed by deficiencies in the kernel
layer interface. The end to end time is the important time, and this
must include mandated application-level costs.
...
Assume you add 2 more TLB entries and 5 more L2 cache lines--your
aftercosts for IPC go up by 2*80 + 5*80 = 560 cycles.
Considering overall IPC costs of 1000 cycles on a P4 with all those
nasty cache and TLB flushes you add an overhead of >50%.
But why would one wish to assume this? First, the implementation can be
better than what you propose, and second, the relevant entries are
frequently resident in practice.
In any case, these references are unavoidable. The only question is
whether you will do them at user level or kernel level.
...
...

Security checks done in the server vs. the kernel are not

necessarily slower or faster. It depends greatly on what security 
checks you wish to do. I argue that:
a) *many* (not all) of the security checks currently done in L4
      servers could be eliminated if kernel-protected bits existed
      in each descriptor
b) For some types of systems (capability systems), disclosure of
      the sender id is an absolute violation of design requirements,
      so any microkernel that relies exclusively on server-side
      security checks based on sender-id is not a universal 
microkernel.
c) More specifically, any microkernel that requires checks based
      on sender-id is entirely unsuitable as a platform for EROS-NG.
And as you stated this is _also_ a limited view, because you only look
from the EROS and capability point of view.
You clearly did not read what I wrote. I am certainly considering EROS
as a candidate application, but I also considered the performance of
current L4 servers.
...
By throwing away one
register for an identifier you reduce your register real estate by 33%.
I agree that register-based transfer on the x86 is severely restricted.
This is why K42, and more recently EROS, are abandoning it in favor of a
better model.
...
And the argument that everything should go in memory (one of your last
emails) is not convincing--register-based IPC is still much faster, it
is mostly a question of a reasonable IDL compiler.
Perhaps this is so, but since this is not a correct description of what
I have proposed, I am not convinced by this argument.
...
And we are all aware
that IA32 is crippled from that perspective.  Take any other
architecture (worst case: IA64) and the argument becomes complete bogus.
On other architectures, register-based transfer may remain viable. IA64
is *not* one of them.
...
...
Ignoring clans and chiefs (which we all agree is too expensive and
inflexible), here is how the three schemes break down:
Thread IDs:
   No restriction who can send.
   Server makes decisions based on sender-id
You forgot sender restriction and redirection.
Yes, I do, because these do not appear in the L4 interface definition.
Would you be kind enough to describe these mechanisms in adequate detail
for me to understand them?
...
...
Hybrid:
   Sender can only invoke a thread descriptor that is mapped in their
      thread descriptor space (thread space)
   Server makes decisions based on either (a) a field that is encoded
      within the descriptor, or (b) the sender-id.
** Sender-id is software controlled by the thread manager, and can
      be set to zero for all threads to simulate capability behavior.
A possibility you did not mention is a hybrid thread id which has an
thread and a descriptor part.  The descriptor is kernel enforced.  That
is what we currently have in V4 (please not that V2 and X.0 are
completely outdated!!!)--the version part of the thread id.
Where can I get a copy of the v4 specification? There is no benefit in
wasting your time (or others).
...
...
In the thread-IDs design, there are two distinguished phases in the
server-side security checks:

Object resolution. Based on sender-id and arguments, determines
the identity and permissions of the server-implemented object
that has been invoked. This phase may conclude that no such
object exists.

Permissions check. Given the object identity and permissions,
make a decision about whether the particular operation is to be
permitted.

All of the bits needed for phase 2 can be encoded in the 
descriptor. All of the expensive parts of the current L4 protocol 
lie in phase 1 (object resolution).
See above.  Furthermore, your suggestion is to move that part into the
kernel... Then the overhead is on _every_ invocation, not just for the
once where you need it.
I do not agree, because the object resolution phase is largely
eliminated by placing it in the kernel.
However, you are once again arguing about kernel times and not end to
end times.
...
...
...
Now I want to make clear why capablities are much better
than virtual thread
...
objects:

The extra word does not seem to decrease performance in

any way (is this
...
true?) so it a free feature, that can be used but doesn't have to.
I believe that this is true, and the evidence of the EROS 
implementation seems to support this view.
Where are benchmarks with cold caches?  Do you have a detailed analysis
of the cache and TLB footprint?
The cold cache numbers aren't significantly different from L4 in the
current EROS implementations, but there is an important caveat: EROS
does not implement thread address spaces. We instead use
software-implemented capability registers, and these retain all of the
locality properties that you cite for thread-ids in L4.
I concur that thread spaces involve indirection overheads, and I am
concerned about this. Enough so that I am considering introducing
distinct invocation patterns to avoid the cost in the fast path. I do,
however, want to look at hashing tricks before I adopt that approach.
The big overhead in EROS relative to L4 comes in the capability copy
path, because of the need to update the doubly linked lists of
capabilities. The problem is that the "neighbor" capabilities are rarely
in cache, so this leads to cache misses. We have a design for
eliminating this that I am in the process of implementing, but no
numbers yet.
...
...
This is a possible usage. In our experience, the more common 
behavior is
to have a pointer to some data structure that describes a server
implemented object (i.e. has nothing to do with any 
particular client),
and reuse the low bits for permissions. For example, the pointer might
point to a file metadata structure, and the low bits might 
indicate read and write permissions.
In our case we use an object identifier in the message, which is a
handle to an object descriptor and do a reverse check (i.e. if the
thread is allowed to invoke that object).  Costs: boundary check (can be
a simple AND) one MOV and a CMP.  The permissions can go into the same
cache line.
That may deserve some more careful thought on my part. From the EROS
perspective, the problem is that the resulting descriptor cannot be
selectively rescinded, but I'm not sure that EROS does this very well
either.
shap

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

RE: IPC/Capabilities Overview