Greetings, all!
I am beginning to explore nano/micro-kernel issues again after a long hiatus (24 years). I don’t know how active this list is nor its followers, so I will be brief.
At the moment I am discussing kernel vs user scheduling speed bumps in the now prevalent multi-core cpu architectures in the context of the ISO C11 subcommittee CPLEX which is considering adding parallelization language extensions. I have already demonstrated feature equivalence using closures (Apple’s Blocks, which I introduced and have previously proposed for C11) and some new synchronization and scheduling primitives that I am already using as part of some Actor research.
I am looking for a non-GPL kernel/executive/nano-kernel to redo some nano-kernel work I did as part of a joint Bell Labs-Sun collaboration that gave rise to SVR4 and, in opposition, the Open Software Foundation. Essentially I built a capability based nano-kernel somewhat along the lines of the V Kernel of David Cheriton (the Sun folks did “Spring”) My prototype at the time had IPC transfer speed of 10x vtable dispatch cost. At the time I wrote my own IDL, but after joining NeXT we added such a concept as a first class language feature, and it was copied into Java as “interfaces”. This is still an active interest of mine also in the context of Actors.
Is this an appropriate list to discuss whether L4 is an appropriate starting point and such issues that might arise from my exploration? (My friends and former colleagues at Apple have IP restrictions such that I would prefer to not risk even the appearance of possible improprieties in discussing such matters)
Blaine Garst
On 02/14/2014 07:10 AM, Blaine Garst wrote:
I am looking for a non-GPL kernel/executive/nano-kernel [...]
Is there any specific reason why you would shy away from GPL code? The only non-GPL L4 kernel I can think of is Pistachio:
It works, but is not actively maintained anymore. If you need a userland for Pistachio, you can try Genode: http://genode.org/
Is this an appropriate list to discuss whether L4 is an appropriate starting point and such issues that might arise from my exploration?
Sure!
HTH Julian
On Feb 14, 2014, at 4:44 AM, Julian Stecklina jsteckli@os.inf.tu-dresden.de wrote:
On 02/14/2014 07:10 AM, Blaine Garst wrote:
I am looking for a non-GPL kernel/executive/nano-kernel [...]
Is there any specific reason why you would shy away from GPL code? The only non-GPL L4 kernel I can think of is Pistachio:
Yes, thank you, now it is clear why there wasn’t a single consistent license reference.
My interest in non-GPL is that, if successful, my explorations will be released and supported under yet another licensing arrangement for free personal non-monetary uses in some higher level software I’m cooking, and then possibly some licensing revenue generating for-profit spinoffs etc., such that GPL could not be used. Re-exporting changes under “dual BSD” would also be the friendly thing to do.
At first glance I suspect that my architectural work will improve L4 IPC times.
The premise is/was that threads don’t belong to address spaces but instead wander with the IPC from one address domain to another carrying their arguments in registers. IPC is a trap, adjust mmu, proceed. If the IPC is carrying an IPC end-point, e.g. a capability, its a different trap and some bookkeeping must be done, but it can also be blindingly fast. The hard question is and was, well, if you don’t have a blocking thread waiting for the IPC, how do you manage all these spontaneous “up-calls”.
My answer is found in my Actor Runtime.
But what about my suspicion? Seems like “swap thread registers system call” will alone be more expensive in both time (I avoid the need to do so) and space (since there are no nano-kernel threads waiting for activation). Anybody have my same hunch?
It works, but is not actively maintained anymore. If you need a userland for Pistachio, you can try Genode: http://genode.org/
Is this an appropriate list to discuss whether L4 is an appropriate starting point and such issues that might arise from my exploration?
Sure!
Okay, some deeper thoughts.
"After disastrous results in the early 90's, the microkernel approach now seems to be promising, although it still bears a lot of research risks.”
I’m curious as to what “results” were being referenced. I shifted out of kernel work around 1992 (after co-architecting the maxi kernel to end all maxi-kernels, SVR4, and being forced to abandon the nano-kernel that would have made it all better!).
At this point I can easily persuade people that the existing maxi-kernel notions of Threads are completely off-base - these were extended simulations of “multi-core” that unified and extended the fundamental notion of simulated “multi-processors” that is the basis for time-sharing by way of kernel timer-interrupts. Well, now we have multi-core for real but there is also fairly widespread agreement that we computer scientists and engineers have been caught flat-footed (to a large degree by Intel’s multi-decade just-make-one-core-faster marathon).
So I think its time yet again (for others) to re-think these notions, and I have some experience and ideas on the subject. I’m not keen on preaching, I prefer to build and show and let people learn by exploring.
Not that its multicore (yet), but has anyone in this community been exploring putting L4 on a Raspberry Pi?
Blaine Garst (former “Wizard of Runtimes”@Apple)
HTH Julian
l4-hackers mailing list l4-hackers@os.inf.tu-dresden.de http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
Hi,
On Fri Feb 14, 2014 at 10:39:49 -0800, Blaine Garst wrote:
On Feb 14, 2014, at 4:44 AM, Julian Stecklina jsteckli@os.inf.tu-dresden.de wrote:
On 02/14/2014 07:10 AM, Blaine Garst wrote:
I am looking for a non-GPL kernel/executive/nano-kernel [...]
Is there any specific reason why you would shy away from GPL code? The only non-GPL L4 kernel I can think of is Pistachio:
Yes, thank you, now it is clear why there wasn’t a single consistent license reference.
My interest in non-GPL is that, if successful, my explorations will be released and supported under yet another licensing arrangement for free personal non-monetary uses in some higher level software I’m cooking, and then possibly some licensing revenue generating for-profit spinoffs etc., such that GPL could not be used. Re-exporting changes under “dual BSD” would also be the friendly thing to do.
At first glance I suspect that my architectural work will improve L4 IPC times.
The premise is/was that threads don’t belong to address spaces but instead wander with the IPC from one address domain to another carrying their arguments in registers. IPC is a trap, adjust mmu, proceed. If the IPC is carrying an IPC end-point, e.g. a capability, its a different trap and some bookkeeping must be done, but it can also be blindingly fast. The hard question is and was, well, if you don’t have a blocking thread waiting for the IPC, how do you manage all these spontaneous “up-calls”.
My answer is found in my Actor Runtime.
But what about my suspicion? Seems like “swap thread registers system call” will alone be more expensive in both time (I avoid the need to do so) and space (since there are no nano-kernel threads waiting for activation). Anybody have my same hunch?
Overall, I think that current kernels do what's architecturally possible regarding IPC. So I'd assume there's not much to squeeze out further. And yes, it's not only about the good case but also for all the other cases where more handling is needed and thus checks on the way. I'd suggest to look at the existing kernels to see how they're doing.
"After disastrous results in the early 90's, the microkernel approach now seems to be promising, although it still bears a lot of research risks.”
I’m curious as to what “results” were being referenced.
1st-gen kernels, i.e. Mach.
I shifted out of kernel work around 1992 (after co-architecting the maxi kernel to end all maxi-kernels, SVR4, and being forced to abandon the nano-kernel that would have made it all better!).
At this point I can easily persuade people that the existing maxi-kernel notions of Threads are completely off-base - these were extended simulations of “multi-core” that unified and extended the fundamental notion of simulated “multi-processors” that is the basis for time-sharing by way of kernel timer-interrupts. Well, now we have multi-core for real but there is also fairly widespread agreement that we computer scientists and engineers have been caught flat-footed (to a large degree by Intel’s multi-decade just-make-one-core-faster marathon).
So I think its time yet again (for others) to re-think these notions, and I have some experience and ideas on the subject. I’m not keen on preaching, I prefer to build and show and let people learn by exploring.
Not that its multicore (yet), but has anyone in this community been exploring putting L4 on a Raspberry Pi?
Fiasco/L4Re is also running on the Raspberry.
Adam
But what about my suspicion? Seems like “swap thread registers system call” will alone be more expensive in both time (I avoid the need to do so) and space (since there are no nano-kernel threads waiting for activation). Anybody have my same hunch?
Overall, I think that current kernels do what's architecturally possible regarding IPC.
Sorry, but I disagree, because the software architecture is wrong.
Change the architecture and more speed is possible.
So I'd assume there's not much to squeeze out further. And yes, it's not only about the good case but also for all the other cases where more handling is needed and thus checks on the way.
Umm, I was suggesting that one has known-to-need-no-checks entries and known-to-need-checks entries.
For Apple’s generational conservative GC we overloaded the assignment operator to call a helper function. At runtime we page mapped in the appropriate helper function, and used a single instruction call.
For non-GC the implementation was “store; return”.
So bloody fast we couldn’t measure its cost.
I'd suggest to look at the existing kernels to see how they're doing.
"After disastrous results in the early 90's, the microkernel approach now seems to be promising, although it still bears a lot of research risks.”
I’m curious as to what “results” were being referenced.
1st-gen kernels, i.e. Mach.
I was hired at NeXT on the premise that Mach’s IPC wasn’t fast enough!
Not that its multicore (yet), but has anyone in this community been exploring putting L4 on a Raspberry Pi?
Fiasco/L4Re is also running on the Raspberry.
Thanks - I’ll peruse its licensing agreement!
Blaine
On 15 Feb 2014, at 2:04 pm, "Blaine Garst" blaine@mac.com wrote:
But what about my suspicion? Seems like “swap thread registers system call” will alone be more expensive in both time (I avoid the need to do so) and space (since there are no nano-kernel threads waiting for activation). Anybody have my same hunch?
Overall, I think that current kernels do what's architecturally possible regarding IPC.
Sorry, but I disagree, because the software architecture is wrong.
Change the architecture and more speed is possible.
That's quite an assertion. Sure - change the APIs, remove functionality, etc, and you'll get it down maybe. But it won't be an L4 system.
You do realize some implementations of L4 IPC are sub 50 cycles with full address space switch?? A lot has happened since 1992!! You've got a lot of reading (papers and code) to do.
My two cents: one of the big problems we have today with IPC latency in microkernels is all the nasty errata that production CPUs seem to have making what should be a fast operation orders of magnitude slower. Another big issue is misuse of IPC..
Perhaps focusing somehow on those would be more interesting.
So I'd assume there's not much to squeeze out further. And yes, it's not only about the good case but also for all the other cases where more handling is needed and thus checks on the way.
Umm, I was suggesting that one has known-to-need-no-checks entries and known-to-need-checks entries.
For Apple’s generational conservative GC we overloaded the assignment operator to call a helper function. At runtime we page mapped in the appropriate helper function, and used a single instruction call.
For non-GC the implementation was “store; return”.
So bloody fast we couldn’t measure its cost.
I'd suggest to look at the existing kernels to see how they're doing.
"After disastrous results in the early 90's, the microkernel approach now seems to be promising, although it still bears a lot of research risks.”
I’m curious as to what “results” were being referenced.
1st-gen kernels, i.e. Mach.
I was hired at NeXT on the premise that Mach’s IPC wasn’t fast enough!
Not that its multicore (yet), but has anyone in this community been exploring putting L4 on a Raspberry Pi?
Fiasco/L4Re is also running on the Raspberry.
Thanks - I’ll peruse its licensing agreement!
Blaine
l4-hackers mailing list l4-hackers@os.inf.tu-dresden.de http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
On Feb 14, 2014, at 9:46 PM, Daniel Potts danielp@ok-labs.com wrote:
On 15 Feb 2014, at 2:04 pm, "Blaine Garst" blaine@mac.com wrote:
But what about my suspicion? Seems like “swap thread registers system call” will alone be more expensive in both time (I avoid the need to do so) and space (since there are no nano-kernel threads waiting for activation). Anybody have my same hunch?
Overall, I think that current kernels do what's architecturally possible regarding IPC.
Sorry, but I disagree, because the software architecture is wrong.
Change the architecture and more speed is possible.
That's quite an assertion. Sure - change the APIs, remove functionality, etc, and you'll get it down maybe. But it won't be an L4 system.
Understood, and regrettable considering all the linux hosting work that has been done.
You do realize some implementations of L4 IPC are sub 50 cycles with full address space switch?? A lot has happened since 1992!! You've got a lot of reading (papers and code) to do.
My hunch is that I could make it 30. More fun to see if its true than reading papers about other people’s experiments. Its a (validating) proof of concept exercise at this point.
My two cents: one of the big problems we have today with IPC latency in microkernels is all the nasty errata that production CPUs seem to have making what should be a fast operation orders of magnitude slower.
Yes. Its not a mistake that the multi-cores are going to simpler earlier designs - less surface area and less errata.
Orders of magnitude? (got a paper I could read? beyond Ousterhout 90) So, I could be completely wrong in that the hardware is swamping what should otherwise be countable instructions and their contention cycles. At the time I was providing IPC at 1 order of magnitude slowdown compared to a more typical 3 orders. (10 instruction equivalent vs 1000).
Another big issue is misuse of IPC..
Perhaps focusing somehow on those would be more interesting.
Well, correcting “misuses” in a systematic way is actually where I hope to get to!
Thanks for the input. I’ll see if I can find succor in reading the 2 BSD licensed kernel.
Blaine
On 15 Feb 2014, at 13:30 , Blaine Garst blaine@mac.com wrote:
Overall, I think that current kernels do what's architecturally possible regarding IPC.
Sorry, but I disagree, because the software architecture is wrong.
Change the architecture and more speed is possible.
Good luck with that! L4 IPC has been unbeaten for 20 years.
On 15 Feb 2014, at 16:46 , Daniel Potts danielp@ok-labs.com wrote:
You do realize some implementations of L4 IPC are sub 50 cycles with full address space switch?? A lot has happened since 1992!! You've got a lot of reading (papers and code) to do.
A good starting point would be Elphinstone & Heiser, From L3 to seL4 -- What Have We Learnt in 20 Years of L4 Microkernels?, SOSP 2013
Gernot
On Fri Feb 14, 2014 at 10:39:49 -0800, Blaine Garst wrote:
At first glance I suspect that my architectural work will improve L4 IPC times.
The premise is/was that threads don’t belong to address spaces but instead wander with the IPC from one address domain to another carrying their arguments in registers.
You’re talking about a migrating-threads model. Bryan Ford implemented that in Mach in the ‘90s [1], it improved Mach IPC (from a very low baseline), but still not even close to L4’s. (And note that they don’t compare to L4, bit of a benchmarking crime…) Pebble [2] was a from-scratch kernel using a migrating threads model, it got within 10% of L4 IPC performance but not better. More recently Gabe Parmer’s and Rich West’s Composite OS [3] tried the same, their IPC costs are also higher than L4’s.
IPC is a trap, adjust mmu, proceed. If the IPC is carrying an IPC end-point, e.g. a capability, its a different trap and some bookkeeping must be done, but it can also be blindingly fast. The hard question is and was, well, if you don’t have a blocking thread waiting for the IPC, how do you manage all these spontaneous “up-calls”.
You’ll find that it ain’t that easy. On the one hand, L4 IPC is designed to be little more than a context switch, so, as Adam says, there isn’t much to shave off. (In fact, about 10–15 years ago, when we were building Mungi on L4, some of my students argued that we should be moving to a kernel with a migrating threads model as this would map more efficiently onto Mungi’s migrating threads model. But when going through the operations that needed to be performed, no-one could show me how it would end up faster than using L4.)
On the other hand, you have to do considerable more than switching page tables. In particular, while logically the thread continues executing on its old stack, in reality that doesn’t work: the thread switches protection domains, and its old stack is no longer accessible. While logically, the whole stack moves between protection domains, in practice, this means that you need to provide a new stack on the fly. Obviously, the stack will be cached, so it can be re-used on a repeat call, but it isn’t as easy as only changing the page table.
And, there is no guarantee (except if you’re in a single-address-space OS like Mungi) that you actually *can* allocate a new stack where you need it: as you’re switching to a new AS, the address range used by the original stack might be in use by something else, which means you’re hosed.
Plus, maintaining a cache of stacks introduces resource-management policies into the kernel, in violation of microkernel principles.
Gernot
[1] Bryan Ford and Jay Lepreau, Evolving Mach 3.0 to a Migrating Thread Model, USENIX Winter, 1994
[2] Eran Gabber, Christopher Small, John Bruno, José Brustoloni and Avi Silberschatz, The Pebble Component-Based Operating System, Usenix’99
[3] Gabriel Parmer, Composite: A component-based operating system for predictable and dependable computing, PhD thesis, Boston University, 2009
Awesome! Thanks for the references.
The fact that L4 hasn’t been beaten is quite significant to me and so I’ll dig in further.
My intuition is that we have to turn our thinking upside down. There are no Threads. Have a few stacks for I/O interrupts in the kernel, do all stack scheduling in user space, and let IPC be as pure as trap, swap mmu, jump. If this model has been explored I’ll be fascinated to learn where it ran aground.
Again, thanks for the references, this is essential work which will take a long time to fully realize. I now know that L4 has set the bar, and I have some papers and code to peruse, and so I have refined my starting position. Its a good result!
Blaine
On Feb 14, 2014, at 9:18 PM, Gernot Heiser gernot@unsw.edu.au wrote:
On Fri Feb 14, 2014 at 10:39:49 -0800, Blaine Garst wrote:
At first glance I suspect that my architectural work will improve L4 IPC times.
The premise is/was that threads don’t belong to address spaces but instead wander with the IPC from one address domain to another carrying their arguments in registers.
You’re talking about a migrating-threads model. Bryan Ford implemented that in Mach in the ‘90s [1], it improved Mach IPC (from a very low baseline), but still not even close to L4’s. (And note that they don’t compare to L4, bit of a benchmarking crime…) Pebble [2] was a from-scratch kernel using a migrating threads model, it got within 10% of L4 IPC performance but not better. More recently Gabe Parmer’s and Rich West’s Composite OS [3] tried the same, their IPC costs are also higher than L4’s.
IPC is a trap, adjust mmu, proceed. If the IPC is carrying an IPC end-point, e.g. a capability, its a different trap and some bookkeeping must be done, but it can also be blindingly fast. The hard question is and was, well, if you don’t have a blocking thread waiting for the IPC, how do you manage all these spontaneous “up-calls”.
You’ll find that it ain’t that easy. On the one hand, L4 IPC is designed to be little more than a context switch, so, as Adam says, there isn’t much to shave off. (In fact, about 10–15 years ago, when we were building Mungi on L4, some of my students argued that we should be moving to a kernel with a migrating threads model as this would map more efficiently onto Mungi’s migrating threads model. But when going through the operations that needed to be performed, no-one could show me how it would end up faster than using L4.)
On the other hand, you have to do considerable more than switching page tables. In particular, while logically the thread continues executing on its old stack, in reality that doesn’t work: the thread switches protection domains, and its old stack is no longer accessible. While logically, the whole stack moves between protection domains, in practice, this means that you need to provide a new stack on the fly. Obviously, the stack will be cached, so it can be re-used on a repeat call, but it isn’t as easy as only changing the page table.
And, there is no guarantee (except if you’re in a single-address-space OS like Mungi) that you actually *can* allocate a new stack where you need it: as you’re switching to a new AS, the address range used by the original stack might be in use by something else, which means you’re hosed.
Plus, maintaining a cache of stacks introduces resource-management policies into the kernel, in violation of microkernel principles.
Gernot
[1] Bryan Ford and Jay Lepreau, Evolving Mach 3.0 to a Migrating Thread Model, USENIX Winter, 1994
[2] Eran Gabber, Christopher Small, John Bruno, José Brustoloni and Avi Silberschatz, The Pebble Component-Based Operating System, Usenix’99
[3] Gabriel Parmer, Composite: A component-based operating system for predictable and dependable computing, PhD thesis, Boston University, 2009
l4-hackers mailing list l4-hackers@os.inf.tu-dresden.de http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
On 16 Feb 2014, at 5:15 , Blaine Garst blaine@mac.com wrote:
My intuition is that we have to turn our thinking upside down. There are no Threads. Have a few stacks for I/O interrupts in the kernel, do all stack scheduling in user space, and let IPC be as pure as trap, swap mmu, jump. If this model has been explored I’ll be fascinated to learn where it ran aground.
Have a look at scheduler activations [1], and their use in K42 [2].
Gernot
[1] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazoswka and Henry M. Levy. Scheduler Activations: Effective Kernel Support for the User-Level Management of Threads, TOCS (10) 1992
[2] Orran Krieger, Marc Auslander, Bryan Rosenburg, Robert W. Wisniewski, Jimi Xenidis, Dilma Da Silva, Michal Ostrowski, Jonathan Appavoo, Maria Butrico, Mark Mergen, Amos Waterland and Volkmar Uhlig. K42: Building a Complete Operating System, EuroSys 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 02/15/2014 07:15 PM, Blaine Garst wrote:
let IPC be as pure as trap, swap mmu, jump
Since IPC on L4-like kernels usually allows capability/rights delegation, it is not quite as simple, but when I look at NOVA's IPC path[1] it roughly fits your description, even with some form of migrating threads. Check out the original paper[2] and another paper that describes the design of the IPC system in more detail[3].
That being said, in practice IPC performance is not as important as it may initially seem.
Julian
[1] https://github.com/udosteinberg/NOVA/blob/master/src/syscall.cpp Starts at sys_call.
[2] https://os.inf.tu-dresden.de/papers_ps/steinberg_eurosys2010.pdf [3] https://os.inf.tu-dresden.de/papers_ps/ospert2010_steinberg_boettcher_kauer....
On Feb 16, 2014, at 3:25 PM, Julian Stecklina jsteckli@os.inf.tu-dresden.de wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 02/15/2014 07:15 PM, Blaine Garst wrote:
let IPC be as pure as trap, swap mmu, jump
Since IPC on L4-like kernels usually allows capability/rights delegation, it is not quite as simple, but when I look at NOVA's IPC path[1] it roughly fits your description, even with some form of migrating threads. Check out the original paper[2] and another paper that describes the design of the IPC system in more detail[3].
That being said, in practice IPC performance is not as important as it may initially seem.
My initial goal is indeed to eliminate kernel scheduling and measure that win; it happens to be the case that the user-land architecture for that is exactly what I didn’t finish in my prototype, but now have, and so the IPC wins of yester-decade again come to mind.
Minimal IPC times are a desirable side-effect for many reasons.
And thank you for the references!! There is so much to read its great to start with well-regarded work!
Blaine
Julian
[1] https://github.com/udosteinberg/NOVA/blob/master/src/syscall.cpp Starts at sys_call.
[2] https://os.inf.tu-dresden.de/papers_ps/steinberg_eurosys2010.pdf [3] https://os.inf.tu-dresden.de/papers_ps/ospert2010_steinberg_boettcher_kauer.... -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux)
iEYEARECAAYFAlMBSPgACgkQ2EtjUdW3H9mW5ACeMA7gf1iyy7oXTbLR92q4yeMA bf4AnA2gpk+JkxANRJ0bIB11aMOGf9K8 =YKo4 -----END PGP SIGNATURE-----
l4-hackers mailing list l4-hackers@os.inf.tu-dresden.de http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
l4-hackers@os.inf.tu-dresden.de