Hello,
I am trying to compare the IPC performance between Pistachio and Fiasco.OC. I found the performances are almost the same for the Pingpong benchmark (I modified the one in Pistachio so that it can run with Fiasco.OC as well). I notice that if I turn off "tracebuffer" and "perfmon counter" switches when compiling Pistachio kernel, the IPC of Pistachio is much faster -- about 150x speedup for Pingpong! I tried to do the same thing for Fiasco.OC by turning off debugging related switches. However, I was not able to observe any performance improvement. I wonder if what config options that can dramatically affect the IPC performance for Fiasco.OC.
Also, I will appreciate it if anyone could explain to me the major IPC implementation differences.
Thanks,
Chen
On Thu Feb 03, 2011 at 15:47:08 -0800, Chen Tian wrote:
I am trying to compare the IPC performance between Pistachio and Fiasco.OC. I found the performances are almost the same for the Pingpong benchmark (I modified the one in Pistachio so that it can run with Fiasco.OC as well). I notice that if I turn off "tracebuffer" and "perfmon counter" switches when compiling Pistachio kernel, the IPC of Pistachio is much faster -- about 150x speedup for Pingpong! I tried
I can't believe that those options cause such a tremendous difference.
to do the same thing for Fiasco.OC by turning off debugging related switches. However, I was not able to observe any performance improvement. I wonder if what config options that can dramatically affect the IPC performance for Fiasco.OC.
When booting, red options are scrolling by, turning those off would be good.
Also, I will appreciate it if anyone could explain to me the major IPC implementation differences.
Could you elaborate a bit more? Overall they are both doing the same thing.
Adam
What are the actual figures, and on what processor? Practical limits on IPC round-trip times (i.e. ping-pong) depend on the architecture.
On x86 it is in the range of 500-5000 cycles depending on the microarchitecture (worst on Pentium-4, better on more recent implementations, and historically better on AMD than on Intel processors, although that might have changed since I last looked at it in detail). Pistachio IPC performance used to be fairly optimal, but probably hasn't been maintained, so may not show the full benefit of the more recent microarchitectures. But if you're seeing more than 2000 cycles round-trip on a recent x86 processor you're a fair bit away from optimal.
For comparison, on ARM we're seeing about 300-500 cycles depending on architecture version and core implementation.
Gernot
On 04/02/2011, at 10:47 , Chen Tian wrote:
Hello,
I am trying to compare the IPC performance between Pistachio and Fiasco.OC. I found the performances are almost the same for the Pingpong benchmark (I modified the one in Pistachio so that it can run with Fiasco.OC as well). I notice that if I turn off "tracebuffer" and "perfmon counter" switches when compiling Pistachio kernel, the IPC of Pistachio is much faster -- about 150x speedup for Pingpong! I tried to do the same thing for Fiasco.OC by turning off debugging related switches. However, I was not able to observe any performance improvement. I wonder if what config options that can dramatically affect the IPC performance for Fiasco.OC.
Also, I will appreciate it if anyone could explain to me the major IPC implementation differences.
Thanks,
Chen
l4-hackers mailing list l4-hackers@os.inf.tu-dresden.de http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
Thanks for the reply.
The architecture is x86. For pistachio, I got ~2000 cycles for a round-trip IPC with all kernel debug features disabled. With kernel debug (i.e. tracebuffer etc.) features, the number is about 300K cycles.
For Fiasco.OC, the number is also around 300K cycles no matter jdb debugging switch is on or off.
All these numbers are obtained from Virtualbox simulations (4-core processors).
In the pingpong benchmark (found in pistachio userland), I replaced the L4_Ipc call with a send/wait pair. Specifically, ping thread does L4_Send/L4_Wait; and pong thread does L4_Wait/L4_Send. These two calls are changed to l4_ipc_wait and l4_ipc_send for Fiasco.OC. I don't understand why my pingpong program shows such slow IPCs.
Chen
On Fri, Feb 4, 2011 at 3:43 PM, Gernot Heiser gernot@cse.unsw.edu.au wrote:
What are the actual figures, and on what processor? Practical limits on IPC round-trip times (i.e. ping-pong) depend on the architecture.
On x86 it is in the range of 500-5000 cycles depending on the microarchitecture (worst on Pentium-4, better on more recent implementations, and historically better on AMD than on Intel processors, although that might have changed since I last looked at it in detail). Pistachio IPC performance used to be fairly optimal, but probably hasn't been maintained, so may not show the full benefit of the more recent microarchitectures. But if you're seeing more than 2000 cycles round-trip on a recent x86 processor you're a fair bit away from optimal.
For comparison, on ARM we're seeing about 300-500 cycles depending on architecture version and core implementation.
Gernot
On 04/02/2011, at 10:47 , Chen Tian wrote:
Hello,
I am trying to compare the IPC performance between Pistachio and Fiasco.OC. I found the performances are almost the same for the Pingpong benchmark (I modified the one in Pistachio so that it can run with Fiasco.OC as well). I notice that if I turn off "tracebuffer" and "perfmon counter" switches when compiling Pistachio kernel, the IPC of Pistachio is much faster -- about 150x speedup for Pingpong! I tried to do the same thing for Fiasco.OC by turning off debugging related switches. However, I was not able to observe any performance improvement. I wonder if what config options that can dramatically affect the IPC performance for Fiasco.OC.
Also, I will appreciate it if anyone could explain to me the major IPC implementation differences.
Thanks,
Chen
l4-hackers mailing list l4-hackers@os.inf.tu-dresden.de http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
[Apologies for top-quote in previous mail.]
On 05/02/2011, at 12:01 , Chen Tian wrote:
The architecture is x86. For pistachio, I got ~2000 cycles for a round-trip IPC with all kernel debug features disabled.
That makes sense. May not be optimal, but not too far out.
With kernel debug (i.e. tracebuffer etc.) features, the number is about 300K cycles.
Makes sense too.
For Fiasco.OC, the number is also around 300K cycles no matter jdb debugging switch is on or off.
Clearly way too high. Can't comment on Fiasco, though.
Gernot
On Fri, 4 Feb 2011 17:01:59 -0800 Chen Tian (CT) wrote:
CT> For Fiasco.OC, the number is also around 300K cycles no matter jdb CT> debugging switch is on or off. CT> CT> All these numbers are obtained from Virtualbox simulations (4-core CT> processors).
So that's why. Try running your benchmarks on bare hardware, not in a VM.
Cheers,
- Udo
Well, I did run it on a real machine (a dual-core processor with hyper-threading). It takes more than one million cycles for one-way. It seems like the number I got is unusual. Do you think using affinity will affect the results? I pinned the ping thread and pong thread down to different cores by setting the affinities before they started calling send/wait IPC calls. I am not sure if I did something wrong there.
-Chen
On Sat, Feb 5, 2011 at 1:26 PM, Udo A. Steinberg udo@hypervisor.org wrote:
On Fri, 4 Feb 2011 17:01:59 -0800 Chen Tian (CT) wrote:
CT> For Fiasco.OC, the number is also around 300K cycles no matter jdb CT> debugging switch is on or off. CT> CT> All these numbers are obtained from Virtualbox simulations (4-core CT> processors).
So that's why. Try running your benchmarks on bare hardware, not in a VM.
Cheers,
- Udo
On 06/02/2011, at 10:36 , Chen Tian wrote:
Well, I did run it on a real machine (a dual-core processor with hyper-threading). It takes more than one million cycles for one-way. It seems like the number I got is unusual. Do you think using affinity will affect the results? I pinned the ping thread and pong thread down to different cores by setting the affinities before they started calling send/wait IPC calls. I am not sure if I did something wrong there.
X-core IPC is more expensive than local IPC, as cache lines must be migrated, and, depening on how it's implemented, you may get inter-core interrupts (IPIs) which are expensive on x86. And you don't get any benefit from parallelism, as one thread is always blocked.
But none of that should result in such bad latencies. I could see a few 10k cyc, but not Mcyc.
Gernot
l4-hackers@os.inf.tu-dresden.de