L4Linux AIM Benchmark results

Martin Pohlack (pohlack (at) os.inf.tu-dresden.de)
2007-06-05


Version 2.01

Introduction

In this report I compare L4Linux, a paravirtualized version of the Linux kernel, with its native counterpart. I do this on two different microarchitectures (Intel's P4 and AMD's Opteron)of the x86 architecture. AIM was chosen as a benchmark to be able to compare results with older publications, which did also use AIM.

Some words about the measurement scenario

Machines

My first test machine, brom, is an Intel P4, Family 15, Model 4, Stepping 9, with 3.2 GHz. I gave 256 MB Ram to Linux and L4Linux, from that, 50 MB are always used as ramdisk with tmpfs ('mount -t tmpfs tmpfs -o size=50M tmpfs/'). I have never seen swapping in the benchmarks.

My second test machine, lutetium, is an AMD Opteron, Family 15, Model 65, Stepping 2, with 2.0 GHz. The memory setup is the same as for brom.

For the benchmarks I stopped all unnecessary background activity with this script. I kept sshd and nfs active for transporting measurement data and configuration files after resp. before the benchmarks, never in between.

Drivers

Linux and L4Linux use native drivers to be as comparable as possible. For IDE disk access, I had to use the SATA drivers (see Section Problems encountered). Here are the .config files for L4Linux and Linux.

Performance counter

For several measurements I used CPU-specific performance counters of the P4 (brom) or the Opteron (lutetium) respectively . For their setup in Linux, I enabled the msr-device (/dev/cpu/0/msr) and wrote into that device using a small tool wrmsr from the msr-tools-1.1.2.

For the setup in L4, I patched the -loadcnt argument in Fiasco to initialize additional performance counters, such that in both setups performance counters were initialized with the same values.

In Section Microbenchmarks I describe which counters I used. Here is a script for programming the counters in Linux on a P4. Here is the corresponding script for Opterons.

What is AIM

AIM is a suite of benchmarks to assess and compare operating systems. I choose AIM7 (multiuser benchmark), mostly to be able to compare the results with earlier studies. AIM simulates several users running a mixed set of applications in parallel. The number of simultaneous users is called load.

Problems encountered

While trying to measure numbers I hit several problems which I will describe in the following. Where available I will also describe a solution.

Results

This graph shows overall AIM results for Linux and L4Linux on 'brom' using the hard disk or ramdisk respectively.
Overall AIM results on 'lutetium'.

We see that Linux and L4Linux performance is relatively close together for the hard disk cases, which resembles the normal operation mode and reflects AIM's purpose (multiuser benchmark). The benchmark represents a mixture of CPU-bound and IO-bound tasks. At a load of 300 both perform nearly equally well.

For the ramdisk case, we see a large gap between Linux and L4Linux. The system is basically never idle and there is no real IO to hide behind. This scenario is one of the most extreme cases to demonstrate virtualization overhead.

We also see that the gap between Linux and L4Linux in the ramdisk setup is larger on brom than on lutetium, which suggest a larger effective overhead on the P4 than on the Opteron.

This graphs shows the absolute frequency of the syscalls that occured during one AIM run.

brk being the most common syscall is probably AIM7 specific and should not be generalized to other workloads. There is a brk-specific microbenchmark in AIM which causes the exceptionally high brk-syscall numbers.

Microbenchmarks

In the following I show data about the single microbenchmarks used in AIM7 with the standard setup. AIM weighs these microbenchmarks non-uniformly, that is, it repeat some of them several times (up to four), whereas other are executed only once. The names are the ones used in AIM's logfiles and are mostly self descriptive.

I repeated this test 100 times at load 1 with my idle testboxes. I used the ramdisk setup for these measurements, so nothing is hidden in IO times.

The first table shows the results for brom (P4):



Times
ITLB misses
TLB page walks
Cache misses
Global Power events (Cycles non-halted)
Instructions retired
add_double
X
X
X
X
X
X
add_float
X
X
X
X
X
X
add_int
X
X
X
X
X
X
add_long
X
X
X
X
X
X
add_short
X
X
X
X
X
X
array_rtns
X
X
X
X
X
X
brk_test
X
X
X
X
X
X
creat-clo
X
X
X
X
X
X
dgram_pipe
X
X
X
X
X
X
dir_rtns_1
X
X
X
X
X
X
disk_cp
X
X
X
X
X
X
disk_rd
X
X
X
X
X
X
disk_rr
X
X
X
X
X
X
disk_rw
X
X
X
X
X
X
disk_src
X
X
X
X
X
X
disk_wrt
X
X
X
X
X
X
div_double
X
X
X
X
X
X
div_float
X
X
X
X
X
X
div_int
X
X
X
X
X
X
div_long
X
X
X
X
X
X
div_short
X
X
X
X
X
X
exec_test
X
X
X
X
X
X
fork_test
X
X
X
X
X
X
jmp_test
X
X
X
X
X
X
link_test
X
X
X
X
X
X
matrix_rtns
X
X
X
X
X
X
mem_rtns_1
X
X
X
X
X
X
mem_rtns_2
X
X
X
X
X
X
misc_rtns_1
X
X
X
X
X
X
mul_double
X
X
X
X
X
X
mul_float
X
X
X
X
X
X
mul_int
X
X
X
X
X
X
mul_long
X
X
X
X
X
X
mul_short
X
X
X
X
X
X
new_raph
X
X
X
X
X
X
num_rtns_1
X
X
X
X
X
X
page_test
X
X
X
X
X
X
pipe_cpy
X
X
X
X
X
X
ram_copy
X
X
X
X
X
X
series_1
X
X
X
X
X
X
shared_memory
X
X
X
X
X
X
shell_rtns_1
X
X
X
X
X
X
sieve
X
X
X
X
X
X
signal_test
X
X
X
X
X
X
sort_rtns_1
X
X
X
X
X
X
stream_pipe
X
X
X
X
X
X
string_rtns
X
X
X
X
X
X
sync_disk_cp
X
X
X
X
X
X
sync_disk_rw
X
X
X
X
X
X
sync_disk_wrt
X
X
X
X
X
X
tcp_test
X
X
X
X
X
X
trig_rtns
X
X
X
X
X
X
udp_test
X
X
X
X
X
X

The second table shows the results for lutetium (Opteron):



Times
DTLB misses
ICache refill from system
Global Power events (Cycles non-halted)
Instructions retired
add_double
X
X
X
X
X
add_float
X
X
X
X
X
add_int
X
X
X
X
X
add_long
X
X
X
X
X
add_short
X
X
X
X
X
array_rtns
X
X
X
X
X
brk_test
X
X
X
X
X
creat-clo
X
X
X
X
X
dgram_pipe
X
X
X
X
X
dir_rtns_1
X
X
X
X
X
disk_cp
X
X
X
X
X
disk_rd
X
X
X
X
X
disk_rr
X
X
X
X
X
disk_rw
X
X
X
X
X
disk_src
X
X
X
X
X
disk_wrt
X
X
X
X
X
div_double
X
X
X
X
X
div_float
X
X
X
X
X
div_int
X
X
X
X
X
div_long
X
X
X
X
X
div_short
X
X
X
X
X
exec_test
X
X
X
X
X
fork_test
X
X
X
X
X
jmp_test
X
X
X
X
X
link_test
X
X
X
X
X
matrix_rtns
X
X
X
X
X
mem_rtns_1
X
X
X
X
X
mem_rtns_2
X
X
X
X
X
misc_rtns_1
X
X
X
X
X
mul_double
X
X
X
X
X
mul_float
X
X
X
X
X
mul_int
X
X
X
X
X
mul_long
X
X
X
X
X
mul_short
X
X
X
X
X
new_raph
X
X
X
X
X
num_rtns_1
X
X
X
X
X
page_test
X
X
X
X
X
pipe_cpy
X
X
X
X
X
ram_copy
X
X
X
X
X
series_1
X
X
X
X
X
shared_memory
X
X
X
X
X
shell_rtns_1
X
X
X
X
X
sieve
X
X
X
X
X
signal_test
X
X
X
X
X
sort_rtns_1
X
X
X
X
X
stream_pipe
X
X
X
X
X
string_rtns
X
X
X
X
X
sync_disk_cp
X
X
X
X
X
sync_disk_rw
X
X
X
X
X
sync_disk_wrt
X
X
X
X
X
tcp_test
X
X
X
X
X
trig_rtns
X
X
X
X
X
udp_test
X
X
X
X
X

Conclusions

In the following I will discuss some observations from the benchmark results.

From the times column we see that nearly all microbenchmarks in this setup fully utilize the CPU (the busy graphs). We basically never have waiting time. So, total execution times seen, resp. their difference resembles the overhead we have with L4Linux compared to native Linux.

The TLB columns show that, we have a huge overhead in the number of TLB misses. This starts with an approximate factor of two for simple, CPU bound benchmarks (e. g., add_*) and goes well beyond a factor of twenty for benchmarks with many syscalls (e. g., brk_test and dgram_pipe). Reducing the number of context switches ot Tagged TLBs could provide performance improvements.

A general pattern observed when looking at the microbenchmark graphs is that L4Linux graphs are typically shifted to the right for a certain amount (overhead, added lantency). Additionally, the L4Linux graphs are typically more "smooth" compared to their Linux counterpart. The distributions are less compact (more widespread), they have more noise added (increased jitter).

The absolute importance, regarding the slowdown of L4Linux compared to Linux in AIM7, is the absolute difference between the average values from the histograms in the times column (e. g., exec_test and fork_test have large absolute differences of about 16 ms resp. 97 ms, but dgram_pipe with about 0.8 ms has not). These microbenchmarks should be analyzed regarding optimization targets (e. g., exec system call, fork system call).

Future Work

Future work on this topics could target the following areas:

References

[IntSys]
Intel Corporation: IA-32 Intel® Architecture Software Developer´s Manual Volume 3: System Programming Guide. 2004.
[IntOpt]
Intel Corporation: IA-32 Intel® Architecture Optimization Reference Manual. 2004.
[AMDSys]
Advanced Micro Devices: AMD64 Architecture Programmer´s Manual Volume 2: System Programming. Revision 3.12, September 2006.
[AMDKer]
Advanced Micro Devices: BIOS and Kernel Developer´s Guide for AMD AthlonTM 64 and AMD OpteronTM Processors. Revision 3.30, February 2006.
[L4Linux]
Adam Lackorzynski: L4 Linux Porting Optimizations. Master´s thesis, TU Dresden, March 2004.
[AIM]
The AIM Benchmarks. http://sourceforge.net/projects/aimbench/
[Xen]
The Xen virtual machine monitor. http://www.cl.cam.ac.uk/research/srg/netos/xen/
[Wombat]
Virtualised os: wombat. http://www.ertos.nicta.com.au/software/kenge/wombat/latest/


Martin Pohlack (pohlack (at) os.inf.tu-dresden.de)