Hello
I am trying to load real time applications on the L4re hypervisor. I would like to make sure that each application has a set ram space and a specific core.
I have followed some examples that are proposed but I cannot segregate the various applications correctly.
Is it possible to have an example of how I have to configure or what I have to do to achieve this?
I have correctly started and segregated applications using uvmm but this approach leads to a noticeable performance prediction.
I say this because by running for cycles on the ‘NOP’ instruction, the times obtained with uvmm are increased by 100 times.
Hi,
On Thu May 08, 2025 at 13:06:49 -0000, agaku03@gmail.com wrote:
I have correctly started and segregated applications using uvmm but this approach leads to a noticeable performance prediction.
I say this because by running for cycles on the ‘NOP’ instruction, the times obtained with uvmm are increased by 100 times.
That is interesting. A NOP instruction won't cause any VM-Exit and thus there's no difference in running bare-metal or virtualized. How do you measure time?
Adam
Hi Adam
Thank you for your reply. This seems strange to me too. I as mentioned am running my own application in place of the Linux OS, below is the configuration of module and the .cfg used
(module.list) entry[arch=arm64] VM-G roottask moe rom/vm-g.cfg module l4re module ned module cons module io module vm-g.cfg module[arch=arm64,fname=icarmvpx3a.io] drivers.io module uvmm module[arch=arm64,fname=kernel.dtb] dtb/kernel.dtb module[arch=arm64,fname=driver_ethernet,nostrip] /home/user/Documents/TestEthernet/build/ethernet_loop
(vm-g.cfg) -- vim:set ft=lua: local L4 = require "L4";
local l = L4.default_loader; local flags = L4.Mem_alloc_flags.Continuous | L4.Mem_alloc_flags.Pinned | L4.Mem_alloc_flags.Super_pages;
local align = 21;
-- start console server local cons = l:new_channel();
l:start({ caps = { cons = cons:svr() }, log = L4.Env.log, }, "rom/cons -a");
l.log_fab = cons;
local serialdev = { arm = "ttyAMA0", arm64 = "ttyAMA0", amd64 = "ttyS0" };
-- start io server local vbus_l4 = l:new_channel();
l:start({ caps = { vbus = vbus_l4:svr(), icu = L4.Env.icu, iommu = L4.Env.iommu, sigma0 = L4.Env.sigma0, }, log = { "IO", "y" }, l4re_dbg = L4.Dbg.Info, scheduler = L4.Env.user_factory:create(L4.Proto.Scheduler, 0xa0, 0x80, 0x02); }, "rom/io rom/drivers.io");
-- start vmm server l:startv({ caps = { ram = L4.Env.user_factory:create(L4.Proto.Dataspace, 0x30000000, flags, align):m("rw"), vbus = vbus_l4,
}, log = { "vm", "Black" }, l4re_dbg = L4.Dbg.Info, scheduler = L4.Env.user_factory:create(L4.Proto.Scheduler, 0x18, 0x8, 0x02); }, "rom/uvmm", "-v", "-i", "-krom/ethernet_loop", "-drom/kernel.dtb", "-b0xC0000000", "-cconsole=" .. serialdev[L4.Info.arch()] .. " rw" );
To calculate the core usage time and get performance estimates I use the following function:
uint32_t cnt_freq = read_cntfrq_el0();
start = read_cntpct_el0(); for (volatile uint64_t i = 0; i < counter; i++) { __asm__ volatile("nop"); } end = read_cntpct_el0();
uint64_t elapsed_ticks = end - start; double time_ns = ((double)elapsed_ticks * 1e9) / cnt_freq;
I have done this count on different amounts of iterations, to give you an estimate, I will only give you the one out of 1000 counts.
1) Application in bare metal => 2719 ns 2) Application on Linux vmm => 3680 ns 3) Application on vmm without Linux (i.e. the application of point 1) => 390880 ns
I hope you can help me understand how I can handle this situation correctly.
Thank you for your support
Hi,
if it is a bare-metal code running in a VM, does it run with caching enabled?
Adam
On Mon May 12, 2025 at 06:12:27 -0000, agaku03@gmail.com wrote:
Hi Adam
Thank you for your reply. This seems strange to me too. I as mentioned am running my own application in place of the Linux OS, below is the configuration of module and the .cfg used
(module.list) entry[arch=arm64] VM-G roottask moe rom/vm-g.cfg module l4re module ned module cons module io module vm-g.cfg module[arch=arm64,fname=icarmvpx3a.io] drivers.io module uvmm module[arch=arm64,fname=kernel.dtb] dtb/kernel.dtb module[arch=arm64,fname=driver_ethernet,nostrip] /home/user/Documents/TestEthernet/build/ethernet_loop
(vm-g.cfg) -- vim:set ft=lua: local L4 = require "L4";
local l = L4.default_loader; local flags = L4.Mem_alloc_flags.Continuous | L4.Mem_alloc_flags.Pinned | L4.Mem_alloc_flags.Super_pages;
local align = 21;
-- start console server local cons = l:new_channel();
l:start({ caps = { cons = cons:svr() }, log = L4.Env.log, }, "rom/cons -a");
l.log_fab = cons;
local serialdev = { arm = "ttyAMA0", arm64 = "ttyAMA0", amd64 = "ttyS0" };
-- start io server local vbus_l4 = l:new_channel();
l:start({ caps = { vbus = vbus_l4:svr(), icu = L4.Env.icu, iommu = L4.Env.iommu, sigma0 = L4.Env.sigma0, }, log = { "IO", "y" }, l4re_dbg = L4.Dbg.Info, scheduler = L4.Env.user_factory:create(L4.Proto.Scheduler, 0xa0, 0x80, 0x02); }, "rom/io rom/drivers.io");
-- start vmm server l:startv({ caps = { ram = L4.Env.user_factory:create(L4.Proto.Dataspace, 0x30000000, flags, align):m("rw"), vbus = vbus_l4,
}, log = { "vm", "Black" }, l4re_dbg = L4.Dbg.Info, scheduler = L4.Env.user_factory:create(L4.Proto.Scheduler, 0x18, 0x8, 0x02); }, "rom/uvmm", "-v", "-i", "-krom/ethernet_loop", "-drom/kernel.dtb", "-b0xC0000000", "-cconsole=" .. serialdev[L4.Info.arch()] .. " rw" );To calculate the core usage time and get performance estimates I use the following function:
uint32_t cnt_freq = read_cntfrq_el0(); start = read_cntpct_el0(); for (volatile uint64_t i = 0; i < counter; i++) { __asm__ volatile("nop"); } end = read_cntpct_el0(); uint64_t elapsed_ticks = end - start; double time_ns = ((double)elapsed_ticks * 1e9) / cnt_freq;I have done this count on different amounts of iterations, to give you an estimate, I will only give you the one out of 1000 counts.
- Application in bare metal => 2719 ns
- Application on Linux vmm => 3680 ns
- Application on vmm without Linux (i.e. the application of point 1) => 390880 ns
I hope you can help me understand how I can handle this situation correctly.
Thank you for your support _______________________________________________ l4-hackers mailing list -- l4-hackers@os.inf.tu-dresden.de To unsubscribe send an email to l4-hackers-leave@os.inf.tu-dresden.de
Hi Adam
Yes, the cache is enabled as I manually activate the SCTLR_EL1 registers with the enable values for I and C. I can see by debugging that some instructions are cached.
On analysis, I noticed that latencies seem to occur when accessing on stack or memory areas. I did a test using the assembly to rewrite the time calculation function and this method is considerably more agile (from 4650 ticks to 79 ticks).
Going deeper and disassembling the code with the for loop, I noticed that the difference is in the non-use of calls of type => ldr x0, [sp, #104], i.e. to the stack.
The stack is an area of memory set by the linker and used as described in the manual ( ARM DAI 0527A Non-Confidential - Application Note - Bare-metal Boot Code for ARMv8-A Processors).
Could there be something I need to manage on the hypervisor side or settings I need to make to optimise these exchanges?
Thanks again for your support and courtesy Gianluca
Hi Gianluca,
did you also install (identy-mapped) page tables? Just enabled I and C bits in SCTLR will not be enough. Looks like the Application Note has all the code for that. I would not know of anything to be done differently here on the hypervisor level.
Adam
On Fri May 16, 2025 at 13:50:25 -0000, agaku03@gmail.com wrote:
Yes, the cache is enabled as I manually activate the SCTLR_EL1 registers with the enable values for I and C. I can see by debugging that some instructions are cached.
On analysis, I noticed that latencies seem to occur when accessing on stack or memory areas. I did a test using the assembly to rewrite the time calculation function and this method is considerably more agile (from 4650 ticks to 79 ticks).
Going deeper and disassembling the code with the for loop, I noticed that the difference is in the non-use of calls of type => ldr x0, [sp, #104], i.e. to the stack.
The stack is an area of memory set by the linker and used as described in the manual ( ARM DAI 0527A Non-Confidential - Application Note - Bare-metal Boot Code for ARMv8-A Processors).
Could there be something I need to manage on the hypervisor side or settings I need to make to optimise these exchanges?
Hi Adam
I have carried out all the tests and it is indeed necessary to create a one-to-one mmu mapping in order to achieve optimal performance again.
I thank you for your support.
l4-hackers@os.inf.tu-dresden.de