Hello,
I finally got round to experimenting with L4Re again, but in attempting to investigate task creation, I seem to have some difficulties understanding the mechanism by which tasks are typically created and how the l4_task_map function might be used in the process.
After looking at lots of different files in the L4Re distribution, my understanding of the basic mechanism is as follows:
1. Some memory is reserved for the UTCB of a new task, perhaps using the l4re_ma_alloc_align function (or equivalent) to obtain a dataspace.
2. A task is created using l4_factory_create_task, indicating the UTCB flexpage, with this being defined as...
l4_factory_create_task(l4re_env()->factory, new_task, l4_fpage(utcb_start, utcb_log2size, L4_FPAGE_RW))
3. A thread is created using l4_factory_create_thread.
l4_factory_create_thread(l4re_env()->factory, new_thread)
4. The thread attributes are set using the l4_thread_control API.
5. The l4_thread_ex_regs function is used to set the instruction pointer (program counter) and stack pointer of the thread.
6. The l4_scheduler_run_thread function is used to initiate the thread.
The expectation is that the thread will immediately fault because there is no memory mapped at the instruction pointer location. However, it seems to me that it should be possible to use l4_task_map to make a memory region available within the task's address space, although I don't ever see this function used in L4Re for anything.
(The C++ API makes it difficult to perform ad-hoc searches for such low-level primitives, in my view, so perhaps I am missing use of the equivalent methods.)
Tentatively, I would imagine that something like this might work:
l4_task_map(new_task, L4RE_THIS_TASK_CAP, l4_fpage(program_start, program_log2size, L4_FPAGE_RX), task_program_start)
Here, the program payload would be loaded into the creating task at program_start, but the new task would be receiving the payload at task_program_start, with the configured instruction pointer location occurring within the receive window (after task_program_start, in other words).
There are, of course, many other considerations around creating tasks, which I have noted from looking at the different packages (libloader, l4re_kernel, moe, ned), and I am aware that a few other things need to be done to start a task such as...
* Defining capability selectors and mapping appropriate capabilities to the new task.
* Creating a stack for the task and populating it with arguments and environment information.
* Defining a suitable pager and exception handler, with this usually being provided by the l4re binary, as I understand it.
Also, when actually dealing with program loading generally, I realise that the ELF binary needs to be interpreted and the appropriate regions associated with different parts of memory, this typically being handled by the region mapper/ manager in L4Re. And there is also the matter of dynamic library loading.
But here, I am just attempting to establish the basic mechanism when a task starts up. Unfortunately, the only discussion I found was this (after some initial discussion about a related topic):
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015366.html
There are various examples in Subversion (maybe somewhere in the Git repositories, too) that create tasks or threads, but I don't find them particularly helpful, apparently being oriented towards very specific applications. A previous example was referenced in the above thread for the older L4Env system (or maybe an even earlier system):
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2000/000384.html
As for why I would be wondering about such things - a question inevitably asked in the first thread referenced above - I firstly want to be able to understand the mechanism involved, but I also want to be able to integrate work I have been doing on file paging into task creation.
Although I can probably do this by customising the "app model" normally used by the different loaders, it seems that I would need to construct an alternative l4re binary, which is rather cumbersome and perhaps a weakness of the abstractions that are provided, these being rather oriented towards obtaining dataspaces via the namespace API which I don't want to have to support in my filesystem.
In any case, I wonder if there are any resources that describe the use of l4_task_map and the details of the program environment within tasks.
Paul
Hi Paul,
On Sun Apr 10, 2022 at 18:58:10 +0200, Paul Boddie wrote:
I finally got round to experimenting with L4Re again, but in attempting to investigate task creation, I seem to have some difficulties understanding the mechanism by which tasks are typically created and how the l4_task_map function might be used in the process.
After looking at lots of different files in the L4Re distribution, my understanding of the basic mechanism is as follows:
- Some memory is reserved for the UTCB of a new task, perhaps using the
l4re_ma_alloc_align function (or equivalent) to obtain a dataspace.
No, for UTCBs there's a dedicated call l4_task_add_ku_mem in case one needs more UTCB memory than has been initially created with l4_factory_create_task().
- A task is created using l4_factory_create_task, indicating the UTCB
flexpage, with this being defined as...
l4_factory_create_task(l4re_env()->factory, new_task, l4_fpage(utcb_start, utcb_log2size, L4_FPAGE_RW))
Yes. Here the flexpage defines where memory usable for UTCBs shall be created.
- A thread is created using l4_factory_create_thread.
l4_factory_create_thread(l4re_env()->factory, new_thread)
The thread attributes are set using the l4_thread_control API.
The l4_thread_ex_regs function is used to set the instruction pointer
(program counter) and stack pointer of the thread.
- The l4_scheduler_run_thread function is used to initiate the thread.
All yes.
The expectation is that the thread will immediately fault because there is no memory mapped at the instruction pointer location. However, it seems to me that it should be possible to use l4_task_map to make a memory region available within the task's address space, although I don't ever see this function used in L4Re for anything.
(The C++ API makes it difficult to perform ad-hoc searches for such low-level primitives, in my view, so perhaps I am missing use of the equivalent methods.)
Indeed. L4::Task::map is used, for example to map some initial capabilities, and typically not memory.
Tentatively, I would imagine that something like this might work:
l4_task_map(new_task, L4RE_THIS_TASK_CAP, l4_fpage(program_start, program_log2size, L4_FPAGE_RX), task_program_start)
Here, the program payload would be loaded into the creating task at program_start, but the new task would be receiving the payload at task_program_start, with the configured instruction pointer location occurring within the receive window (after task_program_start, in other words).
Yes, this would work.
There are, of course, many other considerations around creating tasks, which I have noted from looking at the different packages (libloader, l4re_kernel, moe, ned), and I am aware that a few other things need to be done to start a task such as...
- Defining capability selectors and mapping appropriate capabilities to the
new task.
- Creating a stack for the task and populating it with arguments and
environment information.
- Defining a suitable pager and exception handler, with this usually being
provided by the l4re binary, as I understand it.
Yes, this all needs to be done.
Also, when actually dealing with program loading generally, I realise that the ELF binary needs to be interpreted and the appropriate regions associated with different parts of memory, this typically being handled by the region mapper/ manager in L4Re. And there is also the matter of dynamic library loading.
Yes, indeed.
But here, I am just attempting to establish the basic mechanism when a task starts up. Unfortunately, the only discussion I found was this (after some initial discussion about a related topic):
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015366.html
There are various examples in Subversion (maybe somewhere in the Git repositories, too) that create tasks or threads, but I don't find them particularly helpful, apparently being oriented towards very specific applications. A previous example was referenced in the above thread for the older L4Env system (or maybe an even earlier system):
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2000/000384.html
As for why I would be wondering about such things - a question inevitably asked in the first thread referenced above - I firstly want to be able to understand the mechanism involved, but I also want to be able to integrate work I have been doing on file paging into task creation.
I think you described it very well with your steps listed above. Also, l4util has a l4util_create_thread() function that lists all the steps needed to create a thread in an existing task.
Although I can probably do this by customising the "app model" normally used by the different loaders, it seems that I would need to construct an alternative l4re binary, which is rather cumbersome and perhaps a weakness of the abstractions that are provided, these being rather oriented towards obtaining dataspaces via the namespace API which I don't want to have to support in my filesystem.
In case you want to use other abstractions you probably need to adapt l4re to use them such that they fit together.
In any case, I wonder if there are any resources that describe the use of l4_task_map and the details of the program environment within tasks.
l4_task_map() has documentation: https://l4re.org/doc/group__l4__task__api.html#ga8ed2ff7ba204de7c01311c22412... and is a direct API to the kernel for mapping resources, defined by l4sys. At this level, there is not really a definition of how a program environment looks like. However, as Fiasco needs to supply its initial programs some capabilities, those are defined (https://l4re.org/doc/group__l4__cap__api.html#gaa7801b63edba351bad9ea8026432...). What moe and ned do, is similar, but not necessarily the same, as they provide a more powerful interface to this (https://l4re.org/doc/group__api__l4re__env.html) and also provide all the functionality normal programs enjoy, like argument lists, environment variables, etc.
Adam
Adam,
Thanks for the reply!
On Monday, 11 April 2022 01:02:37 CEST Adam Lackorzynski wrote:
Hi Paul,
On Sun Apr 10, 2022 at 18:58:10 +0200, Paul Boddie wrote:
I finally got round to experimenting with L4Re again, but in attempting to investigate task creation, I seem to have some difficulties understanding the mechanism by which tasks are typically created and how the l4_task_map function might be used in the process.
After looking at lots of different files in the L4Re distribution, my understanding of the basic mechanism is as follows:
- Some memory is reserved for the UTCB of a new task, perhaps using the
l4re_ma_alloc_align function (or equivalent) to obtain a dataspace.
No, for UTCBs there's a dedicated call l4_task_add_ku_mem in case one needs more UTCB memory than has been initially created with l4_factory_create_task().
OK, I did see that function being used, too, but I also found plenty of other things in my perusal of the different files. Obviously, being able to extend the UTCB memory is an important consideration.
- A task is created using l4_factory_create_task, indicating the UTCB
flexpage, with this being defined as...
l4_factory_create_task(l4re_env()->factory, new_task,
l4_fpage(utcb_start, utcb_log2size, L4_FPAGE_RW))
Yes. Here the flexpage defines where memory usable for UTCBs shall be created.
Right. I see that the factory function actually sends the flexpage in the IPC call (using l4_factory_create_add_fpage_u), thus mapping it in the task. I find it hard to follow where this message is actually handled (I presume that Moe acts as the factory) or what the factory actually does with the flexpage, but I presume that it ultimately causes it to be mapped in the new task.
[Thread creation and initiation]
The expectation is that the thread will immediately fault because there is no memory mapped at the instruction pointer location. However, it seems to me that it should be possible to use l4_task_map to make a memory region available within the task's address space, although I don't ever see this function used in L4Re for anything.
(The C++ API makes it difficult to perform ad-hoc searches for such low-level primitives, in my view, so perhaps I am missing use of the equivalent methods.)
Indeed. L4::Task::map is used, for example to map some initial capabilities, and typically not memory.
Yes, I see a lot of these map operations operating on capabilities. For example:
pkg/l4re-core/libloader/include/remote_app_model
However, I wonder about the "chicken and egg" situation in new tasks. It seems to me that the way things work is that a new task in L4Re is typically populated with the l4re binary containing the region mapper/manager (RM). This seems to be initiated here (in launch_loader):
pkg/l4re-core/ned/server/src/lua_exec.cc
This RM is then able to handle the page fault when an attempt is made to load and run a new program. But one cannot rely on the RM when it isn't already installed in a task, so there must be a way of mapping it into the new task so that it can be present. I assumed that using l4_task_map might be one way of doing so.
Otherwise, I thought that perhaps an existing task could provide a kind of RM to act as the new task's pager in the bootstrapping phase, so that page faults would be directed towards the existing task's RM and mappings established to get the new task's RM up and running. However, in that case, since the usual IPC traffic between RM and dataspaces does not involve sending flexpages to the new task (and thus implicitly mapping them in the task, as I understand it), it seems that the existing task's RM would also need to explicitly map flexpages in the new task, again using something like l4_task_map.
I think I understand the usual mechanism between a task's RM and dataspaces, at least enough to have implemented paging with dataspaces myself, but I don't follow what is actually being done here.
Tentatively, I would imagine that something like this might work: l4_task_map(new_task, L4RE_THIS_TASK_CAP,
l4_fpage(program_start, program_log2size, L4_FPAGE_RX), task_program_start)
Here, the program payload would be loaded into the creating task at program_start, but the new task would be receiving the payload at task_program_start, with the configured instruction pointer location occurring within the receive window (after task_program_start, in other words).
Yes, this would work.
As far as I have seen, the "send base" with l4_task_map is effectively defining the location where the flexpage is mapped, since the receive window is "the whole address space" of the destination task according to the documentation. Some testing with existing tasks appears to confirm this, but then I also seem to experience issues with memory coherency or something resembling it, meaning that I write to several pages, map the memory to another task, and yet the mapped region does not reflect the mapped memory (and it is indeed mapped, since I can neglect to map it and get a page fault).
I just spent some time mapping memory by sending flexpages from one task to another, defining the receive window using the recipient's buffer registers so that the mappings are established by the kernel (again, as I understand how this actually ends up working), and this does establish the mappings, but I still see a lack of coherency. Maybe I am just failing to call the appropriate functions to let the recipient see all of the changed memory contents, however.
[...]
In any case, I wonder if there are any resources that describe the use of l4_task_map and the details of the program environment within tasks.
l4_task_map() has documentation: https://l4re.org/doc/group__l4__task__api.html#ga8ed2ff7ba204de7c01311c2241 2a2063 and is a direct API to the kernel for mapping resources, defined by l4sys. At this level, there is not really a definition of how a program environment looks like. However, as Fiasco needs to supply its initial programs some capabilities, those are defined (https://l4re.org/doc/group__l4__cap__api.html#gaa7801b63edba351bad9ea802643 2b5c4). What moe and ned do, is similar, but not necessarily the same, as they provide a more powerful interface to this (https://l4re.org/doc/group__api__l4re__env.html) and also provide all the functionality normal programs enjoy, like argument lists, environment variables, etc.
I've been spending plenty of time looking at the documentation. However, I really feel that the fundamentals of the system are often not readily documented, at least in such reference documentation. So, I've also spent time looking at teaching materials related to L4Re and Fiasco, some of which are helpful, but they obviously do not go into much detail.
That leaves the code, which is not always very easy to follow. Part of the reason I just decided to implement my own IPC library and interface description language was that the IPC support in L4Re is incoherent, with different approaches used in different places, and sometimes convoluted, also not relating obviously to the low-level libraries, thus making it difficult to identify common areas of functionality.
This might just sound like me complaining, but I also have some concerns about being able to verify the behaviour of some of the code. For example, I recently found that my dataspace implementation was getting requests from a region mapper/manager with an opcode of 0x100000000, which doesn't make any sense to me at all, given that the dataspace interface code in L4Re implicitly defines opcodes that are all likely to be very small integers. At first I obviously blamed my own code, but then I found that in the IPC call implementation found here...
pk/l4re-core/l4sys/include/cxx/ipc_iface
...if I explicitly cleared the first message register before this statement...
int send_bytes = Args::template write_op<Do_in_data>(mrs->mr, 0, Mr_bytes, Opt::Opcode, a...);
...then the opcode was produced as expected again. I suppose what I am trying to communicate is that some of the organisation of the code is not conducive to inspection, nor does it readily reflect the mechanisms involved. Although the availability of software to do a task arguably diminishes the need to be familiar with what the software does, that arrangement only works if the software is usable, extensible and does what it is supposed to.
In case you might be wondering why I am doing any of this (as I sometimes do myself), I am attempting to integrate a filesystem into L4Re, but this also leads to the matter of running programs from the filesystem. And so, it becomes interesting to try and create tasks and populate them with those programs.
Paul
Hi Paul,
On Tue Apr 12, 2022 at 01:09:40 +0200, Paul Boddie wrote:
On Monday, 11 April 2022 01:02:37 CEST Adam Lackorzynski wrote:
Hi Paul,
On Sun Apr 10, 2022 at 18:58:10 +0200, Paul Boddie wrote:
I finally got round to experimenting with L4Re again, but in attempting to investigate task creation, I seem to have some difficulties understanding the mechanism by which tasks are typically created and how the l4_task_map function might be used in the process.
After looking at lots of different files in the L4Re distribution, my understanding of the basic mechanism is as follows:
- Some memory is reserved for the UTCB of a new task, perhaps using the
l4re_ma_alloc_align function (or equivalent) to obtain a dataspace.
No, for UTCBs there's a dedicated call l4_task_add_ku_mem in case one needs more UTCB memory than has been initially created with l4_factory_create_task().
OK, I did see that function being used, too, but I also found plenty of other things in my perusal of the different files. Obviously, being able to extend the UTCB memory is an important consideration.
It is, because, for example, one might not know how many threads a task will have, especially the component that creates the task.
- A task is created using l4_factory_create_task, indicating the UTCB
flexpage, with this being defined as...
l4_factory_create_task(l4re_env()->factory, new_task,
l4_fpage(utcb_start, utcb_log2size, L4_FPAGE_RW))
Yes. Here the flexpage defines where memory usable for UTCBs shall be created.
Right. I see that the factory function actually sends the flexpage in the IPC call (using l4_factory_create_add_fpage_u), thus mapping it in the task. I find it hard to follow where this message is actually handled (I presume that Moe acts as the factory) or what the factory actually does with the flexpage, but I presume that it ultimately causes it to be mapped in the new task.
It is handled in Fiasco.
[Thread creation and initiation]
The expectation is that the thread will immediately fault because there is no memory mapped at the instruction pointer location. However, it seems to me that it should be possible to use l4_task_map to make a memory region available within the task's address space, although I don't ever see this function used in L4Re for anything.
(The C++ API makes it difficult to perform ad-hoc searches for such low-level primitives, in my view, so perhaps I am missing use of the equivalent methods.)
Indeed. L4::Task::map is used, for example to map some initial capabilities, and typically not memory.
Yes, I see a lot of these map operations operating on capabilities. For example:
pkg/l4re-core/libloader/include/remote_app_model
However, I wonder about the "chicken and egg" situation in new tasks. It seems to me that the way things work is that a new task in L4Re is typically populated with the l4re binary containing the region mapper/manager (RM). This seems to be initiated here (in launch_loader):
Yes, the l4re binary loads the application and then serves as its pager.
pkg/l4re-core/ned/server/src/lua_exec.cc
This RM is then able to handle the page fault when an attempt is made to load and run a new program. But one cannot rely on the RM when it isn't already installed in a task, so there must be a way of mapping it into the new task so that it can be present. I assumed that using l4_task_map might be one way of doing so.
Otherwise, I thought that perhaps an existing task could provide a kind of RM to act as the new task's pager in the bootstrapping phase, so that page faults would be directed towards the existing task's RM and mappings established to get the new task's RM up and running. However, in that case, since the usual IPC traffic between RM and dataspaces does not involve sending flexpages to the new task (and thus implicitly mapping them in the task, as I understand it), it seems that the existing task's RM would also need to explicitly map flexpages in the new task, again using something like l4_task_map.
That's how it works. Moe also has region managers that are used for the l4re binary to be paged. When a page fault is resolved, then there is someone sending memory via a flexpage to the task in question. In our case it's the dataspace manager which sends the memory via an 'map' call. Here it does not matter whether the l4re binary faulted or the application, because in the end the task is the receiver of the flex page, not the particular application (which are runnig in the same task).
I think I understand the usual mechanism between a task's RM and dataspaces, at least enough to have implemented paging with dataspaces myself, but I don't follow what is actually being done here.
Tentatively, I would imagine that something like this might work: l4_task_map(new_task, L4RE_THIS_TASK_CAP,
l4_fpage(program_start, program_log2size, L4_FPAGE_RX), task_program_start)
Here, the program payload would be loaded into the creating task at program_start, but the new task would be receiving the payload at task_program_start, with the configured instruction pointer location occurring within the receive window (after task_program_start, in other words).
Yes, this would work.
As far as I have seen, the "send base" with l4_task_map is effectively defining the location where the flexpage is mapped, since the receive window is "the whole address space" of the destination task according to the documentation. Some testing with existing tasks appears to confirm this, but then I also seem to experience issues with memory coherency or something resembling it, meaning that I write to several pages, map the memory to another task, and yet the mapped region does not reflect the mapped memory (and it is indeed mapped, since I can neglect to map it and get a page fault).
I just spent some time mapping memory by sending flexpages from one task to another, defining the receive window using the recipient's buffer registers so that the mappings are established by the kernel (again, as I understand how this actually ends up working), and this does establish the mappings, but I still see a lack of coherency. Maybe I am just failing to call the appropriate functions to let the recipient see all of the changed memory contents, however.
Jdb has facilities to check how the address spaces look like, exactly to debug issue like you describe. You can press 's' to see all the tasks in the system, navigate onto them, and then press 'p' to see the page-table view. Here you can navigate the page tables and verify that the pages at some virtual address is actually pointing to the physical location they should point at. For a particular physical address (page frame number) you can also show the mapping hierarchy via the 'm'.
[...]
In any case, I wonder if there are any resources that describe the use of l4_task_map and the details of the program environment within tasks.
l4_task_map() has documentation: https://l4re.org/doc/group__l4__task__api.html#ga8ed2ff7ba204de7c01311c2241 2a2063 and is a direct API to the kernel for mapping resources, defined by l4sys. At this level, there is not really a definition of how a program environment looks like. However, as Fiasco needs to supply its initial programs some capabilities, those are defined (https://l4re.org/doc/group__l4__cap__api.html#gaa7801b63edba351bad9ea802643 2b5c4). What moe and ned do, is similar, but not necessarily the same, as they provide a more powerful interface to this (https://l4re.org/doc/group__api__l4re__env.html) and also provide all the functionality normal programs enjoy, like argument lists, environment variables, etc.
I've been spending plenty of time looking at the documentation. However, I really feel that the fundamentals of the system are often not readily documented, at least in such reference documentation. So, I've also spent time looking at teaching materials related to L4Re and Fiasco, some of which are helpful, but they obviously do not go into much detail.
That leaves the code, which is not always very easy to follow. Part of the reason I just decided to implement my own IPC library and interface description language was that the IPC support in L4Re is incoherent, with different approaches used in different places, and sometimes convoluted, also not relating obviously to the low-level libraries, thus making it difficult to identify common areas of functionality.
For sure this is an area where the code is pretty involved. Back in the old days we had an IDL compiler that grew and grew and was in the end not easy to maintain. When we switched to the capability system and thus changing all the APIs we had the choice of whether adapting the IDL compiler or doing something different. Back then it was a major hassle to parse the input because eventually one wants to have the whole language understood by the IDL compiler to be able to use all sorts of types (of course one could make compromises there). With C / C++ that is not so easy, at least back then. Now there's LLVM and that's a major improvement in this area. Still the actual tool needs to be implemented and maintained. Now, as we all see, we have opted for the "do something else" option. With C++ as our main language and the possibilities with it, there was the idea to implement the "IDL thing" purely with C++ directly in the code. That's what we have now. For me, all the code around it is the IDL compiler, and abstractly, if it would not be in header files it would sit somewhere else but it would be there in one form or another.
This might just sound like me complaining, but I also have some concerns about being able to verify the behaviour of some of the code. For example, I recently found that my dataspace implementation was getting requests from a region mapper/manager with an opcode of 0x100000000, which doesn't make any sense to me at all, given that the dataspace interface code in L4Re implicitly defines opcodes that are all likely to be very small integers. At first I obviously blamed my own code, but then I found that in the IPC call implementation found here...
pk/l4re-core/l4sys/include/cxx/ipc_iface
...if I explicitly cleared the first message register before this statement...
int send_bytes = Args::template write_op<Do_in_data>(mrs->mr, 0, Mr_bytes, Opt::Opcode, a...);
...then the opcode was produced as expected again.
Which does not fully make sense to me because the message registers seem to be written from 0 on. Anyway, do you have an example maybe?
I suppose what I am trying to communicate is that some of the organisation of the code is not conducive to inspection, nor does it readily reflect the mechanisms involved. Although the availability of software to do a task arguably diminishes the need to be familiar with what the software does, that arrangement only works if the software is usable, extensible and does what it is supposed to.
Indeed, I cannot agree more.
In case you might be wondering why I am doing any of this (as I sometimes do myself), I am attempting to integrate a filesystem into L4Re, but this also leads to the matter of running programs from the filesystem. And so, it becomes interesting to try and create tasks and populate them with those programs.
Yes, for sure!
Adam
Adam,
Thanks for the reply again!
On Monday, 18 April 2022 23:26:03 CEST you wrote:
Hi Paul,
On Tue Apr 12, 2022 at 01:09:40 +0200, Paul Boddie wrote:
OK, I did see that function being used, too, but I also found plenty of other things in my perusal of the different files. Obviously, being able to extend the UTCB memory is an important consideration.
It is, because, for example, one might not know how many threads a task will have, especially the component that creates the task.
Right. And so, in the Moe_app_model the size of the UTCB is dependent of the default number of threads. It still seems that I have to provide a UTCB flexpage to l4_factory_create_task, however.
[...]
Right. I see that the factory function actually sends the flexpage in the IPC call (using l4_factory_create_add_fpage_u), thus mapping it in the task. I find it hard to follow where this message is actually handled (I presume that Moe acts as the factory) or what the factory actually does with the flexpage, but I presume that it ultimately causes it to be mapped in the new task.
It is handled in Fiasco.
OK. I think I found this in Task::create (src/kern/task.cpp).
[...]
However, I wonder about the "chicken and egg" situation in new tasks. It seems to me that the way things work is that a new task in L4Re is typically populated with the l4re binary containing the region mapper/manager (RM). This seems to be initiated here (in launch_loader):
Yes, the l4re binary loads the application and then serves as its pager.
OK. And it seems that the RM provided by this binary is able to indicate the receive window for flexpages when asking for mappings from dataspaces.
[Pagers for other tasks]
That's how it works. Moe also has region managers that are used for the l4re binary to be paged. When a page fault is resolved, then there is someone sending memory via a flexpage to the task in question. In our case it's the dataspace manager which sends the memory via an 'map' call. Here it does not matter whether the l4re binary faulted or the application, because in the end the task is the receiver of the flex page, not the particular application (which are runnig in the same task).
So, the one thing I didn't understand until I started digging around in the Fiasco sources and also implementing my own page fault handler was the scope of the receive window for issued flexpages, but it appears that the whole address space is indicated as the receive window in Thread::handle_page_fault_pager (src/kern/thread-ipc.cpp).
[Mapping memory using l4_task_map]
Jdb has facilities to check how the address spaces look like, exactly to debug issue like you describe. You can press 's' to see all the tasks in the system, navigate onto them, and then press 'p' to see the page-table view. Here you can navigate the page tables and verify that the pages at some virtual address is actually pointing to the physical location they should point at. For a particular physical address (page frame number) you can also show the mapping hierarchy via the 'm'.
The "coherency" problems actually turned out to be me forgetting the appropriate alignment for the mapped flexpages. But I have discovered a few things about jdb in my attempts to troubleshoot my code, including 's' to look at tasks. I found the page table view bewildering, though, rather hoping for a nice summary of mapped pages instead. However, the object space view was very useful in establishing that capabilities were being mapped.
I did manage to get l4_task_map to work between two fully initialised tasks created by Ned as follows:
l4_fpage_t payload_fpage = l4_fpage((l4_addr_t) buf, l4util_log2(L4_PAGESIZE * NUM_PAGES), L4_FPAGE_RW);
l4_task_map(recipient, L4RE_THIS_TASK_CAP, payload_fpage, 0x3000000);
This permitted the recipient task to access the memory in the appropriately aligned buf region, this being mapped into the region starting at 0x3000000 in the recipient.
I also managed to achieve the same thing using IPC between the two tasks instead, having the recipient indicate a receive window in the buffer registers as follows:
br[0] = l4_map_control(0, 0, L4_MAP_ITEM_MAP); br[1] = l4_fpage(0x3000000, l4util_log2(L4_PAGESIZE * NUM_PAGES), 0).raw;
And then constructing the send flexpage as follows:
mr[0] = l4_map_control(0, L4_FPAGE_CACHEABLE, 0); mr[1] = l4_fpage((l4_addr_t) buf, l4util_log2(L4_PAGESIZE * NUM_PAGES), L4_FPAGE_RW).raw;
Alternatively, setting the receive window to the entire address space...
br[1] = l4_fpage_all().raw;
...and indicating a different send base also worked:
mr[0] = l4_map_control(0x3000000, L4_FPAGE_CACHEABLE, 0);
This approximating to the l4_task_map scenario.
However, trying this with a completely newly created task, I cannot seem to get l4_task_map to map memory into the task, and my page fault handler does not seem to be able to respond to a page fault message with such a flexpage and have the request satisfied. This means that there is some detail I am overlooking, but I have yet to determine what it is.
One thing I have tried to do is to get Fiasco to report what it is doing when processing page fault messages, but this is quite challenging. Really, I just want to establish whether the reply to a page fault message gets interpreted correctly and causes mappings to be established in the page tables.
Doing some "old school" tracing in various routines like transfer_msg_items by detecting the appropriate fault conditions, enabling tracing, and then producing output on the console doesn't seem to yield any indications of the messages being processed, but perhaps I misunderstand the flow of control from Thread::handle_page_fault_pager within the kernel when the IPC is initiated.
Reviewing the old thread on this broader topic, I found this advice:
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015441.html
This does yield page fault trace log entries of the following form:
pf: 00bc pfa=0000000001000ae3 ip=0000000001000ae3 (rp) spc=0xffffffff13dc5ad8 err=15
Here, I presume that the error is R_aborted (src/abi/l4_error.cpp), meaning that the page fault was not handled.
Looking at advice about IPC tracing...
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015475.html
...I can also get log entries that I think might indicate some element of success in terms of the messages being sent, with this being the fault message:
00be answ [fffffffffffe0002] L=8000fcf3 err=0 (OK) (1000ae5,1000ae3) ipc: 00be wait->[C:INV] DID=be L=0 TO=INF
And elsewhere:
00be answ [00000040] L=0 err=0 (OK) (1000038,400595) ipc: 00be send rcap->[C:INV] DID=bc L=0 [0000000000000040] (0000000001000038,0000000000400595) TO=INF
Here, I am attempting to resolve the page fault caused by execution at 0x1000ae3 by sending a flexpage providing memory in one task at 0x400000 to the recipient at 0x1000000.
None of this really explains why the page fault handler keeps getting called with the same details, sending the same message, and so on.
[...]
For sure this is an area where the code is pretty involved. Back in the old days we had an IDL compiler that grew and grew and was in the end not easy to maintain. When we switched to the capability system and thus changing all the APIs we had the choice of whether adapting the IDL compiler or doing something different. Back then it was a major hassle to parse the input because eventually one wants to have the whole language understood by the IDL compiler to be able to use all sorts of types (of course one could make compromises there). With C / C++ that is not so easy, at least back then. Now there's LLVM and that's a major improvement in this area. Still the actual tool needs to be implemented and maintained. Now, as we all see, we have opted for the "do something else" option. With C++ as our main language and the possibilities with it, there was the idea to implement the "IDL thing" purely with C++ directly in the code. That's what we have now. For me, all the code around it is the IDL compiler, and abstractly, if it would not be in header files it would sit somewhere else but it would be there in one form or another.
The unfortunate thing about the "do something else" solution is that it becomes difficult to determine what the interfaces are for components: the details are all presumably present, but they are encoded in ways that are not particularly readable. That might not seem to matter if the existing classes are usable and readily understandable, but my own experience was that I found myself trying to understand the low-level details before I could hope to understand why the C++ API did things in a particular way, which is really the wrong way round. I must have spent hours staring at Gen_fpage and related types in one way or another.
This might just sound like me complaining, but I also have some concerns about being able to verify the behaviour of some of the code. For example, I recently found that my dataspace implementation was getting requests from a region mapper/manager with an opcode of 0x100000000, which doesn't make any sense to me at all, given that the dataspace interface code in L4Re implicitly defines opcodes that are all likely to be very small integers. At first I obviously blamed my own code, but then I found that in the IPC call implementation found here...
pk/l4re-core/l4sys/include/cxx/ipc_iface
...if I explicitly cleared the first message register before this statement...> int send_bytes =
Args::template write_op<Do_in_data>(mrs->mr, 0, Mr_bytes, Opt::Opcode, a...);
...then the opcode was produced as expected again.
Which does not fully make sense to me because the message registers seem to be written from 0 on. Anyway, do you have an example maybe?
I only found this to happen when a program of mine had fetched many pages from a dataspace via the RM. It would be useful to understand the conditions under which this occurs, and I obviously suspect that I must be doing something wrong, but I can't see how my dataspace implementation would corrupt the opcode sent by the RM in its own IPC messages. But I do think it is odd that somehow, the rather opaque code above doesn't manage to fully initialise the message registers.
Anyway, I will aim to continue my investigations and hopefully make some kind of progress.
Thanks for following up once again!
Paul
Hello,
Continuing my bad habit of following up to my own messages...
On Tuesday, 19 April 2022 01:20:30 CEST Paul Boddie wrote:
Doing some "old school" tracing in various routines like transfer_msg_items by detecting the appropriate fault conditions, enabling tracing, and then producing output on the console doesn't seem to yield any indications of the messages being processed, but perhaps I misunderstand the flow of control from Thread::handle_page_fault_pager within the kernel when the IPC is initiated.
Reviewing the old thread on this broader topic, I found this advice:
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015441.html
This does yield page fault trace log entries of the following form:
pf: 00bc pfa=0000000001000ae3 ip=0000000001000ae3 (rp) spc=0xffffffff13dc5ad8 err=15
Here, I presume that the error is R_aborted (src/abi/l4_error.cpp), meaning that the page fault was not handled.
I see that formatter_pf (in src/kern/tb_entry_output.cc) is responsible for this form of output. Similarly, formatter_ipc and formatter_ipc_res seem to be responsible for the IPC-related log entries.
The log entries appear to be populated when Thread::handle_page_fault (in src/ kern/thread-pagefault.cpp) calls page_fault_log (in src/kern/thread-log.cpp), which populates a Tb_entry_pf instance with the given error details. So, it seems like the reported error is that causing the page fault, not any kind of eventual outcome.
Looking at advice about IPC tracing...
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015475.html
...I can also get log entries that I think might indicate some element of success in terms of the messages being sent, with this being the fault message:
00be answ [fffffffffffe0002] L=8000fcf3 err=0 (OK) (1000ae5,1000ae3)
ipc: 00be wait->[C:INV] DID=be L=0 TO=INF
And elsewhere:
00be answ [00000040] L=0 err=0 (OK) (1000038,400595)
ipc: 00be send rcap->[C:INV] DID=bc L=0 [0000000000000040] (0000000001000038,0000000000400595) TO=INF
Here, I am attempting to resolve the page fault caused by execution at 0x1000ae3 by sending a flexpage providing memory in one task at 0x400000 to the recipient at 0x1000000.
None of this really explains why the page fault handler keeps getting called with the same details, sending the same message, and so on.
I still can't explain this. Doing some more invasive debugging in Task::sys_map (in src/kern/task.cpp) indicates that an explicit l4_task_map call will cause fpage_map (in src/kern/map_util.cpp) and subsequently mem_map (in src/kern/map_util-mem.cpp) and then map (in src/kern/map_util.cpp) to be called, attempting to map 0x400000 (with size 0x400000) in the original task at 0x1000000 in the recipient. This is reportedly successful.
With page fault handling, the fpage_map, mem_map and map functions are called similarly, with the same supposedly successful outcome. But the page fault continues to occur with the same details, as if the mapping did not actually happen. I did wonder if it might be due to the original task not really having the pages it is trying to "export" actually mapped in itself (this being a potential pitfall when implementing dataspaces, in my experience), but this is not the case.
I suppose I must be overlooking something else...
Paul
Hi Paul,
On Tue Apr 19, 2022 at 01:20:30 +0200, Paul Boddie wrote:
On Monday, 18 April 2022 23:26:03 CEST you wrote:
Hi Paul,
On Tue Apr 12, 2022 at 01:09:40 +0200, Paul Boddie wrote:
OK, I did see that function being used, too, but I also found plenty of other things in my perusal of the different files. Obviously, being able to extend the UTCB memory is an important consideration.
It is, because, for example, one might not know how many threads a task will have, especially the component that creates the task.
Right. And so, in the Moe_app_model the size of the UTCB is dependent of the default number of threads. It still seems that I have to provide a UTCB flexpage to l4_factory_create_task, however.
Yes. Both are different levels of APIs. l4_factory_create_task / L4::Factory::create_task is a kernel API, while Moe_app_model is a user-level abstraction that is using those APIs.
Right. I see that the factory function actually sends the flexpage in the IPC call (using l4_factory_create_add_fpage_u), thus mapping it in the task. I find it hard to follow where this message is actually handled (I presume that Moe acts as the factory) or what the factory actually does with the flexpage, but I presume that it ultimately causes it to be mapped in the new task.
It is handled in Fiasco.
OK. I think I found this in Task::create (src/kern/task.cpp).
Yes, that's there.
However, I wonder about the "chicken and egg" situation in new tasks. It seems to me that the way things work is that a new task in L4Re is typically populated with the l4re binary containing the region mapper/manager (RM). This seems to be initiated here (in launch_loader):
Yes, the l4re binary loads the application and then serves as its pager.
OK. And it seems that the RM provided by this binary is able to indicate the receive window for flexpages when asking for mappings from dataspaces.
Yes, it is.
[Pagers for other tasks]
That's how it works. Moe also has region managers that are used for the l4re binary to be paged. When a page fault is resolved, then there is someone sending memory via a flexpage to the task in question. In our case it's the dataspace manager which sends the memory via an 'map' call. Here it does not matter whether the l4re binary faulted or the application, because in the end the task is the receiver of the flex page, not the particular application (which are runnig in the same task).
So, the one thing I didn't understand until I started digging around in the Fiasco sources and also implementing my own page fault handler was the scope of the receive window for issued flexpages, but it appears that the whole address space is indicated as the receive window in Thread::handle_page_fault_pager (src/kern/thread-ipc.cpp).
Yes, for page faults this is the case, i.e., the kernel allows the pager to map into the whole address space to resolve the page fault. There is not restriction made by the kernel for that, and the page fault handler has control over the virtual memory space of the task in any case.
[Mapping memory using l4_task_map]
Jdb has facilities to check how the address spaces look like, exactly to debug issue like you describe. You can press 's' to see all the tasks in the system, navigate onto them, and then press 'p' to see the page-table view. Here you can navigate the page tables and verify that the pages at some virtual address is actually pointing to the physical location they should point at. For a particular physical address (page frame number) you can also show the mapping hierarchy via the 'm'.
The "coherency" problems actually turned out to be me forgetting the appropriate alignment for the mapped flexpages. But I have discovered a few things about jdb in my attempts to troubleshoot my code, including 's' to look at tasks. I found the page table view bewildering, though, rather hoping for a nice summary of mapped pages instead. However, the object space view was very useful in establishing that capabilities were being mapped.
Yeah, the page-table view really shows lots of tables :) I agree that a list of mapped regions would also be nice to have.
I did manage to get l4_task_map to work between two fully initialised tasks created by Ned as follows:
l4_fpage_t payload_fpage = l4_fpage((l4_addr_t) buf, l4util_log2(L4_PAGESIZE * NUM_PAGES), L4_FPAGE_RW);
l4_task_map(recipient, L4RE_THIS_TASK_CAP, payload_fpage, 0x3000000);
This permitted the recipient task to access the memory in the appropriately aligned buf region, this being mapped into the region starting at 0x3000000 in the recipient.
I also managed to achieve the same thing using IPC between the two tasks instead, having the recipient indicate a receive window in the buffer registers as follows:
br[0] = l4_map_control(0, 0, L4_MAP_ITEM_MAP); br[1] = l4_fpage(0x3000000, l4util_log2(L4_PAGESIZE * NUM_PAGES), 0).raw;
And then constructing the send flexpage as follows:
mr[0] = l4_map_control(0, L4_FPAGE_CACHEABLE, 0); mr[1] = l4_fpage((l4_addr_t) buf, l4util_log2(L4_PAGESIZE * NUM_PAGES), L4_FPAGE_RW).raw;
Alternatively, setting the receive window to the entire address space...
br[1] = l4_fpage_all().raw;
...and indicating a different send base also worked:
mr[0] = l4_map_control(0x3000000, L4_FPAGE_CACHEABLE, 0);
This approximating to the l4_task_map scenario.
However, trying this with a completely newly created task, I cannot seem to get l4_task_map to map memory into the task, and my page fault handler does not seem to be able to respond to a page fault message with such a flexpage and have the request satisfied. This means that there is some detail I am overlooking, but I have yet to determine what it is.
One thing I have tried to do is to get Fiasco to report what it is doing when processing page fault messages, but this is quite challenging. Really, I just want to establish whether the reply to a page fault message gets interpreted correctly and causes mappings to be established in the page tables.
Doing some "old school" tracing in various routines like transfer_msg_items by detecting the appropriate fault conditions, enabling tracing, and then producing output on the console doesn't seem to yield any indications of the messages being processed, but perhaps I misunderstand the flow of control from Thread::handle_page_fault_pager within the kernel when the IPC is initiated.
Reviewing the old thread on this broader topic, I found this advice:
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015441.html
This does yield page fault trace log entries of the following form:
pf: 00bc pfa=0000000001000ae3 ip=0000000001000ae3 (rp) spc=0xffffffff13dc5ad8 err=15
Here, I presume that the error is R_aborted (src/abi/l4_error.cpp), meaning that the page fault was not handled.
Looking at advice about IPC tracing...
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015475.html
...I can also get log entries that I think might indicate some element of success in terms of the messages being sent, with this being the fault message:
00be answ [fffffffffffe0002] L=8000fcf3 err=0 (OK) (1000ae5,1000ae3)
ipc: 00be wait->[C:INV] DID=be L=0 TO=INF
And elsewhere:
00be answ [00000040] L=0 err=0 (OK) (1000038,400595)
ipc: 00be send rcap->[C:INV] DID=bc L=0 [0000000000000040] (0000000001000038,0000000000400595) TO=INF
Here, I am attempting to resolve the page fault caused by execution at 0x1000ae3 by sending a flexpage providing memory in one task at 0x400000 to the recipient at 0x1000000.
None of this really explains why the page fault handler keeps getting called with the same details, sending the same message, and so on.
[...]
For sure this is an area where the code is pretty involved. Back in the old days we had an IDL compiler that grew and grew and was in the end not easy to maintain. When we switched to the capability system and thus changing all the APIs we had the choice of whether adapting the IDL compiler or doing something different. Back then it was a major hassle to parse the input because eventually one wants to have the whole language understood by the IDL compiler to be able to use all sorts of types (of course one could make compromises there). With C / C++ that is not so easy, at least back then. Now there's LLVM and that's a major improvement in this area. Still the actual tool needs to be implemented and maintained. Now, as we all see, we have opted for the "do something else" option. With C++ as our main language and the possibilities with it, there was the idea to implement the "IDL thing" purely with C++ directly in the code. That's what we have now. For me, all the code around it is the IDL compiler, and abstractly, if it would not be in header files it would sit somewhere else but it would be there in one form or another.
The unfortunate thing about the "do something else" solution is that it becomes difficult to determine what the interfaces are for components: the details are all presumably present, but they are encoded in ways that are not particularly readable. That might not seem to matter if the existing classes are usable and readily understandable, but my own experience was that I found myself trying to understand the low-level details before I could hope to understand why the C++ API did things in a particular way, which is really the wrong way round. I must have spent hours staring at Gen_fpage and related types in one way or another.
Yes this is involved. I'm convinced that the code handling all this needs to be somewhere. For sure it can be argued how it is written but the overall details seem to be as they are.
This might just sound like me complaining, but I also have some concerns about being able to verify the behaviour of some of the code. For example, I recently found that my dataspace implementation was getting requests from a region mapper/manager with an opcode of 0x100000000, which doesn't make any sense to me at all, given that the dataspace interface code in L4Re implicitly defines opcodes that are all likely to be very small integers. At first I obviously blamed my own code, but then I found that in the IPC call implementation found here...
pk/l4re-core/l4sys/include/cxx/ipc_iface
...if I explicitly cleared the first message register before this statement...> int send_bytes =
Args::template write_op<Do_in_data>(mrs->mr, 0, Mr_bytes, Opt::Opcode, a...);
...then the opcode was produced as expected again.
Which does not fully make sense to me because the message registers seem to be written from 0 on. Anyway, do you have an example maybe?
I only found this to happen when a program of mine had fetched many pages from a dataspace via the RM. It would be useful to understand the conditions under which this occurs, and I obviously suspect that I must be doing something wrong, but I can't see how my dataspace implementation would corrupt the opcode sent by the RM in its own IPC messages. But I do think it is odd that somehow, the rather opaque code above doesn't manage to fully initialise the message registers.
Anyway, I will aim to continue my investigations and hopefully make some kind of progress.
Thanks!
On Thu Apr 21, 2022 at 01:14:02 +0200, Paul Boddie wrote:
Hello,
Continuing my bad habit of following up to my own messages...
On Tuesday, 19 April 2022 01:20:30 CEST Paul Boddie wrote:
Doing some "old school" tracing in various routines like transfer_msg_items by detecting the appropriate fault conditions, enabling tracing, and then producing output on the console doesn't seem to yield any indications of the messages being processed, but perhaps I misunderstand the flow of control from Thread::handle_page_fault_pager within the kernel when the IPC is initiated.
Reviewing the old thread on this broader topic, I found this advice:
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015441.html
This does yield page fault trace log entries of the following form:
pf: 00bc pfa=0000000001000ae3 ip=0000000001000ae3 (rp) spc=0xffffffff13dc5ad8 err=15
Here, I presume that the error is R_aborted (src/abi/l4_error.cpp), meaning that the page fault was not handled.
'err' is the error code of the page fault from the CPU, as described in the x86 architecture manuel (In Intel's, chapter 4.7). It is a hex number. And says user-mode access, page not present, and instruction fetch (that's also what the 'rp' says, read-only fault, not present).
I see that formatter_pf (in src/kern/tb_entry_output.cc) is responsible for this form of output. Similarly, formatter_ipc and formatter_ipc_res seem to be responsible for the IPC-related log entries.
The log entries appear to be populated when Thread::handle_page_fault (in src/ kern/thread-pagefault.cpp) calls page_fault_log (in src/kern/thread-log.cpp), which populates a Tb_entry_pf instance with the given error details. So, it seems like the reported error is that causing the page fault, not any kind of eventual outcome.
Looking at advice about IPC tracing...
http://os.inf.tu-dresden.de/pipermail/l4-hackers/2014/015475.html
...I can also get log entries that I think might indicate some element of success in terms of the messages being sent, with this being the fault message:
00be answ [fffffffffffe0002] L=8000fcf3 err=0 (OK) (1000ae5,1000ae3)
ipc: 00be wait->[C:INV] DID=be L=0 TO=INF
And elsewhere:
00be answ [00000040] L=0 err=0 (OK) (1000038,400595)
ipc: 00be send rcap->[C:INV] DID=bc L=0 [0000000000000040] (0000000001000038,0000000000400595) TO=INF
Here, I am attempting to resolve the page fault caused by execution at 0x1000ae3 by sending a flexpage providing memory in one task at 0x400000 to the recipient at 0x1000000.
None of this really explains why the page fault handler keeps getting called with the same details, sending the same message, and so on.
I still can't explain this. Doing some more invasive debugging in Task::sys_map (in src/kern/task.cpp) indicates that an explicit l4_task_map call will cause fpage_map (in src/kern/map_util.cpp) and subsequently mem_map (in src/kern/map_util-mem.cpp) and then map (in src/kern/map_util.cpp) to be called, attempting to map 0x400000 (with size 0x400000) in the original task at 0x1000000 in the recipient. This is reportedly successful.
With page fault handling, the fpage_map, mem_map and map functions are called similarly, with the same supposedly successful outcome. But the page fault continues to occur with the same details, as if the mapping did not actually happen. I did wonder if it might be due to the original task not really having the pages it is trying to "export" actually mapped in itself (this being a potential pitfall when implementing dataspaces, in my experience), but this is not the case.
I suppose I must be overlooking something else...
Maybe it helps to share some code here?
Adam
Adam,
Thanks once again for indulging me!
On Thursday, 21 April 2022 22:57:40 CEST Adam Lackorzynski wrote:
On Tue Apr 19, 2022 at 01:20:30 +0200, Paul Boddie wrote:
I still can't explain this. Doing some more invasive debugging in Task::sys_map (in src/kern/task.cpp) indicates that an explicit l4_task_map call will cause fpage_map (in src/kern/map_util.cpp) and subsequently mem_map (in src/kern/map_util-mem.cpp) and then map (in src/kern/map_util.cpp) to be called, attempting to map 0x400000 (with size 0x400000) in the original task at 0x1000000 in the recipient. This is reportedly successful.
With page fault handling, the fpage_map, mem_map and map functions are called similarly, with the same supposedly successful outcome. But the page fault continues to occur with the same details, as if the mapping did not actually happen. I did wonder if it might be due to the original task not really having the pages it is trying to "export" actually mapped in itself (this being a potential pitfall when implementing dataspaces, in my experience), but this is not the case.
I suppose I must be overlooking something else...
Maybe it helps to share some code here?
I suppose it might be best to condense this into an example that uses the basic L4Re APIs instead of my own libraries (that wrap those APIs), so as not to confuse things. I'll try and put that together tomorrow.
Paul
On Friday, 22 April 2022 01:16:44 CEST Paul Boddie wrote:
On Thursday, 21 April 2022 22:57:40 CEST Adam Lackorzynski wrote:
Maybe it helps to share some code here?
I suppose it might be best to condense this into an example that uses the basic L4Re APIs instead of my own libraries (that wrap those APIs), so as not to confuse things. I'll try and put that together tomorrow.
So, I finally got to looking at this and making it self-contained, with an archive of the code available here:
https://www.boddie.org.uk/downloads/tests_l4re.tar.bz2
This is a package that should work within the L4Re build system containing two source files: a trivial program (exec_payload_l4re.c) and a loader program that attempts to create a new task to load and start the trivial program (exec_l4re.cc). Both programs are statically linked.
Note that the loader program does not do all the work to start the program, focusing only on being able to resolve the initial page fault. So, although it attempts to set up various capabilities and other resources, it doesn't even initialise the stack or map in other regions that would be needed to actually run the program.
The log when running the loader program using the supplied configuration will report page faults occurring as follows:
exec_l4r| page_fault(0x1000ae4, 0x1000ae3) -> 0x1000ae0 (0x4)... exec_l4r| -> l4_fpage(0x40000, 18, 0x5) @ 1000000 exec_l4r| page_fault(0x1000ae5, 0x1000ae3) -> 0x1000ae0 (0x5)... exec_l4r| -> l4_fpage(0x40000, 18, 0x5) @ 1000000
Here, the received fault is decoded and the flexpage for returning is described. When I switch on page fault and IPC logging (P*, I*, IR+) in jdb, I get entries like this:
0047 answ [00000040] L=0 err=0 (OK) (1000038,40495) pf: 0043 pfa=0000000001000ae3 ip=0000000001000ae3 (rp) spc=0xffffffff13dc5cb8 err=15 ipc: 0047 send rcap->[C:INV] DID=43 L=0 [0000000000000040] (0000000001000038,0000000000040495) TO=INF
Here, thread 0047 belongs to the loader program, and 0043 belongs to the trivial payload program.
It is entirely possible that I am not sufficiently initialising the created task, but it surprises me that the page fault does not get resolved, and I did some old-fashioned debugging with Fiasco to establish that the mapping does occur. One difference between the above and my previous report is that the mapped region is now significantly smaller, since I am now using a binary file installed in the "rom" as a module, not the initially generated ELF binary.
When looking for other code that might be doing this kind of thing, I remembered L4Linux which seems to use the C APIs. Much of what it is doing is similar to what I have been trying, although it seems to use the "alien" thread mechanism at various points, which I don't fully understand.
Anyway, I hope that I am making some obvious mistake that just hasn't been obvious to me. Although there were examples in the package that was available via Subversion (and maybe still is available via Git), they covered alien threads and the vCPU support, but I couldn't find anything related to task creation in its own right. Interestingly, I did find this:
https://github.com/phipse/L4RTEMS/blob/master/l4/pkg/RTEMS_wrapper/server/sr... wrapper_1.cc
But it is also using the vCPU feature. However, it does seem to attempt ELF header decoding and related activities, so it might still prove to be a useful reference.
Thanks once again for listening!
Paul
On Sat Apr 23, 2022 at 00:40:05 +0200, Paul Boddie wrote:
On Friday, 22 April 2022 01:16:44 CEST Paul Boddie wrote:
On Thursday, 21 April 2022 22:57:40 CEST Adam Lackorzynski wrote:
Maybe it helps to share some code here?
I suppose it might be best to condense this into an example that uses the basic L4Re APIs instead of my own libraries (that wrap those APIs), so as not to confuse things. I'll try and put that together tomorrow.
So, I finally got to looking at this and making it self-contained, with an archive of the code available here:
https://www.boddie.org.uk/downloads/tests_l4re.tar.bz2
This is a package that should work within the L4Re build system containing two source files: a trivial program (exec_payload_l4re.c) and a loader program that attempts to create a new task to load and start the trivial program (exec_l4re.cc). Both programs are statically linked.
Note that the loader program does not do all the work to start the program, focusing only on being able to resolve the initial page fault. So, although it attempts to set up various capabilities and other resources, it doesn't even initialise the stack or map in other regions that would be needed to actually run the program.
The log when running the loader program using the supplied configuration will report page faults occurring as follows:
exec_l4r| page_fault(0x1000ae4, 0x1000ae3) -> 0x1000ae0 (0x4)... exec_l4r| -> l4_fpage(0x40000, 18, 0x5) @ 1000000 exec_l4r| page_fault(0x1000ae5, 0x1000ae3) -> 0x1000ae0 (0x5)... exec_l4r| -> l4_fpage(0x40000, 18, 0x5) @ 1000000
Here, the received fault is decoded and the flexpage for returning is described. When I switch on page fault and IPC logging (P*, I*, IR+) in jdb, I get entries like this:
0047 answ [00000040] L=0 err=0 (OK) (1000038,40495)
pf: 0043 pfa=0000000001000ae3 ip=0000000001000ae3 (rp) spc=0xffffffff13dc5cb8 err=15 ipc: 0047 send rcap->[C:INV] DID=43 L=0 [0000000000000040] (0000000001000038,0000000000040495) TO=INF
Here, thread 0047 belongs to the loader program, and 0043 belongs to the trivial payload program.
It is entirely possible that I am not sufficiently initialising the created task, but it surprises me that the page fault does not get resolved, and I did some old-fashioned debugging with Fiasco to establish that the mapping does occur. One difference between the above and my previous report is that the mapped region is now significantly smaller, since I am now using a binary file installed in the "rom" as a module, not the initially generated ELF binary.
When looking for other code that might be doing this kind of thing, I remembered L4Linux which seems to use the C APIs. Much of what it is doing is similar to what I have been trying, although it seems to use the "alien" thread mechanism at various points, which I don't fully understand.
Anyway, I hope that I am making some obvious mistake that just hasn't been obvious to me. Although there were examples in the package that was available via Subversion (and maybe still is available via Git), they covered alien threads and the vCPU support, but I couldn't find anything related to task creation in its own right. Interestingly, I did find this:
https://github.com/phipse/L4RTEMS/blob/master/l4/pkg/RTEMS_wrapper/server/sr... wrapper_1.cc
But it is also using the vCPU feature. However, it does seem to attempt ELF header decoding and related activities, so it might still prove to be a useful reference.
Thanks once again for listening!
Thanks for the example.
I believe I see the issue but first I immediately change to buf_log2size to 12 for the reason of less suprise, and did not change further on the sizes. Then I noticed your way of handling the UTCB involved allocating memory. That's not needed. With the fpage you specify the window of the UTCB memory in the other task, so no need to allocation memory in the launcher task, if it is for reserving the virtual memory.
Then, the issue is that posix_memalign allocates memory which does not have the x-bit set, i.e., is memory that is not executable. Change it to buf = (char *)mmap(NULL, region_size, PROT_EXEC | PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (buf == MAP_FAILED) { printf("Could not reserve memory.\n"); return 1; } and it should work (it did for me).
Adam
On Monday, 25 April 2022 01:04:44 CEST Adam Lackorzynski wrote:
Thanks for the example.
Thanks for looking at it! I appreciate the help.
I believe I see the issue but first I immediately change to buf_log2size to 12 for the reason of less suprise, and did not change further on the sizes. Then I noticed your way of handling the UTCB involved allocating memory. That's not needed. With the fpage you specify the window of the UTCB memory in the other task, so no need to allocation memory in the launcher task, if it is for reserving the virtual memory.
I didn't really understand this when looking through the existing code. It seemed that the memory was reserved, and that seemed to involve telling a region manager/mapper about it, such as in Remote_app_model where the prog_reserve_utcb_area method appears to attach an invalid dataspace (obtained from the reserved_area method) to an existing RM.
Meanwhile, the l4_factory_create_task function accepts a flexpage as parameter whose details are then provided in the IPC message. As you noted before, Fiasco is meant to handle this flexpage. And it does appear that if I just remove the dataspace allocation and provide the flexpage details, the UTCB gets set up in the new task at the appropriate location.
Then, the issue is that posix_memalign allocates memory which does not have the x-bit set, i.e., is memory that is not executable. Change it to buf = (char *)mmap(NULL, region_size, PROT_EXEC | PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (buf == MAP_FAILED) { printf("Could not reserve memory.\n"); return 1; } and it should work (it did for me).
This seems like the obvious thing that I couldn't see: that the memory needs to have the appropriate permissions associated with it. Well, it seems a bit more obvious now!
For the larger region I had in mind, just to keep things simple, mmap is a bit cumbersome because it only supports page-level alignment, so I used the L4Re memory allocator to get a dataspace that I could attach at a suitably aligned address. I imagine that if the parent task were to terminate, having an independently allocated dataspace would be desirable, too.
Thanks once again for the guidance, and sorry I didn't see my mistake!
Paul
Hi Paul,
On Mon Apr 25, 2022 at 17:55:45 +0200, Paul Boddie wrote:
On Monday, 25 April 2022 01:04:44 CEST Adam Lackorzynski wrote:
Thanks for the example.
Thanks for looking at it! I appreciate the help.
I believe I see the issue but first I immediately change to buf_log2size to 12 for the reason of less suprise, and did not change further on the sizes. Then I noticed your way of handling the UTCB involved allocating memory. That's not needed. With the fpage you specify the window of the UTCB memory in the other task, so no need to allocation memory in the launcher task, if it is for reserving the virtual memory.
I didn't really understand this when looking through the existing code. It seemed that the memory was reserved, and that seemed to involve telling a region manager/mapper about it, such as in Remote_app_model where the prog_reserve_utcb_area method appears to attach an invalid dataspace (obtained from the reserved_area method) to an existing RM.
Generally, the RM API has a reserve_area call which should be used for this.
Meanwhile, the l4_factory_create_task function accepts a flexpage as parameter whose details are then provided in the IPC message. As you noted before, Fiasco is meant to handle this flexpage. And it does appear that if I just remove the dataspace allocation and provide the flexpage details, the UTCB gets set up in the new task at the appropriate location.
Yes, a dataspace has nothing to do with this.
Then, the issue is that posix_memalign allocates memory which does not have the x-bit set, i.e., is memory that is not executable. Change it to buf = (char *)mmap(NULL, region_size, PROT_EXEC | PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (buf == MAP_FAILED) { printf("Could not reserve memory.\n"); return 1; } and it should work (it did for me).
This seems like the obvious thing that I couldn't see: that the memory needs to have the appropriate permissions associated with it. Well, it seems a bit more obvious now!
For the larger region I had in mind, just to keep things simple, mmap is a bit cumbersome because it only supports page-level alignment, so I used the L4Re memory allocator to get a dataspace that I could attach at a suitably aligned address. I imagine that if the parent task were to terminate, having an independently allocated dataspace would be desirable, too.
Yes, of course using a dataspace is also totally fine.
Adam
On Thursday, 28 April 2022 23:21:46 CEST Adam Lackorzynski wrote:
On Mon Apr 25, 2022 at 17:55:45 +0200, Paul Boddie wrote:
I didn't really understand this when looking through the existing code. It seemed that the memory was reserved, and that seemed to involve telling a region manager/mapper about it, such as in Remote_app_model where the prog_reserve_utcb_area method appears to attach an invalid dataspace (obtained from the reserved_area method) to an existing RM.
Generally, the RM API has a reserve_area call which should be used for this.
Thanks once again for the clarification! I suppose I should ask the more general question of whether the different L4Re mechanisms are documented in a manual or something of that nature. Although the reference documentation covers some of the concepts, along with APIs for use by applications, but not necessarily various framework-level abstractions (like Remote_app_model), it would be informative to be able to read something that discusses the techniques involved in building a system on top of Fiasco.
At the moment, I am just trying to persuade a program to run in a new task, which will hopefully only be a matter of properly initialising the details of the program environment. As well as paging in the appropriate segments of the binary in the right places, of course.
Paul
Hi Paul,
On Fri Apr 29, 2022 at 00:28:10 +0200, Paul Boddie wrote:
On Thursday, 28 April 2022 23:21:46 CEST Adam Lackorzynski wrote:
On Mon Apr 25, 2022 at 17:55:45 +0200, Paul Boddie wrote:
I didn't really understand this when looking through the existing code. It seemed that the memory was reserved, and that seemed to involve telling a region manager/mapper about it, such as in Remote_app_model where the prog_reserve_utcb_area method appears to attach an invalid dataspace (obtained from the reserved_area method) to an existing RM.
Generally, the RM API has a reserve_area call which should be used for this.
Thanks once again for the clarification! I suppose I should ask the more general question of whether the different L4Re mechanisms are documented in a manual or something of that nature. Although the reference documentation covers some of the concepts, along with APIs for use by applications, but not necessarily various framework-level abstractions (like Remote_app_model), it would be informative to be able to read something that discusses the techniques involved in building a system on top of Fiasco.
I'm afraid but for this particular area there's no better documentation I believe.
At the moment, I am just trying to persuade a program to run in a new task, which will hopefully only be a matter of properly initialising the details of the program environment. As well as paging in the appropriate segments of the binary in the right places, of course.
Adam
Adam,
On Wednesday, 4 May 2022 00:25:42 CEST Adam Lackorzynski wrote:
On Fri Apr 29, 2022 at 00:28:10 +0200, Paul Boddie wrote:
Thanks once again for the clarification! I suppose I should ask the more general question of whether the different L4Re mechanisms are documented in a manual or something of that nature. Although the reference documentation covers some of the concepts, along with APIs for use by applications, but not necessarily various framework-level abstractions (like Remote_app_model), it would be informative to be able to read something that discusses the techniques involved in building a system on top of Fiasco.
I'm afraid but for this particular area there's no better documentation I believe.
Thanks for following up again!
Currently, I am just exploring various abstractions of my own for initialising the environment of new tasks. Previously, I had to deal with some of this when getting Newlib to work on top of L4Re, so it isn't completely unfamiliar, although it can be quite difficult to remember where to find the details, especially given that it has been a couple of years or longer since I last looked at this. Since I seem to have program arguments working, I think I must be doing something right!
The present situation still involves having a pager in the "parent" task also acting as a simple region mapper, sending flexpages corresponding to parts of allocated and populated memory regions in the parent to the "child" task as it encounters page faults. This allows the different segments of the payload to be exposed to the child task and for the code/data to be loaded in on demand, as expected.
What I probably want to do, I think, is to deploy such code within the child task so as to be able to take advantage of defining receive windows and then requesting mappings from dataspaces directly. Then, I imagine that I would start a separate thread for the actual program to be run, with it configured to use the RM in the same task as its pager. I imagine that would then mostly reproduce what the l4re_kernel (l4re program) is doing.
So, there is still some work to be done, I think, but progress is being made. Thanks once again for the help!
Paul
On Monday, 18 April 2022 23:26:03 CEST Adam Lackorzynski wrote:
On Tue Apr 12, 2022 at 01:09:40 +0200, Paul Boddie wrote:
This might just sound like me complaining, but I also have some concerns about being able to verify the behaviour of some of the code. For example, I recently found that my dataspace implementation was getting requests from a region mapper/manager with an opcode of 0x100000000, which doesn't make any sense to me at all, given that the dataspace interface code in L4Re implicitly defines opcodes that are all likely to be very small integers. At first I obviously blamed my own code, but then I found that in the IPC call implementation found here...
pk/l4re-core/l4sys/include/cxx/ipc_iface
This was obviously...
pkg/l4re-core/l4sys/include/cxx/ipc_iface
...if I explicitly cleared the first message register before this statement...
int send_bytes = Args::template write_op<Do_in_data>(mrs->mr, 0, Mr_bytes, Opt::Opcode, a...);
...then the opcode was produced as expected again.
Which does not fully make sense to me because the message registers seem to be written from 0 on. Anyway, do you have an example maybe?
I just spent quite some time seeing errors like this...
ext2svr | L4Re[rm]: mapping for page fault failed with error -39 at 0x1002fbc00 pc=0x10b7804 ext2svr | L4Re: rom/ext2_server: Unhandled exception: PC=0x10b7804 PFA=0x1002fbc00 LdrFlgs=0x0
-39 being -L4_EBADPROTO (unsupported protocol), of course.
And then I remembered that there was some curious IPC-related problem I had encountered a while ago. It turns out that it was this again, and since I had obtained a fresh checkout of the L4Re code and had forgotten that this could be a problem, I had revived it! Introducing my workaround eliminated the error.
Do you have any ideas as to why the first message register gets corrupted?
Thanks in advance for any guidance!
Paul
Hi Paul,
On Sun Aug 21, 2022 at 00:18:57 +0200, Paul Boddie wrote:
On Monday, 18 April 2022 23:26:03 CEST Adam Lackorzynski wrote:
On Tue Apr 12, 2022 at 01:09:40 +0200, Paul Boddie wrote:
This might just sound like me complaining, but I also have some concerns about being able to verify the behaviour of some of the code. For example, I recently found that my dataspace implementation was getting requests from a region mapper/manager with an opcode of 0x100000000, which doesn't make any sense to me at all, given that the dataspace interface code in L4Re implicitly defines opcodes that are all likely to be very small integers. At first I obviously blamed my own code, but then I found that in the IPC call implementation found here...
pk/l4re-core/l4sys/include/cxx/ipc_iface
This was obviously...
pkg/l4re-core/l4sys/include/cxx/ipc_iface
...if I explicitly cleared the first message register before this statement...
int send_bytes = Args::template write_op<Do_in_data>(mrs->mr, 0, Mr_bytes, Opt::Opcode, a...);
...then the opcode was produced as expected again.
Which does not fully make sense to me because the message registers seem to be written from 0 on. Anyway, do you have an example maybe?
I just spent quite some time seeing errors like this...
ext2svr | L4Re[rm]: mapping for page fault failed with error -39 at 0x1002fbc00 pc=0x10b7804 ext2svr | L4Re: rom/ext2_server: Unhandled exception: PC=0x10b7804 PFA=0x1002fbc00 LdrFlgs=0x0
-39 being -L4_EBADPROTO (unsupported protocol), of course.
And then I remembered that there was some curious IPC-related problem I had encountered a while ago. It turns out that it was this again, and since I had obtained a fresh checkout of the L4Re code and had forgotten that this could be a problem, I had revived it! Introducing my workaround eliminated the error.
Do you have any ideas as to why the first message register gets corrupted?
No, still not. Any chance I could see a small example of this?
Adam
On Sunday, 28 August 2022 23:07:38 CEST Adam Lackorzynski wrote:
On Sun Aug 21, 2022 at 00:18:57 +0200, Paul Boddie wrote:
I just spent quite some time seeing errors like this...
ext2svr | L4Re[rm]: mapping for page fault failed with error -39 at 0x1002fbc00 pc=0x10b7804 ext2svr | L4Re: rom/ext2_server: Unhandled exception: PC=0x10b7804 PFA=0x1002fbc00 LdrFlgs=0x0
-39 being -L4_EBADPROTO (unsupported protocol), of course.
And then I remembered that there was some curious IPC-related problem I had encountered a while ago. It turns out that it was this again, and since I had obtained a fresh checkout of the L4Re code and had forgotten that this could be a problem, I had revived it! Introducing my workaround eliminated the error.
Do you have any ideas as to why the first message register gets corrupted?
No, still not. Any chance I could see a small example of this?
I'll try and package up what I've been doing so that it can be more readily investigated. I was actually in the middle of this packaging process when I discovered the problem once again.
Ideally, I would be able to make a very compact example that exhibits this behaviour, but it would probably need a bit of extra contemplation about why it occurs. Obviously, the region manager provided by the l4re binary is trying to send requests to a dataspace, but it is somehow producing the wrong IPC opcodes, and it does so after a number of requests have occurred. On this most recent occasion, the number of requests was reproducible, so I wonder a bit about that (and about my own code's involvement in it occurring, of course).
Anyway, I'll try and get something out there to look at in the next few days. Sorry not to have done that already, but I just wondered if there had been any insights into the matter, discoveries related to the code, and so on.
Thanks once again,
Paul
On Monday, 29 August 2022 00:37:31 CEST Paul Boddie wrote:
On Sunday, 28 August 2022 23:07:38 CEST Adam Lackorzynski wrote:
On Sun Aug 21, 2022 at 00:18:57 +0200, Paul Boddie wrote:
I just spent quite some time seeing errors like this...
ext2svr | L4Re[rm]: mapping for page fault failed with error -39 at 0x1002fbc00 pc=0x10b7804 ext2svr | L4Re: rom/ext2_server: Unhandled exception: PC=0x10b7804 PFA=0x1002fbc00 LdrFlgs=0x0
-39 being -L4_EBADPROTO (unsupported protocol), of course.
[...]
Do you have any ideas as to why the first message register gets corrupted?
No, still not. Any chance I could see a small example of this?
I'll try and package up what I've been doing so that it can be more readily investigated. I was actually in the middle of this packaging process when I discovered the problem once again.
Following up, I decided to give my code a test in 32-bit x86 and MIPS virtual machines which caused the problem to be much more pronounced. This led me to review a few things where I had misread the definitions of certain types (in pkg/l4re-core/l4sys/include/l4int.h). However, I think that the nature of the problem is actually as follows.
When a map request is sent by the L4Re region mapper, the IPC framework pieces together the necessary message. What isn't entirely obvious is the nature of the opcode being used. I originally thought that it was of type l4_umword_t, and dumping the bytes in the message, it does appear that the opcode is actually a 32-bit value (compatible with l4_umword_t on a 32-bit platform) but that the first operand only appears after an initial 64-bit unit containing the opcode, even on a 32-bit platform.
This might not produce problems on 64-bit platforms, although my original report did concern such a platform, but problems are immediately evident on a 32-bit platform. For example, here is a map request on x86 that caused problems:
00 00 00 00 ce e3 08 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... ----------- ----------- ----------------------- ----------------------- opcode ??? offset hot_spot
So, when the IPC framework writes the opcode, it fills the first 32-bit unit but apparently not the second half of the initial 64-bit unit. Consequently, the contents of the second half of the word appear to persist from whatever they were previously. Hence my annotation of "???" above. It looks like an address from the program, perhaps an earlier page fault address.
Often, zero bytes are involved, thus preserving the appropriate behaviour even on 64-bit platforms where the opcode could be interpreted as the first 64-bit (l4_umword_t) unit. Zeroing the first message register (as I noted in an earlier message) fixed any cases where the prior non-zero contents leaked into the message and corrupted the opcode when considered to be a 64-bit value.
The problem on 32-bit platforms is that the operands are displaced and are not found after the first 32-bit word. In the above example, this causes the offset operand to be misinterpreted, along with the values that follow it. Obviously, this brings programs needing dataspaces to a halt very quickly.
The above behaviour contradicts the way the IPC messages are constructed by Lua, for example. What I see on amd64/x86-64 is different from on x86 for the same message payload. For example, an invocation of a factory create operation with an opcode of 6:
On amd64:
06 00 00 00 00 00 00 00 81 00 08 00 00 00 00 00 e8 03 00 00 00 00 00 00 ... ----------------------- ----- ----- ----------- ----------------------- opcode type size unused value (1000)
On x86:
06 00 00 00 81 00 04 00 e8 03 00 00 ... ----------- ----- ----- ----------- opcode type size value (1000)
Here, the opcode is dependent on the word size, and the Lua code is happy to use the word size for the operands with no padding or gaps being introduced. Other IPC messages also appear to use the word size for the opcode. For example, when attach operations are invoked on the region mapper I have implemented, the opcode is only a 32-bit value with no trailing data before the initial operands.
I imagine that none of this would manifest itself if I used precisely the same libraries and/or code as other L4Re components, but then that rather makes the system monolithic. There should be a degree of interoperability based on message specifications and interface descriptions, and there should be some consistency, too. What I have seen is that the dataspace IPC is not consistent with other IPC.
Paul
Hi Paul,
thanks for the detailed explanation and sorry for the long wait. I waded through the templates and am wondering how this happens. I too am of the opinion that on a 32bit platform the opcode is a 32-bit number and it always is written to the first field in the MRs. (Currently, I'd say it's an int).
I need to build a test for to reproduce this to figure out what happens within the templates, but I'm not sure if I understood the details of your explanation, so let me put it in my own words:
* You observe the opcode of a Dataspace::map operation to be written into a 64bit field in the MRs, but only in the lower 32bit. The upper 32bit remain the old value.
* This happens only with IPC using the RPC framework. In IPC using code that is written by hand - like L4::Factory.create() or L4::Task.map() - the opcode-field is a 32bit MR; on 64-bit architectures this is a 64bit value and field respectively.
* As far as I understood you are writing the client-side of the Dataspace.map() request yourself and do not use the RPC framework? With that I mean writing the opcode to mr[0], the offset to mr[1], and so on.
But the server side then reads the mr[0] and mr[1] together as one 64bit value and then replies with EBADPROTO?
On this last point I'm a bit lost on which side does what exactly. Can you maybe write a bit of pseudo code on what happens on client and what on server side to help me understand?
Cheers Philipp
On 9/18/22 23:15, Paul Boddie wrote:
On Monday, 29 August 2022 00:37:31 CEST Paul Boddie wrote:
On Sunday, 28 August 2022 23:07:38 CEST Adam Lackorzynski wrote:
On Sun Aug 21, 2022 at 00:18:57 +0200, Paul Boddie wrote:
I just spent quite some time seeing errors like this...
ext2svr | L4Re[rm]: mapping for page fault failed with error -39 at 0x1002fbc00 pc=0x10b7804 ext2svr | L4Re: rom/ext2_server: Unhandled exception: PC=0x10b7804 PFA=0x1002fbc00 LdrFlgs=0x0
-39 being -L4_EBADPROTO (unsupported protocol), of course.
[...]
Do you have any ideas as to why the first message register gets corrupted?
No, still not. Any chance I could see a small example of this?
I'll try and package up what I've been doing so that it can be more readily investigated. I was actually in the middle of this packaging process when I discovered the problem once again.
Following up, I decided to give my code a test in 32-bit x86 and MIPS virtual machines which caused the problem to be much more pronounced. This led me to review a few things where I had misread the definitions of certain types (in pkg/l4re-core/l4sys/include/l4int.h). However, I think that the nature of the problem is actually as follows.
When a map request is sent by the L4Re region mapper, the IPC framework pieces together the necessary message. What isn't entirely obvious is the nature of the opcode being used. I originally thought that it was of type l4_umword_t, and dumping the bytes in the message, it does appear that the opcode is actually a 32-bit value (compatible with l4_umword_t on a 32-bit platform) but that the first operand only appears after an initial 64-bit unit containing the opcode, even on a 32-bit platform.
This might not produce problems on 64-bit platforms, although my original report did concern such a platform, but problems are immediately evident on a 32-bit platform. For example, here is a map request on x86 that caused problems:
00 00 00 00 ce e3 08 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...
opcode ??? offset hot_spot
So, when the IPC framework writes the opcode, it fills the first 32-bit unit but apparently not the second half of the initial 64-bit unit. Consequently, the contents of the second half of the word appear to persist from whatever they were previously. Hence my annotation of "???" above. It looks like an address from the program, perhaps an earlier page fault address.
Often, zero bytes are involved, thus preserving the appropriate behaviour even on 64-bit platforms where the opcode could be interpreted as the first 64-bit (l4_umword_t) unit. Zeroing the first message register (as I noted in an earlier message) fixed any cases where the prior non-zero contents leaked into the message and corrupted the opcode when considered to be a 64-bit value.
The problem on 32-bit platforms is that the operands are displaced and are not found after the first 32-bit word. In the above example, this causes the offset operand to be misinterpreted, along with the values that follow it. Obviously, this brings programs needing dataspaces to a halt very quickly.
The above behaviour contradicts the way the IPC messages are constructed by Lua, for example. What I see on amd64/x86-64 is different from on x86 for the same message payload. For example, an invocation of a factory create operation with an opcode of 6:
On amd64:
06 00 00 00 00 00 00 00 81 00 08 00 00 00 00 00 e8 03 00 00 00 00 00 00 ...
opcode type size unused value (1000)
On x86:
06 00 00 00 81 00 04 00 e8 03 00 00 ...
opcode type size value (1000)
Here, the opcode is dependent on the word size, and the Lua code is happy to use the word size for the operands with no padding or gaps being introduced. Other IPC messages also appear to use the word size for the opcode. For example, when attach operations are invoked on the region mapper I have implemented, the opcode is only a 32-bit value with no trailing data before the initial operands.
I imagine that none of this would manifest itself if I used precisely the same libraries and/or code as other L4Re components, but then that rather makes the system monolithic. There should be a degree of interoperability based on message specifications and interface descriptions, and there should be some consistency, too. What I have seen is that the dataspace IPC is not consistent with other IPC.
Paul
Hi Paul,
I believe I have an idea of what you are observing.
00 00 00 00 ce e3 08 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...
opcode ??? offset hot_spot
On a 32bit architecture, the opcode is 32bit, the ??? is 32bit padding before the offset, which is - also on a 32bit architecture - a 64bit value and needs to be aligned to 64bit. The client has to adhere to the platform specific alignment constraints of the members in the data structure, as does the server side.
From what I have understood, you are implementing a dataspace provider and are observing the provider generating the error (-39/-L4_EBADPROTO)?
EBADPROTO normaly means that the protocol value in the l4_msgtag_t is not supported by the server. In case of an unsupported opcode servers reports L4_ENOSYS. Thus, I'm a bit confused about the EBADPROTO error code and the opcode/MR issue. What am I missing?
Cheers Philipp
On 9/28/22 18:36, Philipp Eppelt wrote:
Hi Paul,
thanks for the detailed explanation and sorry for the long wait. I waded through the templates and am wondering how this happens. I too am of the opinion that on a 32bit platform the opcode is a 32-bit number and it always is written to the first field in the MRs. (Currently, I'd say it's an int).
I need to build a test for to reproduce this to figure out what happens within the templates, but I'm not sure if I understood the details of your explanation, so let me put it in my own words:
- You observe the opcode of a Dataspace::map operation to be written
into a 64bit field in the MRs, but only in the lower 32bit. The upper 32bit remain the old value.
- This happens only with IPC using the RPC framework. In IPC using code
that is written by hand - like L4::Factory.create() or L4::Task.map() - the opcode-field is a 32bit MR; on 64-bit architectures this is a 64bit value and field respectively.
- As far as I understood you are writing the client-side of the
Dataspace.map() request yourself and do not use the RPC framework? With that I mean writing the opcode to mr[0], the offset to mr[1], and so on.
But the server side then reads the mr[0] and mr[1] together as one 64bit value and then replies with EBADPROTO?
On this last point I'm a bit lost on which side does what exactly. Can you maybe write a bit of pseudo code on what happens on client and what on server side to help me understand?
Cheers Philipp
On 9/18/22 23:15, Paul Boddie wrote:
On Monday, 29 August 2022 00:37:31 CEST Paul Boddie wrote:
On Sunday, 28 August 2022 23:07:38 CEST Adam Lackorzynski wrote:
On Sun Aug 21, 2022 at 00:18:57 +0200, Paul Boddie wrote:
I just spent quite some time seeing errors like this...
ext2svr | L4Re[rm]: mapping for page fault failed with error -39 at 0x1002fbc00 pc=0x10b7804 ext2svr | L4Re: rom/ext2_server: Unhandled exception: PC=0x10b7804 PFA=0x1002fbc00 LdrFlgs=0x0
-39 being -L4_EBADPROTO (unsupported protocol), of course.
[...]
Do you have any ideas as to why the first message register gets corrupted?
No, still not. Any chance I could see a small example of this?
I'll try and package up what I've been doing so that it can be more readily investigated. I was actually in the middle of this packaging process when I discovered the problem once again.
Following up, I decided to give my code a test in 32-bit x86 and MIPS virtual machines which caused the problem to be much more pronounced. This led me to review a few things where I had misread the definitions of certain types (in pkg/l4re-core/l4sys/include/l4int.h). However, I think that the nature of the problem is actually as follows.
When a map request is sent by the L4Re region mapper, the IPC framework pieces together the necessary message. What isn't entirely obvious is the nature of the opcode being used. I originally thought that it was of type l4_umword_t, and dumping the bytes in the message, it does appear that the opcode is actually a 32-bit value (compatible with l4_umword_t on a 32-bit platform) but that the first operand only appears after an initial 64-bit unit containing the opcode, even on a 32-bit platform.
This might not produce problems on 64-bit platforms, although my original report did concern such a platform, but problems are immediately evident on a 32-bit platform. For example, here is a map request on x86 that caused problems:
00 00 00 00 ce e3 08 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...
opcode ??? offset hot_spot
So, when the IPC framework writes the opcode, it fills the first 32-bit unit but apparently not the second half of the initial 64-bit unit. Consequently, the contents of the second half of the word appear to persist from whatever they were previously. Hence my annotation of "???" above. It looks like an address from the program, perhaps an earlier page fault address.
Often, zero bytes are involved, thus preserving the appropriate behaviour even on 64-bit platforms where the opcode could be interpreted as the first 64-bit (l4_umword_t) unit. Zeroing the first message register (as I noted in an earlier message) fixed any cases where the prior non-zero contents leaked into the message and corrupted the opcode when considered to be a 64-bit value.
The problem on 32-bit platforms is that the operands are displaced and are not found after the first 32-bit word. In the above example, this causes the offset operand to be misinterpreted, along with the values that follow it. Obviously, this brings programs needing dataspaces to a halt very quickly.
The above behaviour contradicts the way the IPC messages are constructed by Lua, for example. What I see on amd64/x86-64 is different from on x86 for the same message payload. For example, an invocation of a factory create operation with an opcode of 6:
On amd64:
06 00 00 00 00 00 00 00 81 00 08 00 00 00 00 00 e8 03 00 00 00 00 00 00 ...
opcode type size unused value (1000)
On x86:
06 00 00 00 81 00 04 00 e8 03 00 00 ...
opcode type size value (1000)
Here, the opcode is dependent on the word size, and the Lua code is happy to use the word size for the operands with no padding or gaps being introduced. Other IPC messages also appear to use the word size for the opcode. For example, when attach operations are invoked on the region mapper I have implemented, the opcode is only a 32-bit value with no trailing data before the initial operands.
I imagine that none of this would manifest itself if I used precisely the same libraries and/or code as other L4Re components, but then that rather makes the system monolithic. There should be a degree of interoperability based on message specifications and interface descriptions, and there should be some consistency, too. What I have seen is that the dataspace IPC is not consistent with other IPC.
Paul
l4-hackers mailing list l4-hackers@os.inf.tu-dresden.de https://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
On Thursday, 29 September 2022 13:55:49 CEST Philipp Eppelt wrote:
Hi Paul,
I believe I have an idea of what you are observing.
00 00 00 00 ce e3 08 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...
opcode ??? offset hot_spot
On a 32bit architecture, the opcode is 32bit, the ??? is 32bit padding before the offset, which is - also on a 32bit architecture - a 64bit value and needs to be aligned to 64bit.
Right. I was in the process of writing a reply to your last message - thank you for sending that! - when I started to reconsider the way my own IPC framework operates, and I realised that I had made an embarrassing mistake with regard to structure member alignment, since I use structures to access message/call parameters. As you note, there are alignment constraints for structure members...
The client has to adhere to the platform specific alignment constraints of the members in the data structure, as does the server side.
...these being that even on a 32-bit platform, any 64-bit members will be aligned to 64-bit boundaries, thus causing padding to be inserted after any members preceding them if those members do not occupy the space leading up to the 64-bit boundary. This is presumably dictated by the "alignment requirement" for each type mentioned in this document:
https://en.cppreference.com/w/c/language/object
I had been assuming that the members would be word-aligned. However, since most of my IPC messaging was between peers that interpreted messages in the same way, this was not a general problem.
Now, it is interesting that you mention the alignment requirements in the context of IPC messages prepared by the L4Re RPC framework (found in pkg/l4re- core/l4sys/include/cxx). I would have expected the framework to be serialising the values and filling up message registers, as opposed to treating the message registers as a structure, just from briefly looking at it (and also considering the other IPC mechanisms in L4Re).
From what I have understood, you are implementing a dataspace provider and are observing the provider generating the error (-39/-L4_EBADPROTO)?
EBADPROTO normaly means that the protocol value in the l4_msgtag_t is not supported by the server. In case of an unsupported opcode servers reports L4_ENOSYS. Thus, I'm a bit confused about the EBADPROTO error code and the opcode/MR issue. What am I missing?
So, EBADPROTO would be generated by my server code upon receiving an opcode it doesn't understand. I suppose I am using the wrong error in this case: there's so much to learn about the conventions involved.
Going back to the original problem (on a 64-bit system), I did indeed get a "corrupt" opcode that caused my server code to return EBADPROTO. On a 32-bit system, what happens instead is that the opcode can be interpreted as a machine word (l4_umword_t), but that the structure padding displaces the parameters.
I suppose what I have been clumsily trying to clarify (and dragging you into this) is what the alignment issues are for message parameters. Maybe I should have been reading some kind of ABI documentation, and I can certainly understand that alignment constraints would apply when treating message parameters like normal function call parameters, although that is also in the realm of a platform's ABI documentation.
Thanks once again for following up, and sorry to be a nuisance!
Paul
On Thursday, 29 September 2022 15:36:02 CEST Paul Boddie wrote:
I suppose what I have been clumsily trying to clarify (and dragging you into this) is what the alignment issues are for message parameters. Maybe I should have been reading some kind of ABI documentation, and I can certainly understand that alignment constraints would apply when treating message parameters like normal function call parameters, although that is also in the realm of a platform's ABI documentation.
Following up, trying to make sense of all this, I have largely concluded that I can observe the alignment requirements on AMD64 and MIPS32 in my own IPC mechanisms. For a dataspace map operation, the parameter alignments (in bytes) are as follows:
0: opcode 8: offset 16: hot_spot 24: flags
These apply to both a structure describing the parameters and the way the L4Re RPC framework populates the message registers.
However, on IA32, the structure member alignments appear to be as follows:
0: opcode 4: offset 12: hot_spot 20: flags
At the same time, the RPC framework is generating values that would comply with the expected alignment given above for AMD64 and MIPS32.
So this kind of explains some of my earlier confusion about what should happen on a 32-bit system. In fact, IA32 (x86) seems to be a special case, although I cannot immediately find the documentation that explains why the structure member alignment should be different from the rules applying to other architectures.
I was reminded of Compiler Explorer and did a little experiment:
https://gcc.godbolt.org/z/bzxoceGzs
A structure starting with a 32-bit member and followed by 64-bit members has 64-bit-aligned members on AMD64, as demonstrated above, whereas code generated with the -m32 option exhibits the same properties as that shown above for IA32.
Sorry to confuse things even further!
Paul
Sorry: another follow-up to myself!
On Thursday, 29 September 2022 22:15:06 CEST Paul Boddie wrote:
So this kind of explains some of my earlier confusion about what should happen on a 32-bit system. In fact, IA32 (x86) seems to be a special case, although I cannot immediately find the documentation that explains why the structure member alignment should be different from the rules applying to other architectures.
Here is something that might be relevant:
"System V Application Binary Interface Intel386 Architecture Processor Supplement Version 1.0"
https://uclibc.org/docs/psABI-i386.pdf
Specifically, table 2.1 on page 8 shows that the long long types are 64-bit types with 32-bit alignment.
Paul
Hi all,
On Donnerstag, 29. September 2022 23:54:04 CEST Paul Boddie wrote:
On Thursday, 29 September 2022 22:15:06 CEST Paul Boddie wrote:
So this kind of explains some of my earlier confusion about what should happen on a 32-bit system. In fact, IA32 (x86) seems to be a special case, although I cannot immediately find the documentation that explains why the structure member alignment should be different from the rules applying to other architectures.
Here is something that might be relevant:
"System V Application Binary Interface Intel386 Architecture Processor Supplement Version 1.0"
https://uclibc.org/docs/psABI-i386.pdf
Specifically, table 2.1 on page 8 shows that the long long types are 64-bit types with 32-bit alignment.
True, and gcc confirms that:
$ cat foo.cc #include <cstdio> int main() { printf("%zd\n", alignof(long long)); return 0; }
$ uname -m x86_64
$ g++ -Wall -Wextra -o foo foo.cc && ./foo 8
$ g++ -m32 -Wall -Wextra -o foo foo.cc && ./foo 4
Kind regards,
Frank
On 9/29/22 15:36, Paul Boddie wrote:
I suppose what I have been clumsily trying to clarify (and dragging you into this) is what the alignment issues are for message parameters. Maybe I should have been reading some kind of ABI documentation, and I can certainly understand that alignment constraints would apply when treating message parameters like normal function call parameters, although that is also in the realm of a platform's ABI documentation.
Thanks once again for following up, and sorry to be a nuisance!
Hi Paul,
it's a complex system and a lot to take in and remember. I am certain your explanations and open communication about your projects is appreciated by other readers on this list as they are by me.
Keep asking. I'm keen to see what comes up next.
Cheers, Philipp
On Friday, 30 September 2022 14:59:07 CEST Philipp Eppelt wrote:
it's a complex system and a lot to take in and remember. I am certain your explanations and open communication about your projects is appreciated by other readers on this list as they are by me.
Pleased to be of service, I guess!
Keep asking. I'm keen to see what comes up next.
Well, I was going to ask what I should do about the x86 alignment issues. Clearly, the structure alignment regime is different from the RPC framework's message passing regime, unless I am doing something else that is wrong (very possible).
My workaround was to specify the opcode as occupying a l4_uint64_t, but actually only reading l4_umword_t to get the value, but this is merely a consequence of the message having its parameters aligned to 64-bit boundaries due to the presence of 64-bit values. Obviously, if both peers in the messaging use the RPC framework, they are insulated from any possible mismatch, but that doesn't apply to me.
Thanks once again for the guidance!
Paul
l4-hackers@os.inf.tu-dresden.de