On Monday, 1 March 2021 21:30:11 CET Philipp Eppelt wrote:
sorry for the long wait, I didn't find time last Friday.
No worries: I very much appreciate the reply!
On 2/24/21 10:10 PM, Paul Boddie wrote:
Of course, upon a page fault, the dataspace presents a flexpage to the region manager to satisfy an access for some, if not all, of the region associated with the dataspace. Here, I interpret the span of any given flexpage as being part of a region, with an entire region corresponding to the entire span of memory associated with a dataspace. I hope that is correct.
Yes, correct. Just details: The dataspace is not the acting entity on a page fault, the region manager is. It calls ds->map(...) on the dataspace registered for the pf-address.
Yes, the fault occurs in the task, and the region manager is acting on behalf of the task to request a mapping from the dataspace. Again, as I hopefully understand things.
[...]
This is what I would expect, yes. I detach the region...
l4re_rm_detach(a1)
...or...
l4re_rm_detach_unmap(a1, L4RE_THIS_TASK_CAP)
Yes, that's equivalent. In both cases the used cap's should be valid (otherwise something is seriously wrong), and then the whole region is removed from your tasks virtual address space.
Understood.
...and I also call l4re_util_cap_free_um on the dataspace capability. However, I am not completely familiar with the difference between l4re_util_cap_free and l4_util_cap_free_um.
The one returns the capability index to the capability allocator, the other returns the capability index and unmaps the object that the capability referenced from the local object space. Thus making it inaccessible for all capabilities in this task, even if they still reference this object. E.g. if 5, 7, and 9 reference and object O, and you call l4_util_cap_free_um(9), 5 and 7 will get an L4_IPC_ENOT_EXISTENT error when they try to access it.
In my case, then, I imagine that I will almost always want to unmap the object.
However, if there are other regions attached, e.g. (a2, s2) -> (d1, o2), this will still remain and as soon as you unmap the d1-capability, you have stale entries in your region map.
What happens when a task tries to access the memory within a2 to a2+s2? Are there virtual memory associations that may still provide access to the memory exported by the now-unmapped capability?
This I actually don't know. I'll investigate. I hope the mappings are gone and you'll get a page fault, though.
So do I. :-)
[Strange behaviour]
I also saw it with a region that overlapped the old one instead of having precisely the same base address:
(a1+0x1000, s2) -> d2 -> mem[o2:o2+s2]
Here, an access to the new base of a1+0x1000 appeared to expose mem[o1+0x1000] instead of mem[o2].
Are you certain that d1 and d2 are actually different dataspaces? Are you getting only d1 data or only d2 data? Are you getting a mix of d1 and d2 data?
It is, of course, always possible that I have been making a mistake - this being the usual discovery when I report strange behaviour - but the means of acquiring dataspaces d1 and d2 may involve distinct objects, and it involves creating further distinct objects to act as dataspaces. So, something like this would occur:
d1 = c1.open() d2 = c2.open()
Here, c1 and c2 may even be the same object, but even then they should still allocate a new object for each invocation of the open operation, yielding two distinct dataspaces d1 and d2.
What I would observe is d1 data even after d2 was attached. I was somewhat confused as to whether d1 might still be active or not. But if it is, then d2 should not be allocated an address region coinciding with that of d1. If it isn't, then d2 should be unaffected by whatever d1 had been doing.
Let me summarize the steps I think are necessary during the lifetime of the dataspace:
- Allocate a capability index for the dataspace
- Allocate the memory and receive the dataspace capability in the
allocated index (see http://l4re.org/doc/classL4Re_1_1Mem__alloc.html#a44b301573ae859e8406400338c c8e924) or something alike to get the mapping for the dataspace capability under the allocated capability index. (to be sure use: http://l4re.org/doc/group__l4__task__api.html#ga829a1b5cb4d5dba33ffee57534a5 05af)
Do I need to use the memory allocation interface if the dataspace is sending flexpage items? I have previously used the l4re_ma functions (and possibly C++ equivalents) to allocate memory, but this was mostly useful for device drivers where physical addresses may need to be obtained for hardware peripheral usage, plus convenient sharing of entire memory regions between tasks without any of my tasks needing to act as dataspaces.
My strategy with this work is to implement paging by sending flexpage items to satisfy paging requests and thus provide a dataspace implementation. In the dataspace itself, I actually use posix_memalign to obtain memory, but that is ultimately going to be using l4re_ma functions at the lowest level, I imagine.
- attach the dataspace to the region manager
- <use region/memory>
- detach region from the region manager
- unmap the dataspace capability using l4_task_unmap()
- return the capability index to the capability allocator.
The last two steps are done by l4re_util_cap_free_um.
The other steps are consistent with my approach.
Hopefully, this helps you as a baseline. I'm a bit puzzled by the mem[o1+0x1000] case. I went through the code and I don't see how this can happen unless the "task" capability given to l4re_rm_detach_unmap is invalid, however, l4re_rm_detach is using the correct capability. Which code version are you working on? Maybe I'm looking at the wrong code?
I'm still using the Subversion distribution (version 83) of L4Re. I know I should be following the different GitHub repositories but I find the Subversion distribution more convenient and I have not wanted to introduce too many different variables in my own experiments. Plus, it seems to be reliable enough for my needs.
Over the weekend, I tried to troubleshoot this issue and investigate the nature of it. I then retraced my steps, introducing wrapper functions around l4re_rm_attach and l4re_rm_detach to see if the region manager was giving out duplicate addresses. This seemed to indicate that it was indeed doing so. If I introduced synchronisation around the l4re_rm calls (effectively extending the synchronisation already in place around the STL data structure recording active regions), the observed problem went away.
Now, this is not consistent with what Christian wrote a few weeks ago, where he also noted that the capability slot allocator is not thread-safe, but I imagine that either my own code somehow uses the region manager API in a thread-unsafe way (although I cannot see exactly how that might be) or there is some element of using this API where a degree of "thread unsafety" exists. So, I have just added synchronisation around both the capability slot allocator and the region manager operations.
At some point in hopefully not too long, I aim to bundle this work up once again (since it exists in a much more rough state from an earlier iteration) and then anyone suitably interested can see what I have been doing wrong all along. For now, though, I hope that I may be able to continue to work around whatever the problem might be.
I hope these observations are at the very least informative, if not particularly helpful.
Thanks once again for your advice!
Paul