ARM & caches (closed)

Sebastian Sumpf Sebastian.Sumpf at genode-labs.com
Sun Mar 3 19:52:42 CET 2013


Hi Dmitry,

On 02/28/2013 09:31 PM, Sebastian Sumpf wrote:
> Hi Dmitry,
> 
> On 02/27/2013 04:20 PM, Dmitry Shubin wrote:
>> On 02/20/2013 08:54 PM, Sebastian Sumpf wrote:
>>> Hi Adam,
>>>
>>> On 02/18/2013 07:30 PM, Adam Lackorzynski wrote:
>>>> On Mon Feb 11, 2013 at 19:42:47 +0100, Sebastian Sumpf wrote:
>>>>> On 02/07/2013 09:38 PM, Adam Lackorzynski wrote:
>>>>>> On Wed Feb 06, 2013 at 11:27:51 +0100, Sebastian Sumpf wrote:
>>>>>>> On 01/29/2013 12:09 AM, Adam Lackorzynski wrote:
>>>>>>>> On Fri Jan 25, 2013 at 00:00:26 +0100, Sebastian Sumpf wrote:
>>>>>>>>> On 01/17/2013 11:31 PM, Adam Lackorzynski wrote:
>>>>>>>>>> On Thu Jan 17, 2013 at 17:03:36 +0100, Sebastian Sumpf wrote:
>>>>>>>>>>> I recently upgraded Fiasco.OC to SVN revision 42 and
>>>>>>>>>>> experience some
>>>>>>>>>>> pretty severe performance degradation compared to revision 40
>>>>>>>>>>> on the
>>>>>>>>>>> Pandaboard (SMP). It seems that 'simga0' and the root task
>>>>>>>>>>> stall for 5
>>>>>>>>>>> to 10 seconds during boot up. I tracked the issue down to be
>>>>>>>>>>> caused by
>>>>>>>>>>> the initial mapping operations, especially our root task maps
>>>>>>>>>>> all the
>>>>>>>>>>> available memory during bootstrap. Within the kernel the
>>>>>>>>>>> 'Context::xcpu_tlb_flush' is called for each mapping. The
>>>>>>>>>>> function sends
>>>>>>>>>>> an IPI (to CPU1 which is idle) and then waits for an IPI in
>>>>>>>>>>> order to
>>>>>>>>>>> signal the end of the operation. The whole operation seems to
>>>>>>>>>>> have
>>>>>>>>>>> gotten slower compared to revision 40, but I could not find many
>>>>>>>>>>> differences in the IPI-handling code. Do you have any ideas or
>>>>>>>>>>> suggestions what could cause the delay (maybe scheduling
>>>>>>>>>>> changes) and
>>>>>>>>>>> how to fix it?
>>>>>>>>>>
>>>>>>>>>> I noticed a similar/same thing but hadn't time to investigate yet.
>>>>>>>>>
>>>>>>>>> Okay, I just wanted to make sure that the problem is not at our
>>>>>>>>> side nor
>>>>>>>>> at our usage pattern.
>>>>>>>>> Another thing I wonder is: Since you now have second level cache
>>>>>>>>> support
>>>>>>>>> for the PandaBoard, how do I map DMA memory to a client? The
>>>>>>>>> problem
>>>>>>>>> seems to be that sigma0 maps all memory as cached. So what we
>>>>>>>>> have been
>>>>>>>>> trying to do is this: When someone requests DMA memory we map
>>>>>>>>> the page
>>>>>>>>> as uncached and then call 'l4_cache_dma_coherent' afterwards. This
>>>>>>>>> doesn't seem to work out well for our drivers. The thing I think
>>>>>>>>> I could
>>>>>>>>> gather is that memory that is mapped cached (sigma0, roottask) and
>>>>>>>>> uncached (client) at the same time has an undefined behavior (I
>>>>>>>>> might be
>>>>>>>>> wrong here) on ARM. So, what is the protocol to implement this on
>>>>>>>>> Fiasco.OC/L4RE setups?
>>>>>>>>
>>>>>>>> Indeed, having memory with different attributes must be avoided.
>>>>>>>> But it's also about accessing that memory. So for example for sigma0
>>>>>>>> this isn't a problem because sigma0 does not touch the memory
>>>>>>>> itself.
>>>>>>>> Is your roottask accessing the memory, i.e. pulling it into caches?
>>>>>>>
>>>>>>> Yes this is the problem we're trying to solve. We don't have a
>>>>>>> notice of
>>>>>>> normal RAM and DMA pools within our roottask (if we had this, the
>>>>>>> question of how to dimension DMA pools would arise, also this
>>>>>>> seems to
>>>>>>> be an ARM only issue). So here is what we did with L1 caches enabled
>>>>>>> only: Acquire the memory in roottask, map it to the client as
>>>>>>> non-cacheable, zero out the memory (as it might have been previously
>>>>>>> used as normal RAM and for security reasons), clean and invalidate
>>>>>>> the
>>>>>>> data-cache. With the L2-cache enabled on PandaBoard, I tried to
>>>>>>> use the
>>>>>>> 'l4_cache_dma_coherent' function to accomplish the same behavior,
>>>>>>> which
>>>>>>> worked out well for the L1-case ... but it didn't. So, what am I
>>>>>>> doing
>>>>>>> wrong here, or isn't this supported anyways?
>>>>>>
>>>>>> Sounds reasonable to me. Could you check whether
>>>>>> l4_cache_dma_coherent_full makes a difference in your setup?
>>>>>
>>>>> Thanks for the 'coherent full' hint, it does work. I gonna double check
>>>>> the addresses we hand over in the 'dma coherent' case next.
>>>>
>>>> Please check the current version, there are fixes for both the IPC issue
>>>> and hopefully also for the L2 issue.
>>>
>>> Good stuff! The IPC issue is gone completely and the board starts as
>>> fast as in the single core case. The cache issue is unfortunately still
>>> around. The strange thing is, that it only seems to occur when using our
>>> HDMI driver. One can observe small black lines in the lower left corner
>>> which seem to be 32 bytes (cache-line size?) long and it looks like they
>>> occur on page boundaries only. Beats me. All other drivers seem to work
>>> fine.
>>>
>>> Thanks for your efforts,
>>>
>>> Sebastian
>>>
>>>
>>> _______________________________________________
>>> l4-hackers mailing list
>>> l4-hackers at os.inf.tu-dresden.de
>>> http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
>>>
>>
>> Hi, Sebastian.
>>
>> I've just played around with demo that comes with Genode
>> (git: ce075c05b9f54) a bit and my experience differs from yours.
>> I'll try to summarize what I've found so far.
>>
>> (1) Issues with u-boot:
>>
>> (1.a) I've been told elsewhere that CONFIG_SYS_L2CACHE_OFF option
>> matters when building u-boot. It does not seem to be the case however,
>> present or not it does not affect observed behavior. Anyway, that is
>> just something to keep in mind.
> 
> Indeed, it doesn't matter any more, since Fiasco.OC revision 42 the
> kernel will enable the L2-cache, so u-boot is out of the game.
> 
>> (1.b) u-boot version on the other hand matters. For instance omap_fb
>> doesn't work with u-boot from master branch (47104c37), nothing shows
>> up on the screen. I've got better luck with the version tagged
>> 'v2012.04.01'.
> 
> Yes, it's always a bit tricky to find the right version that initializes
> stuff correctly.
> 
>> (2) Black lines in the corner of the screen: I've actually observed
>> them without L2 cache turned on. The effect is less pronounced
>> (usually just 1 or 2 lines), they don't show up every time (!) and
>> disappear after you click anywhere on the screen or just drag a mouse
>> cursor over. So, this may or may not be L2 issue.
> 
> I did by now too. It is a cache issue and does not show up this often
> because the L1 cache is way smaller than the L2 one.
> 
>> (3) Other drivers don't work for me as you claim they are when L2 is
>> turned on. Namely, usb_drv fails to detect whatever you plug in most
>> of the time, and when it succeeds the device doesn't work anyway.
> 
> Also true. I can tell you my current state of the affair: As mentioned
> above the ARM manual states that mapping a page as cached and uncached
> at the same time results in undefined behavior. Unfortunately DMA memory
> is mapped in 'roottask/core' as cached memory, which is enforced by the
> sigma0 protocol. So what the roottask does, when a client requests DMA
> memory, is to zero out that cached memory, clean and invalidate the
> caches, and map the page 'uncached' to the client. What happens next is
> very weird. We inspected the caches after cleaning them (we cleaned them
> completely for the test) using a hardware debugger, they are indeed
> empty/all lines invalid, we also made sure that there are no accesses to
> the cached memory from roottask's side and guess what? After some
> interaction with this memory on the uncached client side, there appear a
> lot of valid zeroed out cache lines in both caches. After a while these
> cache lines will be evicted, which in turn causes memory corruption at
> the client side. Today one of my colleagues suggested to clear the data
> TLB as well, but that made no difference.
> If we omit the cleanup of the DMA memory, which we don't want to, at the
> roottask everything works out fine.
> So what is the problem here? Maybe these are the results of the
> 'undefined behavior'. I don't know for sure. If anyone on this list
> finds out something, let me know.
> 

Problem solved and just for the sake of documenting the outcome: Our
roottask does indeed touch the memory before it maps it to the client.
It was, how should I say it, just a little hard to find, but relieved me
from almost believing in Santa Claus again .-)
A patch will appear soon,

Sebastian




More information about the l4-hackers mailing list