ARM & caches (was 'Fiasco.OC performance issues')

Thu Feb 28 21:31:44 CET 2013

Hi Dmitry,

On 02/27/2013 04:20 PM, Dmitry Shubin wrote:
> On 02/20/2013 08:54 PM, Sebastian Sumpf wrote:
>> Hi Adam,
>>
>> On 02/18/2013 07:30 PM, Adam Lackorzynski wrote:
>>> On Mon Feb 11, 2013 at 19:42:47 +0100, Sebastian Sumpf wrote:
>>>> On 02/07/2013 09:38 PM, Adam Lackorzynski wrote:
>>>>> On Wed Feb 06, 2013 at 11:27:51 +0100, Sebastian Sumpf wrote:
>>>>>> On 01/29/2013 12:09 AM, Adam Lackorzynski wrote:
>>>>>>> On Fri Jan 25, 2013 at 00:00:26 +0100, Sebastian Sumpf wrote:
>>>>>>>> On 01/17/2013 11:31 PM, Adam Lackorzynski wrote:
>>>>>>>>> On Thu Jan 17, 2013 at 17:03:36 +0100, Sebastian Sumpf wrote:
>>>>>>>>>> I recently upgraded Fiasco.OC to SVN revision 42 and
>>>>>>>>>> experience some
>>>>>>>>>> pretty severe performance degradation compared to revision 40
>>>>>>>>>> on the
>>>>>>>>>> Pandaboard (SMP). It seems that 'simga0' and the root task
>>>>>>>>>> stall for 5
>>>>>>>>>> to 10 seconds during boot up. I tracked the issue down to be
>>>>>>>>>> caused by
>>>>>>>>>> the initial mapping operations, especially our root task maps
>>>>>>>>>> all the
>>>>>>>>>> available memory during bootstrap. Within the kernel the
>>>>>>>>>> 'Context::xcpu_tlb_flush' is called for each mapping. The
>>>>>>>>>> function sends
>>>>>>>>>> an IPI (to CPU1 which is idle) and then waits for an IPI in
>>>>>>>>>> order to
>>>>>>>>>> signal the end of the operation. The whole operation seems to
>>>>>>>>>> have
>>>>>>>>>> gotten slower compared to revision 40, but I could not find many
>>>>>>>>>> differences in the IPI-handling code. Do you have any ideas or
>>>>>>>>>> suggestions what could cause the delay (maybe scheduling
>>>>>>>>>> changes) and
>>>>>>>>>> how to fix it?
>>>>>>>>>
>>>>>>>>> I noticed a similar/same thing but hadn't time to investigate yet.
>>>>>>>>
>>>>>>>> Okay, I just wanted to make sure that the problem is not at our
>>>>>>>> side nor
>>>>>>>> at our usage pattern.
>>>>>>>> Another thing I wonder is: Since you now have second level cache
>>>>>>>> support
>>>>>>>> for the PandaBoard, how do I map DMA memory to a client? The
>>>>>>>> problem
>>>>>>>> seems to be that sigma0 maps all memory as cached. So what we
>>>>>>>> have been
>>>>>>>> trying to do is this: When someone requests DMA memory we map
>>>>>>>> the page
>>>>>>>> as uncached and then call 'l4_cache_dma_coherent' afterwards. This
>>>>>>>> doesn't seem to work out well for our drivers. The thing I think
>>>>>>>> I could
>>>>>>>> gather is that memory that is mapped cached (sigma0, roottask) and
>>>>>>>> uncached (client) at the same time has an undefined behavior (I
>>>>>>>> might be
>>>>>>>> wrong here) on ARM. So, what is the protocol to implement this on
>>>>>>>> Fiasco.OC/L4RE setups?
>>>>>>>
>>>>>>> Indeed, having memory with different attributes must be avoided.
>>>>>>> But it's also about accessing that memory. So for example for sigma0
>>>>>>> this isn't a problem because sigma0 does not touch the memory
>>>>>>> itself.
>>>>>>> Is your roottask accessing the memory, i.e. pulling it into caches?
>>>>>>
>>>>>> Yes this is the problem we're trying to solve. We don't have a
>>>>>> notice of
>>>>>> normal RAM and DMA pools within our roottask (if we had this, the
>>>>>> question of how to dimension DMA pools would arise, also this
>>>>>> seems to
>>>>>> be an ARM only issue). So here is what we did with L1 caches enabled
>>>>>> only: Acquire the memory in roottask, map it to the client as
>>>>>> non-cacheable, zero out the memory (as it might have been previously
>>>>>> used as normal RAM and for security reasons), clean and invalidate
>>>>>> the
>>>>>> data-cache. With the L2-cache enabled on PandaBoard, I tried to
>>>>>> use the
>>>>>> 'l4_cache_dma_coherent' function to accomplish the same behavior,
>>>>>> which
>>>>>> worked out well for the L1-case ... but it didn't. So, what am I
>>>>>> doing
>>>>>> wrong here, or isn't this supported anyways?
>>>>>
>>>>> Sounds reasonable to me. Could you check whether
>>>>> l4_cache_dma_coherent_full makes a difference in your setup?
>>>>
>>>> Thanks for the 'coherent full' hint, it does work. I gonna double check
>>>> the addresses we hand over in the 'dma coherent' case next.
>>>
>>> Please check the current version, there are fixes for both the IPC issue
>>> and hopefully also for the L2 issue.
>>
>> Good stuff! The IPC issue is gone completely and the board starts as
>> fast as in the single core case. The cache issue is unfortunately still
>> around. The strange thing is, that it only seems to occur when using our
>> HDMI driver. One can observe small black lines in the lower left corner
>> which seem to be 32 bytes (cache-line size?) long and it looks like they
>> occur on page boundaries only. Beats me. All other drivers seem to work
>> fine.
>>
>> Thanks for your efforts,
>>
>> Sebastian
>>
>>
>> _______________________________________________
>> l4-hackers mailing list
>> l4-hackers at os.inf.tu-dresden.de
>> http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
>>
> 
> Hi, Sebastian.
> 
> I've just played around with demo that comes with Genode
> (git: ce075c05b9f54) a bit and my experience differs from yours.
> I'll try to summarize what I've found so far.
> 
> (1) Issues with u-boot:
> 
> (1.a) I've been told elsewhere that CONFIG_SYS_L2CACHE_OFF option
> matters when building u-boot. It does not seem to be the case however,
> present or not it does not affect observed behavior. Anyway, that is
> just something to keep in mind.

Indeed, it doesn't matter any more, since Fiasco.OC revision 42 the
kernel will enable the L2-cache, so u-boot is out of the game.

> (1.b) u-boot version on the other hand matters. For instance omap_fb
> doesn't work with u-boot from master branch (47104c37), nothing shows
> up on the screen. I've got better luck with the version tagged
> 'v2012.04.01'.

Yes, it's always a bit tricky to find the right version that initializes
stuff correctly.

> (2) Black lines in the corner of the screen: I've actually observed
> them without L2 cache turned on. The effect is less pronounced
> (usually just 1 or 2 lines), they don't show up every time (!) and
> disappear after you click anywhere on the screen or just drag a mouse
> cursor over. So, this may or may not be L2 issue.

I did by now too. It is a cache issue and does not show up this often
because the L1 cache is way smaller than the L2 one.

> (3) Other drivers don't work for me as you claim they are when L2 is
> turned on. Namely, usb_drv fails to detect whatever you plug in most
> of the time, and when it succeeds the device doesn't work anyway.

Also true. I can tell you my current state of the affair: As mentioned
above the ARM manual states that mapping a page as cached and uncached
at the same time results in undefined behavior. Unfortunately DMA memory
is mapped in 'roottask/core' as cached memory, which is enforced by the
sigma0 protocol. So what the roottask does, when a client requests DMA
memory, is to zero out that cached memory, clean and invalidate the
caches, and map the page 'uncached' to the client. What happens next is
very weird. We inspected the caches after cleaning them (we cleaned them
completely for the test) using a hardware debugger, they are indeed
empty/all lines invalid, we also made sure that there are no accesses to
the cached memory from roottask's side and guess what? After some
interaction with this memory on the uncached client side, there appear a
lot of valid zeroed out cache lines in both caches. After a while these
cache lines will be evicted, which in turn causes memory corruption at
the client side. Today one of my colleagues suggested to clear the data
TLB as well, but that made no difference.
If we omit the cleanup of the DMA memory, which we don't want to, at the
roottask everything works out fine.
So what is the problem here? Maybe these are the results of the
'undefined behavior'. I don't know for sure. If anyone on this list
finds out something, let me know.

Cheers,

Sebastian