Fiasco.OC performance issues

Mon Feb 11 19:42:47 CET 2013

On 02/07/2013 09:38 PM, Adam Lackorzynski wrote:
> On Wed Feb 06, 2013 at 11:27:51 +0100, Sebastian Sumpf wrote:
>> On 01/29/2013 12:09 AM, Adam Lackorzynski wrote:
>>> On Fri Jan 25, 2013 at 00:00:26 +0100, Sebastian Sumpf wrote:
>>>> On 01/17/2013 11:31 PM, Adam Lackorzynski wrote:
>>>>> On Thu Jan 17, 2013 at 17:03:36 +0100, Sebastian Sumpf wrote:
>>>>>> I recently upgraded Fiasco.OC to SVN revision 42 and experience some
>>>>>> pretty severe performance degradation compared to revision 40 on the
>>>>>> Pandaboard (SMP). It seems that 'simga0' and the root task stall for 5
>>>>>> to 10 seconds during boot up. I tracked the issue down to be caused by
>>>>>> the initial mapping operations, especially our root task maps all the
>>>>>> available memory during bootstrap. Within the kernel the
>>>>>> 'Context::xcpu_tlb_flush' is called for each mapping. The function sends
>>>>>> an IPI (to CPU1 which is idle) and then waits for an IPI in order to
>>>>>> signal the end of the operation. The whole operation seems to have
>>>>>> gotten slower compared to revision 40, but I could not find many
>>>>>> differences in the IPI-handling code. Do you have any ideas or
>>>>>> suggestions what could cause the delay (maybe scheduling changes) and
>>>>>> how to fix it?
>>>>>
>>>>> I noticed a similar/same thing but hadn't time to investigate yet.
>>>>
>>>> Okay, I just wanted to make sure that the problem is not at our side nor
>>>> at our usage pattern.
>>>> Another thing I wonder is: Since you now have second level cache support
>>>> for the PandaBoard, how do I map DMA memory to a client? The problem
>>>> seems to be that sigma0 maps all memory as cached. So what we have been
>>>> trying to do is this: When someone requests DMA memory we map the page
>>>> as uncached and then call 'l4_cache_dma_coherent' afterwards. This
>>>> doesn't seem to work out well for our drivers. The thing I think I could
>>>> gather is that memory that is mapped cached (sigma0, roottask) and
>>>> uncached (client) at the same time has an undefined behavior (I might be
>>>> wrong here) on ARM. So, what is the protocol to implement this on
>>>> Fiasco.OC/L4RE setups?
>>>
>>> Indeed, having memory with different attributes must be avoided.
>>> But it's also about accessing that memory. So for example for sigma0
>>> this isn't a problem because sigma0 does not touch the memory itself.
>>> Is your roottask accessing the memory, i.e. pulling it into caches?
>>
>> Yes this is the problem we're trying to solve. We don't have a notice of
>> normal RAM and DMA pools within our roottask (if we had this, the
>> question of how to dimension DMA pools would arise, also this seems to
>> be an ARM only issue). So here is what we did with L1 caches enabled
>> only: Acquire the memory in roottask, map it to the client as
>> non-cacheable, zero out the memory (as it might have been previously
>> used as normal RAM and for security reasons), clean and invalidate the
>> data-cache. With the L2-cache enabled on PandaBoard, I tried to use the
>> 'l4_cache_dma_coherent' function to accomplish the same behavior, which
>> worked out well for the L1-case ... but it didn't. So, what am I doing
>> wrong here, or isn't this supported anyways?
> 
> Sounds reasonable to me. Could you check whether
> l4_cache_dma_coherent_full makes a difference in your setup?

Thanks for the 'coherent full' hint, it does work. I gonna double check
the addresses we hand over in the 'dma coherent' case next.

Sebastian