Re: Fiasco.OC performance issues

6 Feb 2013

      On 01/29/2013 12:09 AM, Adam Lackorzynski wrote:
...
On Fri Jan 25, 2013 at 00:00:26 +0100, Sebastian Sumpf wrote:
...
On 01/17/2013 11:31 PM, Adam Lackorzynski wrote:
...
On Thu Jan 17, 2013 at 17:03:36 +0100, Sebastian Sumpf wrote:
...
I recently upgraded Fiasco.OC to SVN revision 42 and experience some
pretty severe performance degradation compared to revision 40 on the
Pandaboard (SMP). It seems that 'simga0' and the root task stall for 5
to 10 seconds during boot up. I tracked the issue down to be caused by
the initial mapping operations, especially our root task maps all the
available memory during bootstrap. Within the kernel the
'Context::xcpu_tlb_flush' is called for each mapping. The function sends
an IPI (to CPU1 which is idle) and then waits for an IPI in order to
signal the end of the operation. The whole operation seems to have
gotten slower compared to revision 40, but I could not find many
differences in the IPI-handling code. Do you have any ideas or
suggestions what could cause the delay (maybe scheduling changes) and
how to fix it?
I noticed a similar/same thing but hadn't time to investigate yet.
Okay, I just wanted to make sure that the problem is not at our side nor
at our usage pattern.
Another thing I wonder is: Since you now have second level cache support
for the PandaBoard, how do I map DMA memory to a client? The problem
seems to be that sigma0 maps all memory as cached. So what we have been
trying to do is this: When someone requests DMA memory we map the page
as uncached and then call 'l4_cache_dma_coherent' afterwards. This
doesn't seem to work out well for our drivers. The thing I think I could
gather is that memory that is mapped cached (sigma0, roottask) and
uncached (client) at the same time has an undefined behavior (I might be
wrong here) on ARM. So, what is the protocol to implement this on
Fiasco.OC/L4RE setups?
Indeed, having memory with different attributes must be avoided.
But it's also about accessing that memory. So for example for sigma0
this isn't a problem because sigma0 does not touch the memory itself.
Is your roottask accessing the memory, i.e. pulling it into caches?
Yes this is the problem we're trying to solve. We don't have a notice of
normal RAM and DMA pools within our roottask (if we had this, the
question of how to dimension DMA pools would arise, also this seems to
be an ARM only issue). So here is what we did with L1 caches enabled
only: Acquire the memory in roottask, map it to the client as
non-cacheable, zero out the memory (as it might have been previously
used as normal RAM and for security reasons), clean and invalidate the
data-cache. With the L2-cache enabled on PandaBoard, I tried to use the
'l4_cache_dma_coherent' function to accomplish the same behavior, which
worked out well for the L1-case ... but it didn't. So, what am I doing
wrong here, or isn't this supported anyways?

Thanks a lot,

Sebstian