Booting L4Re on the CI20: Panic in sigma0

Tue Jul 18 01:43:24 CEST 2017

And some more...

On Sunday 16. July 2017 02.24.03 Paul Boddie wrote:
> 
> But with the --l4re-dbg option set to "all", after this output...
> 
> L4Re: load binary 'rom/hello'
> L4Re: Start server loop
> 
> ...I notice this continuously recurring message:
> 
> L4Re[svr]: request: tag=0xfffb1026 proto=-5 obj=0x0

(Note that this is the only thing that occurs, doing so endlessly and very 
frequently.)

Well, I haven't really figured this out at all. I thought it might be useful 
to investigate what the message actually represents. First of all, it 
originates from the Dispatcher::dispatch method in...

pkg/l4re-core/l4re_kernel/server/src/dispatcher.cc

I think the "tag" breaks down into something like this:

  0xfffb1026 -> label=0xfffb (-5) -> L4_PROTO_EXCEPTION
                flags=0x1 -> L4_MSGTAG_TRANSFER_FPU
                items=0x00
                words=0x26 (38)

(Reference: pkg/l4re-core/l4sys/include/types.h)

The code performing this logging doesn't indicate what the result of the 
message dispatch was, so I added a trace statement to see, which yielded this 
"tag" information:

  0xfc170000 -> label=0xfc17 (-1001) -> L4_EMSGTOOSHORT
                flags=0x0
                items=0x00
                words=0x00

As far as I can tell (four or so invocations are traversed), this might be 
produced in the handle_svr_obj_call function in...

pkg/l4re-core/l4sys/include/cxx/ipc_server

I would try and add some debugging statements here as well, but doing so seems 
to cause a cascade of library requirements across a range of components. So, 
although I might suspect that the request is malformed in some way, I have no 
firm idea that this is the case. However, one test of it is the following:

  tag.words() + tag.items() * Item_words > Mr_words

Here, the left hand side of the comparison is 38 whereas Mr_words is 
apparently 63, so this test does not identify the cause of the problem.

Meanwhile, I looked into the nature of the request and found the UTCB-related 
functions in...

pkg/l4re-core/l4sys/include/utcb.h

Attempting to determine the nature of the supposed exception, I managed to 
discover that...

l4_utcb_exc_is_pf returns 1 (page fault)
l4_utcb_exc_pfa returns 0x800d1308 (which is a kernel mode address on MIPS)

The program counter is given as 0x7000049c, with the exception cause being 
decoded from 0x10 to be interpreted as an "exception code value" of 4 in the 
CP0_CAUSE register (address error, load or instruction fetch).

I thought that enabling more logging in sigma0 might help, presuming that the 
page fault would be propagated through the pager hierarchy. But changing 
debug_ipc to 1 in...

pkg/l4re-core/sigma0/server/src/globals.h

...indicated that sigma0 is not involved when the above requests are made and 
dispatched: there is a lot of logging from sigma0, but logging from Moe takes 
over after a certain point. And I don't see any logging from Moe when these 
page faults occur: they are described using the "L4Re[svr]" prefix, as shown 
above.

So, I don't really have much more to go on, here. There's a chance that my 
rdhwr instruction support introduced a bug, I suppose, even though I've read 
through that code several times and can't see anything obviously wrong with 
it. I do wonder whether the initialisation routine for other programs is 
initialising the t9 register improperly, as noted previously. But then, I 
don't understand why the erroneously-initialised program isn't just terminated 
when its page fault can't be handled.

Although I've seen a fair amount of the L4Re internals now, I don't think I 
have any productive way of finding the problem here, unfortunately. I guess 
this exercise has provided some way of getting a "tour" of the framework, and 
maybe that will be useful in the future, but I had hoped that this board was 
already supported to the point of already running the example programs.

Paul