Booting L4Re on the CI20: Panic in sigma0

Thu Jul 20 22:10:48 CEST 2017

On Wednesday 19. July 2017 19.40.23 Paul Boddie wrote:
> 
> It always seems to involve an address of 0x8, which seems rather bizarre.
> Again, I think I must be missing something fundamental and must only be
> seeing the consequences.

So, I adjusted the kernel code, putting back in a commented-out debugging 
statement found in the Thread::handle_page_fault method which looks like this 
(having changed some of the details):

  printf("Translation error ? %p\n"
         "  is_kmem_page_fault ? %x\n"
         "  is_sigma0 ? %x\n"
         "  program counter: %p\n"
         "  regs->ip(): %p\n"
         "  page fault address: %p\n",
         (void *) PF::is_translation_error(error_code),
         !PF::is_translation_error(error_code) && mem_space()->is_sigma0(),
         Kmem::is_kmem_page_fault(pfa, error_code),
         (void *) pc,
         (void *) regs->ip(),
         (void *) pfa);

I also introduced a statement in Thread::handle_page_fault_pager as follows:

  printf("handle_page_fault_pager: pfa=" L4_PTR_FMT
         ", errorcode=" L4_PTR_FMT ", pc=%lx, bad_v_addr=%lx\n",
         pfa, error_code, regs()->ip(), regs()->bad_v_addr);

I then observe some strange behaviour:

Translation error ? 0x1
  is_kmem_page_fault ? 0
  is_sigma0 ? 0
  program counter: 0x80019c8c
  regs->ip(): 0x80019c8c
  page fault address: 0xc
  regs->bad_v_addr: 0xc
handle_page_fault_pager: pfa=0000000c, errorcode=00000009, pc=103502c, 
bad_v_addr=8cc4
L4Re[svr]: request: tag=0xfffe0002 proto=-2 obj=0x0
L4Re: page fault: 9 pc=103502c
L4Re[rm]: unhandled read page fault at 0x8 pc=0x103502c

In the above, the last three lines are normal debugging output. The (wrapped) 
line above those is from my statement in handle_page_fault_pager.

For some reason, the presumably correct bad_v_addr (bad virtual address, 
0x8cc4) arising in the apparent initial page fault (at 0x0103502c) does not 
get propagated back to L4Re alongside the associated program counter value. 
Instead, 0x8 gets reported in the L4Re logging output.

While handling this page fault, there appears to be another page fault in the 
kernel (at 0x80019c8c). This latter fault can't be handled (as discussed 
below) and so the original exception is eventually exposed in L4Re with the 
confused mix of details noted above.

The unlikely address of 0x8 reported by L4Re may be related to the kernel 
fault address of 0xc, which according to the above details occurs in the 
following code (found in Ram_quota::alloc):

80019c7c <_ZN9Ram_quota5allocEl>:
80019c7c:       40036000        mfc0    v1,c0_status
80019c80:       30670001        andi    a3,v1,0x1
80019c84:       41606000        di
80019c88:       000000c0        ehb
80019c8c:       8c82000c        lw      v0,12(a0)

Note how the final, fault-causing instruction involves 12 (0xc), suggesting 
that a0 is set to zero, which is not an expected value given that it refers to 
a function/method parameter block and given that a parameter is expected by 
the method.

Unfortunately, I don't know the invocation chain responsible for this, and it 
doesn't appear to be very obvious how I might discover it efficiently.

Paul