Re: L4Re: Identifying the source location of a program exception

31 Jul 2020

      Frank,

Thank you very much for your very descriptive account of how the exception 
location might be discovered using the kernel debugger. I think this may be a 
long exercise, but I wanted to respond to at least acknowledge your message.

On Wednesday, 29 July 2020 09:24:17 CEST Frank Mehnert wrote:
...
I want to encourage you to take the program counter value serious. The
message says that there was an access to the memory at address 0x38
(sounds like an access to offset 38 of an object where the object pointer
was not initialized) and the corresponding program counter in userland
s 0x3a8bd9. From that value I guess that your host is AMD64.
Yes, that is correct. I also assumed that the error was related to a null 
reference.
...
Now the question is of course: Which application triggered this exception?
If you know the answer then you should disassemble the corresponding binary
with
objdump -ldC <filename> | less
and search for the program counter. If your binary was compiled with
debugging information, you will even see the source code around the
faulting instruction.
If your binary was not compiled with debugging information:
1. If the application is compiled within the L4Re tree then use the
    binary from the package build directory because that one is not
    stripped, for example
build-x86-64/pkg/hello/server/src/OBJ-amd64_gen-l4f/hello
rather than
build-x86-64/bin/amd64_gen/l4f/hello
because the latter binary is stripped (i.e. contains no debugging
    information) if CONFIG_BID_STRIP_PROGS is set to 'y'.
This is a useful reminder, but I think I must have experienced difficulties 
before with the bin subdirectory's contents, so I tend to access the 
appropriate binaries inside their package directories, anyway. It's probably 
just good fortune that something in my mind remembers the right kind of 
location to investigate.
...
2. If you compiled the binary yourself, make sure to the the '-g' flag
    to the compiler options. For L4Re applications using the L4Re build
    infrastructure this is done automatically, see 1.
I think that getting programs built outside the L4Re build framework would be 
too advanced for me.
...
Next question: Is your binary linked statically or does it use dynamic
libraries? You can find this out by doing
objdump -p <filename>
If the output contains at least one line with 'NEEDED' then your binary
uses dynamic libraries and looking for the program counter can be more
difficult if the fault happens in a dynamic library because the library
code is relocated to an unknown address when the library is loaded at
program start.
Therefore for debugging it's always advisable to use static linked
binaries. If your application uses the L4Re build infrastructure, set
MODE = static
in the Makefile. If you use your own Makefile, make sure to add
-static
to the linker flags.
Exploring your application binary is always the first advisable strategy
to such an exception.
Here, I was using shared libraries, so I have now switched the linking of the 
offending program to be static.

[Details of the current thread and the return instruction address...]
...
Remember: You are inspecting the region mapper thread which is != the
thread which triggered the exception! Therefore, if you press <space>
at the word marked as 'Return frame: IP', you will see the code for
'enter_kdebug()'. That doesn't help you.
This was certainly very useful advice, saving me quite some potential 
frustration, along with this:
...
Now use the 'lp' view to see the list of present threads in the system. The
cursor is placed at the current thread (the region mapper of your
application). Look around at threads with the same 'sp' value (sp = space,
the address space of the application). See this example:
id  cpu    name             pr     sp  wait    to state
   20   0     hello             2     1c    1d       ready,rcv_wait
   1d   0     #hello           ff     1c             ready
    d   0     moe              ff      c     -       ready,rcv_wait
    b   0     sigma0            1      a     -       ready,rcv_wait
    9   1     -----             0      1             ready
    8   3     -----             0      1             ready
    7   2     -----             0      1             ready
    6   0     -----             0      1             ready
(this setup emulates 4 CPUs, thus there are 4 idle threads)
Thread '1d' is the region mapper thread of the hello application. 'hello'
has 2 threads, thread 1d and thread 20. Thread 20 is currently waiting
for an IPC from thread 1d. Therefore thread 20 is the one you want to
inspect. Go there and press enter. Then move the TCB stack cursor down
to 'Return frame: IP' as I told you before, see there:
OK, so following these instructions, I think I correctly identify the waiting 
thread in the same "space" corresponding to the region mapper thread. 
Navigating to the return instruction address indeed indicates the reported 
address:

L4Re[rm]: unhandled read page fault at 0x70 pc=0x100491b

And if I look in the objdump output, at least on some occasions, I can find an 
instruction which would be causing the exception. The code looks like this:

 100490f:       49 8b 04 24             mov    (%r12),%rax
 1004913:       4c 89 ee                mov    %r13,%rsi
 1004916:       31 d2                   xor    %edx,%edx
 1004918:       4c 89 e7                mov    %r12,%rdi
 100491b:       ff 50 70                callq  *0x70(%rax)

It is at this final instruction that the exception occurs, and the offset is 
as reported, too.

The awkward thing here, though, is that the offending instruction is a virtual 
method call within the same instance:

this->flush_flexpage(flexpage);

As I think I noted in my previous message, concurrency issues may be involved 
here, and I rather think I may need to step back and consider whether I am 
doing things well enough.

Paul

Re: L4Re: Identifying the source location of a program exception

Paul Boddie