Frank,
Thank you very much for your very descriptive account of how the exception location might be discovered using the kernel debugger. I think this may be a long exercise, but I wanted to respond to at least acknowledge your message.
On Wednesday, 29 July 2020 09:24:17 CEST Frank Mehnert wrote:
I want to encourage you to take the program counter value serious. The message says that there was an access to the memory at address 0x38 (sounds like an access to offset 38 of an object where the object pointer was not initialized) and the corresponding program counter in userland s 0x3a8bd9. From that value I guess that your host is AMD64.
Yes, that is correct. I also assumed that the error was related to a null reference.
Now the question is of course: Which application triggered this exception? If you know the answer then you should disassemble the corresponding binary with
objdump -ldC <filename> | less
and search for the program counter. If your binary was compiled with debugging information, you will even see the source code around the faulting instruction.
If your binary was not compiled with debugging information:
If the application is compiled within the L4Re tree then use the binary from the package build directory because that one is not stripped, for example
build-x86-64/pkg/hello/server/src/OBJ-amd64_gen-l4f/hello
rather than
build-x86-64/bin/amd64_gen/l4f/hello
because the latter binary is stripped (i.e. contains no debugging information) if CONFIG_BID_STRIP_PROGS is set to 'y'.
This is a useful reminder, but I think I must have experienced difficulties before with the bin subdirectory's contents, so I tend to access the appropriate binaries inside their package directories, anyway. It's probably just good fortune that something in my mind remembers the right kind of location to investigate.
- If you compiled the binary yourself, make sure to the the '-g' flag to the compiler options. For L4Re applications using the L4Re build infrastructure this is done automatically, see 1.
I think that getting programs built outside the L4Re build framework would be too advanced for me.
Next question: Is your binary linked statically or does it use dynamic libraries? You can find this out by doing
objdump -p <filename>
If the output contains at least one line with 'NEEDED' then your binary uses dynamic libraries and looking for the program counter can be more difficult if the fault happens in a dynamic library because the library code is relocated to an unknown address when the library is loaded at program start.
Therefore for debugging it's always advisable to use static linked binaries. If your application uses the L4Re build infrastructure, set
MODE = static
in the Makefile. If you use your own Makefile, make sure to add
-static
to the linker flags.
Exploring your application binary is always the first advisable strategy to such an exception.
Here, I was using shared libraries, so I have now switched the linking of the offending program to be static.
[Details of the current thread and the return instruction address...]
Remember: You are inspecting the region mapper thread which is != the thread which triggered the exception! Therefore, if you press <space> at the word marked as 'Return frame: IP', you will see the code for 'enter_kdebug()'. That doesn't help you.
This was certainly very useful advice, saving me quite some potential frustration, along with this:
Now use the 'lp' view to see the list of present threads in the system. The cursor is placed at the current thread (the region mapper of your application). Look around at threads with the same 'sp' value (sp = space, the address space of the application). See this example:
id cpu name pr sp wait to state 20 0 hello 2 1c 1d ready,rcv_wait 1d 0 #hello ff 1c ready d 0 moe ff c - ready,rcv_wait b 0 sigma0 1 a - ready,rcv_wait 9 1 ----- 0 1 ready 8 3 ----- 0 1 ready 7 2 ----- 0 1 ready 6 0 ----- 0 1 ready
(this setup emulates 4 CPUs, thus there are 4 idle threads)
Thread '1d' is the region mapper thread of the hello application. 'hello' has 2 threads, thread 1d and thread 20. Thread 20 is currently waiting for an IPC from thread 1d. Therefore thread 20 is the one you want to inspect. Go there and press enter. Then move the TCB stack cursor down to 'Return frame: IP' as I told you before, see there:
OK, so following these instructions, I think I correctly identify the waiting thread in the same "space" corresponding to the region mapper thread. Navigating to the return instruction address indeed indicates the reported address:
L4Re[rm]: unhandled read page fault at 0x70 pc=0x100491b
And if I look in the objdump output, at least on some occasions, I can find an instruction which would be causing the exception. The code looks like this:
100490f: 49 8b 04 24 mov (%r12),%rax 1004913: 4c 89 ee mov %r13,%rsi 1004916: 31 d2 xor %edx,%edx 1004918: 4c 89 e7 mov %r12,%rdi 100491b: ff 50 70 callq *0x70(%rax)
It is at this final instruction that the exception occurs, and the offset is as reported, too.
The awkward thing here, though, is that the offending instruction is a virtual method call within the same instance:
this->flush_flexpage(flexpage);
As I think I noted in my previous message, concurrency issues may be involved here, and I rather think I may need to step back and consider whether I am doing things well enough.
Paul