On Thursday 14. June 2018 15.38.33 Paul Boddie wrote:
Is there some way of interpreting the "PFA" value or getting more information about where the exception really occurs?
Well, I got some off-list help/encouragement (many thanks!) and put in some debugging statements to see what the cause of the exception might be.
In the Region_map::op_exception method definition (found in the file pkg/l4re- core/l4re_kernel/server/src/region.cc), modifying the debugging output yields the following information:
pc=0x800000 (program counter) gp=0x82dd30 (global pointer) sp=0x8d7a (stack pointer, called "PFA" in the default output) ra=0x802f6c (return address) cause=0x1000002c (exception cause)
Initially, I thought this might just be a stray memory access, not really knowing the significance of 0x800000 and whether it might be valid for the program counter. However, further investigation indicated that it is clearly the base address for the loaded object. Also, the stack pointer is fine.
The clue to the actual cause of the exception is the "cause" register whose lower bits provide an indication of the nature of the exception. This turned out to be a "coprocessor unusable" exception.
It was suggested that I output the details of the instruction causing the exception (which I should really have thought of doing myself, but I guess it has been a while since I did this kind of debugging), and this yielded the following value:
464c457f
The significance of this value may be obvious to people here, especially given where it was found, but for the rest of us, I can reveal that it is just the ELF magic number (0x7f 'E' 'L' 'F'). It is a "happy" coincidence that this value looks somewhat like a "coprocessor 1" instruction for the MIPS32 architecture, with the appropriate bits (31..26) indicating a COP1 instruction type, causing the exception on this SoC without such a coprocessor.
I dumped and disassembled the calling region of the code which yielded this:
8f998250 # lw $t9, -32176($gp) 24a55fa8 # addiu $a1, $a1, 0x5fa8 0320f809 # jalr $t9 24844ee4 # addiu $a0, $a0, 0x4ee4 8fbc0010 # lw $gp, 16($sp)
With these details, and using objdump to dump all the programs and libraries, I discovered that it comes from the _ftext section of libld-l4.so:
2f5c: 8f998250 lw t9,-32176(gp) 2f60: 24a55fa8 addiu a1,a1,24488 2f64: 0320f809 jalr t9 2f68: 24844ee4 addiu a0,a0,20196 2f6c: 8fbc0010 lw gp,16(sp)
So, what appears to be happening is that the "jalr t9" instruction is using a value for t9 that is 0x800000, which causes a jump to the object header and the subsequent failure. Here, I started to suspect a problem with the gp register initialisation, but this appears completely reasonable:
00002780 <_ftext>: 2780: 3c1c0003 lui gp,0x3 2784: 279cb5b0 addiu gp,gp,-19024 2788: 0399e021 addu gp,gp,t9
This provides a value of 0x30000 - 19024 == 0x2dd30 for gp. (I guess that it is really 0x82dd30 at run time, with t9 having been adjusted.) The global offset table resides at 0x25d40:
00025d40 a _GLOBAL_OFFSET_TABLE_
But the difference between gp and the table is as expected:
#define OFFSET_GP_GOT 0x7ff0
(See: pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/mips/elfinterp.c)
Where things seem to go wrong is with the computation of t9 before the call:
gp - 32176 == 0x2dd30 - 32176 == 0x25f80
The next symbol/section after the table is this one:
00025f90 g __dso_handle
So, the location of the address to be used (0x25f80) is within the region of the table. However, dumping the memory from the start of the table until the next section indicates that the address lies within an area that seems to be padding, appearing after all the meaningful entries and featuring only 0x800000 for each such entry.
I imagine that code fixes up the table, adding the object base address to each entry, and the padding is also adjusted because the loop just proceeds until it encounters the next section (or the end of the .got section).
What I cannot figure out is where the _ftext code actually comes from. It seems to be some kind of initialisation code, but the only things containing _ftext in the distribution are linker scripts. So I don't know where to find the offending operations.
I did test shared executables successfully on the CI20 which uses a different MIPS32 architecture revision. The code is rather different, perhaps employing different styles of code generation that never got applied to the earlier architecture revision, but I notice that the global offset tables are similarly sized and have the last meaningful entry referring to __dl_runtime_pltresolve.
I suppose that I need to figure out which code is responsible for the failing invocation and why the generated code is trying to access an uninitialised table entry. Although I suspect that some change I have made is responsible, there doesn't seem to be anything really obvious amongst my patches. The patches I needed to fix t9 initialisation are also in use for the CI20, so I doubt that they would have an effect, even if they were relevant here.
Sorry for the long message, but not being particularly familiar with the way dynamic linking works, I feel that reporting my observations might trigger the memories of those who have seen such problems before.
Paul