Building programs with MODE=shared in L4Re
Paul Boddie
paul at boddie.org.uk
Sun Jun 17 01:07:44 CEST 2018
On Thursday 14. June 2018 15.38.33 Paul Boddie wrote:
>
> Is there some way of interpreting the "PFA" value or getting more
> information about where the exception really occurs?
Well, I got some off-list help/encouragement (many thanks!) and put in some
debugging statements to see what the cause of the exception might be.
In the Region_map::op_exception method definition (found in the file pkg/l4re-
core/l4re_kernel/server/src/region.cc), modifying the debugging output yields
the following information:
pc=0x800000 (program counter)
gp=0x82dd30 (global pointer)
sp=0x8d7a (stack pointer, called "PFA" in the default output)
ra=0x802f6c (return address)
cause=0x1000002c (exception cause)
Initially, I thought this might just be a stray memory access, not really
knowing the significance of 0x800000 and whether it might be valid for the
program counter. However, further investigation indicated that it is clearly
the base address for the loaded object. Also, the stack pointer is fine.
The clue to the actual cause of the exception is the "cause" register whose
lower bits provide an indication of the nature of the exception. This turned
out to be a "coprocessor unusable" exception.
It was suggested that I output the details of the instruction causing the
exception (which I should really have thought of doing myself, but I guess it
has been a while since I did this kind of debugging), and this yielded the
following value:
464c457f
The significance of this value may be obvious to people here, especially given
where it was found, but for the rest of us, I can reveal that it is just the
ELF magic number (0x7f 'E' 'L' 'F'). It is a "happy" coincidence that this
value looks somewhat like a "coprocessor 1" instruction for the MIPS32
architecture, with the appropriate bits (31..26) indicating a COP1 instruction
type, causing the exception on this SoC without such a coprocessor.
I dumped and disassembled the calling region of the code which yielded this:
8f998250 # lw $t9, -32176($gp)
24a55fa8 # addiu $a1, $a1, 0x5fa8
0320f809 # jalr $t9
24844ee4 # addiu $a0, $a0, 0x4ee4
8fbc0010 # lw $gp, 16($sp)
With these details, and using objdump to dump all the programs and libraries,
I discovered that it comes from the _ftext section of libld-l4.so:
2f5c: 8f998250 lw t9,-32176(gp)
2f60: 24a55fa8 addiu a1,a1,24488
2f64: 0320f809 jalr t9
2f68: 24844ee4 addiu a0,a0,20196
2f6c: 8fbc0010 lw gp,16(sp)
So, what appears to be happening is that the "jalr t9" instruction is using a
value for t9 that is 0x800000, which causes a jump to the object header and
the subsequent failure. Here, I started to suspect a problem with the gp
register initialisation, but this appears completely reasonable:
00002780 <_ftext>:
2780: 3c1c0003 lui gp,0x3
2784: 279cb5b0 addiu gp,gp,-19024
2788: 0399e021 addu gp,gp,t9
This provides a value of 0x30000 - 19024 == 0x2dd30 for gp. (I guess that it
is really 0x82dd30 at run time, with t9 having been adjusted.) The global
offset table resides at 0x25d40:
00025d40 a _GLOBAL_OFFSET_TABLE_
But the difference between gp and the table is as expected:
#define OFFSET_GP_GOT 0x7ff0
(See: pkg/l4re-core/uclibc/lib/contrib/uclibc/ldso/ldso/mips/elfinterp.c)
Where things seem to go wrong is with the computation of t9 before the call:
gp - 32176 == 0x2dd30 - 32176 == 0x25f80
The next symbol/section after the table is this one:
00025f90 g __dso_handle
So, the location of the address to be used (0x25f80) is within the region of
the table. However, dumping the memory from the start of the table until the
next section indicates that the address lies within an area that seems to be
padding, appearing after all the meaningful entries and featuring only
0x800000 for each such entry.
I imagine that code fixes up the table, adding the object base address to each
entry, and the padding is also adjusted because the loop just proceeds until
it encounters the next section (or the end of the .got section).
What I cannot figure out is where the _ftext code actually comes from. It
seems to be some kind of initialisation code, but the only things containing
_ftext in the distribution are linker scripts. So I don't know where to find
the offending operations.
I did test shared executables successfully on the CI20 which uses a different
MIPS32 architecture revision. The code is rather different, perhaps employing
different styles of code generation that never got applied to the earlier
architecture revision, but I notice that the global offset tables are
similarly sized and have the last meaningful entry referring to
__dl_runtime_pltresolve.
I suppose that I need to figure out which code is responsible for the failing
invocation and why the generated code is trying to access an uninitialised
table entry. Although I suspect that some change I have made is responsible,
there doesn't seem to be anything really obvious amongst my patches. The
patches I needed to fix t9 initialisation are also in use for the CI20, so I
doubt that they would have an effect, even if they were relevant here.
Sorry for the long message, but not being particularly familiar with the way
dynamic linking works, I feel that reporting my observations might trigger the
memories of those who have seen such problems before.
Paul
More information about the l4-hackers
mailing list