Booting L4Re on the CI20: Panic in sigma0

older
Open a shell for a an individual...

Paul Boddie

13 Jul 2017 13 Jul '17

7:22 p.m.

Hello, I've been trying to get Fiasco.OC and L4Re booting on the MIPS Creator CI20, starting out by building the "hello" example. The first obstacle was actually seeing the serial output, where the board appears to be configured to output to UART4 in the bootstrap package, but booting from an SD card produced no output. Although I thought I had messed up the preparation of the image, or that U-Boot was being fussy about the image addresses and failing to execute the payload, switching to UART0 and rewiring my connection got me some output. For reference, changing the UART involves a couple of modifications to pkg/bootstrap/server/src/platform/ci20.cc as follows: - kuart.base_address = 0x10034000; // UART4 + kuart.base_address = 0x10030000; // UART0 - kuart.irqno = 34; + kuart.irqno = 51; // UART0: 32 + 19 But now I appear to experience a panic in sigma0 as it starts up, with the message... Warning: Sigma0 raised an exception --> HALT Here are the regions: Regions of list 'regions' [ 0, 1db] { 1dc} Kern fiasco [ 1000, 10eb] { ec} Root mbi_rt [ 10000, 9d09f] { 8d0a0} Kern fiasco [ 140000, 184773] { 44774} Root moe [ 190000, 197f3f] { 7f40} Root moe [ 200000, 20be17] { be18} Sigma0 sigma0 [ 210000, 2161bf] { 61c0} Sigma0 sigma0 [ 2d0000, 2e33df] { 133e0} Boot bootstrap [ 1100000, 1164fff] { 65000} Root Module And the registers look like this: 00[ 0]: 00000000 at[ 1]: 80022e50 v0[ 2]: 00000001 v1[ 3]: 80000000 a0[ 4]: 00010000 a1[ 5]: 002000e0 a2[ 6]: ffffffe7 a3[ 7]: 00000401 t0[ 8]: 00000000 t1[ 9]: 00000401 t2[10]: 00000413 t3[11]: 82152f38 t4[12]: 82152000 t5[13]: 801873bc t6[14]: fffffffe t7[15]: 801873bc s0[16]: 82152f60 s1[17]: 00000400 s2[18]: 00000001 s3[19]: 00000000 s4[20]: 80090000 s5[21]: 00000000 s6[22]: 00000fa0 s7[23]: 00000000 t8[24]: 8008519c t9[25]: 800a0000 k0[26]: ffffffff k1[27]: ffffffff gp[28]: 800b7f80 sp[29]: 00000000 fp[30]: 80185000 ra[31]: 80010000 HI: 00000000 LO: 000003a8 Status 00000413 Cause 00000010 EPC 002000ec The EPC indeed appears to reference sigma0, with the Cause indicating an erroneous data or instruction fetch operation. Looking at the disassembly of sigma0... 002000e0 <__start>: 2000e0: 3c1c0001 lui gp,0x1 2000e4: 279c7f80 addiu gp,gp,32640 2000e8: 0399e021 addu gp,gp,t9 2000ec: 8f9d8018 lw sp,-32744(gp) 2000f0: 8f99801c lw t9,-32740(gp) 2000f4: 27bdfff0 addiu sp,sp,-16 2000f8: 0320f809 jalr t9 2000fc: 00000000 nop ...it appears that the problem occurs when the global offset table is accessed. The global pointer gets computed as... 0x10000 + 32640 + 0x800a0000 = 0x800b7f80 ...with the load-relative operation accessing... 0x800b7f80 - 32744 = 0x800aff98 It is presumably this address that is illegal within the failing thread of execution. I've been looking at the debugger documentation... http://l4re.org/fiasco/doc/jdb.pdf ...but I'm not sure I'm doing the right things to see the state of the machine. Attempting to dump the memory at that address appears to indicate inaccessible memory, but I imagine that this region might not be mapped within whichever "task" is active. Does anyone have any suggestions about how I can troubleshoot this problem? Thanks in advance, Paul

Show replies by date

Paul Boddie

14 Jul 14 Jul

1:11 a.m.

Following up to myself... On Thursday 13. July 2017 19.22.32 Paul Boddie wrote:

...

But now I appear to experience a panic in sigma0 as it starts up, with the message...

Warning: Sigma0 raised an exception --> HALT

[...]

...

The EPC indeed appears to reference sigma0, with the Cause indicating an erroneous data or instruction fetch operation. Looking at the disassembly of sigma0...

002000e0 <__start>: 2000e0: 3c1c0001 lui gp,0x1 2000e4: 279c7f80 addiu gp,gp,32640 2000e8: 0399e021 addu gp,gp,t9 2000ec: 8f9d8018 lw sp,-32744(gp) 2000f0: 8f99801c lw t9,-32740(gp) 2000f4: 27bdfff0 addiu sp,sp,-16 2000f8: 0320f809 jalr t9 2000fc: 00000000 nop

...it appears that the problem occurs when the global offset table is accessed. The global pointer gets computed as...

0x10000 + 32640 + 0x800a0000 = 0x800b7f80

...with the load-relative operation accessing...

0x800b7f80 - 32744 = 0x800aff98

Of course, t9, which was set to 0x800a0000 (if the register dump can be believed), is completely wrong according to the way it is used conventionally (as a reference to the current routine). So I investigated the initialisation code for sigma0 in the following place: pkg/l4re-core/sigma0/server/src/ARCH-mips/crt0.S This is what it looks like: __start: .cpload $25 /* load GP */ SETUP_GPX64($25, $0) PTR_LA $29, crt0_stack_high PTR_LA $25, init PTR_SUBU $29, (NARGSAVE * SZREG) jalr $25 nop The problem here is that the .cpload directive (operating on t9/$25) doesn't have a known value of t9 to work with, it would seem. Consequently, the calculations in the generated code work with a value that isn't initialised. I tried to add the initialisation of t9 before the .cpload directive, but this actually caused the computed offsets to the gp/$28 register to be eight bytes out, thus causing the accessed table addresses to be eight bytes too low. Here is what I tried: __start: lui $25, %hi(__start) ori $25, $25, %lo(__start) .cpload $25 /* load GP */ Clearly, the two extra instructions were confusing the .cpload directive. So I then introduced another label as follows, making the .cpload directive operate at exactly the referenced place in the program: __start: lui $25, %hi(__realstart) ori $25, $25, %lo(__realstart) __realstart: .cpload $25 /* load GP */ Now, the value of t9 is correct for the gp operations. At this point, I get the following output: SIGMA0: Hello! KIP @ 10000 allocated 4KB for maintenance structures SIGMA0: Dump of all resource maps RAM:------------------------ [4:1000;1fff] [4:140000;184fff] [4:190000;197fff] [4:1100000;1164fff] [0:2155000;3fffffff] IOMEM:---------------------- [0:40000000;ffffffff] I'm guessing that there should be more output after that, but it either isn't produced on UART0 or something is hanging somewhere. Looking at similar code in moe, it appears that a similar measure is required to set up t9. Here is where that code lives: pkg/l4re-core/moe/server/src/ARCH-mips/crt0.S Making a similar change in the "_real_start" routine there yields the following additional output: MOE: Hello world And that is then the end of output. Maybe I've stumbled across a difference in the way the compilers work here. I'm using the mipsel-linux-gnu cross- compilers in Debian unstable, whereas I imagine that the people who ported the code to MIPS used proprietary toolchains with different characteristics. Anyway, I guess I'll investigate other places where this might be occurring and hopefully get the "hello" program to do its thing. Paul

Sarah Hoffmann

5:39 p.m.

Hi Paul,

...

pkg/l4re-core/sigma0/server/src/ARCH-mips/crt0.S

This is what it looks like:

__start: .cpload $25 /* load GP */ SETUP_GPX64($25, $0) PTR_LA $29, crt0_stack_high PTR_LA $25, init PTR_SUBU $29, (NARGSAVE * SZREG) jalr $25 nop

The problem here is that the .cpload directive (operating on t9/$25) doesn't have a known value of t9 to work with, it would seem. Consequently, the calculations in the generated code work with a value that isn't initialised.

This suspiciously looks like a compiler issue. I would .cpload expect to translate to a nop on mips32r2 without PIC. We generally use the compilers provided by Imagination(*). The Debian GCC might do things slightly differently, I need to check that again. Making the code work with the official GCC is on our list, although not very high at the moment. If you can switch to the Imgtec GCC for the moment, this might save you some headache. (*) http://community.imgtec.com/developers/mips/tools/codescape-mips-sdk/downloa...

...

And that is then the end of output. Maybe I've stumbled across a difference in the way the compilers work here. I'm using the mipsel-linux-gnu cross- compilers in Debian unstable, whereas I imagine that the people who ported the code to MIPS used proprietary toolchains with different characteristics.

Anyway, I guess I'll investigate other places where this might be occurring and hopefully get the "hello" program to do its thing.

It won't be that easy I'm afraid. Last time we tried the board, we found that the CI20 has a crippled instruction set. In particular, it does not support the rdhwr instruction which L4Re user programs use to clean caches. The easiest way to work around this issue by emulating the instruction in Fiasco. There is already some code for catching rdhwr. Look for "ENTRY reserved_insn" in src/kern/mips/exception.S. Somebody 'just' needs to add support for reading the cache configuration values. Kind regards Sarah -- Sarah Hoffmann, sarah.hoffmann@kernkonzept.com Kernkonzept GmbH, Dresden, Germany https://kernkonzept.com/

Paul Boddie

6:45 p.m.

On Friday 14. July 2017 17.39.15 Sarah Hoffmann wrote:

...

...
The problem here is that the .cpload directive (operating on t9/$25) doesn't have a known value of t9 to work with, it would seem. Consequently, the calculations in the generated code work with a value that isn't initialised.

This suspiciously looks like a compiler issue. I would .cpload expect to translate to a nop on mips32r2 without PIC.

I thought sigma0 was compiled as position-independent code, though. Or is it the case that it should not be compiled in that way and that gcc is just pursuing its usual strategy.

...

We generally use the compilers provided by Imagination(*). The Debian GCC might do things slightly differently, I need to check that again. Making the code work with the official GCC is on our list, although not very high at the moment. If you can switch to the Imgtec GCC for the moment, this might save you some headache.

(*) http://community.imgtec.com/developers/mips/tools/codescape-mips-sdk/downlo ad-codescape-mips-sdk-essentials/

I would much rather stick with gcc and not have to install vendor-specific tools. Moreover, since gcc appears to be the recommended compiler for L4Re in general, it seems like restoring support for it would do a lot to eliminate such surprises. So perhaps I can try and get the code to work with gcc, if that would be helpful to your efforts. I imagine that it would also be beneficial to a wider audience. [CI20 support]

...

It won't be that easy I'm afraid. Last time we tried the board, we found that the CI20 has a crippled instruction set. In particular, it does not support the rdhwr instruction which L4Re user programs use to clean caches. The easiest way to work around this issue by emulating the instruction in Fiasco. There is already some code for catching rdhwr. Look for "ENTRY reserved_insn" in src/kern/mips/exception.S. Somebody 'just' needs to add support for reading the cache configuration values.

I'll admit to not being familiar with the different MIPS ISA versions, nor particularly with the MIPS instruction set in general. I have been able to write MIPS assembly language programs doing a variety of things, but I don't profess to be an expert. From what I've seen of code for other Ingenic SoCs, particularly those which didn't claim MIPS compliance, it wouldn't surprise me if there were some deviations from what people might expect from other MIPS products. That said, such SoCs are supported by other operating system projects, so I imagine it is "just" a matter of filling in the gaps, as you say. It does also interest me to try and get L4Re working on the jz4720 in the Ben NanoNote, if only because it should be possible, but that would probably also demand the trapping of floating point instructions. But such things are already being done elsewhere, so it's "just" a matter of filling this in, too. Is there anything else I should be aware of? I saw that Kernkonzept had done work with Imagination on MIPS support, but I get the impression that this ultimately didn't cover the CI20. Is this a reasonable impression? Thanks for replying to my previous message! Paul P.S. On the topic of vendor tools, I have had some recent experiences with Microchip's MIPS-based products, and the culture of reliance on Microchip's developer tools - even if some of them might actually be based on the GNU tools - is a rather irritating and frustrating thing, both for people trying to figure out the operation of the silicon as well as those who have made themselves dependent on those tools.

Paul Boddie

15 Jul 15 Jul

12:11 a.m.

Following up to myself again... On Friday 14. July 2017 01.11.38 Paul Boddie wrote:

...

[Boot log on the CI20]

...

MOE: Hello world

And that is then the end of output.

So, I found I'd got ahead of myself here. Noting that the CI20 has 1GB RAM, I changed the mk/platforms/ci20.conf file to set PLATFORM_RAM_SIZE_MB to 1024. However, this is a bad idea indeed: only the range from zero to 0x10000000, which is 256MB as originally indicated in the configuration file, is allocated to the RAM in the physical address space map. The full 1GB is actually mapped from 0x20000000 upwards. (I'm guessing that they've done it this way to maintain some kind of compatibility with other products. At the moment, I don't know whether that memory is directly accessible, but I can obviously do a bit of research on that.) Changing this setting back from 1024 to 256 solved the problem of the Moe find_memory routine trying to obtain memory regions from 0x10000000 and above, and I now get to the point of loading the "hello" example. However, I'm not managing to actually see the example start. I get the following output: MOE: Hello world MOE: found 245048 KByte free memory MOE: found RAM from 1000 to 10000000 MOE: allocated 255 KByte for the page array @0x104d000 MOE: virtual user address space [0-7fffffff] MOE: rom name space cap -> [C:103000] BOOTFS: [1100000-1131e5c] [C:105000] l4re BOOTFS: [1132000-1164b08] [C:107000] hello MOE: cmdline: moe --init=rom/hello MOE: Starting: rom/hello MOE: loading 'rom/hello' Despite inserting a trace statement into the Moe main program after the elf_loader.start invocations, I don't see anything else appear. So I guess something else manages to go wrong at this point. I suppose I'll have to try and track this down. Paul

Paul Boddie

16 Jul 16 Jul

2:24 a.m.

On Saturday 15. July 2017 00.11.58 Paul Boddie wrote:

...

However, I'm not managing to actually see the example start. I get the following output:

MOE: Hello world MOE: found 245048 KByte free memory MOE: found RAM from 1000 to 10000000 MOE: allocated 255 KByte for the page array @0x104d000 MOE: virtual user address space [0-7fffffff] MOE: rom name space cap -> [C:103000] BOOTFS: [1100000-1131e5c] [C:105000] l4re BOOTFS: [1132000-1164b08] [C:107000] hello MOE: cmdline: moe --init=rom/hello MOE: Starting: rom/hello MOE: loading 'rom/hello'

Despite inserting a trace statement into the Moe main program after the elf_loader.start invocations, I don't see anything else appear. So I guess something else manages to go wrong at this point. I suppose I'll have to try and track this down.

So, some more slow but steady progress here. I decided to see where the code was failing by inserting trace statements and looking at the output. The failure path involved the elf_loader.start invocation, through the loader machinery... Loader::start Loader::exec Elf_loader::launch Loader<App_model_, Dbg_>::launch Elf_loader::load ...invoking the Moe_app_model (init_prog and copy_ds), ultimately calling Dataspace_util::copy and l4_cache_coherent. Naturally, l4_cache_coherent on MIPS employs the rdhwr (read hardware register) instruction that Sarah mentioned, and although there is code to handle this instruction when accessing the ULR (User Local Register), there isn't code to handle access to the SYNCI_Step register which describes the instruction cache line size. As noted previously, the code for "reserved" instruction handling is here: kernel/fiasco/src/kern/mips/exception.S So, I've made an attempt to implement this support in the reserved_insn handler, accessing the appropriate Config register and reading the value, setting it in the target register. The existing code wasn't too difficult to follow, and I hope I haven't broken anything by changing it. Now, it appears that Moe does complete the loading of "hello", but I still don't see any output from the example. Switching on all debugging gives the following output from Moe: MOE: loading 'rom/hello' MOE[ldr]: done... MOE: dump of root name space: icu -> [C:6000] f=2309 jdb -> [C:a000] f=2317 log -> [C:5000] f=2309 mem -> [C:10d000] o=108edc0 f=2829 rom -> [C:103000] o=108dfe0 f=2816 .dirinfo -> [C:109000] o=108ee80 f=2816 hello -> [C:107000] o=108ef00 f=2816 l4re -> [C:105000] o=108ef80 f=2816 sigma0 -> [C:4000] f=2317 But with the --l4re-dbg option set to "all", after this output... L4Re: load binary 'rom/hello' L4Re: Start server loop ...I notice this continuously recurring message: L4Re[svr]: request: tag=0xfffb1026 proto=-5 obj=0x0 So I imagine now that I will need to investigate what this is all about. Paul

Paul Boddie

18 Jul 18 Jul

1:43 a.m.

And some more... On Sunday 16. July 2017 02.24.03 Paul Boddie wrote:

...

But with the --l4re-dbg option set to "all", after this output...

L4Re: load binary 'rom/hello' L4Re: Start server loop

...I notice this continuously recurring message:

L4Re[svr]: request: tag=0xfffb1026 proto=-5 obj=0x0

(Note that this is the only thing that occurs, doing so endlessly and very frequently.) Well, I haven't really figured this out at all. I thought it might be useful to investigate what the message actually represents. First of all, it originates from the Dispatcher::dispatch method in... pkg/l4re-core/l4re_kernel/server/src/dispatcher.cc I think the "tag" breaks down into something like this: 0xfffb1026 -> label=0xfffb (-5) -> L4_PROTO_EXCEPTION flags=0x1 -> L4_MSGTAG_TRANSFER_FPU items=0x00 words=0x26 (38) (Reference: pkg/l4re-core/l4sys/include/types.h) The code performing this logging doesn't indicate what the result of the message dispatch was, so I added a trace statement to see, which yielded this "tag" information: 0xfc170000 -> label=0xfc17 (-1001) -> L4_EMSGTOOSHORT flags=0x0 items=0x00 words=0x00 As far as I can tell (four or so invocations are traversed), this might be produced in the handle_svr_obj_call function in... pkg/l4re-core/l4sys/include/cxx/ipc_server I would try and add some debugging statements here as well, but doing so seems to cause a cascade of library requirements across a range of components. So, although I might suspect that the request is malformed in some way, I have no firm idea that this is the case. However, one test of it is the following: tag.words() + tag.items() * Item_words > Mr_words Here, the left hand side of the comparison is 38 whereas Mr_words is apparently 63, so this test does not identify the cause of the problem. Meanwhile, I looked into the nature of the request and found the UTCB-related functions in... pkg/l4re-core/l4sys/include/utcb.h Attempting to determine the nature of the supposed exception, I managed to discover that... l4_utcb_exc_is_pf returns 1 (page fault) l4_utcb_exc_pfa returns 0x800d1308 (which is a kernel mode address on MIPS) The program counter is given as 0x7000049c, with the exception cause being decoded from 0x10 to be interpreted as an "exception code value" of 4 in the CP0_CAUSE register (address error, load or instruction fetch). I thought that enabling more logging in sigma0 might help, presuming that the page fault would be propagated through the pager hierarchy. But changing debug_ipc to 1 in... pkg/l4re-core/sigma0/server/src/globals.h ...indicated that sigma0 is not involved when the above requests are made and dispatched: there is a lot of logging from sigma0, but logging from Moe takes over after a certain point. And I don't see any logging from Moe when these page faults occur: they are described using the "L4Re[svr]" prefix, as shown above. So, I don't really have much more to go on, here. There's a chance that my rdhwr instruction support introduced a bug, I suppose, even though I've read through that code several times and can't see anything obviously wrong with it. I do wonder whether the initialisation routine for other programs is initialising the t9 register improperly, as noted previously. But then, I don't understand why the erroneously-initialised program isn't just terminated when its page fault can't be handled. Although I've seen a fair amount of the L4Re internals now, I don't think I have any productive way of finding the problem here, unfortunately. I guess this exercise has provided some way of getting a "tour" of the framework, and maybe that will be useful in the future, but I had hoped that this board was already supported to the point of already running the example programs. Paul

Sarah Hoffmann

8:58 a.m.

Hi Paul, On 07/18/2017 01:43 AM, Paul Boddie wrote:

...

Well, I haven't really figured this out at all. I thought it might be useful to investigate what the message actually represents. First of all, it originates from the Dispatcher::dispatch method in...

pkg/l4re-core/l4re_kernel/server/src/dispatcher.cc

I think the "tag" breaks down into something like this:

0xfffb1026 -> label=0xfffb (-5) -> L4_PROTO_EXCEPTION flags=0x1 -> L4_MSGTAG_TRANSFER_FPU items=0x00 words=0x26 (38)

(Reference: pkg/l4re-core/l4sys/include/types.h)

The code performing this logging doesn't indicate what the result of the message dispatch was, so I added a trace statement to see, which yielded this "tag" information:

0xfc170000 -> label=0xfc17 (-1001) -> L4_EMSGTOOSHORT flags=0x0 items=0x00 words=0x00

There is a bug in the Fiasco where it sends the wrong message size. Please apply the attached patch to Fiasco. Afterwards you should get more useful error messages in your L4 applications when it throws exceptions.

...

Attempting to determine the nature of the supposed exception, I managed to discover that...

l4_utcb_exc_is_pf returns 1 (page fault) l4_utcb_exc_pfa returns 0x800d1308 (which is a kernel mode address on MIPS)

The program counter is given as 0x7000049c, with the exception cause being decoded from 0x10 to be interpreted as an "exception code value" of 4 in the CP0_CAUSE register (address error, load or instruction fetch).

I thought that enabling more logging in sigma0 might help, presuming that the page fault would be propagated through the pager hierarchy. But changing debug_ipc to 1 in...

An address error generally means that you are trying to access a bad address (which would be the case with the PFA given above). This is different from a normal page fault, which corresponds to TLB exceptions. That is why sigma0 is not involved. Exceptions are directly sent to the exception handler which in a standard L4 application is the thread started first (l4re-kernel thread) or, if that one fails, the launcher (moe in your case).

...

...indicated that sigma0 is not involved when the above requests are made and dispatched: there is a lot of logging from sigma0, but logging from Moe takes over after a certain point. And I don't see any logging from Moe when these page faults occur: they are described using the "L4Re[svr]" prefix, as shown above.

So, I don't really have much more to go on, here. There's a chance that my rdhwr instruction support introduced a bug, I suppose, even though I've read through that code several times and can't see anything obviously wrong with it. I do wonder whether the initialisation routine for other programs is initialising the t9 register improperly, as noted previously.

The t9 issue is a likely cause. There are a couple of places where .cpload is used.

...

But then, I don't understand why the erroneously-initialised program isn't just terminated when its page fault can't be handled.

That is the standard behaviour and the attached patch hopefully brings it back. Kind regards Sarah

...

Although I've seen a fair amount of the L4Re internals now, I don't think I have any productive way of finding the problem here, unfortunately. I guess this exercise has provided some way of getting a "tour" of the framework, and maybe that will be useful in the future, but I had hoped that this board was already supported to the point of already running the example programs.

Paul

_______________________________________________ l4-hackers mailing list l4-hackers@os.inf.tu-dresden.de http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers

-- Sarah Hoffmann, sarah.hoffmann@kernkonzept.com Kernkonzept GmbH, Dresden, Germany https://kernkonzept.com/

Paul Boddie

4:56 p.m.

On Tuesday 18. July 2017 08.58.14 Sarah Hoffmann wrote:

...

There is a bug in the Fiasco where it sends the wrong message size. Please apply the attached patch to Fiasco. Afterwards you should get more useful error messages in your L4 applications when it throws exceptions.

Yes, this fixes the recurring failure to handle the page fault successfully. Thanks for sending this! [...]

...

An address error generally means that you are trying to access a bad address (which would be the case with the PFA given above). This is different from a normal page fault, which corresponds to TLB exceptions. That is why sigma0 is not involved.

Right. I appreciate the clarification here.

...

Exceptions are directly sent to the exception handler which in a standard L4 application is the thread started first (l4re-kernel thread) or, if that one fails, the launcher (moe in your case).

Yes, this is what I might have expected. [...]

...

The t9 issue is a likely cause. There are a couple of places where .cpload is used.

I found another location where t9 is not initialised: pkg/l4re-core/l4re_kernel/server/src/ARCH-mips/loader_mips.S I had actually seen this, but I guess I was distracted too much with the other problems to realise that it needed changing. So, I applied the Fiasco patch and then looked at the above file, checking the objdump output for the l4re binary. With the patch applied, the page fault did not occur endlessly, and with a change to the above file similar to those already made for sigma0 and moe, it appears that the page fault is eliminated completely. As a consequence, the "hello" example now runs, finally! I have attached a few patches required to make this happen in my 32-bit Intel, Debian-based mipsel-linux-gnu development environment: ci20-gcc-cpload.diff Initialises t9 for .cpload ci20-gcc-debian.diff Uses the mipsel-linux-gnu toolchain prefix ci20-rdhwr.diff Implements rdhwr SYNCI_Step handling ci20-uart0.diff Uses the UART0 connection to work around UART4 problems no-at.diff A missing definition (mentioned previously on this list) The rdhwr patch could probably be improved, and I think there are places in the affected file where common code could be consolidated a bit more. I appreciate the help and guidance you have provided in getting me to this point, and I hope now to try and test the drivers that I had written in advance of getting the examples working. Thanks once again, Paul

Paul Boddie

19 Jul 19 Jul

7:40 p.m.

On Tuesday 18. July 2017 16.56.08 Paul Boddie wrote:

...

So, I applied the Fiasco patch and then looked at the above file, checking the objdump output for the l4re binary. With the patch applied, the page fault did not occur endlessly, and with a change to the above file similar to those already made for sigma0 and moe, it appears that the page fault is eliminated completely. As a consequence, the "hello" example now runs, finally!

Despite this success, I should point out that Ned doesn't manage to function at all. It would appear that upon starting, Ned tries to import various Lua libraries, and the second one causes a page fault: Ned says: Hi World! L = luaL_newstate() luaL_requiref(L, "_G", libs[0].func, 1) luaL_requiref(L, "package", libs[1].func, 1) createclibstable(L) luaL_newlib(L, pk_funcs) L4Re[rm]: unhandled read page fault at 0x8 pc=0x10350cc Here, the first and last lines are normal logging output. The other lines are from my clumsy debugging statements. The exception always seems to occur in the luaL_checkversion_ function in... pkg/l4re-core/lua/lib/contrib/src/lauxlib.c ...coincidentally when performing an operation like this: sdc1 $f20,48(sp) Looking at the registers in the kernel debugger, sp refers to a kernel address. Meanwhile, the t9 and gp addresses are valid for accesses to the global offset table (verified from an objdump on ned): t9=0101b2f8 gp=0109adc0 sp=80c41ef8 Trying to check the stack pointer in the offending function is awkward because gcc will happily move any statement printing it out after instructions causing the page fault. Replacing the code with the stack pointer output statements produces a value of 0x8cd8 for sp, which is not a kernel address at least, but I have no idea if it is (or seems) valid. A side-effect of removing the version-checking code is that this does not then obstruct the rest of the package-importing operations. However, another exception of this nature arises when attempting to invoke the execute_lua_buf function in... pkg/l4re-core/ned/server/src/lua.cc It always seems to involve an address of 0x8, which seems rather bizarre. Again, I think I must be missing something fundamental and must only be seeing the consequences. Paul

Paul Boddie

20 Jul 20 Jul

10:10 p.m.

On Wednesday 19. July 2017 19.40.23 Paul Boddie wrote:

...

It always seems to involve an address of 0x8, which seems rather bizarre. Again, I think I must be missing something fundamental and must only be seeing the consequences.

So, I adjusted the kernel code, putting back in a commented-out debugging statement found in the Thread::handle_page_fault method which looks like this (having changed some of the details): printf("Translation error ? %p\n" " is_kmem_page_fault ? %x\n" " is_sigma0 ? %x\n" " program counter: %p\n" " regs->ip(): %p\n" " page fault address: %p\n", (void *) PF::is_translation_error(error_code), !PF::is_translation_error(error_code) && mem_space()->is_sigma0(), Kmem::is_kmem_page_fault(pfa, error_code), (void *) pc, (void *) regs->ip(), (void *) pfa); I also introduced a statement in Thread::handle_page_fault_pager as follows: printf("handle_page_fault_pager: pfa=" L4_PTR_FMT ", errorcode=" L4_PTR_FMT ", pc=%lx, bad_v_addr=%lx\n", pfa, error_code, regs()->ip(), regs()->bad_v_addr); I then observe some strange behaviour: Translation error ? 0x1 is_kmem_page_fault ? 0 is_sigma0 ? 0 program counter: 0x80019c8c regs->ip(): 0x80019c8c page fault address: 0xc regs->bad_v_addr: 0xc handle_page_fault_pager: pfa=0000000c, errorcode=00000009, pc=103502c, bad_v_addr=8cc4 L4Re[svr]: request: tag=0xfffe0002 proto=-2 obj=0x0 L4Re: page fault: 9 pc=103502c L4Re[rm]: unhandled read page fault at 0x8 pc=0x103502c In the above, the last three lines are normal debugging output. The (wrapped) line above those is from my statement in handle_page_fault_pager. For some reason, the presumably correct bad_v_addr (bad virtual address, 0x8cc4) arising in the apparent initial page fault (at 0x0103502c) does not get propagated back to L4Re alongside the associated program counter value. Instead, 0x8 gets reported in the L4Re logging output. While handling this page fault, there appears to be another page fault in the kernel (at 0x80019c8c). This latter fault can't be handled (as discussed below) and so the original exception is eventually exposed in L4Re with the confused mix of details noted above. The unlikely address of 0x8 reported by L4Re may be related to the kernel fault address of 0xc, which according to the above details occurs in the following code (found in Ram_quota::alloc): 80019c7c <_ZN9Ram_quota5allocEl>: 80019c7c: 40036000 mfc0 v1,c0_status 80019c80: 30670001 andi a3,v1,0x1 80019c84: 41606000 di 80019c88: 000000c0 ehb 80019c8c: 8c82000c lw v0,12(a0) Note how the final, fault-causing instruction involves 12 (0xc), suggesting that a0 is set to zero, which is not an expected value given that it refers to a function/method parameter block and given that a parameter is expected by the method. Unfortunately, I don't know the invocation chain responsible for this, and it doesn't appear to be very obvious how I might discover it efficiently. Paul

Adam Lackorzynski

21 Jul 21 Jul

12:06 a.m.

On Thu Jul 20, 2017 at 22:10:48 +0200, Paul Boddie wrote:

...

On Wednesday 19. July 2017 19.40.23 Paul Boddie wrote:

...
It always seems to involve an address of 0x8, which seems rather bizarre. Again, I think I must be missing something fundamental and must only be seeing the consequences.

So, I adjusted the kernel code, putting back in a commented-out debugging statement found in the Thread::handle_page_fault method which looks like this (having changed some of the details):

printf("Translation error ? %p\n" " is_kmem_page_fault ? %x\n" " is_sigma0 ? %x\n" " program counter: %p\n" " regs->ip(): %p\n" " page fault address: %p\n", (void *) PF::is_translation_error(error_code), !PF::is_translation_error(error_code) && mem_space()->is_sigma0(), Kmem::is_kmem_page_fault(pfa, error_code), (void *) pc, (void *) regs->ip(), (void *) pfa);

I also introduced a statement in Thread::handle_page_fault_pager as follows:

printf("handle_page_fault_pager: pfa=" L4_PTR_FMT ", errorcode=" L4_PTR_FMT ", pc=%lx, bad_v_addr=%lx\n", pfa, error_code, regs()->ip(), regs()->bad_v_addr);

I then observe some strange behaviour:

Translation error ? 0x1 is_kmem_page_fault ? 0 is_sigma0 ? 0 program counter: 0x80019c8c regs->ip(): 0x80019c8c page fault address: 0xc regs->bad_v_addr: 0xc handle_page_fault_pager: pfa=0000000c, errorcode=00000009, pc=103502c, bad_v_addr=8cc4 L4Re[svr]: request: tag=0xfffe0002 proto=-2 obj=0x0 L4Re: page fault: 9 pc=103502c L4Re[rm]: unhandled read page fault at 0x8 pc=0x103502c

In the above, the last three lines are normal debugging output. The (wrapped) line above those is from my statement in handle_page_fault_pager.

For some reason, the presumably correct bad_v_addr (bad virtual address, 0x8cc4) arising in the apparent initial page fault (at 0x0103502c) does not get propagated back to L4Re alongside the associated program counter value. Instead, 0x8 gets reported in the L4Re logging output.

While handling this page fault, there appears to be another page fault in the kernel (at 0x80019c8c). This latter fault can't be handled (as discussed below) and so the original exception is eventually exposed in L4Re with the confused mix of details noted above.

The unlikely address of 0x8 reported by L4Re may be related to the kernel fault address of 0xc, which according to the above details occurs in the following code (found in Ram_quota::alloc):

That looks like you should use the patch in http://os.inf.tu-dresden.de/pipermail/l4-hackers/2017/008005.html Adam

Paul Boddie

12:43 a.m.

On Friday 21. July 2017 00.06.27 Adam Lackorzynski wrote:

...

On Thu Jul 20, 2017 at 22:10:48 +0200, Paul Boddie wrote:

...
While handling this page fault, there appears to be another page fault in the kernel (at 0x80019c8c). This latter fault can't be handled (as discussed below) and so the original exception is eventually exposed in L4Re with the confused mix of details noted above.

The unlikely address of 0x8 reported by L4Re may be related to the kernel fault address of 0xc, which according to the above details occurs in the following code (found in Ram_quota::alloc):

That looks like you should use the patch in http://os.inf.tu-dresden.de/pipermail/l4-hackers/2017/008005.html

Well, that fixed it. Thanks for pointing it out! I'll have to see what that patch actually does, I guess. The explanation from the referenced message seems to be that the compiler outmanoeuvred the authors of the code, so I suppose this is my punishment for using a newer compiler. Should the patch be something that is available via the Subversion repository or is it a tentative change that hasn't been fully tested yet? Anyway, I can now hopefully return to the program I thought I'd be testing on Wednesday and see what mistakes I've made in my driver code. Thanks once again, Paul

3148

Age (days ago)

3155

Last active (days ago)

List overview

Download

12 comments

3 participants

participants (3)

Adam Lackorzynski
Paul Boddie
Sarah Hoffmann