memory problems on Fiasco porting

newer
Announcement: Genode OS Framework...

Tsai, Tung-Chieh

11 Feb 2009 11 Feb '09

2:44 p.m.

Dear all, I'm working on porting Fiasco to an ARM(922T) platform. Currently, I can get into kernel debugger, and successfully leave the `Calibrating timer loop'. But then I stuck in Thread::init_workload, fail to create sigma0_thread. ARM processor would raise a data abort exception and then goto 0xffff0010. It seems failed because trying to load an illegal virtual address at 0xc0080004, which doing this at : Kernel_thread::init_workload Thread::create Thread::maybe_create Thread::thread_lock Thread_lock::lock_dirty ... Switch_lock<Thread_lock_valid>::lock I guess the problem maybe becuase some architecture porting part still has problem, or because the sigma0_task create for sigma0_thread setting wrong memory space, because I've some memory space setting incorrect. Honestly, I don't know where setting the virtual address translation for 0xd0000000 ~ 0xc0000000(_tcbs_1 to phys_offset), is this range use for each thread's tcb ? And where can I get more information about Fiasco's memory layout? for example, what does Mem_layout::Tcbs and Mem_layout::User_max means? which range of memory is used for specific purpose, for example, I/O device(seems is 0xef100000 ~ ) ? Any advice is appreciate. Thanks. Best Regards, Tsai, Tung-Chieh

Show replies by date

Adam Lackorzynski

12 Feb 12 Feb

10:32 p.m.

Hi, On Wed Feb 11, 2009 at 21:44:55 +0800, Tsai, Tung-Chieh wrote:

...

I'm working on porting Fiasco to an ARM(922T) platform.

Great!

...

Currently, I can get into kernel debugger, and successfully leave the `Calibrating timer loop'. But then I stuck in Thread::init_workload, fail to create sigma0_thread. ARM processor would raise a data abort exception and then goto 0xffff0010.

It seems failed because trying to load an illegal virtual address at 0xc0080004, which doing this at :

This is the first access to the TCB of the sigma0 thread and actually what should happen in init_workload. The TCBs are pulled in on request.

...

I guess the problem maybe becuase some architecture porting part still has problem, or because the sigma0_task create for sigma0_thread setting wrong memory space, because I've some memory space setting incorrect.

The the data abort happens it should end up in the page fault handler and further on make a page visible at 0xc0080000. Does this happen or not?

...

Honestly, I don't know where setting the virtual address translation for 0xd0000000 ~ 0xc0000000(_tcbs_1 to phys_offset), is this range use for each thread's tcb ?

Yes, actually TCBs go from 0xc0000000 - 0xe0000000 usually.

...

And where can I get more information about Fiasco's memory layout? for example, what does Mem_layout::Tcbs and Mem_layout::User_max means? which range of memory is used for specific purpose, for example, I/O device(seems is 0xef100000 ~ ) ?

There's the mem_layout-arm.cpp file which describes the generic layout. The platform specific layout is described in kern/arm/bsp/foo/mem_layout... User_max is the address there user virtual memory end and Tcbs is the address there the TCBs start. Adam -- Adam adam@os.inf.tu-dresden.de Lackorzynski http://os.inf.tu-dresden.de/~adam/

Tsai, Tung-Chieh

14 Feb 14 Feb

1:25 p.m.

Dear Adam, On Fri, Feb 13, 2009 at 5:32 AM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...

The the data abort happens it should end up in the page fault handler and further on make a page visible at 0xc0080000. Does this happen or not?

Yes, and then it would stuck between irq_handler() and Timeslice_timeout::expired(). Best Regards, Tsai, Tung-Chieh

Adam Lackorzynski

16 Feb 16 Feb

11:41 a.m.

Hi, On Sat Feb 14, 2009 at 20:25:19 +0800, Tsai, Tung-Chieh wrote:

...

On Fri, Feb 13, 2009 at 5:32 AM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...
The the data abort happens it should end up in the page fault handler and further on make a page visible at 0xc0080000. Does this happen or not?

Yes, and then it would stuck between irq_handler() and Timeslice_timeout::expired().

Hmm, so after the page-fault happened it is properly resolved and after the page fault handling is done there is a page at 0xc0080000? I'm just asking because when the irq_handler is invoked this has not really anything to do with page-fault handling but probably is a timer interrupt. Of course this timer interrupt is executed in a specific context and might also access the same address range. Might it be any cache related thing? At which instruction is it stuck? Is there anything special? Adam -- Adam adam@os.inf.tu-dresden.de Lackorzynski http://os.inf.tu-dresden.de/~adam/

Tsai, Tung-Chieh

17 Feb 17 Feb

7:34 p.m.

Dear Adam, On Mon, Feb 16, 2009 at 6:41 PM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...

Hi,

On Sat Feb 14, 2009 at 20:25:19 +0800, Tsai, Tung-Chieh wrote:

...
On Fri, Feb 13, 2009 at 5:32 AM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...
The the data abort happens it should end up in the page fault handler and further on make a page visible at 0xc0080000. Does this happen or not?

Yes, and then it would stuck between irq_handler() and Timeslice_timeout::expired().

Hmm, so after the page-fault happened it is properly resolved and after the page fault handling is done there is a page at 0xc0080000? I'm just asking because when the irq_handler is invoked this has not really anything to do with page-fault handling but probably is a timer interrupt. Of course this timer interrupt is executed in a specific context and might also access the same address range. Might it be any cache related thing? At which instruction is it stuck? Is there anything special?

After page fault handler, there's a page at 0xc0080000 ~ 0xc0081000. Initially, I thought this data abort exception is an error... Now I've understood it's not a problem. But after I skip these exception, it would stuck between irq_handler() and Timeslice_timeout::expired(). I found that it is because in the while loop of Timeout_q::do_timeouts(), traveling the queue Timeout_q::_q via Timeout::_next would exceed Timeout_q::Wakeup_queue_count. I put an assertion on how many time this while loop is executed, and this assertion would not fail when I run it on QEMU integratorcp platform. After this assertion failed, entering jdb, jdb shows that sigma0 had been added to ready/present list. Since timmer interrupt had been enabled before Kernel_thread::init_workload(), I thought this error may be occur at the first timer interrupt after sigma0 had been added to ready queue. I would check cache & tlb relative part again. Currently, for cache & tlb part, I write the following function for arm922t: * Mmu<Flush_area, Ram>::flush_cache() clean D cache(write back), invalidate I & D cache, drain write buffer * Mmu<Flush_area, Ram>::clean_dcache() clean D cache, , drain write buffer * Mmu<Flush_area, Ram>::flush_dcache() clean and invalidate D cache, , drain write buffer * Mem_unit::tlb_flush( void* va, unsigned long) I just flush whole tlb here, since arm922t looks not have any instruction to flush a specific tlb entry without knowing it's instruction tlb or data tlb. * The other functions of Mmu and Men_unit I doesn't mention are using codes of arm926/armv5. Did I missing something ? Or is there any other possible direction ? Any advice is appreciate. Thanks. Best Regards, Tsai, Tung-Chieh

Adam Lackorzynski

19 Feb 19 Feb

9:01 a.m.

Hi, On Wed Feb 18, 2009 at 02:34:34 +0800, Tsai, Tung-Chieh wrote:

...

On Mon, Feb 16, 2009 at 6:41 PM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...
Hi,

On Sat Feb 14, 2009 at 20:25:19 +0800, Tsai, Tung-Chieh wrote:

...
On Fri, Feb 13, 2009 at 5:32 AM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...
The the data abort happens it should end up in the page fault handler and further on make a page visible at 0xc0080000. Does this happen or not?

Yes, and then it would stuck between irq_handler() and Timeslice_timeout::expired().

Hmm, so after the page-fault happened it is properly resolved and after the page fault handling is done there is a page at 0xc0080000? I'm just asking because when the irq_handler is invoked this has not really anything to do with page-fault handling but probably is a timer interrupt. Of course this timer interrupt is executed in a specific context and might also access the same address range. Might it be any cache related thing? At which instruction is it stuck? Is there anything special?

After page fault handler, there's a page at 0xc0080000 ~ 0xc0081000.

Initially, I thought this data abort exception is an error... Now I've understood it's not a problem.

Yes.

...

But after I skip these exception, it would stuck between irq_handler() and Timeslice_timeout::expired(). I found that it is because in the while loop of Timeout_q::do_timeouts(), traveling the queue Timeout_q::_q via Timeout::_next would exceed Timeout_q::Wakeup_queue_count.

I put an assertion on how many time this while loop is executed, and this assertion would not fail when I run it on QEMU integratorcp platform. After this assertion failed, entering jdb, jdb shows that sigma0 had been added to ready/present list.

Since timmer interrupt had been enabled before Kernel_thread::init_workload(), I thought this error may be occur at the first timer interrupt after sigma0 had been added to ready queue.

Ok, if it runs in Qemu it might very well be a cache issue...

...

I would check cache & tlb relative part again. Currently, for cache & tlb part, I write the following function for arm922t:

* Mmu<Flush_area, Ram>::flush_cache() clean D cache(write back), invalidate I & D cache, drain write buffer * Mmu<Flush_area, Ram>::clean_dcache() clean D cache, , drain write buffer * Mmu<Flush_area, Ram>::flush_dcache() clean and invalidate D cache, , drain write buffer * Mem_unit::tlb_flush( void* va, unsigned long) I just flush whole tlb here, since arm922t looks not have any instruction to flush a specific tlb entry without knowing it's instruction tlb or data tlb. * The other functions of Mmu and Men_unit I doesn't mention are using codes of arm926/armv5.

Ok, this is the stuff you need to change, lets hope it's right :)

...

Did I missing something ? Or is there any other possible direction ? Any advice is appreciate. Thanks.

One thing you should try is the following... in src/kern/arm/config-arm.c there's a constant called 'cache_enabled'. It's set the true (obviously), do things change (and start to work better) if you set it to false? Adam -- Adam adam@os.inf.tu-dresden.de Lackorzynski http://os.inf.tu-dresden.de/~adam/

Tsai, Tung-Chieh

3:22 p.m.

Dear Adam, On Thu, Feb 19, 2009 at 4:01 PM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...

One thing you should try is the following... in src/kern/arm/config-arm.c there's a constant called 'cache_enabled'. It's set the true (obviously), do things change (and start to work better) if you set it to false?

I've tried this, and the result is strange. It would stuck in irq_handler to handle timer interrupt, before printing "Calibrating timer loop... " in Kernel_thread::bootstrap.(More precisely, the return address of irq is at "bl Kernel_thread::bootstrap_arch()", before "bl printf" ) Best Regards, Tsai, Tung-Chieh

Adam Lackorzynski

10:41 p.m.

Hi, On Thu Feb 19, 2009 at 22:22:02 +0800, Tsai, Tung-Chieh wrote:

...

On Thu, Feb 19, 2009 at 4:01 PM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...
One thing you should try is the following... in src/kern/arm/config-arm.c there's a constant called 'cache_enabled'. It's set the true (obviously), do things change (and start to work better) if you set it to false?

I've tried this, and the result is strange. It would stuck in irq_handler to handle timer interrupt, before printing "Calibrating timer loop... " in Kernel_thread::bootstrap.(More precisely, the return address of irq is at "bl Kernel_thread::bootstrap_arch()", before "bl printf" )

This is the point where interrupts are enabled and also the points where it's likely that the first timer hits. The cache_enabled constant just influences the cp15-c1 value so maybe you should also disable the flushing functions, maybe there's something strange? Another simplification is also not to do any range based flushing but alsways flush everything. When you mean stuck is it because of a fault or is it looping or something else? Do you look at the system with some external debugger? Adam -- Adam adam@os.inf.tu-dresden.de Lackorzynski http://os.inf.tu-dresden.de/~adam/

Tsai, Tung-Chieh

25 Feb 25 Feb

6:27 p.m.

Dear Adam, On Fri, Feb 20, 2009 at 5:41 AM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...

This is the point where interrupts are enabled and also the points where it's likely that the first timer hits. The cache_enabled constant just influences the cp15-c1 value so maybe you should also disable the flushing functions, maybe there's something strange? Another simplification is also not to do any range based flushing but alsways flush everything. When you mean stuck is it because of a fault or is it looping or something else? Do you look at the system with some external debugger?

I use `stuck' to describe that it would falling into a infinite loop. I use ARM AXD debugger to debug. I set cache_enabled to false, using empty cache flush/clean function, and change Config::scheduler_granularity to 100UL, make the timer interrupt less frequently, then it could pass the "Calibrating timer loop... ", leaveingThread::init_workload() normally. ( Without changing Config::scheduler_granularity , it wouldn't pass) But wouldn't show any thing further.It seems because sigma0 wouldn't work. I try to enter jdb after leaving init_workload() , jdb shows that the status of sigma0 is `cancel,dead' and boot_task is `poll,ipc_progr,snd_progr' : ---- L4 Bootstrapper Memory size is 128MB mod00: 3140d000-3144bb78: fiasco mod01: 3144c000-3145c174: sigma0 mod02: 3145d000-3147e6ac: roottask mod03: 3147f000-3148a268: hello move modules to 32000000 with offset bf3000 move module 4 start 3147f000 -> 32072000 move module 3 start 3145d000 -> 32050000 move module 2 start 3144c000 -> 3203f000 move module 1 start 3140d000 -> 32000000 Scanning fiasco -serial_esc -comport 1 -nowait -nokdb -jdb_cmd=JH Scanning sigma0 Scanning roottask Relocated mbi to [0x3004f000-0x3004f0e7] Loading fiasco Loading sigma0 Loading roottask find kernel info page... found kernel info page at 0x30002000 [ 30001000, 3000197f] Kern fiasco [ 30002000, 3004efff] Kern fiasco [ 3004f000, 3004f1e4] Root Multiboot info [ 30068000, 3007335b] Sigma0 sigma0 [ 30078000, 3013ffff] Root roottask [ 31400000, 3140cd4f] Boot bootstrap [ 32072000, 3207d267] Root Modules Memory API Version: (87) experimental Sigma0 config ip:30068000 sp:3140c910 Roottask config ip:30078000 sp:00000000 Starting kernel fiasco at 30001000 Hello from Startup::stage2 Initialize page table Vmem_alloc::init() Vmem_alloc::TEST allocate zero-filled page... [0xefc01000] done free zero-filled page... done allocate no-zero-filled page... [0xefc02000] done free no-zero-filled page... done SERIAL ESC: allocated IRQ 26 for serial uart Not using serial hack in slow timer handler. Welcome to Fiasco(arm)! DD-L4(v2)/arm microkernel (C) 1998-2008 TU Dresden Rev: rUNKNOWN compiled with gcc 3.4.4 for pacpmp [] --init-----------------------------------------------------PC: f0003244 (0.00) jdb: g Calibrating timer loop... done. --before while(runnung)------------------------------------PC: f001be74 (0.00) jdb: kf KIP @ 0xf0002000 magic: L4µK version: 0x87004444 clock: 0000000000020b70 (134000) freq_cpu: 0kHz freq_bus: 0kHz sigma0_ip: 30068000 sigma0_sp: 3140c910 sigma1_ip: 00000000 sigma1_sp: 00000000 root_ip: 30078000 root_sp: 00000000 Memory (max 20 descriptors): 1:phys [0000000030000000-0000000038000000] Conventional 2:phys [0000000030002000-000000003004f000] Reserved 3:phys [000000003004f000-000000003004f400] Bootloader 4:phys [0000000030068000-0000000030073400] Dedicated 5:phys [0000000030078000-0000000030140000] Bootloader 6:phys [0000000032072000-000000003207d400] Bootloader 7:virt [0000000000000000-00000000c0000000] Conventional 8:phys [0000000037800000-0000000038000000] Reserved user_ptr: 0x3004f000 vhw_offset: 00000000 vkey_irq: 26 Kernel features: deceit_bit_disables_switch abiver:9 multi_irq exception_ipc exc eption_ipc pagerexregs utcb kip_syscalls thread_names (0.00) jdb: t2 thread: 2.00 <00040000> prio: 10 mcp: 00 mode: Con state: 300 cancel,dead wait for: *.** polling: rcv descr: lcked by: ---.-- timeout : cpu time: 900.000 µ timeslice: 200/1000 µs pager : 0.00 cap: ---.-- utcb: f0bfb000 preemptr: 0.00 not monitored ready lnk: ---.-- ---.-- prsent lnk: 4.00 0.00 PC=ffffffd8 USP=300707d0 [0] 3006e4a8 0000000f 0000000f 3006e6c0 [4] 30002000 c00807ec c00807a0 f0048144 [8] c0000000 c0000004 c00006f4 3007304c [c] 00000006 300707d0 3006c804 ffffffd8 720 f0038e90 f0038a8c f0038a8c f0038a90 f0038a90 f001472c c0000000 c0000000 740 c0080000 600000d3 f0038e90 f0038a8c f0038a8c f0038a90 f0038a90 f00148d8 760 00000000 f0048144 c0080004 60000053 00000001 c0080000 c0000000 c0000004 780 c00006f4 3007304c f0005808 c0080000 c00807ec c00807a0 f0048144 f0007d84 734 30002000 c00807ec c00807a0 ffff0318 ffffffd8 f001472c 3006e4a8 0000000f 7c0 0000000f 3006e6c0 30002000 c00807ec c00807a0 f0048144 c0000000 c0000004 7e0 c00006f4 3007304c 00000006 20000010 300707d0 3006c804 3006cae4 ffffffd8 (0.00) jdb: lp id name pr wait to state 4.00 10 poll,ipc_progr,snd_progr 2.00 10 cancel,dead 0.00 0 ready (0.00) jdb: g ---- It seems sigma0 has problem on PC, but the value of Kip::k()->sigma0_ip is correct. Besides, I found that I didn't enable ROM protection(R bit in CP15_c1, bit 9) , because if I adding it to Cpu::Cp15_c1_generic to enable it, an RDI warning will be raised on AXD and uart wouldn't give any output from fiasco kernel. In Page_table::init(Page_table*), it write the domain access permission to 0x0001. In original situation, for example, in integratorcp, d1 ~ d15 will become read only, but in my situation, since R bit is unable to enable, d1 to d15 will become `No acess' . I guess this cause my problem, but not very sure. Best Regards, Tsai, Tung-Chieh

Adam Lackorzynski

26 Feb 26 Feb

7:37 p.m.

Hi, On Thu Feb 26, 2009 at 01:27:24 +0800, Tsai, Tung-Chieh wrote:

...

On Fri, Feb 20, 2009 at 5:41 AM, Adam Lackorzynski <adam@os.inf.tu-dresden.de> wrote:

...
This is the point where interrupts are enabled and also the points where it's likely that the first timer hits. The cache_enabled constant just influences the cp15-c1 value so maybe you should also disable the flushing functions, maybe there's something strange? Another simplification is also not to do any range based flushing but alsways flush everything. When you mean stuck is it because of a fault or is it looping or something else? Do you look at the system with some external debugger?

I use `stuck' to describe that it would falling into a infinite loop. I use ARM AXD debugger to debug.

Ok, and where? Any prominent place?

...

I set cache_enabled to false, using empty cache flush/clean function, and change Config::scheduler_granularity to 100UL, make the timer interrupt less frequently, then it could pass the "Calibrating timer loop... ", leaveingThread::init_workload() normally. ( Without changing Config::scheduler_granularity , it wouldn't pass) But wouldn't show any thing further.It seems because sigma0 wouldn't work. I try to enter jdb after leaving init_workload() , jdb shows that the status of sigma0 is `cancel,dead' and boot_task is `poll,ipc_progr,snd_progr' :

----

(0.00) jdb: t2

thread: 2.00 <00040000> prio: 10 mcp: 00 mode: Con state: 300 cancel,dead wait for: *.** polling: rcv descr: lcked by: ---.-- timeout : cpu time: 900.000 µ timeslice: 200/1000 µs pager : 0.00 cap: ---.-- utcb: f0bfb000 preemptr: 0.00 not monitored ready lnk: ---.-- ---.-- prsent lnk: 4.00 0.00

PC=ffffffd8 USP=300707d0

It seems sigma0 has problem on PC, but the value of Kip::k()->sigma0_ip is correct.

The pc looks like it already did something. The last page should be the syscall page. Did it go through Thread::user_invoke and tried to enter user-mode? What's the pc of roottask? In the initial boot-prompt you should also enabled page fault logging (P*) to see if any (unusual) happened (with T). And maybe also IPC logging (I* and IR+).

...

Besides, I found that I didn't enable ROM protection(R bit in CP15_c1, bit 9) , because if I adding it to Cpu::Cp15_c1_generic to enable it, an RDI warning will be raised on AXD and uart wouldn't give any output from fiasco kernel. In Page_table::init(Page_table*), it write the domain access permission to 0x0001. In original situation, for example, in integratorcp, d1 ~ d15 will become read only, but in my situation, since R bit is unable to enable, d1 to d15 will become `No acess' . I guess this cause my problem, but not very sure.

Looking at the table in the manual this only affects kernel-only ro pages which there should not be any. d0 is the only one used so the other should not play a role. Also on integratorcp d1-d15 should be 0, i.e. no access. I think you mean in the AP bits in the page table. Adam -- Adam adam@os.inf.tu-dresden.de Lackorzynski http://os.inf.tu-dresden.de/~adam/

6214

Age (days ago)

6229

Last active (days ago)

List overview

Download

9 comments

2 participants

participants (2)

Adam Lackorzynski
Tsai, Tung-Chieh

memory problems on Fiasco porting

tags

participants (2)