捉虫日记 0010: LTP call trace on RMI XLR
来自Jack's Lab
1 Phenomenon
环境:
- RMI XLR732 (8 core, 32 threads)
- Linux 2.6.27-rc9
产品内核版本从 2.6.27.8 升级到 2.6.27.14 后,运行 LTP,内核拼命地打印 call trace:
<<<test_start>>> tag=clone06 stime=1688 cmdline="clone06" contacts="" analysis=exit initiation_status="ok" <<<test_output>>> clone06 1 PASS : Test Passed <<<execution_status>>> duration=0 termination_type=exited termination_id=0 corefile=no cutime=0 cstime=1 <<<test_end>>> <<<test_start>>> tag=clone07 stime=1688 cmdline="clone07" contacts="" analysis=exit initiation_status="ok" <<<test_output>>> clone07 1 PASS : Use of return() in child did not cause SIGSEGV <<<execution_status>>> duration=0 termination_type=exited termination_id=0 corefile=no cutime=0 cstime=1 <<<test_end>>> <<<test_start>>> tag=close01 stimUnhandled kernel unaligned access[#12]: e=1688 cmdline=Cpu 17 "close01" conta$ 0 :cts="" analysis 0000000000000000=exit initiatio 000000001000dce0n_status="ok" < 3c0f00418df97064<<test_output>>> 0000000000000011 close01 1 PASS : file a$ 4 :ppears closed c 0000000000000008lose01 2 PA 00000000000000d0SS : pipe appe 0000000000000000ars closed <<<e ffffffffffffffffxecution_status> >> duration=0 t$ 8 :ermination_type= a800000007af4c38exited terminati a8000001254d07e0 0000000000000000 0000000000000418 $12 : 0000000000000038 ffffffff834056e0 0000000000000004 0000000016ee6396 $16 : 3c0f00418df97064 ffffffff838e0d00 000000001000dce1 00000000000000d0 $20 : ffffffff834305f4 0000000000000100 0000000000417384 0000000000000000 $24 : 0000000000000000 ffffffff835d6860 $28 : a800000123ed4000 a800000123ed7c90 000000000041bb90 ffffffff834cbf54 Hi : 0000000000000000 Lo : 0000000000000f40 epc : ffffffff834cbf7c __kmalloc+0x94/0x198 Tainted: G D ra : ffffffff834cbf54 __kmalloc+0x6c/0x198 Status: 1000dce2 KX SX UX KERNEL EXL Cause : 00800010 BadVA : 3c0f00418df97064 PrId : 000c0b04 (RMI Phoenix) Modules linked in: [last unloaded: hwtimer_test] Process connect01 (pid: 2343, threadinfo=a800000123ed4000, task=a8000001256d2820, tls=000000002aab02a0) Stack : 000000001000dce3 00000000000000e0 0000000000000000 a800000125e66b80 a800000125d4bb00 a800000123ed7eb0 000000000041bb90 ffffffff834305f4 a8000001263de690 00000000004065d4 0000000000400000 0000000000000008 0000000000417384 0000000000417000 000000000000000a 0000000000000007 a8000001263de690 000000fffffffe85 a8000001256d2820 0000000000417384 0000000000417000 00000000004065d4 0000000000400000 0000000000000001 a8000001255edd00 a800000125f85170 0000000000000001 0000000000400000 a800000126a5a0c0 a800000125e66bb4 0000000000000005 000000fffffffe85 0000000000000000 ffffffff838b11a0 ffffffff8342f940 fffffffffffffff8 a800000125d4bb00 ffffffff83973268 ffffffff838b6810 0000000000000002 ... Call Trace: [<ffffffff834cbf7c>] __kmalloc+0x94/0x198 [<ffffffff834305f4>] load_elf_binary+0xcb4/0x1e00 [<ffffffff834d751c>] search_binary_handler+0x294/0x4b8 [<ffffffff8351bb74>] compat_do_execve+0x33c/0x3e0 [<ffffffff8342b17c>] sys32_execve+0x5c/0x88 [<ffffffff8340394c>] handle_sys+0x16c/0x188 Code: 9d020014 000210f8 0202102d <dc430000> fd030000 0240102d 40016000 30420001 3421001f on_id=0 corefile=no cutime=0 cstime=1 <<<test_end>>> <<<test_start>>> tag=close02 stime=1688 cmdline="close02" contacts="" analysis=exit initiation_status="ok" <<<test_output>>> close02 1 PASS : call returned EBADF <<<execUnhandled kernel unaligned access[#13]: Cpu 17status>>> duration=1 term$ 0 :ination_type=exi 0000000000000000ted termination_ 000000001000dce0id=0 corefile=no 3c0f00418df97064 cutime=0 cstim 0000000000000011e=2 <<<test_end >>> <<<test_sta$ 4 :rt>>> tag=close ffffffff838e0d0008 stime=1689 c 00000000000080d0mdline="close08" 000000007fa5c760 contacts="" a a800000123f23eb0 nalysis=exit in$ 8 :itiation_status= a800000007af4c38"ok" <<<test_ou a800000123f9e000tput>>> close08 0000000000000000 1 PASS : 0000000032cff56c $12 : 00000000000003f3 ffffffffc0000008 ffffffff8342b120 a800000127a40000 $16 : 3c0f00418df97064 000000001000dce1 ffffffff838e0d00 ffffffff8351b878 $20 : 00000000000080d0 0000000000000100 000000007fa5b9fb 000000002aaa92a8 $24 : 0000000000000000 0000000032dd2470 $28 : a800000123f20000 a800000123f23e10 000000007fa5ba00 ffffffff834cb738 Hi : 0000000000000000 Lo : 0000000000000d40 epc : ffffffff834cb760 kmem_cache_alloc+0x78/0x118 Tainted: G D ra : ffffffff834cb738 kmem_cache_alloc+0x50/0x118 Status: 1000dce2 KX SX UX KERNEL EXL Cause : 00800010 BadVA : 3c0f00418df97064 PrId : 000c0b04 (RMI Phoenix) Modules linked in: [last unloaded: hwtimer_test] Process pan (pid: 2346, threadinfo=a800000123f20000, task=a8000001256d2ed0, tls=000000002aab02a0) Stack : 000000002aaa92a8 a800000123f9e000 000000007fa5cf48 a800000123f9e000 000000000041b3a0 000000007fa5c760 a800000123f23eb0 ffffffff8351b878 0000000032e91250 a800000123f9e000 000000007fa5cf48 000000007fa5b9f1 0000000032e91250 0000000000000000 000000000041b3a0 ffffffff8342b17c 000000007fa5b9fb 000000007fa5cf3f 000000007fa5cf48 ffffffff8340394c 0000000000000000 000000001000dce1 0000000000000fab 000000007fa5b9fa 000000007fa5b9f1 000000000041b3a0 000000007fa5c760 000000007fa5cf48 000000000000006e 000000006f742f62 000000003a3a3a3a ffffffff81010100 ffffffffffffffff 000000007fa5b8d0 0000000032cf9c24 fffffffff0000000 000000007fa5cf12 000000007fa5cf12 000000000041b390 0000000032e91250 ... Call Trace: [<ffffffff834cb760>] kmem_cache_alloc+0x78/0x118 [<ffffffff8351b878>] compat_do_execve+0x40/0x3e0 [<ffffffff8342b17c>] sys32_execve+0x5c/0x88 [<ffffffff8340394c>] handle_sys+0x16c/0x188 Code: 9d020014 000210f8 0202102d <dc430000> fd030000 0220102d 40016000 30420001 3421001f close(tfile_2341) returned 0 <<<execution_status>>> duration=0 termination_type=exited termination_id=0 corefile=no cutime=0 cstime=1 <<<test_end>>> <<<test_start>>> tag=confstr01 stime=1689 cmdline="confstr01" contacts="" analysis=exit initiation_status="ok" <<<test_output>>> confstr01 1 PASS : Test Passed <<<execution_status>>> duration=0 termination_type=exited termination_id=0 corefile=no cutime=0 cstime=0 <<<test_end>>> ...... ......
能够看到,主要是两种 Call trace:
Unhandled kernel unaligned access[#25]: Cpu 17 $ 0 : 0000000000000000 000000001000dce0 3c0f00418df97064 0000000000000011 $ 4 : ffffffff838e0d00 00000000000080d0 000000007fa5c760 a80000012240feb0 $ 8 : a800000007af4c38 a800000123f9d000 0000000000000000 0000000032cff56c $12 : 00000000000003f3 ffffffffc0000008 ffffffff8342b120 a800000127a40000 $16 : 3c0f00418df97064 000000001000dce1 ffffffff838e0d00 ffffffff8351b878 $20 : 00000000000080d0 0000000000000100 000000007fa5b9fb 000000002aaa92a8 $24 : 0000000000000000 0000000032dd2470 $28 : a80000012240c000 a80000012240fe10 000000007fa5ba00 ffffffff834cb738 Hi : 0000000000000000 Lo : 0000000000000d40 epc : ffffffff834cb760 kmem_cache_alloc+0x78/0x118 Tainted: G D ra : ffffffff834cb738 kmem_cache_alloc+0x50/0x118 Status: 1000dce2 KX SX UX KERNEL EXL Cause : 00800010 BadVA : 3c0f00418df97064 PrId : 000c0b04 (RMI Phoenix) Modules linked in: [last unloaded: hwtimer_test] Process pan (pid: 2405, threadinfo=a80000012240c000, task=a800000123f2d040, tls=000000002aab02a0) Stack : 000000007fa5b9fb a800000123f9d000 000000007fa5cf25 a800000123f9d000 000000000041b3a0 000000007fa5c760 a80000012240feb0 ffffffff8351b878 0000000032e91250 a800000123f9d000 000000007fa5cf25 000000007fa5b9f5 0000000032e91250 0000000000000000 000000000041b3a0 ffffffff8342b17c 000000007fa5b9fb 000000007fa5cf20 000000007fa5cf25 ffffffff8340394c 0000000000000000 000000001000dce1 0000000000000fab 000000007fa5b9fa 000000007fa5b9f5 000000000041b3a0 000000007fa5c760 000000007fa5cf25 000000000000006e 000000007efefeff 000000003a3a3a3a ffffffff81010100 ffffffffffffffff 000000007fa5b8d0 0000000032cf9c24 fffffffff0000000 000000007fa5cf12 000000007fa5cf12 000000000041b390 0000000032e91250 ... Call Trace: [<ffffffff834cb760>] kmem_cache_alloc+0x78/0x118 [<ffffffff8351b878>] compat_do_execve+0x40/0x3e0 [<ffffffff8342b17c>] sys32_execve+0x5c/0x88 [<ffffffff8340394c>] handle_sys+0x16c/0x188 Code: 9d020014 000210f8 0202102d <dc430000> fd030000 0220102d 40016000 30420001 3421001f
和:
Unhandled kernel unaligned access[#29]: Cpu 17 $ 0 : 0000000000000000 000000001000dce0 3c0f00418df97064 0000000000000011 $ 4 : 0000000000000008 00000000000000d0 0000000000000000 ffffffffffffffff $ 8 : a800000007af4c38 a8000001254d1500 0000000000000000 0000000000055575 $12 : 0000000000000028 ffffffff834056e8 0000000000000005 000000004b81435b $16 : 3c0f00418df97064 ffffffff838e0d00 000000001000dce1 00000000000000d0 $20 : ffffffff834305f4 0000000000000100 0000000000024158 0000000055550000 $24 : 0000000000000006 ffffffff835d6860 $28 : a800000123ed4000 a800000123ed7c90 0000000000029800 ffffffff834cbf54 Hi : 0000000000000000 Lo : 0000000000000f40 epc : ffffffff834cbf7c __kmalloc+0x94/0x198 Tainted: G D ra : ffffffff834cbf54 __kmalloc+0x6c/0x198 Status: 1000dce2 KX SX UX KERNEL EXL Cause : 00800010 BadVA : 3c0f00418df97064 PrId : 000c0b04 (RMI Phoenix) Modules linked in: [last unloaded: hwtimer_test] Process mail.local (pid: 2458, threadinfo=a800000123ed4000, task=a800000125606450, tls=000000002aad0840) Stack : 000000001000dce3 00000000000000e0 0000000000000000 a800000125dd6a00 a8000001271b3700 a800000123ed7eb0 0000000055579800 ffffffff834305f4 0000000000020000 000000005556368c 0000000055550000 0000000000000009 0000000055574158 000000005557368c 0000000000000000 ffffffff834056fc 0000000000000000 0000000d00000004 0000000000000000 0000000000024158 000000000002368c 000000000001368c 0000000000000000 000000000000001b a8000001262a3600 a800000125dd7130 0000000000000001 0000000055550000 a800000125104600 a800000125dd6a34 0000000000000003 a800000123ed7dd0 0000000000000000 ffffffff838b11a0 ffffffff8342f940 fffffffffffffff8 a8000001271b3700 ffffffff83973268 ffffffff838b6810 0000000000000002 ... Call Trace: [<ffffffff834cbf7c>] __kmalloc+0x94/0x198 [<ffffffff834305f4>] load_elf_binary+0xcb4/0x1e00 [<ffffffff834d751c>] search_binary_handler+0x294/0x4b8 [<ffffffff8351bb74>] compat_do_execve+0x33c/0x3e0 [<ffffffff8342b17c>] sys32_execve+0x5c/0x88 [<ffffffff8340394c>] handle_sys+0x16c/0x188 Code: 9d020014 000210f8 0202102d <dc430000> fd030000 0240102d 40016000 30420001 3421001f
2 Analysis
可以看到每次引起 Call trace 的 BadVaddr 皆为 3c0f00418df97064,引起异常的代码段皆类似,且引起异常的指令皆为 0xdc430000 :
<kmem_cache_alloc>: ...... ffffffff834cb754: 9d020014 lwu v0,20(a4) ffffffff834cb758: 000210f8 dsll v0,v0,0x3 ffffffff834cb75c: 0202102d daddu v0,s0,v0 ffffffff834cb760: dc430000 ld v1,0(v0) ffffffff834cb764: fd030000 sd v1,0(a4) ffffffff834cb768: 0220102d move v0,s1 ffffffff834cb76c: 40016000 mfc0 at,c0_status ffffffff834cb770: 30420001 andi v0,v0,0x1 ffffffff834cb774: 3421001f ori at,at,0x1f ......
2009.2.12 23:43
MD,一跑 LTP 怎么多 Call trace, 都快崩溃了。现在还好,接受 Tiejun 的建议,把 slub 换成了 slab,在 ATX-II 没出现 Call trace,也许是没加载模块的原因。在 ATX-I 上先 insmod hwtimer_test.ko; rmmod hwtimer_test.ko 后,就出现了:
...... ...... <<<test_start>>> tag=waitpid10 stime=1464 cmdline="waitpid10 5" contacts="" analysis=exit initiation_status="ok" <<<test_output>>> waitpid10 1 *******PASS : Test P****SSED <<<execut********************************** cpu_31 received a bus/cache error ********************************************* Bridge: Phys Addr = 0x0128000000, Device_AERR = 0x00000008 Bridge: The devices reporting AERR are: cpu 3 CPU: (XLR specific) Cache Error log = 0x000000f93e7d7108, Phy Addr = 0x01f27cfae0 CPU: epc = 0x32dd2480, errorepc = 0xffffffff834c7c8c, cacheerr = 0x00000000 Can not handle bus/cache error - Halting cpu BUG: soft lockup - CPU#0 stuck for 61s! [events/0:99] Modules linked in: binfmt_misc ipv6 Cpu 0 $ 0 : 0000000000000000 000000001000dce0 0000000000000003 0000000000000020 $ 4 : 0000000000000020 0000000000000002 0000000000000020 a80000012715bdc0 $ 8 : 0000000000000000 0000000000000000 0000000000000000 0000000000000001 $12 : a80000012715bfe0 000000000000dc00 0000000000000000 ffffffff83884000 $16 : ffffffff838d0000 a800000125584ec0 0000000000000001 0000000000000000 $20 : ffffffff83494690 0000000000000000 fffffffffffffffc ffffffffefffffff $24 : 0000000000000000 ffffffff83419d88 $28 : a800000127158000 a80000012715bdf0 ffffffff83947240 ffffffff8347f010 Hi : 0000000000000000 Lo : 0000000000000000 epc : ffffffff8347ee50 smp_call_function_mask+0x170/0x2f0 Not tainted ra : ffffffff8347f010 smp_call_function+0x40/0x88 Status: 1000dce3 KX SX UX KERNEL EXL IE Cause : 00808000 PrId : 000c0b04 (RMI Phoenix) BUG: soft lockup - CPU#6 stuck for 61s! [syslogd:6427] Modules linked in: binfmt_misc ipv6 Cpu 6 $ 0 : 0000000000000000 000000001000dce0 0000000000000003 0000000000000020 $ 4 : 0000000000000020 0000000000000002 0000000000000020 a80000012351fbf0 $ 8 : 0000000000000000 ffffffff8342c8f8 0000000000000020 0000000000000000 $12 : 0000000000000000 000000000000dc00 0000000000000000 a80000012796c000 $16 : ffffffff838d0000 a800000125467260 0000000000000001 0000000000000000 $20 : ffffffff83434398 000000007ff3f450 0000000055551dc8 0000000000000001 $24 : 0000000000000020 ffffffff83419d88 $28 : a80000012351c000 a80000012351fc20 0000000055569040 ffffffff8347f010 Hi : 0000000000000000 Lo : 0000000000000000 epc : ffffffff8347ee50 smp_call_function_mask+0x170/0x2f0 Not tainted ra : ffffffff8347f010 smp_call_function+0x40/0x88 Status: 1000dce3 KX SX UX KERNEL EXL IE Cause : 00808000 PrId : 000c0b04 (RMI Phoenix) <<soft lockup here>>
2009.2.13 1:05
换用 slab 后,ATX-II 上 insmod ./hwtimer_test.ko; rmmod ./hwtimer_test.ko; ./runltplite.sh 跑了一夜,似乎没什么问题。
ATX-I 上直接 ./runltplite.sh 似乎也没什么问题。
2009.2.13 9:43