捉虫日记 0010: LTP call trace on RMI XLR

来自Jack's Lab
跳转到: 导航, 搜索

1 Phenomenon

环境:

  • RMI XLR732 (8 core, 32 threads)
  • Linux 2.6.27-rc9


产品内核版本从 2.6.27.8 升级到 2.6.27.14 后,运行 LTP,内核拼命地打印 call trace:


<<<test_start>>>
tag=clone06 stime=1688
cmdline="clone06"
contacts=""
analysis=exit
initiation_status="ok"
<<<test_output>>>
clone06     1 PASS : Test Passed
<<<execution_status>>>
duration=0 termination_type=exited termination_id=0 corefile=no
cutime=0 cstime=1
<<<test_end>>>
<<<test_start>>>
tag=clone07 stime=1688
cmdline="clone07"
contacts=""
analysis=exit
initiation_status="ok"
<<<test_output>>>
clone07     1 PASS : Use of return() in child did not cause SIGSEGV
<<<execution_status>>>
duration=0 termination_type=exited termination_id=0 corefile=no
cutime=0 cstime=1
<<<test_end>>>
<<<test_start>>>
tag=close01 stimUnhandled kernel unaligned access[#12]:
e=1688
cmdline=Cpu 17
"close01"
conta$ 0   :cts=""
analysis 0000000000000000=exit
initiatio 000000001000dce0n_status="ok"
< 3c0f00418df97064<<test_output>>> 0000000000000011
close01     1
PASS : file a$ 4   :ppears closed
c 0000000000000008lose01     2 PA 00000000000000d0SS : pipe appe 0000000000000000ars closed
<<<e ffffffffffffffffxecution_status>
>>
duration=0 t$ 8   :ermination_type= a800000007af4c38exited terminati a8000001254d07e0 0000000000000000 0000000000000418
$12   : 0000000000000038 ffffffff834056e0 0000000000000004 0000000016ee6396
$16   : 3c0f00418df97064 ffffffff838e0d00 000000001000dce1 00000000000000d0
$20   : ffffffff834305f4 0000000000000100 0000000000417384 0000000000000000
$24   : 0000000000000000 ffffffff835d6860                                 
$28   : a800000123ed4000 a800000123ed7c90 000000000041bb90 ffffffff834cbf54
Hi    : 0000000000000000
Lo    : 0000000000000f40
epc   : ffffffff834cbf7c __kmalloc+0x94/0x198
    Tainted: G      D
ra    : ffffffff834cbf54 __kmalloc+0x6c/0x198
Status: 1000dce2    KX SX UX KERNEL EXL
Cause : 00800010
BadVA : 3c0f00418df97064
PrId : 000c0b04 (RMI Phoenix)
Modules linked in: [last unloaded: hwtimer_test]
Process connect01 (pid: 2343, threadinfo=a800000123ed4000, task=a8000001256d2820, tls=000000002aab02a0)
Stack : 000000001000dce3 00000000000000e0 0000000000000000 a800000125e66b80
        a800000125d4bb00 a800000123ed7eb0 000000000041bb90 ffffffff834305f4
        a8000001263de690 00000000004065d4 0000000000400000 0000000000000008
        0000000000417384 0000000000417000 000000000000000a 0000000000000007
        a8000001263de690 000000fffffffe85 a8000001256d2820 0000000000417384
        0000000000417000 00000000004065d4 0000000000400000 0000000000000001
        a8000001255edd00 a800000125f85170 0000000000000001 0000000000400000
        a800000126a5a0c0 a800000125e66bb4 0000000000000005 000000fffffffe85
        0000000000000000 ffffffff838b11a0 ffffffff8342f940 fffffffffffffff8
        a800000125d4bb00 ffffffff83973268 ffffffff838b6810 0000000000000002
        ...
Call Trace:
[<ffffffff834cbf7c>] __kmalloc+0x94/0x198
[<ffffffff834305f4>] load_elf_binary+0xcb4/0x1e00
[<ffffffff834d751c>] search_binary_handler+0x294/0x4b8
[<ffffffff8351bb74>] compat_do_execve+0x33c/0x3e0
[<ffffffff8342b17c>] sys32_execve+0x5c/0x88
[<ffffffff8340394c>] handle_sys+0x16c/0x188

Code: 9d020014 000210f8 0202102d <dc430000> fd030000 0240102d 40016000 30420001 3421001f

on_id=0 corefile=no
cutime=0 cstime=1
<<<test_end>>>
<<<test_start>>>
tag=close02 stime=1688
cmdline="close02"
contacts=""
analysis=exit
initiation_status="ok"
<<<test_output>>>
close02     1 PASS : call returned EBADF
<<<execUnhandled kernel unaligned access[#13]:
Cpu 17status>>>

duration=1 term$ 0   :ination_type=exi 0000000000000000ted termination_ 000000001000dce0id=0 corefile=no 3c0f00418df97064
cutime=0 cstim 0000000000000011e=2
<<<test_end
>>>
<<<test_sta$ 4   :rt>>>
tag=close ffffffff838e0d0008 stime=1689
c 00000000000080d0mdline="close08" 000000007fa5c760
contacts=""
a a800000123f23eb0
nalysis=exit
in$ 8   :itiation_status= a800000007af4c38"ok"
<<<test_ou a800000123f9e000tput>>>
close08 0000000000000000     1 PASS : 0000000032cff56c
$12   : 00000000000003f3 ffffffffc0000008 ffffffff8342b120 a800000127a40000
$16   : 3c0f00418df97064 000000001000dce1 ffffffff838e0d00 ffffffff8351b878
$20   : 00000000000080d0 0000000000000100 000000007fa5b9fb 000000002aaa92a8
$24   : 0000000000000000 0000000032dd2470                                 
$28   : a800000123f20000 a800000123f23e10 000000007fa5ba00 ffffffff834cb738
Hi    : 0000000000000000
Lo    : 0000000000000d40
epc   : ffffffff834cb760 kmem_cache_alloc+0x78/0x118
    Tainted: G      D
ra    : ffffffff834cb738 kmem_cache_alloc+0x50/0x118
Status: 1000dce2    KX SX UX KERNEL EXL
Cause : 00800010
BadVA : 3c0f00418df97064
PrId : 000c0b04 (RMI Phoenix)
Modules linked in: [last unloaded: hwtimer_test]
Process pan (pid: 2346, threadinfo=a800000123f20000, task=a8000001256d2ed0, tls=000000002aab02a0)
Stack : 000000002aaa92a8 a800000123f9e000 000000007fa5cf48 a800000123f9e000
        000000000041b3a0 000000007fa5c760 a800000123f23eb0 ffffffff8351b878
        0000000032e91250 a800000123f9e000 000000007fa5cf48 000000007fa5b9f1
        0000000032e91250 0000000000000000 000000000041b3a0 ffffffff8342b17c
        000000007fa5b9fb 000000007fa5cf3f 000000007fa5cf48 ffffffff8340394c
        0000000000000000 000000001000dce1 0000000000000fab 000000007fa5b9fa
        000000007fa5b9f1 000000000041b3a0 000000007fa5c760 000000007fa5cf48
        000000000000006e 000000006f742f62 000000003a3a3a3a ffffffff81010100
        ffffffffffffffff 000000007fa5b8d0 0000000032cf9c24 fffffffff0000000
        000000007fa5cf12 000000007fa5cf12 000000000041b390 0000000032e91250
        ...
Call Trace:
[<ffffffff834cb760>] kmem_cache_alloc+0x78/0x118
[<ffffffff8351b878>] compat_do_execve+0x40/0x3e0
[<ffffffff8342b17c>] sys32_execve+0x5c/0x88
[<ffffffff8340394c>] handle_sys+0x16c/0x188


Code: 9d020014 000210f8 0202102d <dc430000> fd030000 0220102d 40016000 30420001 3421001f
close(tfile_2341) returned 0
<<<execution_status>>>
duration=0 termination_type=exited termination_id=0 corefile=no
cutime=0 cstime=1
<<<test_end>>>
<<<test_start>>>
tag=confstr01 stime=1689
cmdline="confstr01"
contacts=""
analysis=exit
initiation_status="ok"
<<<test_output>>>
confstr01    1 PASS : Test Passed
<<<execution_status>>>
duration=0 termination_type=exited termination_id=0 corefile=no
cutime=0 cstime=0
<<<test_end>>>
......
......


能够看到,主要是两种 Call trace:

Unhandled kernel unaligned access[#25]:
Cpu 17
$ 0   : 0000000000000000 000000001000dce0 3c0f00418df97064 0000000000000011
$ 4   : ffffffff838e0d00 00000000000080d0 000000007fa5c760 a80000012240feb0
$ 8   : a800000007af4c38 a800000123f9d000 0000000000000000 0000000032cff56c
$12   : 00000000000003f3 ffffffffc0000008 ffffffff8342b120 a800000127a40000
$16   : 3c0f00418df97064 000000001000dce1 ffffffff838e0d00 ffffffff8351b878
$20   : 00000000000080d0 0000000000000100 000000007fa5b9fb 000000002aaa92a8
$24   : 0000000000000000 0000000032dd2470                                 
$28   : a80000012240c000 a80000012240fe10 000000007fa5ba00 ffffffff834cb738
Hi    : 0000000000000000
Lo    : 0000000000000d40
epc   : ffffffff834cb760 kmem_cache_alloc+0x78/0x118
    Tainted: G      D
ra    : ffffffff834cb738 kmem_cache_alloc+0x50/0x118
Status: 1000dce2    KX SX UX KERNEL EXL
Cause : 00800010
BadVA : 3c0f00418df97064
PrId : 000c0b04 (RMI Phoenix)
Modules linked in: [last unloaded: hwtimer_test]
Process pan (pid: 2405, threadinfo=a80000012240c000, task=a800000123f2d040, tls=000000002aab02a0)
Stack : 000000007fa5b9fb a800000123f9d000 000000007fa5cf25 a800000123f9d000
        000000000041b3a0 000000007fa5c760 a80000012240feb0 ffffffff8351b878
        0000000032e91250 a800000123f9d000 000000007fa5cf25 000000007fa5b9f5
        0000000032e91250 0000000000000000 000000000041b3a0 ffffffff8342b17c
        000000007fa5b9fb 000000007fa5cf20 000000007fa5cf25 ffffffff8340394c
        0000000000000000 000000001000dce1 0000000000000fab 000000007fa5b9fa
        000000007fa5b9f5 000000000041b3a0 000000007fa5c760 000000007fa5cf25
        000000000000006e 000000007efefeff 000000003a3a3a3a ffffffff81010100
        ffffffffffffffff 000000007fa5b8d0 0000000032cf9c24 fffffffff0000000
        000000007fa5cf12 000000007fa5cf12 000000000041b390 0000000032e91250
        ...
Call Trace:
[<ffffffff834cb760>] kmem_cache_alloc+0x78/0x118
[<ffffffff8351b878>] compat_do_execve+0x40/0x3e0
[<ffffffff8342b17c>] sys32_execve+0x5c/0x88
[<ffffffff8340394c>] handle_sys+0x16c/0x188

Code: 9d020014 000210f8 0202102d <dc430000> fd030000 0220102d 40016000 30420001 3421001f


和:

Unhandled kernel unaligned access[#29]:
Cpu 17
$ 0   : 0000000000000000 000000001000dce0 3c0f00418df97064 0000000000000011
$ 4   : 0000000000000008 00000000000000d0 0000000000000000 ffffffffffffffff
$ 8   : a800000007af4c38 a8000001254d1500 0000000000000000 0000000000055575
$12   : 0000000000000028 ffffffff834056e8 0000000000000005 000000004b81435b
$16   : 3c0f00418df97064 ffffffff838e0d00 000000001000dce1 00000000000000d0
$20   : ffffffff834305f4 0000000000000100 0000000000024158 0000000055550000
$24   : 0000000000000006 ffffffff835d6860                                 
$28   : a800000123ed4000 a800000123ed7c90 0000000000029800 ffffffff834cbf54
Hi    : 0000000000000000
Lo    : 0000000000000f40
epc   : ffffffff834cbf7c __kmalloc+0x94/0x198
    Tainted: G      D
ra    : ffffffff834cbf54 __kmalloc+0x6c/0x198
Status: 1000dce2    KX SX UX KERNEL EXL
Cause : 00800010
BadVA : 3c0f00418df97064
PrId : 000c0b04 (RMI Phoenix)
Modules linked in: [last unloaded: hwtimer_test]
Process mail.local (pid: 2458, threadinfo=a800000123ed4000, task=a800000125606450, tls=000000002aad0840)
Stack : 000000001000dce3 00000000000000e0 0000000000000000 a800000125dd6a00
        a8000001271b3700 a800000123ed7eb0 0000000055579800 ffffffff834305f4
        0000000000020000 000000005556368c 0000000055550000 0000000000000009
        0000000055574158 000000005557368c 0000000000000000 ffffffff834056fc
        0000000000000000 0000000d00000004 0000000000000000 0000000000024158
        000000000002368c 000000000001368c 0000000000000000 000000000000001b
        a8000001262a3600 a800000125dd7130 0000000000000001 0000000055550000
        a800000125104600 a800000125dd6a34 0000000000000003 a800000123ed7dd0
        0000000000000000 ffffffff838b11a0 ffffffff8342f940 fffffffffffffff8
        a8000001271b3700 ffffffff83973268 ffffffff838b6810 0000000000000002
        ...
Call Trace:
[<ffffffff834cbf7c>] __kmalloc+0x94/0x198
[<ffffffff834305f4>] load_elf_binary+0xcb4/0x1e00
[<ffffffff834d751c>] search_binary_handler+0x294/0x4b8
[<ffffffff8351bb74>] compat_do_execve+0x33c/0x3e0
[<ffffffff8342b17c>] sys32_execve+0x5c/0x88
[<ffffffff8340394c>] handle_sys+0x16c/0x188

Code: 9d020014 000210f8 0202102d <dc430000> fd030000 0240102d 40016000 30420001 3421001f



2 Analysis

可以看到每次引起 Call trace 的 BadVaddr 皆为 3c0f00418df97064,引起异常的代码段皆类似,且引起异常的指令皆为 0xdc430000 :

<kmem_cache_alloc>:
......
ffffffff834cb754:   9d020014    lwu v0,20(a4)
ffffffff834cb758:   000210f8    dsll    v0,v0,0x3
ffffffff834cb75c:   0202102d    daddu   v0,s0,v0
ffffffff834cb760:   dc430000    ld v1,0(v0)
ffffffff834cb764:   fd030000    sd v1,0(a4)
ffffffff834cb768:   0220102d    move    v0,s1
ffffffff834cb76c:   40016000    mfc0    at,c0_status
ffffffff834cb770:   30420001    andi    v0,v0,0x1
ffffffff834cb774:   3421001f    ori at,at,0x1f
......


2009.2.12 23:43


MD,一跑 LTP 怎么多 Call trace, 都快崩溃了。现在还好,接受 Tiejun 的建议,把 slub 换成了 slab,在 ATX-II 没出现 Call trace,也许是没加载模块的原因。在 ATX-I 上先 insmod hwtimer_test.ko; rmmod hwtimer_test.ko 后,就出现了:

......
......
<<<test_start>>>
tag=waitpid10 stime=1464
cmdline="waitpid10 5"
contacts=""
analysis=exit
initiation_status="ok"
<<<test_output>>>
waitpid10    1 *******PASS : Test P****SSED
<<<execut**********************************
cpu_31 received a bus/cache error
*********************************************
Bridge: Phys Addr = 0x0128000000, Device_AERR = 0x00000008
Bridge: The devices reporting AERR are:
    cpu 3
CPU: (XLR specific) Cache Error log = 0x000000f93e7d7108, Phy Addr = 0x01f27cfae0
CPU: epc = 0x32dd2480, errorepc = 0xffffffff834c7c8c, cacheerr = 0x00000000
Can not handle bus/cache error - Halting cpu
BUG: soft lockup - CPU#0 stuck for 61s! [events/0:99]
Modules linked in: binfmt_misc ipv6
Cpu 0
$ 0   : 0000000000000000 000000001000dce0 0000000000000003 0000000000000020
$ 4   : 0000000000000020 0000000000000002 0000000000000020 a80000012715bdc0
$ 8   : 0000000000000000 0000000000000000 0000000000000000 0000000000000001
$12   : a80000012715bfe0 000000000000dc00 0000000000000000 ffffffff83884000
$16   : ffffffff838d0000 a800000125584ec0 0000000000000001 0000000000000000
$20   : ffffffff83494690 0000000000000000 fffffffffffffffc ffffffffefffffff
$24   : 0000000000000000 ffffffff83419d88
$28   : a800000127158000 a80000012715bdf0 ffffffff83947240 ffffffff8347f010
Hi    : 0000000000000000
Lo    : 0000000000000000
epc   : ffffffff8347ee50 smp_call_function_mask+0x170/0x2f0
    Not tainted
ra    : ffffffff8347f010 smp_call_function+0x40/0x88
Status: 1000dce3    KX SX UX KERNEL EXL IE
Cause : 00808000
PrId : 000c0b04 (RMI Phoenix)
BUG: soft lockup - CPU#6 stuck for 61s! [syslogd:6427]
Modules linked in: binfmt_misc ipv6
Cpu 6
$ 0   : 0000000000000000 000000001000dce0 0000000000000003 0000000000000020
$ 4   : 0000000000000020 0000000000000002 0000000000000020 a80000012351fbf0
$ 8   : 0000000000000000 ffffffff8342c8f8 0000000000000020 0000000000000000
$12   : 0000000000000000 000000000000dc00 0000000000000000 a80000012796c000
$16   : ffffffff838d0000 a800000125467260 0000000000000001 0000000000000000
$20   : ffffffff83434398 000000007ff3f450 0000000055551dc8 0000000000000001
$24   : 0000000000000020 ffffffff83419d88
$28   : a80000012351c000 a80000012351fc20 0000000055569040 ffffffff8347f010
Hi    : 0000000000000000
Lo    : 0000000000000000
epc   : ffffffff8347ee50 smp_call_function_mask+0x170/0x2f0
    Not tainted
ra    : ffffffff8347f010 smp_call_function+0x40/0x88
Status: 1000dce3    KX SX UX KERNEL EXL IE
Cause : 00808000
PrId : 000c0b04 (RMI Phoenix)
<<soft lockup here>>


2009.2.13 1:05



换用 slab 后,ATX-II 上 insmod ./hwtimer_test.ko; rmmod ./hwtimer_test.ko; ./runltplite.sh 跑了一夜,似乎没什么问题。

ATX-I 上直接 ./runltplite.sh 似乎也没什么问题。

2009.2.13 9:43

















个人工具
名字空间

变换
操作
导航
工具箱