捉虫日记 0009: Bad pte = 03607114, process = ???, vm flags = 100177, vaddr = ffffffffed
1 Phenomenon
环境:
- RMI XLR732 (8 core, 32 threads)
- Linux 2.6.27-rc9
移植 2.6.21 到 2.6.27,启动到文件系统挂载后,运行 /sbin/init 时出现 "Bad pte = ..." 错误,log 如下:
0:<5>0:Linux version 2.6.27-rc5-00016-g5874155-dirty (comcat@Pek) (gcc version 4.1.2 (Wind River Linux Sourcery G++ 4.1-91)) #132 SMP Thu Oct 16 23:07:24 CST 2008 0:Initializing message ring for cpu_0 0:<6>CPU revision is: 000c0b04 (RMI Phoenix) 0:Checking for the multiply/shift bug... 0:no. 0:Checking for the daddiu bug... 0:no. 0:<6>Determined physical RAM map: 0: memory: 000000000b700000 @ 0000000000100000 0:(usable) 0: memory: 0000000004000000 @ 000000000c000000 0:(usable) 0: memory: 00000000a0000000 @ 0000000020000000 0:(usable) 0: memory: 0000000048000000 @ 00000000e0000000 0:(usable) 0:Wasting 14336 bytes for tracking 256 unused pages 0:<6>Initrd not found or empty0: - disabling initrd 0:Zone PFN ranges: 0: Normal 0x00000100 -> 0x00128000 0:Movable zone start PFN for each node 0:early_node_map[4] active PFN ranges 0: 0: 0x00000100 -> 0x0000b800 0: 0: 0x0000c000 -> 0x00010000 0: 0: 0x00020000 -> 0x000c0000 0: 0: 0x000e0000 -> 0x00128000 0:<7>On node 0 totalpages: 1013504 0:<7>free_area_init_node: node 0, pgdat ffffffff835f9c20, node_mem_map a800000003c8a800 0:<7> Normal zone: 996931 pages, LIFO batch:31 0:(PROM) CPU present map: ffffffff 0:Phys CPU present map: ffffffff, possible map ffffffff 0:Detected 32 Slave CPU(s) 0:Built 1 zonelists in Zone order, mobility grouping on. Total pages: 996931 0:<5>Kernel command line: ip=dhcp root=/dev/ram0 rw init=/bin/busybox console=ttyS0,38400 rdinit=/sbin/init 0:Primary instruction cache 32kB, 8-way, linesize 32 bytes. 0:Primary data cache 32kB 8-way, linesize 32 bytes. 0:Wrote TLB load handler fastpath (55 instructions). 0:Wrote TLB store handler fastpath (55 instructions). 0:Wrote TLB modify handler fastpath (54 instructions). 0:PID hash table entries: 4096 (order: 12, 32768 bytes) 0:<6>Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) 0:<6>Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) 0:<6>Memory: 3919360k/4054016k available (39k kernel code, 134032k reserved, 2035k data, 6160k init, 0k highmem) 0:<6>Calibrating delay loop... 0:267.26 BogoMIPS (lpj=534528) 0:Mount-cache hash table entries: 256 0:Checking for the daddi bug... 0:no. 0:<6>Brought up 32 CPUs 0:<6>msgmni has been set to 7656 0:<6>io scheduler noop registered 0:<6>io scheduler anticipatory registered 0:<6>io scheduler deadline registered 0:<6>io scheduler cfq registered (default) 0:Registered phoenix msgring driver: major=244 0:<6>Serial: 8250/16550 driver2 ports, IRQ sharing disabled 0:<6>serial8250: ttyS0 at MMIO 0xffffffffbef14000 (irq = 17) is a 16550A 0:<6>brd: module loaded 0:<6>loop: module loaded 0:<6>Freeing unused kernel memory: 6160k freed 1:<3>Bad pte = 03607114, process = ???, vm_flags = 100177, vaddr = ffffffffed 1:Call Trace:1: 1:[<ffffffff83402a70>] handle_sys64+0xf0/0x10c 1: 1:<4>Failed to execute /sbin/init 2:<3>Bad pte = 0360c114, process = ???, vm_flags = 100177, vaddr = ffffffffeb 2:Call Trace:2: 2:[<ffffffff83402a70>] handle_sys64+0xf0/0x10c 2: 2:<4>Failed to execute /bin/busybox. Attempting defaults... 3:<3>Bad pte = 03b4b114, process = ???, vm_flags = 100177, vaddr = ffffffffed 3:Call Trace:3: 3:[<ffffffff83402a70>] handle_sys64+0xf0/0x10c 3: 4:<3>Bad pte = 03c7a114, process = ???, vm_flags = 100177, vaddr = fffffffff0 4:Call Trace:4: 4:[<ffffffff83402a70>] handle_sys64+0xf0/0x10c 4: 4:<0>Kernel panic - not syncing: No init found. Try passing init= option to kernel. 4:<0>Rebooting in 5 seconds..
2 Analysis
因为 dump 出的栈不全,因此上来就在 handle_sys64 里加了个 printk, 将 v0 的值(系统调用号)打出来,发现是 sys_execve,因此就疯狂的往 do_execve() 里加 printk,一段时间的跟踪后,发现出错的调用链为: do_execve() --> copy_strings_kernel() --> copy_strings() --> get_arg_page() --> get_user_pages() --> handle_mm_fault() --> handle_pte_fault() --> do_nonlinear_fault() --> print_bad_pte()
"Bad pte = ..." 信息即由 print_bad_pte() 输出。
在 do_nonlinear_fault() 中,print_bad_pte() 的调用条件为:
if (unlikely(!(vma->vm_flags & VM_NONLINEAR) || !(vma->vm_flags & VM_CAN_NONLINEAR))) { /* * Page table corrupted: show pte and kill process. */ print_bad_pte(vma, orig_pte, address); return VM_FAULT_OOM; }
很自然的,焦点就集中到 vma->vm_flags 为何没有置 VM_NONLINEAR 和 VM_CAN_NONLINEAR ?于是在 do_execve 中,将 vma->vm_flags 强制置位,重启后发现,虽然没有了 pte error 的错误,但 do_execve() 仍然执行失败。遂放弃这个粗暴行为,往上回溯。
在 handle_pte_fault() 中,入口 do_nonlinear_fault() 的条件是: entry = *pte; if (!pte_present(entry)) { if (pte_none(entry)) { if (vma->vm_ops) { if (likely(vma->vm_ops->fault)) return do_linear_fault(mm, vma, address, pte, pmd, write_access, entry); } return do_anonymous_page(mm, vma, address, pte, pmd, write_access); } if (pte_file(entry)) return do_nonlinear_fault(mm, vma, address, pte, pmd, write_access, entry); return do_swap_page(mm, vma, address, pte, pmd, write_access, entry); }
即当 address 对应的 PTE entry ,没有被置 _PAGE_PRESENT 位 (!pte_present(entry)) 且其被置了 _PAGE_FILE 位 (pte_file(entry)) 时才进入 do_nonlinear_fault()。