捉虫日记 0009: Bad pte = 03607114, process = ???, vm flags = 100177, vaddr = ffffffffed

1 Phenomenon

环境：

RMI XLR732 (8 core, 32 threads)

Linux 2.6.27-rc9

移植 2.6.21 到 2.6.27，启动到文件系统挂载后，运行 /sbin/init 时出现 "Bad pte = ..." 错误，log 如下：

0:<5>0:Linux version 2.6.27-rc5-00016-g5874155-dirty (comcat@Pek) (gcc version 4.1.2 (Wind River Linux Sourcery G++ 4.1-91)) #132 SMP Thu Oct 16 23:07:24 CST 2008
0:Initializing message ring for cpu_0
0:<6>CPU revision is: 000c0b04 (RMI Phoenix)
0:Checking for the multiply/shift bug... 0:no.
0:Checking for the daddiu bug... 0:no.
0:<6>Determined physical RAM map:
0: memory: 000000000b700000 @ 0000000000100000 0:(usable)
0: memory: 0000000004000000 @ 000000000c000000 0:(usable)
0: memory: 00000000a0000000 @ 0000000020000000 0:(usable)
0: memory: 0000000048000000 @ 00000000e0000000 0:(usable)
0:Wasting 14336 bytes for tracking 256 unused pages
0:<6>Initrd not found or empty0: - disabling initrd
0:Zone PFN ranges:
0: Normal   0x00000100 -> 0x00128000
0:Movable zone start PFN for each node
0:early_node_map[4] active PFN ranges
0:    0: 0x00000100 -> 0x0000b800
0:    0: 0x0000c000 -> 0x00010000
0:    0: 0x00020000 -> 0x000c0000
0:    0: 0x000e0000 -> 0x00128000
0:<7>On node 0 totalpages: 1013504
0:<7>free_area_init_node: node 0, pgdat ffffffff835f9c20, node_mem_map a800000003c8a800
0:<7> Normal zone: 996931 pages, LIFO batch:31
0:(PROM) CPU present map: ffffffff
0:Phys CPU present map: ffffffff, possible map ffffffff
0:Detected 32 Slave CPU(s)
0:Built 1 zonelists in Zone order, mobility grouping on. Total pages: 996931
0:<5>Kernel command line: ip=dhcp root=/dev/ram0 rw init=/bin/busybox console=ttyS0,38400 rdinit=/sbin/init
0:Primary instruction cache 32kB, 8-way, linesize 32 bytes.
0:Primary data cache 32kB 8-way, linesize 32 bytes.
0:Wrote TLB load handler fastpath (55 instructions).
0:Wrote TLB store handler fastpath (55 instructions).
0:Wrote TLB modify handler fastpath (54 instructions).
0:PID hash table entries: 4096 (order: 12, 32768 bytes)
0:<6>Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
0:<6>Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
0:<6>Memory: 3919360k/4054016k available (39k kernel code, 134032k reserved, 2035k data, 6160k init, 0k highmem)
0:<6>Calibrating delay loop... 0:267.26 BogoMIPS (lpj=534528)
0:Mount-cache hash table entries: 256
0:Checking for the daddi bug... 0:no.
0:<6>Brought up 32 CPUs
0:<6>msgmni has been set to 7656
0:<6>io scheduler noop registered
0:<6>io scheduler anticipatory registered
0:<6>io scheduler deadline registered
0:<6>io scheduler cfq registered (default)
0:Registered phoenix msgring driver: major=244
0:<6>Serial: 8250/16550 driver2 ports, IRQ sharing disabled
0:<6>serial8250: ttyS0 at MMIO 0xffffffffbef14000 (irq = 17) is a 16550A
0:<6>brd: module loaded
0:<6>loop: module loaded
0:<6>Freeing unused kernel memory: 6160k freed
1:<3>Bad pte = 03607114, process = ???, vm_flags = 100177, vaddr = ffffffffed
1:Call Trace:1:
1:[<ffffffff83402a70>] handle_sys64+0xf0/0x10c
1:
1:<4>Failed to execute /sbin/init
2:<3>Bad pte = 0360c114, process = ???, vm_flags = 100177, vaddr = ffffffffeb
2:Call Trace:2:
2:[<ffffffff83402a70>] handle_sys64+0xf0/0x10c
2:
2:<4>Failed to execute /bin/busybox. Attempting defaults...
3:<3>Bad pte = 03b4b114, process = ???, vm_flags = 100177, vaddr = ffffffffed
3:Call Trace:3:
3:[<ffffffff83402a70>] handle_sys64+0xf0/0x10c
3:
4:<3>Bad pte = 03c7a114, process = ???, vm_flags = 100177, vaddr = fffffffff0
4:Call Trace:4:
4:[<ffffffff83402a70>] handle_sys64+0xf0/0x10c
4:
4:<0>Kernel panic - not syncing: No init found. Try passing init= option to kernel.
4:<0>Rebooting in 5 seconds..

2 Analysis

因为 dump 出的栈不全，因此上来就在 handle_sys64 里加了个 printk，将 v0 的值（系统调用号）打出来，发现是 sys_execve，因此就疯狂的往 do_execve() 里加 printk，一段时间的跟踪后，发现出错的调用链为： do_execve() --> copy_strings_kernel() --> copy_strings() --> get_arg_page() --> get_user_pages() --> handle_mm_fault() --> handle_pte_fault() --> do_nonlinear_fault() --> print_bad_pte()

"Bad pte = ..." 信息即由 print_bad_pte() 输出。

在 do_nonlinear_fault() 中，print_bad_pte() 的调用条件为：

    if (unlikely(!(vma->vm_flags & VM_NONLINEAR) ||
            !(vma->vm_flags & VM_CAN_NONLINEAR))) {
        /*
         * Page table corrupted: show pte and kill process.
         */
        print_bad_pte(vma, orig_pte, address);
        return VM_FAULT_OOM;
    }

很自然的，焦点就集中到 vma->vm_flags 为何没有置 VM_NONLINEAR 和 VM_CAN_NONLINEAR ？于是在 do_execve 中，将 vma->vm_flags 强制置位，重启后发现，虽然没有了 pte error 的错误，但 do_execve() 仍然执行失败。遂放弃这个粗暴行为，往上回溯。

在 handle_pte_fault() 中，入口 do_nonlinear_fault() 的条件是：

    entry = *pte;
    if (!pte_present(entry)) {
        if (pte_none(entry)) {
            if (vma->vm_ops) {
                if (likely(vma->vm_ops->fault))
                    return do_linear_fault(mm, vma, address,
                        pte, pmd, write_access, entry);
            }
            return do_anonymous_page(mm, vma, address,
                         pte, pmd, write_access);
        }
        if (pte_file(entry))
            return do_nonlinear_fault(mm, vma, address,
                    pte, pmd, write_access, entry);
        return do_swap_page(mm, vma, address,
                    pte, pmd, write_access, entry);
    }

即当 address 对应的 PTE entry ，没有被置 _PAGE_PRESENT 位 (!pte_present(entry)) 且其被置了 _PAGE_FILE 位 (pte_file(entry)) 时才进入 do_nonlinear_fault()。