SCIENTIFIC-LINUX-USERS Archives

July 2012

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
David Sommerseth <[log in to unmask]>
Reply To:
Date:
Wed, 18 Jul 2012 19:32:24 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (513 lines)
On 18/07/12 18:35, Orion Poplawski wrote:
> On 07/17/2012 11:22 AM, Orion Poplawski wrote:
>> Our SL6.2 KVM and nfs/backup server has been crashing frequently recently
>> (starting around Fri 13th - yikes!) with Kernel panic - Out of memory
>> and no
>> killable processes.  The server has 48GB ram, 2GB swap, only about 15GB
>> dedicated to VM guests.  I've tried bumping up vm.min_free_kbytes to
>> 262144 to
>> no avail.  Nothing strange is getting written to the logs before the
>> crash.
>>
>> Happening with both 2.6.32-220.23.1 and 2.6.32-279.1.1.
>>
>> Anyone else seeing this?  Any other ideas?  I've set a serial console
>> log to
>> try to catch more information the next time it happens.
>>
> 
> here we go, see below.  This makes no sense to me.
> 
> 
> lvm invoked oom-killer: gfp_mask=0x201d0, order=0, oom_adj=0,
> oom_score_adj=0
> lvm cpuset=/ mems_allowed=0
> Pid: 3400, comm: lvm Not tainted 2.6.32-279.1.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffff810c4981>] ? cpuset_print_task_mems_allowed+0x91/0xb0
>  [<ffffffff811170f0>] ? dump_header+0x90/0x1b0
>  [<ffffffff8121470c>] ? security_real_capable_noaudit+0x3c/0x70
>  [<ffffffff81117572>] ? oom_kill_process+0x82/0x2a0
>  [<ffffffff811174b1>] ? select_bad_process+0xe1/0x120
>  [<ffffffff811179b0>] ? out_of_memory+0x220/0x3c0
>  [<ffffffff811b3380>] ? blkdev_get_block+0x0/0x70
>  [<ffffffff811276ce>] ? __alloc_pages_nodemask+0x89e/0x940
>  [<ffffffff8115c1ea>] ? alloc_pages_current+0xaa/0x110
>  [<ffffffff811144f7>] ? __page_cache_alloc+0x87/0x90
>  [<ffffffff81113ede>] ? find_get_page+0x1e/0xa0
>  [<ffffffff8111606b>] ? do_read_cache_page+0x4b/0x180
>  [<ffffffff811b4330>] ? blkdev_readpage+0x0/0x20
>  [<ffffffff811161e9>] ? read_cache_page_async+0x19/0x20
>  [<ffffffff811161fe>] ? read_cache_page+0xe/0x20
>  [<ffffffff811ecaa0>] ? read_dev_sector+0x30/0xa0
>  [<ffffffff811edc5d>] ? amiga_partition+0x6d/0x460
                          ^^^^^^^^^^^^^^^
wtf!?!  What kind of partition tables and file systems do you use?  This
OOM kill seems to be caused by the amiga partition table code in the
kernel.  It looks like it's some LVM command causing this to happen
somehow, though.

(more coming lower down)

>  [<ffffffff811161e9>] ? read_cache_page_async+0x19/0x20
>  [<ffffffff811ecaa0>] ? read_dev_sector+0x30/0xa0
>  [<ffffffff811ef1ac>] ? osf_partition+0x6c/0x120
>  [<ffffffff811ed7d7>] ? rescan_partitions+0x1a7/0x470
>  [<ffffffff811b4ab6>] ? __blkdev_get+0x1b6/0x3c0
>  [<ffffffff811b4ce0>] ? blkdev_open+0x0/0xc0
>  [<ffffffff811b4cd0>] ? blkdev_get+0x10/0x20
>  [<ffffffff811b4d51>] ? blkdev_open+0x71/0xc0
>  [<ffffffff8117889a>] ? __dentry_open+0x10a/0x360
>  [<ffffffff8121c272>] ? selinux_inode_permission+0x72/0xb0
>  [<ffffffff812142af>] ? security_inode_permission+0x1f/0x30
>  [<ffffffff81178c04>] ? nameidata_to_filp+0x54/0x70
>  [<ffffffff8118c110>] ? do_filp_open+0x6c0/0xd60
>  [<ffffffff81198192>] ? alloc_fd+0x92/0x160
>  [<ffffffff81178649>] ? do_sys_open+0x69/0x140
>  [<ffffffff81178760>] ? sys_open+0x20/0x30
>  [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU    0: hi:    0, btch:   1 usd:   0
> Node 0 DMA32 per-cpu:
> CPU    0: hi:   42, btch:   7 usd:  23
> active_anon:49 inactive_anon:97 isolated_anon:0
>  active_file:0 inactive_file:0 isolated_file:0
>  unevictable:3846 dirty:0 writeback:0 unstable:0
>  free:412 slab_reclaimable:1194 slab_unreclaimable:5681
>  mapped:356 shmem:0 pagetables:31 bounce:0
> Node 0 DMA free:224kB min:0kB low:0kB high:0kB active_anon:0kB
> inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:328kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
> lowmem_reserve[]: 0 125 125 125
> Node 0 DMA32 free:1424kB min:1428kB low:1784kB high:2140kB
> active_anon:196kB inactive_anon:388kB active_file:0kB inactive_file:0kB
> unevictable:15384kB isolated(anon):0kB isolated(file):0kB
> present:128256kB mlocked:0kB dirty:0kB writeback:0kB mapped:1424kB
> shmem:0kB slab_reclaimable:4776kB slab_unreclaimable:22724kB
> kernel_stack:600kB pagetables:124kB unstable:0kB bounce:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 0*4kB 2*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB
> 0*1024kB 0*2048kB 0*4096kB = 224kB
> Node 0 DMA32: 0*4kB 2*8kB 2*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB
> 1*1024kB 0*2048kB 0*4096kB = 1424kB
> 3846 total pagecache pages
> 0 pages in swap cache
> Swap cache stats: add 0, delete 0, find 0/0
> Free swap  = 0kB
> Total swap = 0kB

Now this is concerning ... you're out of swap, unless that's disabled.

> 45035 pages RAM
> 16585 pages reserved
> 359 pages shared
> 23771 pages non-shared
> [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> [ 3400]     0  3400     5650      439   0       0             0 lvm
> Out of memory: Kill process 3400 (lvm) score 1 or sacrifice child
> Killed process 3400, UID 0, (lvm) total-vm:22600kB, anon-rss:528kB,
> file-rss:1228kB
>  unknown partition table
> KILL

Now, here LVM got killed in the middle of doing some partition table
checks (found above) ... and it complains about some unknown partition
tables as well.

> Activating logicinit invoked oom-killer: gfp_mask=0x84d0, order=0,
> oom_adj=0, oom_score_adj=0
> al volumes
> init cpuset=/ mems_allowed=0
> Pid: 3401, comm: init Not tainted 2.6.32-279.1.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffff810c4981>] ? cpuset_print_task_mems_allowed+0x91/0xb0
>  [<ffffffff811170f0>] ? dump_header+0x90/0x1b0
>  [<ffffffff8121470c>] ? security_real_capable_noaudit+0x3c/0x70
>  [<ffffffff81117572>] ? oom_kill_process+0x82/0x2a0
>  [<ffffffff811174b1>] ? select_bad_process+0xe1/0x120
>  [<ffffffff811179b0>] ? out_of_memory+0x220/0x3c0
>  [<ffffffff811276ce>] ? __alloc_pages_nodemask+0x89e/0x940
>  [<ffffffff8115c1ea>] ? alloc_pages_current+0xaa/0x110
>  [<ffffffff81048aab>] ? pte_alloc_one+0x1b/0x50
>  [<ffffffff8113af22>] ? __pte_alloc+0x32/0x160
>  [<ffffffff8113fd79>] ? handle_mm_fault+0x149/0x2b0
>  [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
>  [<ffffffff8150327e>] ? do_page_fault+0x3e/0xa0
>  [<ffffffff81500635>] ? page_fault+0x25/0x30

Here /sbin/init fails to allocate memory too.

> Mem-Info:
> Node 0 DMA per-cpu:
> CPU    0: hi:    0, btch:   1 usd:   0
> Node 0 DMA32 per-cpu:
> CPU    0: hi:   42, btch:   7 usd:  11
> active_anon:1 inactive_anon:16 isolated_anon:0
>  active_file:1 inactive_file:0 isolated_file:0
>  unevictable:3846 dirty:0 writeback:0 unstable:0
>  free:410 slab_reclaimable:1192 slab_unreclaimable:5685
>  mapped:50 shmem:1 pagetables:6 bounce:0
> Node 0 DMA free:224kB min:0kB low:0kB high:0kB active_anon:0kB
> inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:328kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
> lowmem_reserve[]: 0 125 125 125
> Node 0 DMA32 free:1416kB min:1428kB low:1784kB high:2140kB
> active_anon:4kB inactive_anon:64kB active_file:4kB inactive_file:0kB
> unevictable:15384kB isolated(anon):0kB isolated(file):0kB
> present:128256kB mlocked:0kB dirty:0kB writeback:0kB mapped:200kB
> shmem:4kB slab_reclaimable:4768kB slab_unreclaimable:22740kB
> kernel_stack:592kB pagetables:24kB unstable:0kB bounce:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 0*4kB 2*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB
> 0*1024kB 0*2048kB 0*4096kB = 224kB
> Node 0 DMA32: 0*4kB 2*8kB 0*16kB 0*32kB 0*64kB 1*128kB 1*256kB 0*512kB
> 1*1024kB 0*2048kB 0*4096kB = 1424kB
> 3848 total pagecache pages
> 0 pages in swap cache
> Swap cache stats: add 0, delete 0, find 0/0
> Free swap  = 0kB
> Total swap = 0kB
> 45035 pages RAM
> 16585 pages reserved
> 65 pages shared
> 24077 pages non-shared
> [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> [ 3401]     0  3401      288       13   0       0             0 init
> Out of memory: Kill process 3401 (init) score 1 or sacrifice child
> Killed process 3401, UID 0, (init) total-vm:1152kB, anon-rss:52kB,
> file-rss:0kB
> KILL
> init invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0,
> oom_score_adj=0

And OOM killer kills /sbin/init ... that's bad, that's really bad.

> init cpuset=/ mems_allowed=0
> Pid: 1, comm: init Not tainted 2.6.32-279.1.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffff810c4981>] ? cpuset_print_task_mems_allowed+0x91/0xb0
>  [<ffffffff811170f0>] ? dump_header+0x90/0x1b0
>  [<ffffffff8121470c>] ? security_real_capable_noaudit+0x3c/0x70
>  [<ffffffff81117572>] ? oom_kill_process+0x82/0x2a0
>  [<ffffffff811174b1>] ? select_bad_process+0xe1/0x120
>  [<ffffffff811179b0>] ? out_of_memory+0x220/0x3c0
>  [<ffffffff811276ce>] ? __alloc_pages_nodemask+0x89e/0x940
>  [<ffffffff8115c2ea>] ? alloc_pages_vma+0x9a/0x150
>  [<ffffffff8113e3fd>] ? do_wp_page+0xfd/0x8d0
>  [<ffffffff8113f3ad>] ? handle_pte_fault+0x2cd/0xb50
>  [<ffffffff8113fe14>] ? handle_mm_fault+0x1e4/0x2b0
>  [<ffffffff81054a04>] ? check_preempt_wakeup+0x1a4/0x260
>  [<ffffffff810632c4>] ? enqueue_task_fair+0x64/0x100
>  [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
>  [<ffffffff81060a83>] ? wake_up_new_task+0xd3/0x120
>  [<ffffffff8106a873>] ? do_fork+0x133/0x460
>  [<ffffffff81198192>] ? alloc_fd+0x92/0x160
>  [<ffffffff81178407>] ? fd_install+0x47/0x90
>  [<ffffffff8150327e>] ? do_page_fault+0x3e/0xa0
>  [<ffffffff81500635>] ? page_fault+0x25/0x30
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU    0: hi:    0, btch:   1 usd:   0
> Node 0 DMA32 per-cpu:
> CPU    0: hi:   42, btch:   7 usd:  12
> active_anon:4 inactive_anon:10 isolated_anon:0
>  active_file:1 inactive_file:0 isolated_file:0
>  unevictable:3846 dirty:0 writeback:0 unstable:0
>  free:413 slab_reclaimable:1192 slab_unreclaimable:5685
>  mapped:50 shmem:1 pagetables:6 bounce:0
> Node 0 DMA free:224kB min:0kB low:0kB high:0kB active_anon:0kB
> inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:328kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
> lowmem_reserve[]: 0 125 125 125
> Node 0 DMA32 free:1428kB min:1428kB low:1784kB high:2140kB
> active_anon:16kB inactive_anon:40kB active_file:4kB inactive_file:0kB
> unevictable:15384kB isolated(anon):0kB isolated(file):0kB
> present:128256kB mlocked:0kB dirty:0kB writeback:0kB mapped:200kB
> shmem:4kB slab_reclaimable:4768kB slab_unreclaimable:22740kB
> kernel_stack:592kB pagetables:24kB unstable:0kB bounce:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 0*4kB 2*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB
> 0*1024kB 0*2048kB 0*4096kB = 224kB
> Node 0 DMA32: 3*4kB 1*8kB 0*16kB 0*32kB 0*64kB 1*128kB 1*256kB 0*512kB
> 1*1024kB 0*2048kB 0*4096kB = 1428kB
> 3848 total pagecache pages
> 0 pages in swap cache
> Swap cache stats: add 0, delete 0, find 0/0
> Free swap  = 0kB
> Total swap = 0kB
> 45035 pages RAM
> 16585 pages reserved
> 66 pages shared
> 24074 pages non-shared
> [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> [ 3402]     0  3402      288       13   0       0             0 init
> Out of memory: Kill process 3402 (init) score 1 or sacrifice child

And again ...

> Killed process 3402, UID 0, (init) total-vm:1152kB, anon-rss:52kB,
> file-rss:0kB
> init invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0,
> oom_score_adj=0
> init cpuset=/ mems_allowed=0
> Pid: 1, comm: init Not tainted 2.6.32-279.1.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffff810c4981>] ? cpuset_print_task_mems_allowed+0x91/0xb0
>  [<ffffffff811170f0>] ? dump_header+0x90/0x1b0
>  [<ffffffff8111746e>] ? select_bad_process+0x9e/0x120
>  [<ffffffff81117b0a>] ? out_of_memory+0x37a/0x3c0
>  [<ffffffff811276ce>] ? __alloc_pages_nodemask+0x89e/0x940
>  [<ffffffff8115c2ea>] ? alloc_pages_vma+0x9a/0x150
>  [<ffffffff8113e3fd>] ? do_wp_page+0xfd/0x8d0
>  [<ffffffff8113f3ad>] ? handle_pte_fault+0x2cd/0xb50
>  [<ffffffff8113fe14>] ? handle_mm_fault+0x1e4/0x2b0
>  [<ffffffff81054a04>] ? check_preempt_wakeup+0x1a4/0x260
>  [<ffffffff810632c4>] ? enqueue_task_fair+0x64/0x100
>  [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
>  [<ffffffff81060a83>] ? wake_up_new_task+0xd3/0x120
>  [<ffffffff8106a873>] ? do_fork+0x133/0x460
>  [<ffffffff81198192>] ? alloc_fd+0x92/0x160
>  [<ffffffff81178407>] ? fd_install+0x47/0x90
>  [<ffffffff8150327e>] ? do_page_fault+0x3e/0xa0
>  [<ffffffff81500635>] ? page_fault+0x25/0x30
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU    0: hi:    0, btch:   1 usd:   0
> Node 0 DMA32 per-cpu:
> CPU    0: hi:   42, btch:   7 usd:  22
> active_anon:4 inactive_anon:10 isolated_anon:0
>  active_file:1 inactive_file:0 isolated_file:0
>  unevictable:3846 dirty:0 writeback:0 unstable:0
>  free:413 slab_reclaimable:1192 slab_unreclaimable:5685
>  mapped:50 shmem:1 pagetables:6 bounce:0
> Node 0 DMA free:224kB min:0kB low:0kB high:0kB active_anon:0kB
> inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:328kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
> lowmem_reserve[]: 0 125 125 125
> Node 0 DMA32 free:1428kB min:1428kB low:1784kB high:2140kB
> active_anon:16kB inactive_anon:40kB active_file:4kB inactive_file:0kB
> unevictable:15384kB isolated(anon):0kB isolated(file):0kB
> present:128256kB mlocked:0kB dirty:0kB writeback:0kB mapped:200kB
> shmem:4kB slab_reclaimable:4768kB slab_unreclaimable:22740kB
> kernel_stack:592kB pagetables:24kB unstable:0kB bounce:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> Node 0 DMA: 0*4kB 2*8kB 1*16kB 2*32kB 2*64kB 0*128kB 0*256kB 0*512kB
> 0*1024kB 0*2048kB 0*4096kB = 224kB
> Node 0 DMA32: 3*4kB 2*8kB 0*16kB 0*32kB 0*64kB 1*128kB 1*256kB 0*512kB
> 1*1024kB 0*2048kB 0*4096kB = 1436kB
> 3848 total pagecache pages
> 0 pages in swap cache
> Swap cache stats: add 0, delete 0, find 0/0
> Free swap  = 0kB
> Total swap = 0kB
> 45035 pages RAM
> 16585 pages reserved
> 53 pages shared
> 24075 pages non-shared
> [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> Kernel panic - not syncing: Out of memory and no killable processes...
> 
> Pid: 1, comm: init Not tainted 2.6.32-279.1.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffff814fd12a>] ? panic+0xa0/0x168
>  [<ffffffff8111716d>] ? dump_header+0x10d/0x1b0
>  [<ffffffff81117b1f>] ? out_of_memory+0x38f/0x3c0
>  [<ffffffff811276ce>] ? __alloc_pages_nodemask+0x89e/0x940
>  [<ffffffff8115c2ea>] ? alloc_pages_vma+0x9a/0x150
>  [<ffffffff8113e3fd>] ? do_wp_page+0xfd/0x8d0
>  [<ffffffff8113f3ad>] ? handle_pte_fault+0x2cd/0xb50
>  [<ffffffff8113fe14>] ? handle_mm_fault+0x1e4/0x2b0
>  [<ffffffff81054a04>] ? check_preempt_wakeup+0x1a4/0x260
>  [<ffffffff810632c4>] ? enqueue_task_fair+0x64/0x100
>  [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
>  [<ffffffff81060a83>] ? wake_up_new_task+0xd3/0x120
>  [<ffffffff8106a873>] ? do_fork+0x133/0x460
>  [<ffffffff81198192>] ? alloc_fd+0x92/0x160
>  [<ffffffff81178407>] ? fd_install+0x47/0x90
>  [<ffffffff8150327e>] ? do_page_fault+0x3e/0xa0
>  [<ffffffff81500635>] ? page_fault+0x25/0x30

And finally, the panic()

> Last monitor screens:
> 
> ATOP - saga                2012/07/18  10:09:56                -----P
>         10s elapsed
> PRC | sys    2.86s | user   3.46s  | #proc    889 | #tslpu     1 |
> #zombie 0  | #exit     98 |
> CPU | sys      25% | user     33%  | irq       1% | idle   1535% | wait
> 2%  | curscal  72% |
> cpu | sys       1% | user     16%  | irq       0% | idle     82% |
> cpu009 w 0%  | curscal 100% |
> cpu | sys       1% | user      8%  | irq       0% | idle     89% |
> cpu001 w 0%  | curscal  70% |
> cpu | sys       4% | user      2%  | irq       0% | idle     94% |
> cpu000 w 0%  | curscal  70% |
> cpu | sys       6% | user      1%  | irq       0% | idle     92% |
> cpu003 w 1%  | curscal  70% |
> cpu | sys       2% | user      1%  | irq       0% | idle     96% |
> cpu011 w 1%  | curscal  70% |
> cpu | sys       2% | user      1%  | irq       0% | idle     96% |
> cpu004 w 0%  | curscal  70% |
> cpu | sys       2% | user      1%  | irq       0% | idle     97% |
> cpu012 w 0%  | curscal  70% |
> cpu | sys       1% | user      1%  | irq       0% | idle     97% |
> cpu008 w 0%  | curscal  70% |
> cpu | sys       1% | user      1%  | irq       0% | idle     98% |
> cpu006 w 0%  | curscal  70% |
> cpu | sys       1% | user      1%  | irq       0% | idle     98% |
> cpu014 w 0%  | curscal  70% |
> cpu | sys       1% | user      0%  | irq       0% | idle     98% |
> cpu005 w 0%  | curscal  70% |
> cpu | sys       1% | user      0%  | irq       0% | idle     99% |
> cpu013 w 0%  | curscal  70% |
> cpu | sys       0% | user      0%  | irq       0% | idle     99% |
> cpu002 w 0%  | curscal  70% |
> cpu | sys       0% | user      0%  | irq       0% | idle     99% |
> cpu007 w 0%  | curscal  70% |
> cpu | sys       0% | user      0%  | irq       0% | idle    100% |
> cpu010 w 0%  | curscal  70% |
> cpu | sys       0% | user      0%  | irq       0% | idle    100% |
> cpu015 w 0%  | curscal  70% |
> CPL | avg1    1.02 | avg5    1.16  | avg15   1.53 | csw    67236 | intr
> 45803  | numcpu    16 |
> MEM | tot    47.1G | free    5.5G  | cache  24.6G | dirty   0.2M | buff
> 4.2G  | slab    6.1G |
> SWP | tot     8.0G | free    8.0G  |              |              | vmcom
> 16.6G  | vmlim  31.6G |
> LVM | abbix--disk0 | busy     13%  | read       0 | write    299 | MBw/s
> 0.74  | avio 4.29 ms |
> LVM |  vg_root-var | busy     11%  | read       0 | write    349 | MBw/s
> 0.13  | avio 3.15 ms |
> LVM | pute1--disk0 | busy      0%  | read       0 | write      4 | MBw/s
> 0.00  | avio 0.50 ms |
> MDD |          md1 | busy      0%  | read       0 | write    722 | MBw/s
> 0.87  | avio 0.00 ms |
> DSK |          sdc | busy     34%  | read     393 | write    271 | MBw/s
> 0.26  | avio 5.13 ms |
> DSK |          sdf | busy     34%  | read     396 | write    268 | MBw/s
> 0.22  | avio 5.10 ms |
> DSK |          sde | busy     34%  | read     390 | write    265 | MBw/s
> 0.22  | avio 5.14 ms |
> DSK |          sdd | busy     33%  | read     389 | write    273 | MBw/s
> 0.26  | avio 5.03 ms |
> DSK |          sdh | busy     31%  | read     389 | write    238 | MBw/s
> 0.19  | avio 4.97 ms |
> DSK |          sdb | busy     30%  | read     396 | write    215 | MBw/s
> 0.20  | avio 4.83 ms |
> DSK |          sdg | busy     29%  | read     392 | write    238 | MBw/s
> 0.19  | avio 4.67 ms |
> DSK |          sda | busy     28%  | read     397 | write    214 | MBw/s
> 0.20  | avio 4.54 ms |
> NET | transport    | tcpi    1481  | tcpo     668 | udpi     147 | udpo
> 142  | tcpao      2 |
> NET | network      | ipi     1639  | ipo      820 | ipfrw      0 | deliv
> 1628  | icmpo      6 |
> NET | vnet1     0% | pcki     672  | pcko     701 | si   38 Kbps | so  
> 42 Kbps  | erro       0 |
> NET | eth0      0% | pcki    1923  | pcko    2769 | si  177 Kbps | so
> 2403 Kbps  | erro       0 |
> NET | vnet3     0% | pcki       3  | pcko      78 | si    0 Kbps | so   
> 5 Kbps  | erro       0 |
> NET | vnet0     0% | pcki       0  | pcko      75 | si    0 Kbps | so   
> 5 Kbps  | erro       0 |
> NET | vnet4     0% | pcki       0  | pcko      75 | si    0 Kbps | so   
> 5 Kbps  | erro       0 |
> NET | vnet2     0% | pcki       0  | pcko      75 | si    0 Kbps | so   
> 5 Kbps  | erro       0 |
> NET | eth1      0% | pcki      18  | pcko      12 | si    3 Kbps | so   
> 0 Kbps  | erro       0 |
> NET | br0     ---- | pcki    1681  | pcko     807 | si  140 Kbps | so
> 2293 Kbps  | erro       0 |
> NET | lo      ---- | pcki       6  | pcko       6 | si    0 Kbps | so   
> 0 Kbps  | erro       0 |
> Write failed: Broken pipe
> [root@orca ~]# SCPU  USRCPU   VGROW   RGROW  RDDSK   WRDSK ST  EXC  S
> CPUNR CPU  CMD        1/7
>  4137    4    0.10s   0.40s   1588K    512K     0K   7556K --    -  S   
> 10   5%  qemu-kvm
>  4548    2    0.14s   0.27s      0K      0K     0K      0K --    -  S   
> 12   4%  qemu-kvm
>  4299    9    0.05s   0.03s      0K      0K     0K     16K --    -  S   
> 14   1%  qemu-kvm
> 20921    5    0.01s   0.05s      0K      0K     0K      0K --    - 
> S     9  1%  qemu-kvm
>  4047    2    0.03s   0.02s      0K      0K     0K      0K --    -  S   
> 12   1%  qemu-kvm
> 
> 
> slabtop:
>  Active / Total Slabs (% used)      : 1599178 / 1599216 (100.0%)
>  Active / Total Caches (% used)     : 132 / 204 (64.7%)
>  Active / Total Size (% used)       : 6195130.43K / 6274921.86K (98.7%)
>  Minimum / Average / Maximum Object : 0.02K / 0.28K / 4096.00K
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 7227321 6709793  92%    0.10K 195333       37    781332K buffer_head
> 4242650 4242336  99%    0.07K  80050       53    320200K
> selinux_inode_security
> 4224700 4224638  99%    1.00K 1056175        4   4224700K ext4_inode_cache

This smells a bit bad ... ext4_inode_cache is using a lot of memory ...

> 3257480 3257186  99%    0.19K 162874       20    651496K dentry
> 1324786 1250981  94%    0.06K  22454       59     89816K size-64
> 484128 484094  99%    0.02K   3362      144     13448K avtab_node
> 347088 342539  98%    0.03K   3099      112     12396K size-32
> 342580 324110  94%    0.55K  48940        7    195760K radix_tree_node
> 236059 235736  99%    0.06K   4001       59     16004K ksm_rmap_item
> 123980 123566  99%    0.19K   6199       20     24796K size-192
> 105630  47803  45%    0.12K   3521       30     14084K size-128
>  24300  24261  99%    0.14K    900       27      3600K sysfs_dir_cache
>  17402  15599  89%    0.05K    226       77       904K anon_vma_chain
>  16055  14874  92%    0.20K    845       19      3380K vm_area_struct
>   9844   8471  86%    0.04K    107       92       428K anon_vma
>   8952   8775  98%    0.58K   1492        6      5968K inode_cache
>   7518   5829  77%    0.62K   1253        6      5012K proc_inode_cache
>   6840   4692  68%    0.19K    342       20      1368K filp
>   5888   5532  93%    0.04K     64       92       256K dm_io
> 
> 
> top - 10:10:02 up 22:34,  4 users,  load average: 1.02, 1.15, 1.53
> Tasks: 888 total,   1 running, 887 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.8%us,  1.2%sy,  0.0%ni, 97.9%id,  0.1%wa,  0.0%hi,  0.0%si, 
> 0.0%st
> Mem:  49421492k total, 43619512k used,  5801980k free,  4409144k buffers
> Swap:  8388600k total,    16308k used,  8372292k free, 25837164k cached

Somehow, this doesn't reflect what the kernel complains about when the
OOM killer starts its mission.

I see that you're using  kernel-2.6.32-279.1.1.el6.x86_64 ... that
smells a bit like a SL 6.3 Beta ... is that right?  As SL 6.2 is usually
around 2.6.32-220-something.  I would probably recommend you to try a
6.2 kernel if you're running something much more bleeding edge.

And it somehow seems to be related to some file system issues ... at
least from what I can see.  Could be a bugy kernel which leaks memory,
somewhere in either the parition table code or ext4 code paths.

Not sure I'm able to provide any better clues right now.


kind regards,

David Sommerseth

ATOM RSS1 RSS2