On 07/18/2012 11:32 AM, David Sommerseth wrote:
> On 18/07/12 18:35, Orion Poplawski wrote:
>> On 07/17/2012 11:22 AM, Orion Poplawski wrote:
>>> Our SL6.2 KVM and nfs/backup server has been crashing frequently recently
>>> (starting around Fri 13th - yikes!) with Kernel panic - Out of memory
>>> and no
>>> killable processes. The server has 48GB ram, 2GB swap, only about 15GB
>>> dedicated to VM guests. I've tried bumping up vm.min_free_kbytes to
>>> 262144 to
>>> no avail. Nothing strange is getting written to the logs before the
>>> crash.
>>>
>>> Happening with both 2.6.32-220.23.1 and 2.6.32-279.1.1.
>>>
>>> Anyone else seeing this? Any other ideas? I've set a serial console
>>> log to
>>> try to catch more information the next time it happens.
>>>
>>
>> here we go, see below. This makes no sense to me.
>>
>> [<ffffffff811edc5d>] ? amiga_partition+0x6d/0x460
> ^^^^^^^^^^^^^^^
> wtf!?! What kind of partition tables and file systems do you use? This
> OOM kill seems to be caused by the amiga partition table code in the
> kernel. It looks like it's some LVM command causing this to happen
> somehow, though.
Well I bet it's just scanning all partition types and:
/boot/config-2.6.32-220.23.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y
/boot/config-2.6.32-279.1.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y
>> 0 pages in swap cache
>> Swap cache stats: add 0, delete 0, find 0/0
>> Free swap = 0kB
>> Total swap = 0kB
>
> Now this is concerning ... you're out of swap, unless that's disabled.
I'm curious about this too. I have 8GB swap and top showed some being used.
>> 4224700 4224638 99% 1.00K 1056175 4 4224700K ext4_inode_cache
>
> This smells a bit bad ... ext4_inode_cache is using a lot of memory ...
>> 3257480 3257186 99% 0.19K 162874 20 651496K dentry
>> 1324786 1250981 94% 0.06K 22454 59 89816K size-64
>> 484128 484094 99% 0.02K 3362 144 13448K avtab_node
>> 347088 342539 98% 0.03K 3099 112 12396K size-32
>> 342580 324110 94% 0.55K 48940 7 195760K radix_tree_node
>> 236059 235736 99% 0.06K 4001 59 16004K ksm_rmap_item
>> 123980 123566 99% 0.19K 6199 20 24796K size-192
>> 105630 47803 45% 0.12K 3521 30 14084K size-128
>> 24300 24261 99% 0.14K 900 27 3600K sysfs_dir_cache
>> 17402 15599 89% 0.05K 226 77 904K anon_vma_chain
>> 16055 14874 92% 0.20K 845 19 3380K vm_area_struct
>> 9844 8471 86% 0.04K 107 92 428K anon_vma
>> 8952 8775 98% 0.58K 1492 6 5968K inode_cache
>> 7518 5829 77% 0.62K 1253 6 5012K proc_inode_cache
>> 6840 4692 68% 0.19K 342 20 1368K filp
>> 5888 5532 93% 0.04K 64 92 256K dm_io
>>
>>
>> top - 10:10:02 up 22:34, 4 users, load average: 1.02, 1.15, 1.53
>> Tasks: 888 total, 1 running, 887 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 0.8%us, 1.2%sy, 0.0%ni, 97.9%id, 0.1%wa, 0.0%hi, 0.0%si,
>> 0.0%st
>> Mem: 49421492k total, 43619512k used, 5801980k free, 4409144k buffers
>> Swap: 8388600k total, 16308k used, 8372292k free, 25837164k cached
>
> Somehow, this doesn't reflect what the kernel complains about when the
> OOM killer starts its mission.
That's bugging me too.
> I see that you're using kernel-2.6.32-279.1.1.el6.x86_64 ... that
> smells a bit like a SL 6.3 Beta ... is that right? As SL 6.2 is usually
> around 2.6.32-220-something. I would probably recommend you to try a
> 6.2 kernel if you're running something much more bleeding edge.
I was running 220-23.1 and it was crashing so I tried the newer one to see if
that helped. I think I'll back off now.
> And it somehow seems to be related to some file system issues ... at
> least from what I can see. Could be a bugy kernel which leaks memory,
> somewhere in either the parition table code or ext4 code paths.
One possibility perhaps. The machine comes up doing a md sync:
md1 : active raid10 sdh1[4] sdb2[0] sda2[1] sdd1[2] sde1[7] sdf1[6] sdc1[3]
sdg1[5]
3906203648 blocks 256K chunks 2 near-copies [8/8] [UUUUUUUU]
[=>...................] resync = 9.2% (362126208/3906203648)
finish=1334.6min speed=44255K/sec
I wonder if when that completes some kind of lvm device scan is triggered
which causes the problem. I'm not sure what fires off a lvm process in the
first place.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [log in to unmask]
Boulder, CO 80301 http://www.nwra.com
|