LISTSERV - SCIENTIFIC-LINUX-USERS Archives

On 07/18/2012 11:32 AM, David Sommerseth wrote:
> On 18/07/12 18:35, Orion Poplawski wrote:
>> On 07/17/2012 11:22 AM, Orion Poplawski wrote:
>>> Our SL6.2 KVM and nfs/backup server has been crashing frequently recently
>>> (starting around Fri 13th - yikes!) with Kernel panic - Out of memory
>>> and no
>>> killable processes.  The server has 48GB ram, 2GB swap, only about 15GB
>>> dedicated to VM guests.  I've tried bumping up vm.min_free_kbytes to
>>> 262144 to
>>> no avail.  Nothing strange is getting written to the logs before the
>>> crash.
>>>
>>> Happening with both 2.6.32-220.23.1 and 2.6.32-279.1.1.
>>>
>>> Anyone else seeing this?  Any other ideas?  I've set a serial console
>>> log to
>>> try to catch more information the next time it happens.
>>>
>>
>> here we go, see below.  This makes no sense to me.
>>
>>   [<ffffffff811edc5d>] ? amiga_partition+0x6d/0x460
>                            ^^^^^^^^^^^^^^^
> wtf!?!  What kind of partition tables and file systems do you use?  This
> OOM kill seems to be caused by the amiga partition table code in the
> kernel.  It looks like it's some LVM command causing this to happen
> somehow, though.

Well I bet it's just scanning all partition types and:

/boot/config-2.6.32-220.23.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y
/boot/config-2.6.32-279.1.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y


>> 0 pages in swap cache
>> Swap cache stats: add 0, delete 0, find 0/0
>> Free swap  = 0kB
>> Total swap = 0kB
>
> Now this is concerning ... you're out of swap, unless that's disabled.

I'm curious about this too.  I have 8GB swap and top showed some being used.

>> 4224700 4224638  99%    1.00K 1056175        4   4224700K ext4_inode_cache
>
> This smells a bit bad ... ext4_inode_cache is using a lot of memory ...


>> 3257480 3257186  99%    0.19K 162874       20    651496K dentry
>> 1324786 1250981  94%    0.06K  22454       59     89816K size-64
>> 484128 484094  99%    0.02K   3362      144     13448K avtab_node
>> 347088 342539  98%    0.03K   3099      112     12396K size-32
>> 342580 324110  94%    0.55K  48940        7    195760K radix_tree_node
>> 236059 235736  99%    0.06K   4001       59     16004K ksm_rmap_item
>> 123980 123566  99%    0.19K   6199       20     24796K size-192
>> 105630  47803  45%    0.12K   3521       30     14084K size-128
>>   24300  24261  99%    0.14K    900       27      3600K sysfs_dir_cache
>>   17402  15599  89%    0.05K    226       77       904K anon_vma_chain
>>   16055  14874  92%    0.20K    845       19      3380K vm_area_struct
>>    9844   8471  86%    0.04K    107       92       428K anon_vma
>>    8952   8775  98%    0.58K   1492        6      5968K inode_cache
>>    7518   5829  77%    0.62K   1253        6      5012K proc_inode_cache
>>    6840   4692  68%    0.19K    342       20      1368K filp
>>    5888   5532  93%    0.04K     64       92       256K dm_io
>>
>>
>> top - 10:10:02 up 22:34,  4 users,  load average: 1.02, 1.15, 1.53
>> Tasks: 888 total,   1 running, 887 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  0.8%us,  1.2%sy,  0.0%ni, 97.9%id,  0.1%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:  49421492k total, 43619512k used,  5801980k free,  4409144k buffers
>> Swap:  8388600k total,    16308k used,  8372292k free, 25837164k cached
>
> Somehow, this doesn't reflect what the kernel complains about when the
> OOM killer starts its mission.

That's bugging me too.

> I see that you're using  kernel-2.6.32-279.1.1.el6.x86_64 ... that
> smells a bit like a SL 6.3 Beta ... is that right?  As SL 6.2 is usually
> around 2.6.32-220-something.  I would probably recommend you to try a
> 6.2 kernel if you're running something much more bleeding edge.

I was running 220-23.1 and it was crashing so I tried the newer one to see if 
that helped.  I think I'll back off now.

> And it somehow seems to be related to some file system issues ... at
> least from what I can see.  Could be a bugy kernel which leaks memory,
> somewhere in either the parition table code or ext4 code paths.

One possibility perhaps. The machine comes up doing a md sync:

md1 : active raid10 sdh1[4] sdb2[0] sda2[1] sdd1[2] sde1[7] sdf1[6] sdc1[3] 
sdg1[5]
       3906203648 blocks 256K chunks 2 near-copies [8/8] [UUUUUUUU]
       [=>...................]  resync =  9.2% (362126208/3906203648) 
finish=1334.6min speed=44255K/sec

I wonder if when that completes some kind of lvm device scan is triggered 
which causes the problem.  I'm not sure what fires off a lvm process in the 
first place.

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder Office                  FAX: 303-415-9702
3380 Mitchell Lane                       [log in to unmask]
Boulder, CO 80301                   http://www.nwra.com