On 07/18/2012 11:32 AM, David Sommerseth wrote: > On 18/07/12 18:35, Orion Poplawski wrote: >> On 07/17/2012 11:22 AM, Orion Poplawski wrote: >>> Our SL6.2 KVM and nfs/backup server has been crashing frequently recently >>> (starting around Fri 13th - yikes!) with Kernel panic - Out of memory >>> and no >>> killable processes. The server has 48GB ram, 2GB swap, only about 15GB >>> dedicated to VM guests. I've tried bumping up vm.min_free_kbytes to >>> 262144 to >>> no avail. Nothing strange is getting written to the logs before the >>> crash. >>> >>> Happening with both 2.6.32-220.23.1 and 2.6.32-279.1.1. >>> >>> Anyone else seeing this? Any other ideas? I've set a serial console >>> log to >>> try to catch more information the next time it happens. >>> >> >> here we go, see below. This makes no sense to me. >> >> [<ffffffff811edc5d>] ? amiga_partition+0x6d/0x460 > ^^^^^^^^^^^^^^^ > wtf!?! What kind of partition tables and file systems do you use? This > OOM kill seems to be caused by the amiga partition table code in the > kernel. It looks like it's some LVM command causing this to happen > somehow, though. Well I bet it's just scanning all partition types and: /boot/config-2.6.32-220.23.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y /boot/config-2.6.32-279.1.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y >> 0 pages in swap cache >> Swap cache stats: add 0, delete 0, find 0/0 >> Free swap = 0kB >> Total swap = 0kB > > Now this is concerning ... you're out of swap, unless that's disabled. I'm curious about this too. I have 8GB swap and top showed some being used. >> 4224700 4224638 99% 1.00K 1056175 4 4224700K ext4_inode_cache > > This smells a bit bad ... ext4_inode_cache is using a lot of memory ... >> 3257480 3257186 99% 0.19K 162874 20 651496K dentry >> 1324786 1250981 94% 0.06K 22454 59 89816K size-64 >> 484128 484094 99% 0.02K 3362 144 13448K avtab_node >> 347088 342539 98% 0.03K 3099 112 12396K size-32 >> 342580 324110 94% 0.55K 48940 7 195760K radix_tree_node >> 236059 235736 99% 0.06K 4001 59 16004K ksm_rmap_item >> 123980 123566 99% 0.19K 6199 20 24796K size-192 >> 105630 47803 45% 0.12K 3521 30 14084K size-128 >> 24300 24261 99% 0.14K 900 27 3600K sysfs_dir_cache >> 17402 15599 89% 0.05K 226 77 904K anon_vma_chain >> 16055 14874 92% 0.20K 845 19 3380K vm_area_struct >> 9844 8471 86% 0.04K 107 92 428K anon_vma >> 8952 8775 98% 0.58K 1492 6 5968K inode_cache >> 7518 5829 77% 0.62K 1253 6 5012K proc_inode_cache >> 6840 4692 68% 0.19K 342 20 1368K filp >> 5888 5532 93% 0.04K 64 92 256K dm_io >> >> >> top - 10:10:02 up 22:34, 4 users, load average: 1.02, 1.15, 1.53 >> Tasks: 888 total, 1 running, 887 sleeping, 0 stopped, 0 zombie >> Cpu(s): 0.8%us, 1.2%sy, 0.0%ni, 97.9%id, 0.1%wa, 0.0%hi, 0.0%si, >> 0.0%st >> Mem: 49421492k total, 43619512k used, 5801980k free, 4409144k buffers >> Swap: 8388600k total, 16308k used, 8372292k free, 25837164k cached > > Somehow, this doesn't reflect what the kernel complains about when the > OOM killer starts its mission. That's bugging me too. > I see that you're using kernel-2.6.32-279.1.1.el6.x86_64 ... that > smells a bit like a SL 6.3 Beta ... is that right? As SL 6.2 is usually > around 2.6.32-220-something. I would probably recommend you to try a > 6.2 kernel if you're running something much more bleeding edge. I was running 220-23.1 and it was crashing so I tried the newer one to see if that helped. I think I'll back off now. > And it somehow seems to be related to some file system issues ... at > least from what I can see. Could be a bugy kernel which leaks memory, > somewhere in either the parition table code or ext4 code paths. One possibility perhaps. The machine comes up doing a md sync: md1 : active raid10 sdh1[4] sdb2[0] sda2[1] sdd1[2] sde1[7] sdf1[6] sdc1[3] sdg1[5] 3906203648 blocks 256K chunks 2 near-copies [8/8] [UUUUUUUU] [=>...................] resync = 9.2% (362126208/3906203648) finish=1334.6min speed=44255K/sec I wonder if when that completes some kind of lvm device scan is triggered which causes the problem. I'm not sure what fires off a lvm process in the first place. -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA, Boulder Office FAX: 303-415-9702 3380 Mitchell Lane [log in to unmask] Boulder, CO 80301 http://www.nwra.com