Hi All, We have started seeing a rather worrisome problem with our large XFS fileservers. Basically, it appears that something overwrites the LVM metadata on a physical volume. In every case so far, after a reboot we can repair the physical volume without losing data or corrupting the filesystem. We don't know exactly what triggers this, but it seems related to filling up a filesystem (or physical volume?). Our RAID servers are running SL305 with the 2.4.21-37.EL.XFSsmp kernel. We use 3ware 9500S-12 RAID controllers with 750G disks in a RAID5 configuration. The 3ware controller splits this 6.82TB RAID volume into 2TB volumes (sd[c-f]) that we combine into one volume group. We currently have nine identical fileservers (some in operation for over two years), and in the last month we've seen this problem on two of them. We now have a test setup that we can (more or less) reliably trigger the problem on. Since we don't know exactly where the problem originates, I will attempt to describe our test setup and how everything is created. Logical volumes are created as follows (for example): [root@lnx113 root]# lvcreate --size 2000G --name test1 vg1 [root@lnx113 root]# lvcreate --size 128M --name test1log vg And we create XFS filesystems as follows: [root@lnx113 root]# mkfs.xfs -s size=4096 -l logdev=/dev/vg/test1log / dev/vg1/test1 meta-data=/dev/vg1/test1 isize=256 agcount=32, agsize=16384000 blks = sectsz=4096 data = bsize=4096 blocks=524288000, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =/dev/vg/test1log bsize=4096 blocks=32768, version=2 = sectsz=4096 sunit=1 blks realtime =none extsz=65536 blocks=0, rtextents=0 The first logical volume (test) is 1.9TB and resides entirely on sdc. The second logical volume (test1) fills up sdc and uses up part of sdd. Everything then looks like: [root@lnx113 root]# pvscan pvscan -- reading all physical volumes (this may take a while...) pvscan -- WARNING: physical volume "/dev/sda4" belongs to a meta device pvscan -- WARNING: physical volume "/dev/sdb4" belongs to a meta device pvscan -- ACTIVE PV "/dev/sdc" of VG "vg1" [2 TB / 0 free] pvscan -- ACTIVE PV "/dev/sdd" of VG "vg1" [2 TB / 195.94 GB free] pvscan -- ACTIVE PV "/dev/sde" of VG "vg1" [2 TB / 2 TB free] pvscan -- ACTIVE PV "/dev/sdf" of VG "vg1" [840.78 GB / 840.78 GB free] pvscan -- ACTIVE PV "/dev/md2" of VG "vg" [68.72 GB / 49.17 GB free] pvscan -- total: 7 [6.89 TB] / in use: 7 [6.89 TB] / in no VG: 0 [0] [root@lnx113 root]# lvscan ... lvscan -- ACTIVE "/dev/vg/testlog" [128 MB] lvscan -- ACTIVE "/dev/vg/test1log" [128 MB] lvscan -- ACTIVE "/dev/vg1/test" [1.86 TB] lvscan -- ACTIVE "/dev/vg1/test1" [1.95 TB] lvscan -- 14 logical volumes with 1.83 TB total in 2 volume groups lvscan -- 14 active logical volumes We then fill up the filesystems using dd (for example, `dd if=/dev/ zero of=/mnt/test1/zero bs=1M count=200000`). We can fill up /dev/ vg1/test without a problem. However, while filling up test1, at some point syslog reports the following error: Feb 3 15:36:28 lnx113 kernel: attempt to access beyond end of device Feb 3 15:36:28 lnx113 kernel: 08:20: rw=1, want=0, limit=2147483647 Note that this happens before test1 fills up. For example: [root@lnx113 root]# df -h |grep test /dev/vg1/test 1.9T 1.9T 36K 100% /mnt/test /dev/vg1/test1 2.0T 1.8T 193G 91% /mnt/test1 A pvscan and vgscan then show the problem (what happened to sdc?): [root@lnx113 root]# pvscan pvscan -- reading all physical volumes (this may take a while...) pvscan -- WARNING: physical volume "/dev/sda4" belongs to a meta device pvscan -- WARNING: physical volume "/dev/sdb4" belongs to a meta device pvscan -- ACTIVE PV "/dev/sdd" is associated to unknown VG "vg1" (run vgscan) pvscan -- ACTIVE PV "/dev/sde" is associated to unknown VG "vg1" (run vgscan) pvscan -- ACTIVE PV "/dev/sdf" is associated to unknown VG "vg1" (run vgscan) pvscan -- ACTIVE PV "/dev/md2" of VG "vg" [68.72 GB / 49.17 GB free] pvscan -- total: 6 [4.89 TB] / in use: 6 [4.89 TB] / in no VG: 0 [0] [root@lnx113 root]# vgscan vgscan -- reading all physical volumes (this may take a while...) vgscan -- found active volume group "vg" vgscan -- found active volume group "vg1" vgscan -- ERROR "vg_read_with_pv_and_lv(): current PV" can't get data of volume group "vg1" from physical volume(s) vgscan -- "/etc/lvmtab" and "/etc/lvmtab.d" successfully created vgscan -- WARNING: This program does not do a VGDA backup of your volume groups After a reboot, we are then able to recreate the physical volume on sdc and restore the LVM metadata: pvcreate /dev/sdc vgcfgrestore -n vg1 /dev/sdc vgchange -a y vg1 After doing this, everything looks fine and fsck reports that the filesystems are clean. Please let me know if there's any other information we can provide. Any suggestions or help would be greatly appreciate. Many thanks, Devin ------ Devin Bougie Laboratory for Elementary-Particle Physics Cornell University [log in to unmask]