Greetings. We have a dual-Opteron system running Scientific Linux 4.4
(for x86_64):
# cat /etc/redhat-release
Scientific Linux SL release 4.4 (Beryllium)
#
# cat /proc/version
Linux version 2.6.9-42.0.3.ELsmp ([log in to unmask]) (gcc version
3.4.4 20050721 (Red Hat 3.4.4-2)) #1 SMP Thu Oct 5 16:29:37 CDT 2006
The machine has a softare-RAID-6 array, using 7 SATA disk drives:
# cat /proc/mdstat
Personalities : [raid6]
md0 : active raid6 sdh2[6] sdg2[5] sdf2[4] sde2[3] sdd2[2] sdc2[1]
1454764800 blocks level 6, 256k chunk, algorithm 2 [7/6] [_UUUUUU]
unused devices: <none>
As you can see, one of the drives, sdb2, is missing from the array.
When we first encountered the problem, we simply put in a spare drive
and let the RAID set rebuild itself, which it does in 3-4 hours.
After we remounted the RAID set, everything seemed fine for a short
while (no more than an hour or so), after which the SAME drive
disappeared from the array. We repeated the exercise with a second
spare and got the same behavior. We're assuming that the three drives
can't ALL be bad.
In fact, this is more than an assumption: we placed one of these drives
in the "bad" slot, reformatted it as a standalone (not part of the RAID
set) drive, and excercised it overnight with no obvious problems.
We inspected the guts of the system, and everthing looks completely
solid: no loose cables, no kinks, etc.
The vendor's tech-support guys suggested that we run fsck on the array
before trying to remount it. That led to the second interesting aspect
of this situation: the fsck seems to loop forever (meaning 60 hours in
one case), finding illegal blocks, always in the SAME inode, and
restarting itself. I've appended a representative sample.
We found a reference to a similar problem on one of the Redhat lists,
but the suggested fix was to upgrade to a newer version of e2fsprogs:
https://listman.redhat.com/archives/ext3-users/2005-February/msg00048.html
So far as we can tell, we have the system fully updated, so I don't know
where we would get an upgrade.
It might be of interest that we have both 32-bit and 64-bit e2fsprog
packages installed (not by conscious choice):
# rpm -qa | grep e2fsprogs
e2fsprogs-1.35-12.4.EL4.i386
e2fsprogs-devel-1.35-12.4.EL4.x86_64
e2fsprogs-1.35-12.4.EL4.x86_64
It looks as if both packages claim to "own" the fsck program:
# rpm -q -f /sbin/fsck
e2fsprogs-1.35-12.4.EL4.x86_64
e2fsprogs-1.35-12.4.EL4.i386
We're going to try replacing the SATA cables next, but, as I said above,
we really didn't see any problems when we inspected the cables, and the
machine HAD been working fine for some time before this problem
occurred. Also, the spare drive worked fine as a standalone drive,
using the same cables.
If you have any insight into this, please let me know ASAP.
Thanks.
- Mike
Appendix: sample of output from "looping" e2fsck:
================================================
.
.
.
Restarting e2fsck from the beginning...
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 8 has illegal block(s). Clear? yes
Illegal block #219948 (2475724427) in inode 8. CLEARED.
Illegal block #219949 (4171757466) in inode 8. CLEARED.
Illegal block #219950 (4238416011) in inode 8. CLEARED.
Illegal block #219951 (2339933318) in inode 8. CLEARED.
Illegal block #219952 (428579973) in inode 8. CLEARED.
Illegal block #219953 (2341702547) in inode 8. CLEARED.
Illegal block #219955 (3328508415) in inode 8. CLEARED.
Illegal block #219956 (3276797744) in inode 8. CLEARED.
Illegal block #219957 (2448919297) in inode 8. CLEARED.
Illegal block #219958 (4156880643) in inode 8. CLEARED.
Illegal block #219959 (2342393167) in inode 8. CLEARED.
Too many illegal blocks in inode 8.
Clear inode? yes
Restarting e2fsck from the beginning...
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 8 has illegal block(s). Clear? yes
Illegal block #219960 (2542701456) in inode 8. CLEARED.
Illegal block #219961 (2710671757) in inode 8. CLEARED.
Illegal block #219962 (2533920651) in inode 8. CLEARED.
Illegal block #219963 (511413122) in inode 8. CLEARED.
Illegal block #219964 (1996904443) in inode 8. CLEARED.
Illegal block #219965 (2274068875) in inode 8. CLEARED.
Illegal block #219966 (2256898430) in inode 8. CLEARED.
Illegal block #219968 (2643170174) in inode 8. CLEARED.
Illegal block #219970 (4151319807) in inode 8. CLEARED.
Illegal block #219971 (2332139923) in inode 8. CLEARED.
Illegal block #219972 (4160970488) in inode 8. CLEARED.
Too many illegal blocks in inode 8.
Clear inode? yes
Inode 14529 was part of the orphaned inode list. FIXED.
Extended attribute block 9748 has reference count 2, should be 1. Fix?
yes
Restarting e2fsck from the beginning...
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 8 has illegal block(s). Clear? yes
Illegal block #219974 (2525730202) in inode 8. CLEARED.
.
.
.
--
Michael Hannon mailto:[log in to unmask]
Dept. of Physics 530.752.4966
University of California 530.752.4717 FAX
Davis, CA 95616-8677
|