SCIENTIFIC-LINUX-USERS Archives

November 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Michael Hannon <[log in to unmask]>
Reply To:
Michael Hannon <[log in to unmask]>
Date:
Wed, 15 Nov 2006 12:21:12 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (140 lines)
Greetings.  We have a dual-Opteron system running Scientific Linux 4.4
(for x86_64):

     # cat /etc/redhat-release
     Scientific Linux SL release 4.4 (Beryllium)
     #
     # cat /proc/version
     Linux version 2.6.9-42.0.3.ELsmp ([log in to unmask]) (gcc version
     3.4.4 20050721 (Red Hat 3.4.4-2)) #1 SMP Thu Oct 5 16:29:37 CDT 2006

The machine has a softare-RAID-6 array, using 7 SATA disk drives:

     # cat /proc/mdstat
     Personalities : [raid6]
     md0 : active raid6 sdh2[6] sdg2[5] sdf2[4] sde2[3] sdd2[2] sdc2[1]
      1454764800 blocks level 6, 256k chunk, algorithm 2 [7/6] [_UUUUUU]

     unused devices: <none>

As you can see, one of the drives, sdb2, is missing from the array.

When we first encountered the problem, we simply put in a spare drive
and let the RAID set rebuild itself, which it does in 3-4 hours.

After we remounted the RAID set, everything seemed fine for a short
while (no more than an hour or so), after which the SAME drive
disappeared from the array.  We repeated the exercise with a second
spare and got the same behavior.  We're assuming that the three drives
can't ALL be bad.

In fact, this is more than an assumption: we placed one of these drives
in the "bad" slot, reformatted it as a standalone (not part of the RAID
set) drive, and excercised it overnight with no obvious problems.

We inspected the guts of the system, and everthing looks completely
solid: no loose cables, no kinks, etc.

The vendor's tech-support guys suggested that we run fsck on the array
before trying to remount it.  That led to the second interesting aspect
of this situation: the fsck seems to loop forever (meaning 60 hours in
one case), finding illegal blocks, always in the SAME inode, and
restarting itself.  I've appended a representative sample.

We found a reference to a similar problem on one of the Redhat lists,
but the suggested fix was to upgrade to a newer version of e2fsprogs:

https://listman.redhat.com/archives/ext3-users/2005-February/msg00048.html

So far as we can tell, we have the system fully updated, so I don't know
where we would get an upgrade.

It might be of interest that we have both 32-bit and 64-bit e2fsprog
packages installed (not by conscious choice):

     # rpm -qa | grep e2fsprogs
     e2fsprogs-1.35-12.4.EL4.i386
     e2fsprogs-devel-1.35-12.4.EL4.x86_64
     e2fsprogs-1.35-12.4.EL4.x86_64

It looks as if both packages claim to "own" the fsck program:

     # rpm -q -f /sbin/fsck
     e2fsprogs-1.35-12.4.EL4.x86_64
     e2fsprogs-1.35-12.4.EL4.i386

We're going to try replacing the SATA cables next, but, as I said above,
we really didn't see any problems when we inspected the cables, and the
machine HAD been working fine for some time before this problem
occurred.  Also, the spare drive worked fine as a standalone drive,
using the same cables.

If you have any insight into this, please let me know ASAP.

Thanks.

                                         - Mike

Appendix: sample of output from "looping" e2fsck:
================================================
.
.
.
Restarting e2fsck from the beginning...
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 8 has illegal block(s).  Clear? yes

Illegal block #219948 (2475724427) in inode 8.  CLEARED.
Illegal block #219949 (4171757466) in inode 8.  CLEARED.
Illegal block #219950 (4238416011) in inode 8.  CLEARED.
Illegal block #219951 (2339933318) in inode 8.  CLEARED.
Illegal block #219952 (428579973) in inode 8.  CLEARED.
Illegal block #219953 (2341702547) in inode 8.  CLEARED.
Illegal block #219955 (3328508415) in inode 8.  CLEARED.
Illegal block #219956 (3276797744) in inode 8.  CLEARED.
Illegal block #219957 (2448919297) in inode 8.  CLEARED.
Illegal block #219958 (4156880643) in inode 8.  CLEARED.
Illegal block #219959 (2342393167) in inode 8.  CLEARED.
Too many illegal blocks in inode 8.
Clear inode? yes

Restarting e2fsck from the beginning...
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 8 has illegal block(s).  Clear? yes

Illegal block #219960 (2542701456) in inode 8.  CLEARED.
Illegal block #219961 (2710671757) in inode 8.  CLEARED.
Illegal block #219962 (2533920651) in inode 8.  CLEARED.
Illegal block #219963 (511413122) in inode 8.  CLEARED.
Illegal block #219964 (1996904443) in inode 8.  CLEARED.
Illegal block #219965 (2274068875) in inode 8.  CLEARED.
Illegal block #219966 (2256898430) in inode 8.  CLEARED.
Illegal block #219968 (2643170174) in inode 8.  CLEARED.
Illegal block #219970 (4151319807) in inode 8.  CLEARED.
Illegal block #219971 (2332139923) in inode 8.  CLEARED.
Illegal block #219972 (4160970488) in inode 8.  CLEARED.
Too many illegal blocks in inode 8.
Clear inode? yes

Inode 14529 was part of the orphaned inode list.  FIXED.
Extended attribute block 9748 has reference count 2, should be 1.  Fix?
yes

Restarting e2fsck from the beginning...
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 8 has illegal block(s).  Clear? yes

Illegal block #219974 (2525730202) in inode 8.  CLEARED.
.
.
.

-- 
Michael Hannon            mailto:[log in to unmask]
Dept. of Physics          530.752.4966
University of California  530.752.4717 FAX
Davis, CA 95616-8677

ATOM RSS1 RSS2