SCIENTIFIC-LINUX-USERS Archives

February 2012

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Bill Maidment <[log in to unmask]>
Reply To:
Bill Maidment <[log in to unmask]>
Date:
Thu, 23 Feb 2012 11:03:43 +1100
Content-Type:
text/plain
Parts/Attachments:
text/plain (57 lines)
-----Original message-----
From:	Nico Kadel-Garcia <[log in to unmask]>
Sent:	Thu 23-02-2012 10:42
Subject:	Re: Degraded array issues with SL 6.1 and SL 6.2
To:	Bill Maidment <[log in to unmask]>; 
CC:	SL Users <[log in to unmask]>; Tom H <[log in to unmask]>; 
> On Wed, Feb 22, 2012 at 4:38 PM, Bill Maidment <[log in to unmask]> wrote:
> > In (1) above, are they replying that you can't "--fail", "--remove",
> > and then "--add" the same disk or that you can't "--fail" and
> > "--remove" a disk, replace it, and then can't "--add" it because it's
> > got the same "X"/"XY" in "sdX"/"sdaXY" as the previous, failed disk?
> >
> >
> 
> Now I've had my coffee fix I have got back my sanity.
> I have used the following sequence of commands to remove and re-add a disk to a 
> running RAID1 array:
> mdadm /dev/md3 -f /dev/sdc1
> mdadm /dev/md3 -r /dev/sdc1
> mdadm --zero-superblock /dev/sdc1
> mdadm /dev/md3 -a /dev/sdc1
> 
> It works as expected. I just found the original error message a bit confusing 
> when it referred to making the disk a "spare". It would seem that earlier 
> versions of the kernel did that automatically.
> 
>  
> 
> Interesting! I have mixed feeling about RAID, especially for simple RAID1 
> setups. I'd rather use the second drive as an rsnapshot based backup drive, 
> usueally in read-only mode. That allows me to recover files that I've 
> accidentally screwed up or deleted in the recent past, which occurs far more 
> often than drive failures. And it puts different wear and tear on the hard 
> drive: there's nothing like having all the drives in a RAID set start failing 
> at almost the same time, before drive replacement can occur. This has happened 
> to me before and is actually pretty well described in a Google white paper at 
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.goo
> gle.com/en/us/archive/disk_failures.pdf 
> <http://static.googleusercontent.com/external_content/untrusted_dlcp/research.go
> ogle.com/en/us/archive/disk_failures.pdf> .
>  
> However, in this case, I'd tend to agreee with the idea that a RAID1 pair 
> should not be automatically re-activated on reboot. If one drive starts 
> failing, it should be kept offline until replaced, and trying to outguess this 
> process at boot time without intervention seems fraught.
> 

Yes. I agree with your sentiments generally.
We had an issue that required we change the hard drives before they actually failed.
The firmware on several of our 500GB Seagate drives (firmware SN05) was faulty and liable to cause the drive to fail if the machine was rebooted. We took these drives out of the arrays and re-flashed the firmware to SN06 and then added the drives back into the arrays.
We needed to go this way because of the dramatic shortage of server grade hard drives (and consequent price increase) in recent months.

Cheers
Bill Maidment
IT Consultant to Elgas Ltd
Phone: 02 4294 3649

ATOM RSS1 RSS2