LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

April 2017

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS April 2017

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: RAID 6 array and failing harddrives
From:	Konstantin Olchanski <[log in to unmask]>
Reply To:	Konstantin Olchanski <[log in to unmask]>
Date:	Tue, 4 Apr 2017 16:59:34 -0700
Content-Type:	text/plain
Parts/Attachments:	text/plain (172 lines)

> > Moving to ZFS...
> > ZFS is also scary...
>
> Heh - another soon to be victim of ZFS on linux :)
> 

No kidding. Former victim of XLV+XFS (remember XLV?), former
victim of LV+EFS, former victim of ext2, ext3, reiserfs, former
victim of LVM, current victim of mdadm/raid5/6/ext4/xfs.

>
> You'll quickly realise that the majority of major features you'd expect
> to work - don't.
>

I am not big on "features". For me the main features is open()/read()/write()/close(),
mkdir()/rmdir()/readdir() and those seem to work on all filesystems. Next features are:
a) non-scary raid rebuild after a crash or disk failure,
b) "online" fsck

>
> You can't grow a ZFS 'raid'. You're stuck with the number of disks you first start with.
>

We only have a few hardware configurations, all with fixed number of disks, so not a problem:

a) single 120GB ssd for OS (/home on NFS)
b) single SSD for OS, dual 4-6-8 TB HDD for data, RAID1 configuration to protect against single disk failure
c) dual SSD for OS and /home, dual HDD for data, both RAID1 configuration to protect against single disk failure
d) single SSD for OS, multiple (usually 8) 6-8 TB HDDs for data, mdadm raid6+xfs and now raidz2 ZFS (protection against single disk failure + failure of second disk during raid rebuild).

Storage requirements are always known ahead of time, there is no need to add storage space
until time of major refresh when we install centos N+1, replace 2TB disks with 6TB disks,
4TB disks with 8TB disks and so forth, always in pairs.

>
> You'll find out more as you go down this rabbit hole.
> 

There is no rabbit hole for me. If me no like zfs, I roll out a new machine
with a new-fs and a new-raid, rsync copy the data, and say bye, bye zfs.

>
> > BTRFS is even better (on paper), but not usable in el7.3...
> 
> DO NOT USE RAID5/6 WITHIN BTRFS.
> 

BTRFS is billed as "open source replacement for ZFS", but after testing it,
my impression is that it is only used by a couple of enthusiasts
in single-disk laptop configurations. In a single-disk system, it is not
clear how btrfs/zfs is better than old-plain ext4/xfs. And anyway,
single-disk system is not appropriate for production use as there is no
protection against single-disk failure. (based on existing failure-rate
data one can make the argument that SSDs never fail and do not require
protection against single-disk failure).

> I have tried this before and have the many Gb of lost data when it goes
> wrong. In fact, I discovered several new bugs that I lodged with the
> BTRFS guys - which led to warnings of DO NOT USE PARITY BASED RAID
> LEVELS IN BTRFS becoming the official line.

No lost data here, yet. All home directories are backed-up, all data directories
the experiments are required to make second copies of their data (usually
on cloud-type storage).

> However, BTRFS is very stable if you use it as a simple filesystem. You
> will get more flexible results in using mdadm with btrfs on top of it.

For simple filesystem for single-disk use we already have ext4 and xfs,
both work well enough, thank you very much.

For multiple-disk installations (pairs of RAID1 for single-disk-failure
protection and 3-4-8-disk sets of RAID6 (6-8-10 TB HDDs) for large capacity),
md raid does not cut it anymore because raid rebuild takes unreasonable
amounts of time (days) and lacks self-healing functions (i.e. no rewriting
of bad sectors, no reasonable handling of disk errors, etc). If lights blink
during rebuild (or you bump the power bar, or ...), you have some exciting
time recovering the raid array without any guarranty of data integrity
(no per-file checksums).

> mdadm can be a pain to tweak - but almost all problems are well known
> and documented - and unless you really lose all your parity, you'll be
> able to recover with much less data loss than most other concoctions.

mdadm does not cut it for 8x10TB RAID6. raid rebuild takes *days*.

K.O.

> 
> > 
> > K.O.
> > 
> > 
> > On Tue, Apr 04, 2017 at 04:17:22PM +0200, David Sommerseth wrote:
> >> Hi,
> >>
> >> I just need some help to understand what might be the issue on a SL7.3
> >> server which today decided to disconnect two drives from a RAID 6 setup.
> >>
> >> First some gory details
> >>
> >> - smartctl + mdadm output
> >> <https://paste.fedoraproject.org/paste/wLyz44nipkJ7FgKxWk-1mV5M1UNdIGYhyRLivL9gydE=>
> >>
> >> - kernel log messages
> >> https://paste.fedoraproject.org/paste/mkyjZINKnkD4SQcXTSxyt15M1UNdIGYhyRLivL9gydE=
> >>
> >>
> >> The server is setup with 2x WD RE4 harddrives and 2x Seagate
> >> Constellation ES.3 drives.  All 4TB, all was bought brand new.  They're
> >> installed in a mixed pattern (sda: RE4, sdb: ES3, sdc: RE4, sdd: ES3)
> >> ... and the curious devil in the detail ... there are no /dev/sde
> >> installed on this system - never have been even, at least not on that
> >> controller.  (Later today, I attached a USB drive to make some backups -
> >> which got designated /dev/sde)
> >>
> >> This morning *both* ES.3 drives (sdb, sdd) got disconnected and removed
> >> from the mdraid setup.  With just minutes in between.  On drives which
> >> have been in production for less than 240 days or so.
> >>
> >> lspci details:
> >> 00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset
> >> Family SATA AHCI Controller (rev 05)
> >>
> >> Server: HP ProLiant MicroServer Gen8 (F9A40A)
> >>
> >> <https://www.hpe.com/us/en/product-catalog/servers/proliant-servers/pip.specifications.hpe-proliant-microserver-gen8.5379860.html>
> >>
> >>
> >> Have any one else experienced such issues?  Several places on the net,
> >> the ata kernel error messages have been resolved by checking SATA cables
> >> and their seating.  It just sounds a bit too incredible that two
> >> harddrives of the same brand and type in different HDD slots have the
> >> same issues but not at the exact same time (but close, though).  And I
> >> struggle to believe two identical drives just failing so close in time.
> >>
> >> What am I missing? :)  Going to shut down the server soon (after last
> >> backup round) and will double check all the HDD seating and cabling.
> >> But I'm not convinced that's all just yet.
> >>
> >>
> >> -- 
> >> kind regards,
> >>
> >> David Sommerseth
> > 
> 
> -- 
> Steven Haigh
> 
> Email: [log in to unmask]
> Web: https://www.crc.id.au
> Phone: (03) 9001 6090 - 0412 935 897
> 

-- 
Konstantin Olchanski
Data Acquisition Systems: The Bytes Must Flow!
Email: olchansk-at-triumf-dot-ca
Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV