LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

August 2021

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS August 2021

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Disk reliability - and swap out`
From:	Konstantin Olchanski <[log in to unmask]>
Reply To:	Konstantin Olchanski <[log in to unmask]>
Date:	Tue, 10 Aug 2021 15:08:25 -0700
Content-Type:	text/plain
Parts/Attachments:	text/plain (86 lines)

Hi, Larry, thank you for this information, it is always good to see
how other people do things.

I am surprised at how little storage you have, only a handful of TBs.

Here, for each experiment data acquisition station, we now configure
2x1TB SSD for os, home dirs, apps, etc and 2x8-10-12TB HDD for recording
experiment data. We use "sort by price" NAS CMR HDDs (WD red, etc).

All disks are doubled up as linux mdadm raid1 (mirror) or ZFS mirror. This is
to prevent any disruption of data taking from single-disk failure.

(it is important to configure the boot loader on both SSDs to boot
even if the other SSD is dead).

I am surprised you use 1TB HDDs. We switched to SSD up to 2TB size (WD blue SATA SSDs).

Failure rates of HDDs, the only reliable data is from backblaze:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.backblaze.com_b2_hard-2Ddrive-2Dtest-2Ddata.html&d=DwIBAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=gd8BzeSQcySVxr0gDWSEbN-P-pgDXkdyCtaMqdCgPPdW1cyL5RIpaIYrCn8C5x2A&m=NXYkiOfF7bPKBqi2iMgqsqrtLHRVdP7lIO-L5J4AmqQ&s=DgUuM1BVcm4jUkUWsi_DNMAjvkuy1zl1oaDQzrC4YAk&e= 
and
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.backblaze.com_blog_backblaze-2Ddrive-2Dstats-2Dfor-2Dq2-2D2021_&d=DwIBAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=gd8BzeSQcySVxr0gDWSEbN-P-pgDXkdyCtaMqdCgPPdW1cyL5RIpaIYrCn8C5x2A&m=NXYkiOfF7bPKBqi2iMgqsqrtLHRVdP7lIO-L5J4AmqQ&s=lPk2j2mTwp6uDzrZYUsP2rIxyRiacBHZOU0o7R5mUqM&e= 

Failure rates of SSDs, seems to be very low, I only have 2-3 failed SSDs. One SSD had an
internal short and turned into a space heater, luckily there was no fire.

For backups of os and home dirs we use amanda and rsync+zfs snapshots. Backups
of experiment data is not our responibility (many experiments use usb hdds).


K.O.


On Tue, Aug 10, 2021 at 10:55:35AM -0400, Larry Linder wrote:
> There are 25 systems in our shop, all linux based, a linux based server,
> and synology Disk Station running raid 1.   The Disk Station has 12 TB
> of space.  6 TB per for each raid level.
> 
> We buy only one brand of disk with the black label.  They are typically
> 1 TB.
> 
> User boxes has a SSD drive for the OS and a 2 TB disk for the users
> space and 32 G RAM. and a quad or six core AMD processor.  The graphics
> boxes get a Video card with lots of ram.  3 D rendering on a slow video
> care wast's a lot of users time.
> 
> The server has a SSD for the OS and 6 TB for user apps /
> library /usr/local and /opt.  It also has a mirror disk that keeps a
> copy of the server locally.
> 
> These systems are on 24 / 7 and accumulate a lot of hours.  No matter
> what the make mechanical disks have a life span.  For grins I used to do
> a post mortum on disk that failed.   There were to types of failures,
> the spring that returns the arm holding the heads cracks.  The second
> type of failure is the main bearings.  Newer disk seem to have less of a
> bearing failure rate.  
> To prevent operational problems we just swap out the disk on each box at
> about 5,000 to 7,000 hr.  The manufacturer says they are good for 10,000
> hr. See the fine print in the Waranty,  You have to remember this is a
> money making operation and down time is costly.
> 
> Backups run at 12:29 and 0:29 in the AM.  At the end of the morning back
> up a copy is sent to a remote site. 
> 
> For security we shut down the network at 6:20 PM, bring it up at 0:01 AM
> and shut it down after back up is complete.  We bring it back up at 6:45
> AM.  
> 10 yeas ago we had a fixed IP and the Chineese found it by just
> continually pounding on the door.  The return IP was 4 hops to a city
> north east of Shanghi.  They had installed a root kit on our server,
> disabled cron.  When you changed the passwd to the server a few
> millisecond later it was sent to china.  We got rid of the fixed IP and
> reloaded all the systems.  So when you shout down the network to your
> provider the next time your start it you get a different IP.
> 
> We don't give the disks away as they contain a lot of design data,
> SW,Cad programs, part programs for our mill etc.  We donate them to a
> charity that drills the disks and recycles the rest.
> 
> Larry Linder

-- 
Konstantin Olchanski
Data Acquisition Systems: The Bytes Must Flow!
Email: olchansk-at-triumf-dot-ca
Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV