LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

August 2021

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS August 2021

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Disk reliability - and swap out`
From:	Yasha Karant <[log in to unmask]>
Reply To:	Yasha Karant <[log in to unmask]>
Date:	Tue, 10 Aug 2021 15:34:00 -0700
Content-Type:	text/plain
Parts/Attachments:	text/plain (102 lines)

One SSD had an internal short and turned into a space heater, luckily 
there was no fire. End excerpt.

Clearly, there is very poor safety engineering and/or quality control 
(as with certain Li batteries that did similar things in personal 
devices being operated by the user).  If that SSD had been inside a 
laptop (presumably, inside a rack mounted disk farm and there were fire 
extinguishers and possibly a machine room fire suppression system), 
things could have had a much worse outcome (most laptops have 
combustible materials).

As for the small amount of storage, the commentator is at a reasonably 
well funded (through government sources and possible tax-deductible or 
glamour philanthropy) HEP facility.  Much of the world, including 
non-collaboration funded university research facilities have rather poor 
funding at most entities within the USA (not all faculty members can be 
at Harvard, Stanford, etc.) -- administrative and some instructional 
facilities typically can get much more.  Many universities now outsource 
to paid "cloud" storage, with all of the issues that may entail.

On 8/10/21 3:08 PM, Konstantin Olchanski wrote:
> Hi, Larry, thank you for this information, it is always good to see
> how other people do things.
> 
> I am surprised at how little storage you have, only a handful of TBs.
> 
> Here, for each experiment data acquisition station, we now configure
> 2x1TB SSD for os, home dirs, apps, etc and 2x8-10-12TB HDD for recording
> experiment data. We use "sort by price" NAS CMR HDDs (WD red, etc).
> 
> All disks are doubled up as linux mdadm raid1 (mirror) or ZFS mirror. This is
> to prevent any disruption of data taking from single-disk failure.
> 
> (it is important to configure the boot loader on both SSDs to boot
> even if the other SSD is dead).
> 
> I am surprised you use 1TB HDDs. We switched to SSD up to 2TB size (WD blue SATA SSDs).
> 
> Failure rates of HDDs, the only reliable data is from backblaze:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.backblaze.com_b2_hard-2Ddrive-2Dtest-2Ddata.html&d=DwIBAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=gd8BzeSQcySVxr0gDWSEbN-P-pgDXkdyCtaMqdCgPPdW1cyL5RIpaIYrCn8C5x2A&m=NXYkiOfF7bPKBqi2iMgqsqrtLHRVdP7lIO-L5J4AmqQ&s=DgUuM1BVcm4jUkUWsi_DNMAjvkuy1zl1oaDQzrC4YAk&e=
> and
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.backblaze.com_blog_backblaze-2Ddrive-2Dstats-2Dfor-2Dq2-2D2021_&d=DwIBAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=gd8BzeSQcySVxr0gDWSEbN-P-pgDXkdyCtaMqdCgPPdW1cyL5RIpaIYrCn8C5x2A&m=NXYkiOfF7bPKBqi2iMgqsqrtLHRVdP7lIO-L5J4AmqQ&s=lPk2j2mTwp6uDzrZYUsP2rIxyRiacBHZOU0o7R5mUqM&e=
> 
> Failure rates of SSDs, seems to be very low, I only have 2-3 failed SSDs. One SSD had an
> internal short and turned into a space heater, luckily there was no fire.
> 
> For backups of os and home dirs we use amanda and rsync+zfs snapshots. Backups
> of experiment data is not our responibility (many experiments use usb hdds).
> 
> 
> K.O.
> 
> 
> On Tue, Aug 10, 2021 at 10:55:35AM -0400, Larry Linder wrote:
>> There are 25 systems in our shop, all linux based, a linux based server,
>> and synology Disk Station running raid 1.   The Disk Station has 12 TB
>> of space.  6 TB per for each raid level.
>>
>> We buy only one brand of disk with the black label.  They are typically
>> 1 TB.
>>
>> User boxes has a SSD drive for the OS and a 2 TB disk for the users
>> space and 32 G RAM. and a quad or six core AMD processor.  The graphics
>> boxes get a Video card with lots of ram.  3 D rendering on a slow video
>> care wast's a lot of users time.
>>
>> The server has a SSD for the OS and 6 TB for user apps /
>> library /usr/local and /opt.  It also has a mirror disk that keeps a
>> copy of the server locally.
>>
>> These systems are on 24 / 7 and accumulate a lot of hours.  No matter
>> what the make mechanical disks have a life span.  For grins I used to do
>> a post mortum on disk that failed.   There were to types of failures,
>> the spring that returns the arm holding the heads cracks.  The second
>> type of failure is the main bearings.  Newer disk seem to have less of a
>> bearing failure rate.
>> To prevent operational problems we just swap out the disk on each box at
>> about 5,000 to 7,000 hr.  The manufacturer says they are good for 10,000
>> hr. See the fine print in the Waranty,  You have to remember this is a
>> money making operation and down time is costly.
>>
>> Backups run at 12:29 and 0:29 in the AM.  At the end of the morning back
>> up a copy is sent to a remote site.
>>
>> For security we shut down the network at 6:20 PM, bring it up at 0:01 AM
>> and shut it down after back up is complete.  We bring it back up at 6:45
>> AM.
>> 10 yeas ago we had a fixed IP and the Chineese found it by just
>> continually pounding on the door.  The return IP was 4 hops to a city
>> north east of Shanghi.  They had installed a root kit on our server,
>> disabled cron.  When you changed the passwd to the server a few
>> millisecond later it was sent to china.  We got rid of the fixed IP and
>> reloaded all the systems.  So when you shout down the network to your
>> provider the next time your start it you get a different IP.
>>
>> We don't give the disks away as they contain a lot of design data,
>> SW,Cad programs, part programs for our mill etc.  We donate them to a
>> charity that drills the disks and recycles the rest.
>>
>> Larry Linder
>

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV