SCIENTIFIC-LINUX-USERS Archives

February 2015

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Yasha Karant <[log in to unmask]>
Reply To:
Yasha Karant <[log in to unmask]>
Date:
Thu, 12 Feb 2015 13:50:26 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (71 lines)
I always run an "enterprise" environment on any server, including our 
GPU compute engine for research applications.  This is not a
testbed machine per se, although we must load new drivers and new 
concurrent/GPU implementation methodologies as these evolve.  The base 
of the
GPU engine is CUDA.  New compute applications, often from other problem 
domain areas, typically are run (sometimes ported) to this compute engine.

We recently started the transition from SL 6 to SL  7; a colleague here 
was doing the work.  He has numerous comments, posted below,
and is now insisting that SL (e.g., RHEL) 7 is not suitable for 
production use in our environment, but that OpenSuSE, Debian, or Mint 
are more suitable environments.
I personally disagree, but I greatly would appeciate commentary, 
particularly from anyone who run other Linux distributions in a 
production server environment.
We must support CUDA, some variety of MPI, and operational Infiniband 
drivers and services.

Comments (only lightly "cleaned up)

so i verified that the drive indeed has a bad superblock - open suse did 
not hesitate to mount
because the drive was not in fstab, sl6 had mounted it previously 
because drives only
get fsckd every (usually) 20 reboots

so this drive reached the 20 reboot threshold and fsck failed with bad 
superblock -

so far so good. the problem is - sl6 refused to mount the root drive rw 
in the emergency shell,
but also refused to do anything other than reboot once a drive that is 
known not to be the
system drive failed fsck (and it know this was not the sys drive because 
it had alread mounted
root to get at fstab)

the sane, competent, safe solution to a drive problem is to not mount 
that drive, not refuse
to bootdrive failure with bad superblock - however it is a data drive, 
in no way needed to
boot the system.

over many trials, it became clear that:

1> the drive is in fstab, so system tries fsck which fails into a shell 
- there appears to be no
way to tell the system to continue to load, since manual fsck also fails 
- reboot leads to the same
problem

2> removed the drive - does not help, still tries to fsck and fails, and 
refuses to continue to load

3> tried to edit fstab from shell to rem out drive - could not edit, 
drive was mounted readonly,
could not change

4> tried to boot from the sl7 live/install usb key - did not let me get 
to a shell, did not want to
go ahead and install on top of current system

5> created open suse usb key - this allowed me to boot, mount the raid1 
drives, edit fstab -
whereupon the system was able to boot

What kind of system is unable to deal gracefully with a failed data drive?

My conclusion - Scientific Linux is too fragile a system for serious use

ATOM RSS1 RSS2