SCIENTIFIC-LINUX-USERS Archives

February 2015

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Connie Sieh <[log in to unmask]>
Reply To:
Connie Sieh <[log in to unmask]>
Date:
Thu, 12 Feb 2015 17:24:50 -0600
Content-Type:
text/plain
Parts/Attachments:
text/plain (90 lines)
On Thu, 12 Feb 2015, Yasha Karant wrote:

> I always run an "enterprise" environment on any server, including our
> GPU compute engine for research applications.  This is not a
> testbed machine per se, although we must load new drivers and new
> concurrent/GPU implementation methodologies as these evolve.  The base
> of the
> GPU engine is CUDA.  New compute applications, often from other problem
> domain areas, typically are run (sometimes ported) to this compute engine.
>
> We recently started the transition from SL 6 to SL  7; a colleague here
> was doing the work.  He has numerous comments, posted below,
> and is now insisting that SL (e.g., RHEL) 7 is not suitable for
> production use in our environment, but that OpenSuSE, Debian, or Mint
> are more suitable environments.
> I personally disagree, but I greatly would appeciate commentary,
> particularly from anyone who run other Linux distributions in a
> production server environment.
> We must support CUDA, some variety of MPI, and operational Infiniband
> drivers and services.
>
> Comments (only lightly "cleaned up)
>
> so i verified that the drive indeed has a bad superblock - open suse did
> not hesitate to mount
> because the drive was not in fstab, sl6 had mounted it previously
> because drives only
> get fsckd every (usually) 20 reboots
>
> so this drive reached the 20 reboot threshold and fsck failed with bad
> superblock -
>
> so far so good. the problem is - sl6 refused to mount the root drive rw
> in the emergency shell,
> but also refused to do anything other than reboot once a drive that is
> known not to be the
> system drive failed fsck (and it know this was not the sys drive because
> it had alread mounted
> root to get at fstab)
>
> the sane, competent, safe solution to a drive problem is to not mount
> that drive, not refuse
> to bootdrive failure with bad superblock - however it is a data drive,
> in no way needed to
> boot the system.
>
> over many trials, it became clear that:
>
> 1> the drive is in fstab, so system tries fsck which fails into a shell
> - there appears to be no
> way to tell the system to continue to load, since manual fsck also fails
> - reboot leads to the same
> problem
>
> 2> removed the drive - does not help, still tries to fsck and fails, and
> refuses to continue to load
>
> 3> tried to edit fstab from shell to rem out drive - could not edit,
> drive was mounted readonly,
> could not change
>
> 4> tried to boot from the sl7 live/install usb key - did not let me get
> to a shell, did not want to
> go ahead and install on top of current system

Did you select the "Troubleshooting/Rescue A SL System" boot option?

After boot you get the choice of "Continue/Read_Only/shell" .  Continue 
will mount your partitions rw if able.  From there you can edit the fstab.

>
> 5> created open suse usb key - this allowed me to boot, mount the raid1
> drives, edit fstab -
> whereupon the system was able to boot
>
> What kind of system is unable to deal gracefully with a failed data drive?
>
> My conclusion - Scientific Linux is too fragile a system for serious use
>

-- 
Connie J. Sieh
Computing Services Specialist III

Fermi National Accelerator Laboratory
630 840 8531 office

http://www.fnal.gov
[log in to unmask]

ATOM RSS1 RSS2