SCIENTIFIC-LINUX-USERS Archives

February 2015

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Hoffmann, Tony" <[log in to unmask]>
Reply To:
Hoffmann, Tony
Date:
Thu, 12 Feb 2015 14:48:21 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (1 lines)
Wouldn't booting with the fastboot option on the grub command line help with this?   That should skip the fsck of the file systems.



Also the issue of not being able to edit fstab should have solvable by just doing a mount -o remount,rw /



The comments are all kinda sys admin 101 stuff to my mind.  Maybe the environment is more complicated than what the comments lead me to think.



Tony



-----Original Message-----

From: [log in to unmask] [mailto:[log in to unmask]] On Behalf Of Yasha Karant

Sent: February-12-15 1:50 PM

To: [log in to unmask]

Subject: Issues with SL7



I always run an "enterprise" environment on any server, including our GPU compute engine for research applications.  This is not a testbed machine per se, although we must load new drivers and new concurrent/GPU implementation methodologies as these evolve.  The base of the GPU engine is CUDA.  New compute applications, often from other problem domain areas, typically are run (sometimes ported) to this compute engine.



We recently started the transition from SL 6 to SL  7; a colleague here was doing the work.  He has numerous comments, posted below, and is now insisting that SL (e.g., RHEL) 7 is not suitable for production use in our environment, but that OpenSuSE, Debian, or Mint are more suitable environments.

I personally disagree, but I greatly would appeciate commentary, particularly from anyone who run other Linux distributions in a production server environment.

We must support CUDA, some variety of MPI, and operational Infiniband drivers and services.



Comments (only lightly "cleaned up)



so i verified that the drive indeed has a bad superblock - open suse did not hesitate to mount because the drive was not in fstab, sl6 had mounted it previously because drives only get fsckd every (usually) 20 reboots



so this drive reached the 20 reboot threshold and fsck failed with bad superblock -



so far so good. the problem is - sl6 refused to mount the root drive rw in the emergency shell, but also refused to do anything other than reboot once a drive that is known not to be the system drive failed fsck (and it know this was not the sys drive because it had alread mounted root to get at fstab)



the sane, competent, safe solution to a drive problem is to not mount that drive, not refuse to bootdrive failure with bad superblock - however it is a data drive, in no way needed to boot the system.



over many trials, it became clear that:



1> the drive is in fstab, so system tries fsck which fails into a shell

- there appears to be no

way to tell the system to continue to load, since manual fsck also fails

- reboot leads to the same

problem



2> removed the drive - does not help, still tries to fsck and fails, and

refuses to continue to load



3> tried to edit fstab from shell to rem out drive - could not edit, 

drive was mounted readonly,

could not change



4> tried to boot from the sl7 live/install usb key - did not let me get 

to a shell, did not want to

go ahead and install on top of current system



5> created open suse usb key - this allowed me to boot, mount the raid1 

drives, edit fstab -

whereupon the system was able to boot



What kind of system is unable to deal gracefully with a failed data drive?



My conclusion - Scientific Linux is too fragile a system for serious use


ATOM RSS1 RSS2