SCIENTIFIC-LINUX-USERS Archives

August 2013

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Nico Kadel-Garcia <[log in to unmask]>
Reply To:
Nico Kadel-Garcia <[log in to unmask]>
Date:
Sat, 3 Aug 2013 22:56:08 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (78 lines)
On Sat, Aug 3, 2013 at 9:28 PM, John Lauro <[log in to unmask]> wrote:
> ----- Original Message -----
>> From: "Nico Kadel-Garcia" <[log in to unmask]>
>>
>> It's exceedingly dangerous in a production environment. I've helped
>> run, and done OS specifications and installers for a system over
>> 10,000 hosts. and you *never*, *never*, *never* auto-update them
>> without warning or outside the maintenance windows. *Never*. If I
>> caught someone else on the team doing that as a matter of policy, I
>> would have campaigned to have them fired ASAP.
>
>
> If you have to manage 10,000 hosts then you are lucky you never had to learn to deal with no maintenance window and 0 downtime, and so most of your maintenance had to be possible outside of a maintenance window.  That is how many IT shops with thousands of machines have to operate

No, you schedule the updates. A maintenance window is not the same as
scheduled downtime, and for larger environments, you can schedule them
for a planned set of well defined updates.

For example,  before allowing the system wide changes, you test them
in a lab with a variety of the services and hardware you use in the
field. And you don't test "whatever the upstream vendor happened to
publish lately, sight unseen, plus whatever they added between the
test and the permitted update". You set up a defined set of updates,
such as using a yum mirror snapshot (for Scientific Linux or CentOS)
or a well defined RHN configuration.

> these days.  You might even want to read up on Netflix's thoughts on chaos monkey.

I'm familiar with the concept, and it has uses. However, having a
chaos monkey in place does not reduce the risk of *network wide*
auto-updates of corrupting all operating system configurations. I'm
afraid I've had that happen: A kernel update introduced a regression,
and took down over 1000 systems all the same night. (A vendor had
changed hardware without notifying us, and the new kernel didn't have
the right drivers for it: the old kernel did). Fortunately, it
happened at a well defined maintenance window.  And also fortunately,
I'd taken advantage of the old LILO "default" and "boot once oncly
with a different setting" tools to boot new kernels in test mode, and
retain the old kernel as default after a power cycle if the new kernel
failed to boot.

> Autoupgrades are just another form of random outage you might have to deal with.  As long as

And you deal with them by *turning them off* and scheduling the
updates, with the chance to assess the updates. That's because leaving
them enabled by default is not "random". It's scheduled arbitrarily by
the upstream vendor, and even *they* publish release notes and provide
their own entire system (RHN, or spacewalk if you use the free
versions) to schedule them.

> you have different hosts upgrading on different days and times, and you have automated routines that test and take servers out of service automatically if things fail, then autogrades is perfectly fine. If things break from the autoupgrades, it becomes real obvious based on the update history of which machines broke from it.

Gee, you mean that you don't let systems automatically update without
planning, and update different members at different scheduled and
times? Why didn't I think of something like that? You must be smart!

> Campaigning to have someone fired without even hearing their reason for upgrading, or even warning them first that at your location is is standard practice not to ever autoupgrade because you have a separate QA process that even critical security patches must go through is a very bad practice on your part.

Oh, he or she would get a chance to talk. If they spouted the
"auto-updating is safe" mantra and refused to budge, I'd be like them
like white on rice.  Touching production servers unannounced is a
serious no-no in large networks or any large network.

> I am not going to state what patch policy I use, only that different policies work for different environments.  Based on your statement, it sounds like you could be loosing some valuable co-workers by lobbying to get people fired that have a different opinion from you instead of trying to educate and/or learn from each other.  If you feel you can not learn from your peers, you have already proven you are correct in that respect, but you have also shown there is much you don't know by being incapable of learning new things.

Oh, if they're *trainable*, they might get a shot. But leaving out the
"schedule the updates so they don't all occur at once" part, as you
did at first, is pretty dangerous.

> (Personally I would hate to use Nagios for 10,000 hosts.  It didn't really scale that well IMHO, but to be honest I haven't bothered looking at it in over 4 years, and maybe it's improved.  Not familiar with Icinga, but I have had good luck with Zabbix for large scale)

Oh, you split it for a network that big!!!! It handles a thousand
hosts reasonably well, even 10 years ago, if you don't go overboard
with too frequent sampling and computationally expensive checks. And
for the "nagios-plugin-check-updates", you only really need to run it
daily. (And maybe re-run it across a set of hosts after the updates
are installed, to catch any that got missed.

ATOM RSS1 RSS2