On Mon, 15 Jun 2009, Dr Andrew C Aitchison wrote:
> What do other groups do about updating applications and machines
> with long running processes ?
>
> My users run two sorts of long running processes, with different
> problems when it comes to updates.
>
> First, I have users who never log off. Thus applications like
> firefox and pdf viewers will be running when they are updated.
> Some time later these applications may try to load and run plugins
> which have been removed/updated.
>
> Second, I have users with long running calculations (often weeks
> or more) which would be interrupted if the machine were rebooted into an
> updated kernel. User-writing code often check-points, so the actual
> calculation time lost is not significant, but calculations in
> commercial packages such as Mathematica and Maple are often less good about
> check-pointing.
>
> How do people balance the disruption of killing user processes
> against the need to update to the latest versions of software ?
>
> Thanks,
For security updates of things like firefox where the user account might
be compromised by viewing an evil page I tend to err on the side of do the
updates asap and sort out the complaints later. This probably applies to
most stuff where the errata mentions 'critical' or where there is a risk
of arbitrary code execution or similar.
For stuff which I think won't cause problems for users I also will
typically do those updates fairly quickly (this assumes I get it right of
course).
Updates which would be disruptive but I don't think affect us (e.g. a
security fix for a feature we don't use), I tend to accumulate until
something more important turns up and then we apply them all together.
For updates which are disruptive (replacing important parts or requiring
reboots etc), we generally announce them to users about a week in advance,
and give them a chance to have the updates applied *sooner* if the day we
have picked would be bad for them. Typically we do reboots only on
wednesday mornings unless we think it is sufficiently urgent to justify
doing it sooner.
In recent years the number of people who are upset by the announced
reboots has gone down, though a few people clearly don't read our news
items (and are hence surprised/upset), so we plan to also have an opt-in
mailing list for 'important' items.
BTW the default reboot/shutdown procedures in el5/sl5 don't give user
processes very long to checkpoint themselves, and I *think* that
networking may have been turned off by the time they get signalled. We
ended up adding an extra shutdown script which runs fairly early and sends
sigterm to all user processes and give them a short time to save state
before carrying on with the shutdown/reboot.
I'm not sure if it was any different in earlier versions but we got more
complaints after the update to sl5...
--
/--------------------------------------------------------------------\
| "Computers are different from telephones. Computers do not ring." |
| -- A. Tanenbaum, "Computer Networks", p. 32 |
---------------------------------------------------------------------|
| Jon Peatfield, _Computer_ Officer, DAMTP, University of Cambridge |
| Mail: [log in to unmask] Web: http://www.damtp.cam.ac.uk/ |
\--------------------------------------------------------------------/
|