SCIENTIFIC-LINUX-USERS Archives

July 2011

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Kelsey Cummings <[log in to unmask]>
Reply To:
Kelsey Cummings <[log in to unmask]>
Date:
Tue, 19 Jul 2011 22:04:35 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (83 lines)
This isn't exactly an scientific linux issue but I hope that folks here 
may be more likely to be using IPMI then some of the other lists.

We have a series of Supermicro systems w/IPMI running RHEL 5.5.  We're 
using IPMI primarily to monitor psu status and for the hardware watchdog 
support teamed with the watchdog service.  3 or 4 out of 8 identical 
systems have exhibited hardware watchdog triggered resets for no 
apparent reason.

Best we can tell, despite the OS and hardware being perfectly healthy 
(no other errors, and the systems work fine after the watchdog is 
disabled,) the hardware watchdog is triggering a reset on its own, and 
worse, the boxes do not appear to come back from it.

Anyone else seen similar issue or have any input?

So far Supermicro has suggested we disable the hardware watchdog...

/etc/watchdog.conf contains the minimal config:
interval = 10
realtime = yes
priority = 1
watchdog-device = /dev/watchdog

The SEL, the Power Supply events are due to remotely resetting the 
power, and, of course, all of these systems are at remote pops.

   3c | 07/02/2011 | 19:58:40 | Watchdog 2 #0xfe | Hard reset | Asserted
   3d | 07/02/2011 | 21:05:27 | Power Supply #0x14 | State Asserted
   3e | Pre-Init Time-stamp   | Physical Security #0x44 | General 
Chassis intrusion | Asserted
   3f | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-critical going low
   40 | Pre-Init Time-stamp   | Fan #0x0f | Lower Critical going low
   41 | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-recoverable going low
   42 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-critical going low
   43 | Pre-Init Time-stamp   | Fan #0x10 | Lower Critical going low
   44 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-recoverable going low
   45 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-critical going low
   46 | Pre-Init Time-stamp   | Fan #0x11 | Lower Critical going low
   47 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-recoverable going low
   48 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-critical going low
   49 | Pre-Init Time-stamp   | Fan #0x12 | Lower Critical going low
   4a | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-recoverable going low
   4b | 07/02/2011 | 21:11:30 | Watchdog 2 #0xfe | Hard reset | Asserted
   4c | 07/02/2011 | 21:13:39 | Power Supply #0x14 | State Asserted
   4d | Pre-Init Time-stamp   | Physical Security #0x44 | General 
Chassis intrusion | Asserted
   4e | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-critical going low
   4f | Pre-Init Time-stamp   | Fan #0x0f | Lower Critical going low
   50 | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-recoverable going low
   51 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-critical going low
   52 | Pre-Init Time-stamp   | Fan #0x10 | Lower Critical going low
   53 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-recoverable going low
   54 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-critical going low
   55 | Pre-Init Time-stamp   | Fan #0x11 | Lower Critical going low
   56 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-recoverable going low
   57 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-critical going low
   58 | Pre-Init Time-stamp   | Fan #0x12 | Lower Critical going low
   59 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-recoverable going low
   5a | 07/02/2011 | 21:18:53 | Watchdog 2 #0xfe | Hard reset | Asserted
   5b | 07/02/2011 | 21:26:55 | Power Supply #0x14 | State Asserted
   5c | Pre-Init Time-stamp   | Physical Security #0x44 | General 
Chassis intrusion | Asserted
   5d | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-critical going low
   5e | Pre-Init Time-stamp   | Fan #0x0f | Lower Critical going low
   5f | Pre-Init Time-stamp   | Fan #0x0f | Lower Non-recoverable going low
   60 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-critical going low
   61 | Pre-Init Time-stamp   | Fan #0x10 | Lower Critical going low
   62 | Pre-Init Time-stamp   | Fan #0x10 | Lower Non-recoverable going low
   63 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-critical going low
   64 | Pre-Init Time-stamp   | Fan #0x11 | Lower Critical going low
   65 | Pre-Init Time-stamp   | Fan #0x11 | Lower Non-recoverable going low
   66 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-critical going low
   67 | Pre-Init Time-stamp   | Fan #0x12 | Lower Critical going low
   68 | Pre-Init Time-stamp   | Fan #0x12 | Lower Non-recoverable going low
   69 | 07/02/2011 | 21:35:02 | Watchdog 2 #0xfe | Timer expired | Asserted


-- 
Kelsey Cummings - [log in to unmask]      sonic.net, inc.
System Architect                          2260 Apollo Way
707.522.1000                              Santa Rosa, CA 95407

ATOM RSS1 RSS2