Sender: |
|
Date: |
Tue, 19 Jul 2011 22:04:35 -0700 |
Reply-To: |
|
Content-Transfer-Encoding: |
7bit |
Subject: |
|
From: |
|
Content-Type: |
text/plain; charset=ISO-8859-1; format=flowed |
MIME-Version: |
1.0 |
Organization: |
Sonic.net, Inc |
Comments: |
|
Parts/Attachments: |
|
|
This isn't exactly an scientific linux issue but I hope that folks here
may be more likely to be using IPMI then some of the other lists.
We have a series of Supermicro systems w/IPMI running RHEL 5.5. We're
using IPMI primarily to monitor psu status and for the hardware watchdog
support teamed with the watchdog service. 3 or 4 out of 8 identical
systems have exhibited hardware watchdog triggered resets for no
apparent reason.
Best we can tell, despite the OS and hardware being perfectly healthy
(no other errors, and the systems work fine after the watchdog is
disabled,) the hardware watchdog is triggering a reset on its own, and
worse, the boxes do not appear to come back from it.
Anyone else seen similar issue or have any input?
So far Supermicro has suggested we disable the hardware watchdog...
/etc/watchdog.conf contains the minimal config:
interval = 10
realtime = yes
priority = 1
watchdog-device = /dev/watchdog
The SEL, the Power Supply events are due to remotely resetting the
power, and, of course, all of these systems are at remote pops.
3c | 07/02/2011 | 19:58:40 | Watchdog 2 #0xfe | Hard reset | Asserted
3d | 07/02/2011 | 21:05:27 | Power Supply #0x14 | State Asserted
3e | Pre-Init Time-stamp | Physical Security #0x44 | General
Chassis intrusion | Asserted
3f | Pre-Init Time-stamp | Fan #0x0f | Lower Non-critical going low
40 | Pre-Init Time-stamp | Fan #0x0f | Lower Critical going low
41 | Pre-Init Time-stamp | Fan #0x0f | Lower Non-recoverable going low
42 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-critical going low
43 | Pre-Init Time-stamp | Fan #0x10 | Lower Critical going low
44 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-recoverable going low
45 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-critical going low
46 | Pre-Init Time-stamp | Fan #0x11 | Lower Critical going low
47 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-recoverable going low
48 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-critical going low
49 | Pre-Init Time-stamp | Fan #0x12 | Lower Critical going low
4a | Pre-Init Time-stamp | Fan #0x12 | Lower Non-recoverable going low
4b | 07/02/2011 | 21:11:30 | Watchdog 2 #0xfe | Hard reset | Asserted
4c | 07/02/2011 | 21:13:39 | Power Supply #0x14 | State Asserted
4d | Pre-Init Time-stamp | Physical Security #0x44 | General
Chassis intrusion | Asserted
4e | Pre-Init Time-stamp | Fan #0x0f | Lower Non-critical going low
4f | Pre-Init Time-stamp | Fan #0x0f | Lower Critical going low
50 | Pre-Init Time-stamp | Fan #0x0f | Lower Non-recoverable going low
51 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-critical going low
52 | Pre-Init Time-stamp | Fan #0x10 | Lower Critical going low
53 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-recoverable going low
54 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-critical going low
55 | Pre-Init Time-stamp | Fan #0x11 | Lower Critical going low
56 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-recoverable going low
57 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-critical going low
58 | Pre-Init Time-stamp | Fan #0x12 | Lower Critical going low
59 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-recoverable going low
5a | 07/02/2011 | 21:18:53 | Watchdog 2 #0xfe | Hard reset | Asserted
5b | 07/02/2011 | 21:26:55 | Power Supply #0x14 | State Asserted
5c | Pre-Init Time-stamp | Physical Security #0x44 | General
Chassis intrusion | Asserted
5d | Pre-Init Time-stamp | Fan #0x0f | Lower Non-critical going low
5e | Pre-Init Time-stamp | Fan #0x0f | Lower Critical going low
5f | Pre-Init Time-stamp | Fan #0x0f | Lower Non-recoverable going low
60 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-critical going low
61 | Pre-Init Time-stamp | Fan #0x10 | Lower Critical going low
62 | Pre-Init Time-stamp | Fan #0x10 | Lower Non-recoverable going low
63 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-critical going low
64 | Pre-Init Time-stamp | Fan #0x11 | Lower Critical going low
65 | Pre-Init Time-stamp | Fan #0x11 | Lower Non-recoverable going low
66 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-critical going low
67 | Pre-Init Time-stamp | Fan #0x12 | Lower Critical going low
68 | Pre-Init Time-stamp | Fan #0x12 | Lower Non-recoverable going low
69 | 07/02/2011 | 21:35:02 | Watchdog 2 #0xfe | Timer expired | Asserted
--
Kelsey Cummings - [log in to unmask] sonic.net, inc.
System Architect 2260 Apollo Way
707.522.1000 Santa Rosa, CA 95407
|
|
|