SCIENTIFIC-LINUX-USERS Archives

December 2013

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
David Sommerseth <[log in to unmask]>
Reply To:
Date:
Thu, 5 Dec 2013 00:35:52 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (61 lines)
On 04/12/13 14:21, ~Stack~ wrote:> Greetings,
 >
 > I have a test system I use for testing deployments and when I am not
 > using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
 > Recently (last ~3 weeks) I have started getting the same kernel panic.
 > Sometimes it will be multiple times in a single day and other times it
 > will be days before the next one (it just had a 5 day uptime). But the
 > kernel panic looks pretty much the same. It is a complaint about a hung
 > task plus information about the ext4 file system. I have run the
 > smartmon tool against both drives (2 drives setup in a hardware RAID
 > mirror) and both drives checkout fine. I ran a fsck against the /
 > partition and everything looked fine (on this text box there is only /
 > and swap partitions). I even took out a drive at a time and had the same
 > crashes (though this could be an indicator that both drives are bad). I
 > am wondering if my RAID card is going bad.
 >
 > When the crash happens I still have the SSH prompt, however, I can only
 > do basic things like navigating directories and sometimes reading files.
 > Writing to a file seems to hang, using tab-autocomplete will frequently
 > hang, running most programs (even `init 6` or `top`) will hang.
 >
 > It crashed again last night, and I am kind of stumped. I would greatly
 > appreciate others thoughts and input on what the problem might be.
 >
 > Thanks!
 > ~Stack~
 >
 > Dec  4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
 > for more than 120 seconds.
 > Dec  4 02:25:09 testbox kernel: "echo 0 >
 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 > Dec  4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0000000000000000     0
 >   273      2 0x00000000
 > Dec  4 02:25:09 testbox kernel: ffff8802142cfb30 0000000000000046
 > ffff8802138b5800 0000000000001000
 > Dec  4 02:25:09 testbox kernel: ffff8802142cfaa0 ffffffff81012c59
 > ffff8802142cfae0 ffffffff810a2431
 > Dec  4 02:25:09 testbox kernel: ffff880214157058 ffff8802142cffd8
 > 000000000000fb88 ffff880214157058
 > Dec  4 02:25:09 testbox kernel: Call Trace:
 > Dec  4 02:25:09 testbox kernel: [<ffffffff81012c59>] ? read_tsc+0x9/0x20

This looks like some locking issue to me, triggered by something around the 
TSC timer.

This is either a buggy driver (most likely the ccsis driver) or a related 
firmware (read the complete boot log carefully, look after firmware warnings). 
  Or it's a really unstable TSC clock source.  Try switching from TSC to HPET 
(or in really worst case acpi_pm).  See this KB for some related info: 
<https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html>

But my hunch tells me it's a driver related issue, with some bad locking. 
There seems to be several filesystem operations happening on two or more CPU 
cores in a certain order which seems to trigger a deadlock.


--
kind regards,

David Sommerseth

ATOM RSS1 RSS2