Subject: | |
From: | |
Reply To: | |
Date: | Thu, 5 Dec 2013 00:35:52 +0100 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
On 04/12/13 14:21, ~Stack~ wrote:> Greetings,
>
> I have a test system I use for testing deployments and when I am not
> using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
> Recently (last ~3 weeks) I have started getting the same kernel panic.
> Sometimes it will be multiple times in a single day and other times it
> will be days before the next one (it just had a 5 day uptime). But the
> kernel panic looks pretty much the same. It is a complaint about a hung
> task plus information about the ext4 file system. I have run the
> smartmon tool against both drives (2 drives setup in a hardware RAID
> mirror) and both drives checkout fine. I ran a fsck against the /
> partition and everything looked fine (on this text box there is only /
> and swap partitions). I even took out a drive at a time and had the same
> crashes (though this could be an indicator that both drives are bad). I
> am wondering if my RAID card is going bad.
>
> When the crash happens I still have the SSH prompt, however, I can only
> do basic things like navigating directories and sometimes reading files.
> Writing to a file seems to hang, using tab-autocomplete will frequently
> hang, running most programs (even `init 6` or `top`) will hang.
>
> It crashed again last night, and I am kind of stumped. I would greatly
> appreciate others thoughts and input on what the problem might be.
>
> Thanks!
> ~Stack~
>
> Dec 4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
> for more than 120 seconds.
> Dec 4 02:25:09 testbox kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0000000000000000 0
> 273 2 0x00000000
> Dec 4 02:25:09 testbox kernel: ffff8802142cfb30 0000000000000046
> ffff8802138b5800 0000000000001000
> Dec 4 02:25:09 testbox kernel: ffff8802142cfaa0 ffffffff81012c59
> ffff8802142cfae0 ffffffff810a2431
> Dec 4 02:25:09 testbox kernel: ffff880214157058 ffff8802142cffd8
> 000000000000fb88 ffff880214157058
> Dec 4 02:25:09 testbox kernel: Call Trace:
> Dec 4 02:25:09 testbox kernel: [<ffffffff81012c59>] ? read_tsc+0x9/0x20
This looks like some locking issue to me, triggered by something around the
TSC timer.
This is either a buggy driver (most likely the ccsis driver) or a related
firmware (read the complete boot log carefully, look after firmware warnings).
Or it's a really unstable TSC clock source. Try switching from TSC to HPET
(or in really worst case acpi_pm). See this KB for some related info:
<https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html>
But my hunch tells me it's a driver related issue, with some bad locking.
There seems to be several filesystem operations happening on two or more CPU
cores in a certain order which seems to trigger a deadlock.
--
kind regards,
David Sommerseth
|
|
|