LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

March 2008

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS March 2008

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Problems with 4.5 TB disk array (LVM & ext3)
From:	Stephan Wiesand <[log in to unmask]>
Reply To:	[log in to unmask]
Date:	Thu, 27 Mar 2008 16:35:46 +0100
Content-Type:	TEXT/PLAIN
Parts/Attachments:	TEXT/PLAIN (109 lines)

On Wed, 26 Mar 2008, Jon Peatfield wrote:

> On Wed, 26 Mar 2008, Jan Schulze wrote:
>
>> Hi all,
>> 
>> I have a disk array with about 4.5 TB and would like to use it as one large 
>> logical volume with an ext3 file system. When mounting the logical volume , 
>> I get an "Input/Output error: can't read superblock".
>
> Do you get any interesting kernel messages in the output of dmesg (or 
> /var/log/messages etc)?  Which exact kernel is this (uname -r) and what arch 
> (i386/x86_64 etc; uname -m)?

And what driver/hardware?

>> I'm using SL 4.2 with kernel 2.6 and this is what I did so far:
>> 
>> - used parted to create a gpt disk label (mklabel gpt) and one large
>>   partition (mkpart primary ext3 0s -1s)
>> 
>> - used parted to enable LVM flag on device (set 1 LVM on)
>
> I know it would be slow but can you test that you can read/write to all of 
> /dev/sda1?

Using dd's "seek" parameter, this should not take too much time. But if
creating the GPT label & partition was successful, chances are the whole 
device is accessible.

>> - created one physical volume, one volume group and one logical volume
>>   (pvcreate /dev/sda1, vgcreate raid6 /dev/sda1, lvcreate -l 1189706 -n
>>   vol1 raid6)
>> 
>> - created an ext3 filesystem and explicitly specified a 4K blocksize, as
>>   this should allow a filesystem size of up to 16 TB (mkfs.ext3 -m 0 -b
>>   4096 /dev/raid6/vol1)
>
> For some reason my EL4 notes tell me that we also specify -N (number of 
> inodes), as well as -E (set RAID stride), -J size= (set journal size) and -O 
> sparse_super,dir_index,filetype though most of that is probably the default 
> these days...

Specifying the stripe width is also supposed to be a good idea, as is 
aligning the start of the partition to a stripe boundary (although that's 
more likely to be useful without LVM on top).

>> However, mounting (mount /dev/raid6/vol1 /raid) gives the superblock error, 
>> mentioned above.
>> 
>> Everything is working as expected, when using ext2 filesystem (with LVM) or 
>> ext3 filesystem (without LVM). Using a smaller volume (< 2 TB) is working
>> 
>> with ext3+LVM as well. Only the combination of > 2TB+ext3+LVM gives me 
>> trouble.
>> 
>> Any ideas or suggestions?
>
> We found that in at least some combinations of kernel/hardware (drivers 
> really I expect), that support for >2TB block devices was still rather flakey 
> (at least in the early versions of EL4).
>
> We ended up getting our RAID boxes to present as multiple LUNs each under 2TB 
> which we can then set up as PVs and join back together into a single VG and 
> still have an LV which was bigger than 2TB.  I'm rather conservative in such 
> things so we still avoid big block devices at the moment.
>
> [ obviously with single disk sizes growing at the rate they are it means that 
> the block-devices >2TB code is going to get a LOT more testing! ]

We're successfully using devices up to 7 TB with a single XFS 
filesystem on them, under SL4/5 (but I think we started doing this with 
4.3, not 4.2). I have no hope to be able to check (xfs_repair) those 
should this ever become necessary though - from what I've read it would 
require more RAM than fits into a server today.

> However, some of the tools (e.g. ext2/3 fsck) still seemed to fail at about 
> 3.5TB so we ended up needing to build the 'very latest' tools to be able to 
> run fsck properly (the ones included in EL4 - and EL5 I think - get into an 
> infinite loop at some point while scanning the inode tables).
>
> Currently we try to avoid 'big' ext3 LVs ; the one where we discovered the 
> fsck problems was originally ~6.8TB but we ended up splitting that into 
> several smaller LVs since even with working tools it still took ~2 days to 
> fsck... (and longer to dump/copy/restore it all!)
>
> Some of my co-workers swear by XFS for 'big' volumes but then we do have SGI 
> boxes where XFS (well CXFS) is the expected default fs.  I've not done much 
> testing with XFS on SL mainly because TUV don't like XFS much...

I think it's still the best choice for large (> 2 TB) filesystems. The xfs 
available in SL4 contrib has done very well here. There are some 
interesting effects when such a filesystem runs full and you have to 
remount it with the "inode64" option in order to be able to create new 
files and you discover that quite a few applications are not ready for 
64-bit node numbers. But that aside, it has done very well. No other 
headaches. We're now beginning to deploy large (>10TB) XFS filesystems 
under SL5.

This being said, we now also have lustre OSTs (using a modified ext3) 
7.5 TB in size. No problems so far, but then none of them has run full or 
required an fsck yet.

-- 
Stephan Wiesand
    DESY - DV -
    Platanenallee 6
    15738 Zeuthen, Germany

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV