LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

March 2009

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS March 2009

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: my ongoing battle with large filesystems
From:	Jon Peatfield <[log in to unmask]>
Reply To:	Jon Peatfield <[log in to unmask]>
Date:	Wed, 11 Mar 2009 21:24:26 +0000
Content-Type:	TEXT/PLAIN
Parts/Attachments:	TEXT/PLAIN (128 lines)

On Thu, 5 Mar 2009, Miles O'Neal wrote:

> recap: new 64 bit Intel quadcore server with Adaptec SATA RAID
> controller, 16x1TB drives.  1 drive JBOD for OS.  The rest are
> setup as RAID6 with 1 spare.  We've tried EL5.1 + all yum updates,
> and EL5.2 stock.  We can't get /dev/sdb1 (12TB) stable with ext2
> or xfs (ext3 blows up in the journal setup).
>
> So I decided to carve /dev/sdb up into a dozen partitions and
> use LVM.  Initially I want to use one partition per LV and make
> each of those one xfs FS.  Then as things grow I can add a PV
> (one partition per PV) into the appropriate VG and grow the LV/FS.
> Between typos and missteps, I've had to build up and tear down the
> LV pieces several times.  And now I get messages such as
>
>  Aborting - please provide new pathname for what used to be /dev/disk/by-path/pci-0000:01:00.0-scsi-0:0:1:0-part6
> or
>  Device /dev/sdb6 not found (or ignored by filtering).
>
> I clean it all up, wipe out all the files in /etc/lvm/*/*
> (including cache/.cache), and try again, still broken.
>
> I tried rebooting.  Still broken.
>
> How can I fix this short of a full reinstall?
>
> The whole LVM system feels really kludgy.  I suppose there's
> not a better alternative at this time?

Just a little history for you to mull over...  A while ago block devices 
over 2TB (1TB on some systems!) were not supported properly since with the 
default block-size (512 bytes) that meant that the block numbers went over 
2^32 (2^31 was the limit on some systems including earlier Linux kernels 
because the type had been left signed by accident).

While 64 bit block offsets have been added each device driver needs to 
properly implement them, so even if people say 'it works for me with 
device xxx', you can't be sure it will work for you unless you also have 
device xxx and exacactly the same (or newer I guess) drivers.

To work round this most RAID devices have long supported a 'hack', where a 
larger volume is exported to the host as a number of slices - typically 
each under 2TB as new LUNs on the same bus,target (or whatever).  The host 
sees these as individual 'disks', but you can fairly easily join things 
back together by putting the PVs into the same VG and then slice things up 
into the LV's of your choice.

Note that splitting your large device into smaller partitions isn't the 
same since the kernel will still need to use large block offsets to access 
the parts more than 2TB into the device.

Some people think I'm really silly/old-fasioned for still doing things 
this way (using large devices works for them, and it might work for me 
with some hardware), but there is little extra overhead and my sanity is 
better preserved by not tempting fate.

e.g. on one box I'm using atm it has a ~4TB RAID (a Dell PERC/6i in fact), 
but it is configured to present that as three devices to me, whcih we 
then partition and set up as PVs as if they were different disks.  e.g.

$ grep sd /proc/partitions
    8     0 1843200000 sda
    8     1     104391 sda1
    8     2 1843089255 sda2
    8    16 1843200000 sdb
    8    17 1843193646 sdb1
    8    32  705822720 sdc
    8    33  705815743 sdc1

The actual numbers for the 'disk' slice sizes don't really matter but 
happened to be based on what Dell suggested for a different box we have 
with a PERC/6i controller (it wouldn't fit as 2 '<2TB devices' so we need 
at least 3...)

$ pvscan
   PV /dev/sda2   VG TempLobeSys00   lvm2 [1.72 TB / 0    free]
   PV /dev/sdb1   VG TempLobeSys00   lvm2 [1.72 TB / 78.00 GB free]
   PV /dev/sdc1   VG TempLobeSys00   lvm2 [673.09 GB / 0    free]
   Total: 3 [4.09 TB] / in use: 3 [4.09 TB] / in no VG: 0 [0   ]

$ vgdisplay
   --- Volume group ---
   VG Name               TempLobeSys00
   System ID
   Format                lvm2
   Metadata Areas        3
   Metadata Sequence No  11
   VG Access             read/write
   VG Status             resizable
   MAX LV                0
   Cur LV                7
   Open LV               7
   Max PV                0
   Cur PV                3
   Act PV                3
   VG Size               4.09 TB
   PE Size               32.00 MB
   Total PE              134034
   Alloc PE / Size       131538 / 4.01 TB
   Free  PE / Size       2496 / 78.00 GB
   VG UUID               h393E0-mHmv-s4gf-Py2x-mJuU-OWUc-VoWtuP

$ df -hlP
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/TempLobeSys00-root  2.0G  644M  1.3G  35% /
/dev/mapper/TempLobeSys00-scratch   50G  9.2G   38G  20% /local
/dev/mapper/TempLobeSys00-var  3.9G  1.5G  2.2G  41% /var
/dev/mapper/TempLobeSys00-tmp  9.7G  152M  9.1G   2% /tmp
/dev/mapper/TempLobeSys00-usr   12G  5.3G  5.6G  49% /usr
/dev/sda1              99M   25M   70M  26% /boot
tmpfs                  12G     0   12G   0% /dev/shm
/dev/mapper/TempLobeSys00-tardis  4.0T  9.3G  3.9T   1% /local/tardis

This happens to be using XFS for the larger fs but we had it working with 
ext3 - though I understand that this isn't as big as your example.

Re-configuring your RAID controllers to export as <2TB slices isn't fun, 
but it should be possible without a re-install (if a bit fiddly).

-- 
/--------------------------------------------------------------------\
| "Computers are different from telephones.  Computers do not ring." |
|       -- A. Tanenbaum, "Computer Networks", p. 32                  |
---------------------------------------------------------------------|
| Jon Peatfield, _Computer_ Officer, DAMTP,  University of Cambridge |
| Mail:  [log in to unmask]     Web:  http://www.damtp.cam.ac.uk/ |
\--------------------------------------------------------------------/

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV