SCIENTIFIC-LINUX-USERS Archives

October 2008

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
dave peck <[log in to unmask]>
Reply To:
Date:
Sun, 19 Oct 2008 09:18:26 -0600
Content-Type:
text/plain
Parts/Attachments:
text/plain (331 lines)
On Fri, 2008-10-03 at 18:51 -0600, dave peck wrote: 
> Hi Jon,
<snip> 
> 
> So. I can see what has gone awry with the mkinitrd script, but I'm not
> sure I know how to fix it, and specifically the findstoragedriverinsys
> function--I've built several unbootable initrd images trying to sort
> this out but it seems a bit of a tangled mess at this point.
> 
<snip>
> ==> dave
> 
I know it's bad form to reply ones own posting but I have finally got
things working now and thought this reply might help someone else faced
with a similar problem or who might be working on the mkinitrd script
itself.

I was finally able to get back onto this problem yesterday and thought
I'd document what I finally ended up doing. I checked the errata for the
updated mkinitrd-5.1.19.6-28 package nothing seemed obvious except the
notes concerning booting from LVM volumes.

        * when booting from LVM volumes, initrd attempts to active all
        logical volumes. If there was a full snapshot, initrd failed to
        recover, resulting in a failure to boot.

and,

        * in certain situations, an infinite loop occurred when creating
        a new initrd for an installed kernel on IBM System z
        architectures.

We don't run on IBM z-Series Mainframes--that apparently show similar
behaviour to what I was seeing--but we are running/booting from LVM
volumes built upon software (md) raid1 devices but in this case the
errata reported LVM volume snapshot issue seems to be a red herring. In
any event, there were changes were made to mkinitrd script since our
last 'known good' kernel images were made.

Reviewing the code replacement for mkinitrd 5.1.19.6-28 versus the
5.1.19.6-19 version shows that the problematic code segments, previously
reported were new additions to the to the 6-28 mkinitrd script.

        while [ ! -L device ]; do
            if [ -L subsystem ]; then
                cd slaves
                for x in *;do
                    if [ -L $x ]; then
                        cd $x;
                        break
                    fi;
                done
            fi
        done

I downgraded the new and improved SL 5.2 mkinitrd and nash packages, and
used the latest (but older) SL 5.1 packages downloaded from
<ftp://ftp.scientificlinux.org/linux/scientific/51/>: e.g.,

        - Allow write access to /boot device:
        
                ~]$ sudo mount -o remount,rw /boot
        
        - Install or Update package with previous version:
        
                ~]$ sudo rpm -Uhv --oldpackage
                mkinitrd-5.1.19.6-19.i386.rpm
                mkinitrd-5.1.19.6-19.x86_64.rpm
                nash-5.1.19.6-19.x86_64.rpm
        
          o for the older PAE kernel systems I used:
        
            ~]$ sudo rpm -Uhv --oldpackage
        mkinitrd-5.1.19.6-19.i386.rpm 
             nash-5.1.19.6-19.i386.rpm

Loading the previous set of mkinitrd and nash packages allowed the SL
2.6.18-92.1.13.el5 kernel upgrade to successfully run to completion for
all the test systems.

        ~]$ sudo yum update kernel kernel-devel kernel-doc
        kernel-headers
        
All of our desktop/workstation systems are running on LVM volumes but
those that did NOT have md raid devices underneath LVM came up properly
with just this downgrade of the mkinitrd and nash packages.

However, on the more serious systems we run LVM on top of software md
raid1 devices, and the resulting initrd images were unusable
(unbootable) for those systems--and the resulting initrd file on the
test system showed as being (potentially) too short in a directory
listing.

        ~]$ sudo ls -sk /boot | grep initrd
        3206 initrd-2.6.18-92.1.10.el5.img
        2537 initrd-2.6.18-92.1.13.el5.img   <-- not good

Given the final error message being emitted when attempting to load this
new kernel on the system:

        Switching to new root
        <snip> 
        Kernel panic - not sysncing: Attempted to kill init!

it seemed clear that the newly generated 2.6.18-92.1.13.el5 ram disk
built using the older mkinitrd, was incomplete--but the script did
successfully create a somewhat intact initrd image and we should now
only needed to add in the specific bits that were missed during the SL
mkinitrd run to get things operational.

The first step was to determine what was working in the current running
initrd configuration compared to what was being generated by the the new
92.1.13 build. I took the currently running 2.6.18-92.1.10.el5 initrd
disk image and unpacked it into a temporary directory:

        ~]$ mkdir fix-initrd
        ~]$ cd fix-initrd
        
        ~]$ sudo cp -ar /boot/initrd-2.6.18-92.1.10.el5.img
        initrd-2.6.18-92.1.10.el5.gz
        ~]$ gunzip initrd-2.6.18-92.1.10.el5.gz
        ~]$ mkdir 10
        ~]$ cd 10
        ~]$ cpio -di <../initrd-2.6.18-92.1.10.el5

and compare it with the newly generated 92.1.13 disk image.

        ~]$ cd ..
        ~]$ sudo cp -ar /boot/initrd-2.6.18-92.1.13.el5.img
        initrd-2.6.18-92.1.13.el5.gz
        ~]$ gunzip initrd-2.6.18-92.1.13.el5.gz
        ~]$ mkdir 13
        ~]$ cd 13
        ~]$ cpio -di <../initrd-2.6.18-92.1.13.el5
        
Comparing the directory structures and files between the working
configuration 92.1.10 and the broken 92.1.13 kernel directories we find
that 92.1.13 is missing a few files; specifically lvm and the raid
kernel module. This seems odd since the handful of desktop/workstation
systems that successfully updated using this older mkinitrd have
identical lvm configurations as the server systems--but were not running
md raid1... (so lvm was recognised on the systems but only if if there
were no raid devices involved?)

        ~]$ ls -al ../1?/bin
        ../10/bin:
        total 6128
        drwx------ 2 root root    4096 Oct 18 10:29 .
        drwxr-xr-x 9 root root    4096 Oct 18 10:38 ..
        -rwx------ 1 root root 1024504 Oct 18 10:29 dmraid
        -rwx------ 1 root root  521304 Oct 18 10:29 insmod
        -rwx------ 1 root root  899576 Oct 18 10:29 kpartx
        -r-x------ 1 root root 1461864 Oct 18 10:29 lvm
        lrwxrwxrwx 1 root root      10 Oct 18 10:29 modprobe
        -> /sbin/nash
        -rwx------ 1 root root 2329792 Oct 18 10:29 nash
        
        ../13/bin:
        total 4688
        drwx------ 2 root root    4096 Oct 18 10:33 .
        drwxr-xr-x 9 root root    4096 Oct 18 10:33 ..
        -rwx------ 1 root root 1024504 Oct 18 10:33 dmraid
        -rwx------ 1 root root  521304 Oct 18 10:33 insmod
        -rwx------ 1 root root  899576 Oct 18 10:33 kpartx
        lrwxrwxrwx 1 root root      10 Oct 18 10:33 modprobe
        -> /sbin/nash
        -rwx------ 1 root root 2320672 Oct 18 10:33 nash

so we copy the missing lvm executable (static) into the unpacked
2.6.18-92.1.13 ram disk image:

        ~]$ pwd
        /home/peckd/fix-initrd/13
        ~]$ cp -ar /sbin/lvm.static bin/lvm

in addition we seem to be missing the lvm configuration file in the
2.6.18-92.1.13 initrd image:

        ~]$ ls -al  ../1?/etc/
        ../10/etc/:
        total 4
        drwx------ 2 root root 4096 Oct 18 10:29 lvm
        
        ../13/etc/:
        total 0

so we copy that as well.
        
        ~]$ pwd
        /home/peckd/fix-initrd/13
        ~]$  mkdir etc/lvm
        ~]$  cp -ar /etc/lvm/lvm.conf etc/lvm

We also seem to be missing raid1.ko module from /lib

        ~]$ ll ../1?/lib
        ../10/lib:
        total 2052
        -rw------- 1 root root  66920 Oct 18 10:29 ata_piix.ko
        -rw------- 1 root root  73112 Oct 18 10:29 dm-mirror.ko
        -rw------- 1 root root 131536 Oct 18 10:29 dm-mod.ko
        -rw------- 1 root root  59512 Oct 18 10:29 dm-snapshot.ko
        -rw------- 1 root root  37448 Oct 18 10:29 dm-zero.ko
        -rw------- 1 root root  77088 Oct 18 10:29 ehci-hcd.ko
        -rw------- 1 root root 227896 Oct 18 10:29 ext3.ko
        drwx------ 2 root root   4096 Oct 18 10:29 firmware
        -rw------- 1 root root 135928 Oct 18 10:29 jbd.ko
        -rw------- 1 root root 258736 Oct 18 10:29 libata.ko
        -rw------- 1 root root 152712 Oct 18 10:29 mptbase.ko
        -rw------- 1 root root  84512 Oct 18 10:29 mptscsih.ko
        -rw------- 1 root root  63456 Oct 18 10:29 mptspi.ko
        -rw------- 1 root root  63344 Oct 18 10:29 ohci-hcd.ko
        -rw------- 1 root root  65624 Oct 18 10:29 raid1.ko    <--
        missing
        -rw------- 1 root root 281096 Oct 18 10:29 scsi_mod.ko
        -rw------- 1 root root  72296 Oct 18 10:29 scsi_transport_spi.ko
        -rw------- 1 root root  68872 Oct 18 10:29 sd_mod.ko
        -rw------- 1 root root  67040 Oct 18 10:29 uhci-hcd.ko
        
        ../13/lib:
        total 1980
        -rw------- 1 root root  66920 Oct 18 10:33 ata_piix.ko
        -rw------- 1 root root  73112 Oct 18 10:33 dm-mirror.ko
        -rw------- 1 root root 131544 Oct 18 10:33 dm-mod.ko
        -rw------- 1 root root  59512 Oct 18 10:33 dm-snapshot.ko
        -rw------- 1 root root  37448 Oct 18 10:33 dm-zero.ko
        -rw------- 1 root root  77088 Oct 18 10:33 ehci-hcd.ko
        -rw------- 1 root root 227896 Oct 18 10:33 ext3.ko
        drwx------ 2 root root   4096 Oct 18 10:33 firmware
        -rw------- 1 root root 135936 Oct 18 10:33 jbd.ko
        -rw------- 1 root root 258736 Oct 18 10:33 libata.ko
        -rw------- 1 root root 152712 Oct 18 10:33 mptbase.ko
        -rw------- 1 root root  84512 Oct 18 10:33 mptscsih.ko
        -rw------- 1 root root  63456 Oct 18 10:33 mptspi.ko
        -rw------- 1 root root  63344 Oct 18 10:33 ohci-hcd.ko
        -rw------- 1 root root 281096 Oct 18 10:33 scsi_mod.ko
        -rw------- 1 root root  72296 Oct 18 10:33 scsi_transport_spi.ko
        -rw------- 1 root root  68872 Oct 18 10:33 sd_mod.ko
        -rw------- 1 root root  67040 Oct 18 10:33 uhci-hcd.ko

and copy that as well:

        ~]$ pwd
        /home/peckd/fix-initrd/13
        ~]$ ls /lib/modules/*/kernel/drivers/md/raid1.ko
        /lib/modules/2.6.18-92.1.10.el5/kernel/drivers/md/raid1.ko
        /lib/modules/2.6.18-92.1.13.el5/kernel/drivers/md/raid1.ko
        ~]$ sudo
        cp /lib/modules/2.6.18-92.1.13.el5/kernel/drivers/md/raid1.ko ../13/lib

I then took a look the init script itself to find the differences:

        ~]$ ls -al ../1?/init
        -rwx------ 1 root root 2603 Oct 18 10:29 ../10/init
        -rwx------ 1 root root 2391 Oct 18 10:33 ../13/init

and merge those missing pieces from the working init config into the new
2.6.18-92.1.13 init:

        ~]$ diff -Naur ../13/init ../10/init
        --- ../13/init  2008-10-18 18:00:01.000000000 +0000
        +++ ../10/init  2008-10-18 16:29:36.000000000 +0000
        @@ -40,12 +40,12 @@
        hotplug
        echo Creating block device nodes.
        mkblkdevs
        -echo "Loading uhci-hcd.ko module"
        -insmod /lib/uhci-hcd.ko 
        -echo "Loading ohci-hcd.ko module"
        -insmod /lib/ohci-hcd.ko 
        echo "Loading ehci-hcd.ko module"
        insmod /lib/ehci-hcd.ko 
        +echo "Loading ohci-hcd.ko module"
        +insmod /lib/ohci-hcd.ko 
        +echo "Loading uhci-hcd.ko module"
        +insmod /lib/uhci-hcd.ko 
        mount -t usbfs /proc/bus/usb /proc/bus/usb
        echo "Loading jbd.ko module"
        insmod /lib/jbd.ko 
        @@ -88,7 +88,7 @@
        lvm vgchange -ay --ignorelockingfailure  vg01
        resume LABEL=/v1l0-swap
        echo Creating root device.
        -mkrootdev -t ext3 -o user_xattr,acl,noatime,ro /dev/vg01/lv01
        +mkrootdev -t ext3 -o defaults,ro /dev/vg01/lv01
        echo Mounting root filesystem.
        mount /sysroot
        echo Setting up other filesystems.

After applying those differences that seemed significant to the
2.6.18-92.1.13 init I rebuilt the initrd image for the system(s)

        ~]$ pwd
        /home/peckd/fix-initrd/13
        ~]$ find ./ | cpio -H newc -o > ../new-initrd
        ~]$ gzip ../new-initrd

and moved the new ram disk image to /boot.

        ~]$ sudo mv ../new-initrd /boot/dp-initrd-2.6.18-92.1.13.el5.img

After created a new entry in the /boot/grub/grub.conf that pointed to
the existing  vmlinuz-2.6.18-92.1.13.el5 kernel and referencing this new
initrd image:

        title Scientific Linux SL (2.6.18-92.1.13.el5)
                root (hd0,0)
                kernel /vmlinuz-2.6.18-92.1.13.el5 ro root=LABEL=/ 
                initrd /dp-initrd-2.6.18-92.1.13.el5.img
        
I then rebooted using this entry and so far, everything appears to be
working.

        ~]$ uname -rsimpv
        Linux 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 16:39:16 EDT 2008
        x86_64 x86_64 x86_64

I would have really have preferred divining a patch the mkinitrd script
itself to handle the configurations we have; I can't believe they're
that unusual, and I may look at this in more detail later, but getting
the boxes updated and running with the latest kernel security fixes
seemed more pressing.

Anyway, I hope this note will be useful to someone.

Thank you and my very best regards,

    ==> dave

                

ATOM RSS1 RSS2