On Fri, 2008-10-03 at 18:51 -0600, dave peck wrote: > Hi Jon, <snip> > > So. I can see what has gone awry with the mkinitrd script, but I'm not > sure I know how to fix it, and specifically the findstoragedriverinsys > function--I've built several unbootable initrd images trying to sort > this out but it seems a bit of a tangled mess at this point. > <snip> > ==> dave > I know it's bad form to reply ones own posting but I have finally got things working now and thought this reply might help someone else faced with a similar problem or who might be working on the mkinitrd script itself. I was finally able to get back onto this problem yesterday and thought I'd document what I finally ended up doing. I checked the errata for the updated mkinitrd-5.1.19.6-28 package nothing seemed obvious except the notes concerning booting from LVM volumes. * when booting from LVM volumes, initrd attempts to active all logical volumes. If there was a full snapshot, initrd failed to recover, resulting in a failure to boot. and, * in certain situations, an infinite loop occurred when creating a new initrd for an installed kernel on IBM System z architectures. We don't run on IBM z-Series Mainframes--that apparently show similar behaviour to what I was seeing--but we are running/booting from LVM volumes built upon software (md) raid1 devices but in this case the errata reported LVM volume snapshot issue seems to be a red herring. In any event, there were changes were made to mkinitrd script since our last 'known good' kernel images were made. Reviewing the code replacement for mkinitrd 5.1.19.6-28 versus the 5.1.19.6-19 version shows that the problematic code segments, previously reported were new additions to the to the 6-28 mkinitrd script. while [ ! -L device ]; do if [ -L subsystem ]; then cd slaves for x in *;do if [ -L $x ]; then cd $x; break fi; done fi done I downgraded the new and improved SL 5.2 mkinitrd and nash packages, and used the latest (but older) SL 5.1 packages downloaded from <ftp://ftp.scientificlinux.org/linux/scientific/51/>: e.g., - Allow write access to /boot device: ~]$ sudo mount -o remount,rw /boot - Install or Update package with previous version: ~]$ sudo rpm -Uhv --oldpackage mkinitrd-5.1.19.6-19.i386.rpm mkinitrd-5.1.19.6-19.x86_64.rpm nash-5.1.19.6-19.x86_64.rpm o for the older PAE kernel systems I used: ~]$ sudo rpm -Uhv --oldpackage mkinitrd-5.1.19.6-19.i386.rpm nash-5.1.19.6-19.i386.rpm Loading the previous set of mkinitrd and nash packages allowed the SL 2.6.18-92.1.13.el5 kernel upgrade to successfully run to completion for all the test systems. ~]$ sudo yum update kernel kernel-devel kernel-doc kernel-headers All of our desktop/workstation systems are running on LVM volumes but those that did NOT have md raid devices underneath LVM came up properly with just this downgrade of the mkinitrd and nash packages. However, on the more serious systems we run LVM on top of software md raid1 devices, and the resulting initrd images were unusable (unbootable) for those systems--and the resulting initrd file on the test system showed as being (potentially) too short in a directory listing. ~]$ sudo ls -sk /boot | grep initrd 3206 initrd-2.6.18-92.1.10.el5.img 2537 initrd-2.6.18-92.1.13.el5.img <-- not good Given the final error message being emitted when attempting to load this new kernel on the system: Switching to new root <snip> Kernel panic - not sysncing: Attempted to kill init! it seemed clear that the newly generated 2.6.18-92.1.13.el5 ram disk built using the older mkinitrd, was incomplete--but the script did successfully create a somewhat intact initrd image and we should now only needed to add in the specific bits that were missed during the SL mkinitrd run to get things operational. The first step was to determine what was working in the current running initrd configuration compared to what was being generated by the the new 92.1.13 build. I took the currently running 2.6.18-92.1.10.el5 initrd disk image and unpacked it into a temporary directory: ~]$ mkdir fix-initrd ~]$ cd fix-initrd ~]$ sudo cp -ar /boot/initrd-2.6.18-92.1.10.el5.img initrd-2.6.18-92.1.10.el5.gz ~]$ gunzip initrd-2.6.18-92.1.10.el5.gz ~]$ mkdir 10 ~]$ cd 10 ~]$ cpio -di <../initrd-2.6.18-92.1.10.el5 and compare it with the newly generated 92.1.13 disk image. ~]$ cd .. ~]$ sudo cp -ar /boot/initrd-2.6.18-92.1.13.el5.img initrd-2.6.18-92.1.13.el5.gz ~]$ gunzip initrd-2.6.18-92.1.13.el5.gz ~]$ mkdir 13 ~]$ cd 13 ~]$ cpio -di <../initrd-2.6.18-92.1.13.el5 Comparing the directory structures and files between the working configuration 92.1.10 and the broken 92.1.13 kernel directories we find that 92.1.13 is missing a few files; specifically lvm and the raid kernel module. This seems odd since the handful of desktop/workstation systems that successfully updated using this older mkinitrd have identical lvm configurations as the server systems--but were not running md raid1... (so lvm was recognised on the systems but only if if there were no raid devices involved?) ~]$ ls -al ../1?/bin ../10/bin: total 6128 drwx------ 2 root root 4096 Oct 18 10:29 . drwxr-xr-x 9 root root 4096 Oct 18 10:38 .. -rwx------ 1 root root 1024504 Oct 18 10:29 dmraid -rwx------ 1 root root 521304 Oct 18 10:29 insmod -rwx------ 1 root root 899576 Oct 18 10:29 kpartx -r-x------ 1 root root 1461864 Oct 18 10:29 lvm lrwxrwxrwx 1 root root 10 Oct 18 10:29 modprobe -> /sbin/nash -rwx------ 1 root root 2329792 Oct 18 10:29 nash ../13/bin: total 4688 drwx------ 2 root root 4096 Oct 18 10:33 . drwxr-xr-x 9 root root 4096 Oct 18 10:33 .. -rwx------ 1 root root 1024504 Oct 18 10:33 dmraid -rwx------ 1 root root 521304 Oct 18 10:33 insmod -rwx------ 1 root root 899576 Oct 18 10:33 kpartx lrwxrwxrwx 1 root root 10 Oct 18 10:33 modprobe -> /sbin/nash -rwx------ 1 root root 2320672 Oct 18 10:33 nash so we copy the missing lvm executable (static) into the unpacked 2.6.18-92.1.13 ram disk image: ~]$ pwd /home/peckd/fix-initrd/13 ~]$ cp -ar /sbin/lvm.static bin/lvm in addition we seem to be missing the lvm configuration file in the 2.6.18-92.1.13 initrd image: ~]$ ls -al ../1?/etc/ ../10/etc/: total 4 drwx------ 2 root root 4096 Oct 18 10:29 lvm ../13/etc/: total 0 so we copy that as well. ~]$ pwd /home/peckd/fix-initrd/13 ~]$ mkdir etc/lvm ~]$ cp -ar /etc/lvm/lvm.conf etc/lvm We also seem to be missing raid1.ko module from /lib ~]$ ll ../1?/lib ../10/lib: total 2052 -rw------- 1 root root 66920 Oct 18 10:29 ata_piix.ko -rw------- 1 root root 73112 Oct 18 10:29 dm-mirror.ko -rw------- 1 root root 131536 Oct 18 10:29 dm-mod.ko -rw------- 1 root root 59512 Oct 18 10:29 dm-snapshot.ko -rw------- 1 root root 37448 Oct 18 10:29 dm-zero.ko -rw------- 1 root root 77088 Oct 18 10:29 ehci-hcd.ko -rw------- 1 root root 227896 Oct 18 10:29 ext3.ko drwx------ 2 root root 4096 Oct 18 10:29 firmware -rw------- 1 root root 135928 Oct 18 10:29 jbd.ko -rw------- 1 root root 258736 Oct 18 10:29 libata.ko -rw------- 1 root root 152712 Oct 18 10:29 mptbase.ko -rw------- 1 root root 84512 Oct 18 10:29 mptscsih.ko -rw------- 1 root root 63456 Oct 18 10:29 mptspi.ko -rw------- 1 root root 63344 Oct 18 10:29 ohci-hcd.ko -rw------- 1 root root 65624 Oct 18 10:29 raid1.ko <-- missing -rw------- 1 root root 281096 Oct 18 10:29 scsi_mod.ko -rw------- 1 root root 72296 Oct 18 10:29 scsi_transport_spi.ko -rw------- 1 root root 68872 Oct 18 10:29 sd_mod.ko -rw------- 1 root root 67040 Oct 18 10:29 uhci-hcd.ko ../13/lib: total 1980 -rw------- 1 root root 66920 Oct 18 10:33 ata_piix.ko -rw------- 1 root root 73112 Oct 18 10:33 dm-mirror.ko -rw------- 1 root root 131544 Oct 18 10:33 dm-mod.ko -rw------- 1 root root 59512 Oct 18 10:33 dm-snapshot.ko -rw------- 1 root root 37448 Oct 18 10:33 dm-zero.ko -rw------- 1 root root 77088 Oct 18 10:33 ehci-hcd.ko -rw------- 1 root root 227896 Oct 18 10:33 ext3.ko drwx------ 2 root root 4096 Oct 18 10:33 firmware -rw------- 1 root root 135936 Oct 18 10:33 jbd.ko -rw------- 1 root root 258736 Oct 18 10:33 libata.ko -rw------- 1 root root 152712 Oct 18 10:33 mptbase.ko -rw------- 1 root root 84512 Oct 18 10:33 mptscsih.ko -rw------- 1 root root 63456 Oct 18 10:33 mptspi.ko -rw------- 1 root root 63344 Oct 18 10:33 ohci-hcd.ko -rw------- 1 root root 281096 Oct 18 10:33 scsi_mod.ko -rw------- 1 root root 72296 Oct 18 10:33 scsi_transport_spi.ko -rw------- 1 root root 68872 Oct 18 10:33 sd_mod.ko -rw------- 1 root root 67040 Oct 18 10:33 uhci-hcd.ko and copy that as well: ~]$ pwd /home/peckd/fix-initrd/13 ~]$ ls /lib/modules/*/kernel/drivers/md/raid1.ko /lib/modules/2.6.18-92.1.10.el5/kernel/drivers/md/raid1.ko /lib/modules/2.6.18-92.1.13.el5/kernel/drivers/md/raid1.ko ~]$ sudo cp /lib/modules/2.6.18-92.1.13.el5/kernel/drivers/md/raid1.ko ../13/lib I then took a look the init script itself to find the differences: ~]$ ls -al ../1?/init -rwx------ 1 root root 2603 Oct 18 10:29 ../10/init -rwx------ 1 root root 2391 Oct 18 10:33 ../13/init and merge those missing pieces from the working init config into the new 2.6.18-92.1.13 init: ~]$ diff -Naur ../13/init ../10/init --- ../13/init 2008-10-18 18:00:01.000000000 +0000 +++ ../10/init 2008-10-18 16:29:36.000000000 +0000 @@ -40,12 +40,12 @@ hotplug echo Creating block device nodes. mkblkdevs -echo "Loading uhci-hcd.ko module" -insmod /lib/uhci-hcd.ko -echo "Loading ohci-hcd.ko module" -insmod /lib/ohci-hcd.ko echo "Loading ehci-hcd.ko module" insmod /lib/ehci-hcd.ko +echo "Loading ohci-hcd.ko module" +insmod /lib/ohci-hcd.ko +echo "Loading uhci-hcd.ko module" +insmod /lib/uhci-hcd.ko mount -t usbfs /proc/bus/usb /proc/bus/usb echo "Loading jbd.ko module" insmod /lib/jbd.ko @@ -88,7 +88,7 @@ lvm vgchange -ay --ignorelockingfailure vg01 resume LABEL=/v1l0-swap echo Creating root device. -mkrootdev -t ext3 -o user_xattr,acl,noatime,ro /dev/vg01/lv01 +mkrootdev -t ext3 -o defaults,ro /dev/vg01/lv01 echo Mounting root filesystem. mount /sysroot echo Setting up other filesystems. After applying those differences that seemed significant to the 2.6.18-92.1.13 init I rebuilt the initrd image for the system(s) ~]$ pwd /home/peckd/fix-initrd/13 ~]$ find ./ | cpio -H newc -o > ../new-initrd ~]$ gzip ../new-initrd and moved the new ram disk image to /boot. ~]$ sudo mv ../new-initrd /boot/dp-initrd-2.6.18-92.1.13.el5.img After created a new entry in the /boot/grub/grub.conf that pointed to the existing vmlinuz-2.6.18-92.1.13.el5 kernel and referencing this new initrd image: title Scientific Linux SL (2.6.18-92.1.13.el5) root (hd0,0) kernel /vmlinuz-2.6.18-92.1.13.el5 ro root=LABEL=/ initrd /dp-initrd-2.6.18-92.1.13.el5.img I then rebooted using this entry and so far, everything appears to be working. ~]$ uname -rsimpv Linux 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 16:39:16 EDT 2008 x86_64 x86_64 x86_64 I would have really have preferred divining a patch the mkinitrd script itself to handle the configurations we have; I can't believe they're that unusual, and I may look at this in more detail later, but getting the boxes updated and running with the latest kernel security fixes seemed more pressing. Anyway, I hope this note will be useful to someone. Thank you and my very best regards, ==> dave