Hi,
I look after a small beowulf whose head node got rebooted last night.
Today the compute nodes have got their NFS mounts from the head node mixed
up. On the head node I have this extract in /etc/fstab:
/dev/mapper/3600d02300000000000ed100540e4a000p1 /data01 ext3 defaults 1 2
/dev/mapper/3600d02300000000000ed100540e4a001p1 /data02 ext3 defaults 1 2
/dev/mapper/3600d02300000000000ed105947002a00p1 /data03 ext3 defaults 1 2
/dev/mapper/3600d02300000000000ed105947002a01p1 /data04 ext3 defaults 1 2
/dev/mapper/3600d02300000000000ed105a1384ec00p1 /data05 ext3 defaults 1 2
/dev/mapper/3600d02300000000000ed105a1384ec01p1 /data06 ext3 defaults 1 2
and /etc/exports extract:
/data01 10.0.0.0/24(rw,sync) 130.88.15.0/24(rw,sync) 130.88.67.0/24(rw,sync) 130.88.16.0/24(rw,sync)
/data02 10.0.0.0/24(rw,sync) 130.88.15.0/24(rw,sync) 130.88.67.0/24(rw,sync) 130.88.16.0/24(rw,sync)
/data03 10.0.0.0/24(rw,sync) 130.88.15.0/24(rw,sync) 130.88.67.0/24(rw,sync) 130.88.16.0/24(rw,sync)
/data04 10.0.0.0/24(rw,sync) 130.88.15.0/24(rw,sync) 130.88.67.0/24(rw,sync) 130.88.16.0/24(rw,sync)
/data05 10.0.0.0/24(rw,sync) 130.88.15.0/24(rw,sync) 130.88.67.0/24(rw,sync) 130.88.16.0/24(rw,sync)
/data06 10.0.0.0/24(rw,sync) 130.88.15.0/24(rw,sync) 130.88.67.0/24(rw,sync) 130.88.16.0/24(rw,sync)
To demonstrate what is wrong I created a file on each /dataxx partition called the same as the
partition so on the head node I see this:
# for n in 1 2 3 4 5 6
> do
> ls /data0${n}/data??
> done
/data01/data01
/data02/data02
/data03/data03
/data04/data04
/data05/data05
/data06/data06
On the compute nodes (which I can only access via submitting a Sun Grid Engine job
at this moment in time) this is in /etc/fstab:
10.0.0.254:/data01 /data01 nfs defaults 0 0
10.0.0.254:/data02 /data02 nfs defaults 0 0
10.0.0.254:/data03 /data03 nfs defaults 0 0
10.0.0.254:/data04 /data04 nfs defaults 0 0
10.0.0.254:/data05 /data05 nfs defaults 0 0
10.0.0.254:/data06 /data06 nfs defaults 0 0
The mount command shows this:
/dev/ram0 on / type ext2 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda2 on /tmp type ext3 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
10.0.0.254:/home on /home type nfs (rw,addr=10.0.0.254)
10.0.0.254:/usr on /usr type nfs (rw,addr=10.0.0.254)
10.0.0.254:/opt on /opt type nfs (rw,addr=10.0.0.254)
10.0.0.254:/data01 on /data01 type nfs (rw,addr=10.0.0.254)
10.0.0.254:/data02 on /data02 type nfs (rw,addr=10.0.0.254)
10.0.0.254:/data03 on /data03 type nfs (rw,addr=10.0.0.254)
10.0.0.254:/data04 on /data04 type nfs (rw,addr=10.0.0.254)
10.0.0.254:/data05 on /data05 type nfs (rw,addr=10.0.0.254)
10.0.0.254:/data06 on /data06 type nfs (rw,addr=10.0.0.254)
but when I run my little loop as above, I see this:
/data01/data02
/data02/data03
/data03/data01
/data04/data04
/data05/data05
/data06/data06
The first three mounts are plainly wrong and it is the same on all four compute
nodes. I am absolutely confused as to what has happened - any ideas?
Sorry for the length of this but I've tried to be as concise as possible.
It's probably not even SL specific (the cluster is running SL 5.0) but
I value the knowledge and wisdom of the people on this list.
--
Mark Whidby
Infrastructure Coordinator (Unix) - Physics/Chemistry/EAES/Mathematics Team
Information Systems
Faculty of Engineering and Physical Sciences
|