LISTSERV - SCIENTIFIC-LINUX-USERS Archives

We're getting "do_ypcall: clnt_call: RPC: Timed out"
errors.

We're in the process of upgrading to 4.4,
starting with some new 64 bit Supermnicros,
some with a single Xeon dual core and some with
a single Core 2 Duo.  Both have Intel e1000
ethernet chipsets.

We use NIS for user passwd and group entries,
as well as netgroups, services and automounts.
This has worked for us on 32 bit systems from
Redhat5.2 up through SL30{4,7} (including some
64 bit Athlons running a 32 bit OS).  We can
reproduce this on the 32 bit SL3 systems, but
they're a lot slower, and it takes some effort
to do it.

We first saw problems with torque (we've used
PBS Pro in the past), but narrowed it down to
rsh (and even a bare bones program running
rcmd()).  A single, random rsh call is fairly
safe, but if we do one every second or two,
we quickly start getting hangs and the error:

   do_ypcall: clnt_call: RPC: Timed out

So it can happen at any time, but when we fire
off lots of jobs in quick succession via torque,
it's guaranteed to happen.  We have also seen
this with less frequency in some home grown tools.

We've stripped down NIS to bare essentials (using
only netgroup for testing), we've tried adding in
a 3Com ethernet card to use instead of the built
on cards, we've upgraded to the latest EL4 ypbind,
ypserv and glibc (which we found in a CERN repo
after looking through TUV's bug list), we've tried
adding more, faster NIS servers, and we've tried
isolating three machines on a 100Mb network (no
spare 1Gb switches).  And tried running the non-SMP
kernel.  No difference.

Bizarrely, we also get whining in the SL3 ypservers'
message logs about failed NIS host lookups.  We don't
use NIS for host lookups; nsswitch.conf has

   hosts:   files dns

.  We had only used solaris servers in the past,
and their ypserv's were not logging these errors.
Presumably they still got the requests, but we
don't know that.

We ran ypserv in debug mode for a while, and nothing
jumped out at us.

We started running nscd for passwd and group on all
the Linux systems after this started.  No change.

The switches are Cisco Gb switches and HP ProCurve
Gb switches (the isolated test network was a 3Com
100Mb switch).

Any ideas on either problem?

Thanks,
Miles

TEST SCRIPT (works every time with failure in less than
10 rsh calls on our faster boxes on the Gb network):

	#!/bin/csh
	
	# set LIST_OF_HOSTNAMES to a valid list of hosts
	# to try, the more the merrier.  We use a command
	# to generate these from a file of valid names.
	
	while ( 1 )
		foreach i ( $LIST_OF_HOSTNAMES )
			rsh $i uname -a # or any command you like
		end
	end