LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

May 2007

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS May 2007

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: YP/NIS weirdness on 4.4
From:	Jon Peatfield <[log in to unmask]>
Reply To:	Jon Peatfield <[log in to unmask]>
Date:	Fri, 18 May 2007 15:20:28 +0100
Content-Type:	TEXT/PLAIN
Parts/Attachments:	TEXT/PLAIN (119 lines)

On Fri, 18 May 2007, Miles O'Neal wrote:

> We're getting "do_ypcall: clnt_call: RPC: Timed out"
> errors.
>
> We're in the process of upgrading to 4.4,
> starting with some new 64 bit Supermnicros,
> some with a single Xeon dual core and some with
> a single Core 2 Duo.  Both have Intel e1000
> ethernet chipsets.
>
> We use NIS for user passwd and group entries,
> as well as netgroups, services and automounts.
> This has worked for us on 32 bit systems from
> Redhat5.2 up through SL30{4,7} (including some
> 64 bit Athlons running a 32 bit OS).  We can
> reproduce this on the 32 bit SL3 systems, but
> they're a lot slower, and it takes some effort
> to do it.
>
> We first saw problems with torque (we've used
> PBS Pro in the past), but narrowed it down to
> rsh (and even a bare bones program running
> rcmd()).  A single, random rsh call is fairly
> safe, but if we do one every second or two,
> we quickly start getting hangs and the error:
>
>   do_ypcall: clnt_call: RPC: Timed out

The glibc code for doing nis calls will retry about 4 times (well it was 
last time I checked), and if the server doesn't answer by then it errors.

If you manage to send sufficiently many requests to the server that *it*
can't cope then you will see these messages.  Some ypserv implementations 
cope better with load than others...

Now glibc sends the yp requests from a privelaged port and lets the system 
pick, so ends up cycling though the available range.

Now we have some servers with Intel mboards with braindead BMC chipsets 
which eat all traffic to the IPMI ports.  When anything happened to pick 
those ports it never gets an answer so will time out.  We saw *lots* of 
this especially doing things which caused lots of yp requests -- until we 
tracked it down and caused things to avoid the IPMI ports.

Can you just do the sanity check and see if there is any correlation 
between the errors and ports in use at the time?  In our case tcpdump 
would show a packet being sent but no reply and it was pretty obvious from 
those logs that anything using ports 623 and 664 (tcp and udp) was 
broken...

> So it can happen at any time, but when we fire
> off lots of jobs in quick succession via torque,
> it's guaranteed to happen.  We have also seen
> this with less frequency in some home grown tools.
>
> We've stripped down NIS to bare essentials (using
> only netgroup for testing), we've tried adding in
> a 3Com ethernet card to use instead of the built
> on cards, we've upgraded to the latest EL4 ypbind,
> ypserv and glibc (which we found in a CERN repo
> after looking through TUV's bug list), we've tried
> adding more, faster NIS servers, and we've tried
> isolating three machines on a 100Mb network (no
> spare 1Gb switches).  And tried running the non-SMP
> kernel.  No difference.

I assume that you also checked for firewall issues at both ends...

> Bizarrely, we also get whining in the SL3 ypservers'
> message logs about failed NIS host lookups.  We don't
> use NIS for host lookups; nsswitch.conf has
>
>   hosts:   files dns
>
> .  We had only used solaris servers in the past,
> and their ypserv's were not logging these errors.
> Presumably they still got the requests, but we
> don't know that.

Do you have any libc5 code perhaps?

> We ran ypserv in debug mode for a while, and nothing
> jumped out at us.
>
> We started running nscd for passwd and group on all
> the Linux systems after this started.  No change.
>
> The switches are Cisco Gb switches and HP ProCurve
> Gb switches (the isolated test network was a 3Com
> 100Mb switch).
>
> Any ideas on either problem?
>
> Thanks,
> Miles
>
> TEST SCRIPT (works every time with failure in less than
> 10 rsh calls on our faster boxes on the Gb network):
>
> 	#!/bin/csh
>
> 	# set LIST_OF_HOSTNAMES to a valid list of hosts
> 	# to try, the more the merrier.  We use a command
> 	# to generate these from a file of valid names.
>
> 	while ( 1 )
> 		foreach i ( $LIST_OF_HOSTNAMES )
> 			rsh $i uname -a # or any command you like
> 		end
> 	end

Do you also see it with ssh connections?  I ask 'cos rsh also picks a 
privelaged (tcp) port...

-- 
Jon Peatfield,  Computer Officer,  DAMTP,  University of Cambridge
Mail:  [log in to unmask]     Web:  http://www.damtp.cam.ac.uk/

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV