SCIENTIFIC-LINUX-USERS Archives

August 2007

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Steve Traylen <[log in to unmask]>
Reply To:
Steve Traylen <[log in to unmask]>
Date:
Tue, 7 Aug 2007 09:59:06 +0200
Content-Type:
multipart/signed
Parts/Attachments:
text/plain (1268 bytes) , smime.p7s (1609 bytes)

On Aug 7, 2007, at 4:56 AM, Miles O'Neal wrote:

> We recently migrated from PBS to torque, and most of our
> systems are now running 4.4 .  The torque server (a Core2
> Duo at 2.4GHz) is only handling about 3x the jobs our 300MHz
> Sun Ultra 5 could handle before bogging down horribly.  This
> seems a bit odd.
>

How many nodes and jobs?

> Watching the server logs, it seems there's a lot of time
> spent waiting for replies on sockets, though it's not clear
> whether it's on the same system between the scheduler and
> batch server, or between the batch server and client node
> processes (pbs_moms).
>

Do consider changing the values as described here.
http://www.clusterresources.com/torquedocs21/a.flargeclusters.shtml

in particular for large farms you really need to have poll_jobs set  
to true
and increase the job_stat_rate.

> We're beginning to wonder of it's OS-related.  Torque uses
> a lot of sockets, and sets them up and tears them down at a
> hefty rate.  We have the number set to 16K for the scheduler
> and server processes via ulimit, but we aren't getting much
> above 1400 between the two processes.
>
> Is anyone aware of an issue in 4.4 that might affect this?
>
> Thanks,
> Miles

-- 
Steve Traylen
Work Calendar: http://tinyurl.com/22lw9o
[log in to unmask]
CERN, IT-GD-OPS.





ATOM RSS1 RSS2