Subject: | |
From: | |
Reply To: | |
Date: | Tue, 7 Aug 2007 09:59:06 +0200 |
Content-Type: | multipart/signed |
Parts/Attachments: |
|
|
On Aug 7, 2007, at 4:56 AM, Miles O'Neal wrote:
> We recently migrated from PBS to torque, and most of our
> systems are now running 4.4 . The torque server (a Core2
> Duo at 2.4GHz) is only handling about 3x the jobs our 300MHz
> Sun Ultra 5 could handle before bogging down horribly. This
> seems a bit odd.
>
How many nodes and jobs?
> Watching the server logs, it seems there's a lot of time
> spent waiting for replies on sockets, though it's not clear
> whether it's on the same system between the scheduler and
> batch server, or between the batch server and client node
> processes (pbs_moms).
>
Do consider changing the values as described here.
http://www.clusterresources.com/torquedocs21/a.flargeclusters.shtml
in particular for large farms you really need to have poll_jobs set
to true
and increase the job_stat_rate.
> We're beginning to wonder of it's OS-related. Torque uses
> a lot of sockets, and sets them up and tears them down at a
> hefty rate. We have the number set to 16K for the scheduler
> and server processes via ulimit, but we aren't getting much
> above 1400 between the two processes.
>
> Is anyone aware of an issue in 4.4 that might affect this?
>
> Thanks,
> Miles
--
Steve Traylen
Work Calendar: http://tinyurl.com/22lw9o
[log in to unmask]
CERN, IT-GD-OPS.
|
|
|