Hi David,
I am certainly no expert but this looks to me like the classic NFS
symptoms when the server gets overloaded, or a disk or the network gets
flaky.
If it were me, I'd try to get the class to do more local i/o (if
possible). Perhaps a scratch area on the local disk would solve the
problem.
I think you could reproduce the problem by writing a test script that
does heavy i/o to the network folders and then running on more and more
machines and watch the i/o throughput approach zero with the machines
hung while waiting for NFS.
Again, I'm no expert feel free to ignore me.
Joe
On 11/29/2012 10:49 AM, David Fitzgerald wrote:
> Last night during class time I had a chance to check some of the machines with the frozen displays, and I am not sure what to make of what I found. Running 'lsof -p $PID' with (PID being 5044) on one of the affected machines, gave this which, doesn't tell me much:
>
> 10.10.10 5044 root cwd DIR 8,7 4096 2 /
> 10.10.10 5044 root rtd DIR 8,7 4096 2 /
> 10.10.10 5044 root txt unknown /proc/5044/exe
>
>
> I also ran pstree and I will put that output below, but I think I may be barking up the wrong tree. While some of my clients were freezing up, I saw that my NFS server was getting very high 'top' loads. Fortunately I have sysstat running on the server and after class 'sar -u' showed that %iowait went from less than 1 before class to a high of 53 after class began, and stayed high until class ended. Here is the relevant 'chunk' of the sar -u output:
>
> 05:20:01 PM all 0.03 0.00 0.07 0.17 0.00 99.73
> 05:30:01 PM all 0.03 0.00 0.03 0.11 0.00 99.83
> 05:40:01 PM all 0.18 0.00 0.50 1.88 0.00 97.44
> 05:50:01 PM all 0.16 0.00 1.12 6.93 0.00 91.78
> 06:00:01 PM all 0.73 0.00 5.23 32.61 0.00 61.43
> 06:10:01 PM all 0.77 0.00 6.55 53.67 0.00 39.01
> 06:20:01 PM all 0.13 0.00 4.81 27.81 0.00 67.25
> 06:30:01 PM all 0.13 0.00 6.69 21.71 0.00 71.47
> 06:40:01 PM all 0.11 0.00 3.47 33.34 0.00 63.08
> 06:50:01 PM all 0.11 0.00 3.20 31.02 0.00 65.67
> 07:00:01 PM all 0.24 0.00 3.93 30.79 0.00 65.05
> 07:10:01 PM all 0.16 0.00 3.63 20.51 0.00 75.71
> 07:20:01 PM all 0.18 0.00 5.23 1.45 0.00 93.13
> 07:30:01 PM all 0.10 0.00 5.72 0.70 0.00 93.48
> Average: all 0.06 0.01 0.46 2.13 0.00 97.34
>
>
> The NFS server is a virtual machine in running ESXI 4.1 and VMware tools IS installed. Could this be slow disk access, and thus a VMware misconfiguration? I hate to admit it, but I am at a loss.
>
> I can run other sar reports on yesterday's (Wednesday's) data if anyone thinks there may be something in there to help.
>
> For what its worth, here is the output from pstree from one of the affected clients, and I do NOT see the PID that I was looking for:
>
> init(1)-+-NetworkManager(1782)-+-dhclient(1808)
> | `-{NetworkManager}(1809)
> |-abrtd(2341)
> |-acpid(2039)
> |-anacron(3615)
> |-atd(2413)
> |-atieventsd(2421)---authatieventsd.(4134)
> |-auditd(1547)-+-audispd(1549)-+-sedispatch(1550)
> | | `-{audispd}(1551)
> | `-{auditd}(1548)
> |-automount(2134)-+-{automount}(2135)
> | |-{automount}(2136)
> | |-{automount}(2139)
> | |-{automount}(2142)
> | |-{automount}(2143)
> | `-{automount}(2144)
> |-avahi-daemon(1794)---avahi-daemon(1795)
> |-bonobo-activati(4549)---{bonobo-activat}(4550)
> |-cachefilesd(1597)
> |-certmonger(2435)
> |-clock-applet(4644)
> |-console-kit-dae(2521)-+-{console-kit-da}(2522)
> | |-{console-kit-da}(2523)
> | |-{console-kit-da}(2524)
> | |-{console-kit-da}(2525)
> | |-{console-kit-da}(2526)
> | |-{console-kit-da}(2527)
> | |-{console-kit-da}(2528)
> | |-{console-kit-da}(2529)
> | |-{console-kit-da}(2530)
> | |-{console-kit-da}(2531)
> | |-{console-kit-da}(2532)
> | |-{console-kit-da}(2533)
> | |-{console-kit-da}(2534)
> | |-{console-kit-da}(2535)
> | |-{console-kit-da}(2536)
> | |-{console-kit-da}(2537)
> | |-{console-kit-da}(2538)
> | |-{console-kit-da}(2539)
> | |-{console-kit-da}(2540)
> | |-{console-kit-da}(2541)
> | |-{console-kit-da}(2542)
> | |-{console-kit-da}(2543)
> | |-{console-kit-da}(2544)
> | |-{console-kit-da}(2545)
> | |-{console-kit-da}(2546)
> | |-{console-kit-da}(2547)
> | |-{console-kit-da}(2548)
> | |-{console-kit-da}(2549)
> | |-{console-kit-da}(2550)
> | |-{console-kit-da}(2551)
> | |-{console-kit-da}(2552)
> | |-{console-kit-da}(2553)
> | |-{console-kit-da}(2554)
> | |-{console-kit-da}(2555)
> | |-{console-kit-da}(2556)
> | |-{console-kit-da}(2557)
> | |-{console-kit-da}(2558)
> | |-{console-kit-da}(2559)
> | |-{console-kit-da}(2560)
> | |-{console-kit-da}(2561)
> | |-{console-kit-da}(2562)
> | |-{console-kit-da}(2563)
> | |-{console-kit-da}(2564)
> | |-{console-kit-da}(2565)
> | |-{console-kit-da}(2566)
> | |-{console-kit-da}(2567)
> | |-{console-kit-da}(2568)
> | |-{console-kit-da}(2569)
> | |-{console-kit-da}(2570)
> | |-{console-kit-da}(2571)
> | |-{console-kit-da}(2572)
> | |-{console-kit-da}(2573)
> | |-{console-kit-da}(2574)
> | |-{console-kit-da}(2575)
> | |-{console-kit-da}(2576)
> | |-{console-kit-da}(2577)
> | |-{console-kit-da}(2578)
> | |-{console-kit-da}(2579)
> | |-{console-kit-da}(2580)
> | |-{console-kit-da}(2581)
> | |-{console-kit-da}(2582)
> | |-{console-kit-da}(2583)
> | `-{console-kit-da}(2585)
> |-crond(2402)
> |-cupsd(1955)
> |-dbus-daemon(1772)
> |-dbus-daemon(2883)
> |-dbus-launch(2591)
> |-dbus-launch(2882)
> |-devkit-power-da(2602)
> |-fcoemon(1760)
> |-firefox(4968)
> |-gconf-im-settin(4534)
> |-gconfd-2(3175)
> |-gdm-binary(2449)---gdm-simple-slav(2490)-+-Xorg(2492)
> | `-gdm-session-wor(2671)---tcsh(2849)---gnome-session(4148)-+-bluetooth-apple(436+
> | |-gdu-notificatio(432+
> | |-gnome-panel(4253)
> | |-gnome-power-man(434+
> | |-gnome-volume-co(432+
> | |-gpk-update-icon(430+
> | |-krb5-auth-dialo(435+
> | |-metacity(4244)
> | |-nautilus(4276)
> | |-nm-applet(4342)
> | |-polkit-gnome-au(432+
> | |-python(4294)
> | `-{gnome-session}(422+
> |-gdm-user-switch(4640)
> |-gedit(4779)-+-{gedit}(4894)
> | |-{gedit}(5037)
> | |-{gedit}(5038)
> | `-{gedit}(5039)
> |-gnome-keyring-d(2831)-+-{gnome-keyring-}(2832)
> | `-{gnome-keyring-}(4237)
> |-gnome-screensav(4665)
> |-gnome-settings-(4235)---{gnome-settings}(4248)
> |-gnote(4635)
> |-gvfs-afc-volume(4573)---{gvfs-afc-volum}(4574)
> |-gvfs-gdu-volume(4569)
> |-gvfs-gphoto2-vo(4571)
> |-gvfsd(3168)
> |-gvfsd-burn(4754)
> |-gvfsd-metadata(4794)
> |-gvfsd-trash(4656)
> |-hald(2048)---hald-runner(2049)-+-hald-addon-acpi(2096)
> | |-hald-addon-inpu(2088)
> | `-hald-addon-stor(2097)
> |-im-settings-dae(4371)
> |-lldpad(1734)
> |-master(2332)-+-pickup(2347)
> | `-qmgr(2348)
> |-mingetty(2454)
> |-mingetty(2456)
> |-mingetty(2458)
> |-mingetty(2460)
> |-mingetty(2462)
> |-modem-manager(1789)
> |-notification-ar(4642)
> |-ntpd(2249)
> |-pcscd(2114)---{pcscd}(2129)
> |-polkitd(2647)
> |-pulseaudio(4331)-+-gconf-helper(4563)
> | |-{pulseaudio}(4535)
> | `-{pulseaudio}(4539)
> |-qpidd(2356)-+-{qpidd}(2357)
> | |-{qpidd}(2358)
> | `-{qpidd}(2359)
> |-rpc.idmapd(1864)
> |-rpc.mountd(2190)
> |-rpc.rquotad(2175)
> |-rpc.statd(1818)
> |-rpcbind(1648)
> |-rsyslogd(1574)-+-{rsyslogd}(1575)
> | |-{rsyslogd}(1576)
> | `-{rsyslogd}(1578)
> |-rtkit-daemon(2661)-+-{rtkit-daemon}(2662)
> | `-{rtkit-daemon}(2663)
> |-seahorse-agent(3155)
> |-seahorse-daemon(4243)
> |-sshd(2233)---sshd(5003)---bash(5005)---pstree(5057)
> |-sssd(2216)-+-sssd_be(2281)
> | |-sssd_nss(2286)
> | `-sssd_pam(2287)
> |-stap-serverd(1927)---{stap-serverd}(1932)
> |-udevd(542)-+-udevd(1166)
> | `-udevd(1745)
> |-udisks-daemon(4373)---udisks-daemon(4374)
> |-wpa_supplicant(1813)
> `-xinetd(2241)
>
>
>
>
> ________________________________________
> From: Christopher Tooley [[log in to unmask]]
> Sent: Wednesday, November 28, 2012 1:00 PM
> To: David Fitzgerald
> Cc: [log in to unmask]
> Subject: Re: clients slow down due to unknown process
>
> If/when you find out what it is, would you kindly report back to the list what you find? This has got me really curious now. :D
>
> -Chris
>
> On 2012-11-28, at 5:51 AM, David Fitzgerald<[log in to unmask]> wrote:
>
>> Thank you everyone for all the good ideas. I have class this evening and will be able to use your suggestions. I'll let you know what I find.
>>
>> Dave
>>
>> -----Original Message-----
>> From: Robert Blair [mailto:[log in to unmask]]
>> Sent: Tuesday, November 27, 2012 11:56 AM
>> To: Sergio Ballestrero
>> Cc: David Fitzgerald; [log in to unmask]
>> Subject: Re: clients slow down due to unknown process
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> "/usr/sbin/lsof -p $PID" will also list all of the resources it uses which is often a big help in figuring out wtf it is all about.
>>
>> On 11/27/2012 10:52 AM, Sergio Ballestrero wrote:
>>> Hello David,
>>> I'm not familiar with freeIPA, but anyway you can start by better
>>> identifying the process.
>>> In top, get the PID and look under /proc/$PID - in particular exe
>>> will be a link to the binary, like lrwxrwxrwx 1 root root 0 Nov 27
>>> 01:41 /proc/1/exe -> /sbin/init
>>>
>>> pstree -p -H $PID
>>> will help you identify the parent process, if there's one.
>>>
>>> Cheers,
>>> Sergio
>>>
>>> On 27 Nov 2012, at 16:21, David Fitzgerald wrote:
>>>
>>>> Hello,
>>>>
>>>> Sorry for the length of this post, but I want to make sure I give all
>>>> the information needed for someone to help.
>>>>
>>>> I have a lab of 25 workstations running Scientific Linux 6.2. User
>>>> accounts are authenticated via freeIPA, and auto mounted to an NFS
>>>> server and the users use Gnome 2.8. The NFS and freeIPA servers are
>>>> located on the same server (IP 10.10.10.10) which is also running
>>>> Scientific Linux 6.2 and is a virtual guest in VMware ESXI 4.1.
>>>>
>>>> During class when the workstations are most heavily in use, the
>>>> students are writing Fortran programs with gedit and usually have
>>>> firefox up as well. Here is my predicament. During class some of
>>>> the workstation screens will freeze with no mouse or keyboard input.
>>>> This can last for varying lengths of time, sometimes a few minutes,
>>>> some other times for the full length of the class. I can ssh in to
>>>> the frozen machines and top will show load averages of up to 4 or more.
>>>> The process taking up the most CPU is one I don't recognize named
>>>> 10.10.10.10-ma. The 10.10.10.10 being the IP address of my server.
>>>> I have no idea what that process is related to, whether it's freeIPA,
>>>> NFS, Gnome or something else. Killing the process doesn't help as it
>>>> simply restarts with a new PID. Note that the freezing does NOT
>>>> happen when only a few people are using the lab, so reproducing the
>>>> problem outside of class time is difficult.
>>>>
>>>> Can anyone help me track down this problem and fix it?
>>>>
>>>> I appreciate any help you can give.
>>>>
>>>> Thanks!
>>>>
>>>> Dave
>>>>
>>>>
>>>> +++++++++++++++++++++++
>>>> David Fitzgerald
>>>> Department of Earth Sciences
>>>> Millersville University
>>>> Millersville, PA 17551
>>>>
>>>> Phone: 717-871-2394
>>>>
>>> --
>>> Sergio Ballestrero - http://physics.uj.ac.za/psiwiki/Ballestrero
>>> University of Johannesburg, Physics Department ATLAS TDAQ sysadmin
>>> team - Office:75282 OnCall:164851
>>>
>>>
>>>
>>>
>>>
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.5 (GNU/Linux)
>>
>> iQEUAwUBULTwmfQM1KNWz8QaAQLU0Qf2JXa29RVDhJALq2TD72Nis4wAmxlqFIYP
>> rIo5sHBUI+o/bebsDit9qoC+hWuCK3+xDai9fzF2jUQqXfhRZiPHjdQRpCViMurY
>> Wp+aVZWCD1U3KusuWMSWlv6Xdx0QmaMQr8Nh8JRRWUi8cNEgAO2Th1txwdu3auJb
>> LssTFmwUjLUEC0mKhgx6520hisirfOHNTnF3rQCN5ilZGEYEZ2vMm/lcm5yI0Sqc
>> wdqWUXVYGNsBepFf4bRWaWPX0Hbf6sbLgoJNUHJOJ2pGpc3MUp3SiGsIIUGkZwPW
>> xT6kS523J+nItY/odmvdl+ibHRVa7TgDx0xhuqISarr39g00yvvx
>> =RQky
>> -----END PGP SIGNATURE-----
|