Hi David, I am certainly no expert but this looks to me like the classic NFS symptoms when the server gets overloaded, or a disk or the network gets flaky. If it were me, I'd try to get the class to do more local i/o (if possible). Perhaps a scratch area on the local disk would solve the problem. I think you could reproduce the problem by writing a test script that does heavy i/o to the network folders and then running on more and more machines and watch the i/o throughput approach zero with the machines hung while waiting for NFS. Again, I'm no expert feel free to ignore me. Joe On 11/29/2012 10:49 AM, David Fitzgerald wrote: > Last night during class time I had a chance to check some of the machines with the frozen displays, and I am not sure what to make of what I found. Running 'lsof -p $PID' with (PID being 5044) on one of the affected machines, gave this which, doesn't tell me much: > > 10.10.10 5044 root cwd DIR 8,7 4096 2 / > 10.10.10 5044 root rtd DIR 8,7 4096 2 / > 10.10.10 5044 root txt unknown /proc/5044/exe > > > I also ran pstree and I will put that output below, but I think I may be barking up the wrong tree. While some of my clients were freezing up, I saw that my NFS server was getting very high 'top' loads. Fortunately I have sysstat running on the server and after class 'sar -u' showed that %iowait went from less than 1 before class to a high of 53 after class began, and stayed high until class ended. Here is the relevant 'chunk' of the sar -u output: > > 05:20:01 PM all 0.03 0.00 0.07 0.17 0.00 99.73 > 05:30:01 PM all 0.03 0.00 0.03 0.11 0.00 99.83 > 05:40:01 PM all 0.18 0.00 0.50 1.88 0.00 97.44 > 05:50:01 PM all 0.16 0.00 1.12 6.93 0.00 91.78 > 06:00:01 PM all 0.73 0.00 5.23 32.61 0.00 61.43 > 06:10:01 PM all 0.77 0.00 6.55 53.67 0.00 39.01 > 06:20:01 PM all 0.13 0.00 4.81 27.81 0.00 67.25 > 06:30:01 PM all 0.13 0.00 6.69 21.71 0.00 71.47 > 06:40:01 PM all 0.11 0.00 3.47 33.34 0.00 63.08 > 06:50:01 PM all 0.11 0.00 3.20 31.02 0.00 65.67 > 07:00:01 PM all 0.24 0.00 3.93 30.79 0.00 65.05 > 07:10:01 PM all 0.16 0.00 3.63 20.51 0.00 75.71 > 07:20:01 PM all 0.18 0.00 5.23 1.45 0.00 93.13 > 07:30:01 PM all 0.10 0.00 5.72 0.70 0.00 93.48 > Average: all 0.06 0.01 0.46 2.13 0.00 97.34 > > > The NFS server is a virtual machine in running ESXI 4.1 and VMware tools IS installed. Could this be slow disk access, and thus a VMware misconfiguration? I hate to admit it, but I am at a loss. > > I can run other sar reports on yesterday's (Wednesday's) data if anyone thinks there may be something in there to help. > > For what its worth, here is the output from pstree from one of the affected clients, and I do NOT see the PID that I was looking for: > > init(1)-+-NetworkManager(1782)-+-dhclient(1808) > | `-{NetworkManager}(1809) > |-abrtd(2341) > |-acpid(2039) > |-anacron(3615) > |-atd(2413) > |-atieventsd(2421)---authatieventsd.(4134) > |-auditd(1547)-+-audispd(1549)-+-sedispatch(1550) > | | `-{audispd}(1551) > | `-{auditd}(1548) > |-automount(2134)-+-{automount}(2135) > | |-{automount}(2136) > | |-{automount}(2139) > | |-{automount}(2142) > | |-{automount}(2143) > | `-{automount}(2144) > |-avahi-daemon(1794)---avahi-daemon(1795) > |-bonobo-activati(4549)---{bonobo-activat}(4550) > |-cachefilesd(1597) > |-certmonger(2435) > |-clock-applet(4644) > |-console-kit-dae(2521)-+-{console-kit-da}(2522) > | |-{console-kit-da}(2523) > | |-{console-kit-da}(2524) > | |-{console-kit-da}(2525) > | |-{console-kit-da}(2526) > | |-{console-kit-da}(2527) > | |-{console-kit-da}(2528) > | |-{console-kit-da}(2529) > | |-{console-kit-da}(2530) > | |-{console-kit-da}(2531) > | |-{console-kit-da}(2532) > | |-{console-kit-da}(2533) > | |-{console-kit-da}(2534) > | |-{console-kit-da}(2535) > | |-{console-kit-da}(2536) > | |-{console-kit-da}(2537) > | |-{console-kit-da}(2538) > | |-{console-kit-da}(2539) > | |-{console-kit-da}(2540) > | |-{console-kit-da}(2541) > | |-{console-kit-da}(2542) > | |-{console-kit-da}(2543) > | |-{console-kit-da}(2544) > | |-{console-kit-da}(2545) > | |-{console-kit-da}(2546) > | |-{console-kit-da}(2547) > | |-{console-kit-da}(2548) > | |-{console-kit-da}(2549) > | |-{console-kit-da}(2550) > | |-{console-kit-da}(2551) > | |-{console-kit-da}(2552) > | |-{console-kit-da}(2553) > | |-{console-kit-da}(2554) > | |-{console-kit-da}(2555) > | |-{console-kit-da}(2556) > | |-{console-kit-da}(2557) > | |-{console-kit-da}(2558) > | |-{console-kit-da}(2559) > | |-{console-kit-da}(2560) > | |-{console-kit-da}(2561) > | |-{console-kit-da}(2562) > | |-{console-kit-da}(2563) > | |-{console-kit-da}(2564) > | |-{console-kit-da}(2565) > | |-{console-kit-da}(2566) > | |-{console-kit-da}(2567) > | |-{console-kit-da}(2568) > | |-{console-kit-da}(2569) > | |-{console-kit-da}(2570) > | |-{console-kit-da}(2571) > | |-{console-kit-da}(2572) > | |-{console-kit-da}(2573) > | |-{console-kit-da}(2574) > | |-{console-kit-da}(2575) > | |-{console-kit-da}(2576) > | |-{console-kit-da}(2577) > | |-{console-kit-da}(2578) > | |-{console-kit-da}(2579) > | |-{console-kit-da}(2580) > | |-{console-kit-da}(2581) > | |-{console-kit-da}(2582) > | |-{console-kit-da}(2583) > | `-{console-kit-da}(2585) > |-crond(2402) > |-cupsd(1955) > |-dbus-daemon(1772) > |-dbus-daemon(2883) > |-dbus-launch(2591) > |-dbus-launch(2882) > |-devkit-power-da(2602) > |-fcoemon(1760) > |-firefox(4968) > |-gconf-im-settin(4534) > |-gconfd-2(3175) > |-gdm-binary(2449)---gdm-simple-slav(2490)-+-Xorg(2492) > | `-gdm-session-wor(2671)---tcsh(2849)---gnome-session(4148)-+-bluetooth-apple(436+ > | |-gdu-notificatio(432+ > | |-gnome-panel(4253) > | |-gnome-power-man(434+ > | |-gnome-volume-co(432+ > | |-gpk-update-icon(430+ > | |-krb5-auth-dialo(435+ > | |-metacity(4244) > | |-nautilus(4276) > | |-nm-applet(4342) > | |-polkit-gnome-au(432+ > | |-python(4294) > | `-{gnome-session}(422+ > |-gdm-user-switch(4640) > |-gedit(4779)-+-{gedit}(4894) > | |-{gedit}(5037) > | |-{gedit}(5038) > | `-{gedit}(5039) > |-gnome-keyring-d(2831)-+-{gnome-keyring-}(2832) > | `-{gnome-keyring-}(4237) > |-gnome-screensav(4665) > |-gnome-settings-(4235)---{gnome-settings}(4248) > |-gnote(4635) > |-gvfs-afc-volume(4573)---{gvfs-afc-volum}(4574) > |-gvfs-gdu-volume(4569) > |-gvfs-gphoto2-vo(4571) > |-gvfsd(3168) > |-gvfsd-burn(4754) > |-gvfsd-metadata(4794) > |-gvfsd-trash(4656) > |-hald(2048)---hald-runner(2049)-+-hald-addon-acpi(2096) > | |-hald-addon-inpu(2088) > | `-hald-addon-stor(2097) > |-im-settings-dae(4371) > |-lldpad(1734) > |-master(2332)-+-pickup(2347) > | `-qmgr(2348) > |-mingetty(2454) > |-mingetty(2456) > |-mingetty(2458) > |-mingetty(2460) > |-mingetty(2462) > |-modem-manager(1789) > |-notification-ar(4642) > |-ntpd(2249) > |-pcscd(2114)---{pcscd}(2129) > |-polkitd(2647) > |-pulseaudio(4331)-+-gconf-helper(4563) > | |-{pulseaudio}(4535) > | `-{pulseaudio}(4539) > |-qpidd(2356)-+-{qpidd}(2357) > | |-{qpidd}(2358) > | `-{qpidd}(2359) > |-rpc.idmapd(1864) > |-rpc.mountd(2190) > |-rpc.rquotad(2175) > |-rpc.statd(1818) > |-rpcbind(1648) > |-rsyslogd(1574)-+-{rsyslogd}(1575) > | |-{rsyslogd}(1576) > | `-{rsyslogd}(1578) > |-rtkit-daemon(2661)-+-{rtkit-daemon}(2662) > | `-{rtkit-daemon}(2663) > |-seahorse-agent(3155) > |-seahorse-daemon(4243) > |-sshd(2233)---sshd(5003)---bash(5005)---pstree(5057) > |-sssd(2216)-+-sssd_be(2281) > | |-sssd_nss(2286) > | `-sssd_pam(2287) > |-stap-serverd(1927)---{stap-serverd}(1932) > |-udevd(542)-+-udevd(1166) > | `-udevd(1745) > |-udisks-daemon(4373)---udisks-daemon(4374) > |-wpa_supplicant(1813) > `-xinetd(2241) > > > > > ________________________________________ > From: Christopher Tooley [[log in to unmask]] > Sent: Wednesday, November 28, 2012 1:00 PM > To: David Fitzgerald > Cc: [log in to unmask] > Subject: Re: clients slow down due to unknown process > > If/when you find out what it is, would you kindly report back to the list what you find? This has got me really curious now. :D > > -Chris > > On 2012-11-28, at 5:51 AM, David Fitzgerald<[log in to unmask]> wrote: > >> Thank you everyone for all the good ideas. I have class this evening and will be able to use your suggestions. I'll let you know what I find. >> >> Dave >> >> -----Original Message----- >> From: Robert Blair [mailto:[log in to unmask]] >> Sent: Tuesday, November 27, 2012 11:56 AM >> To: Sergio Ballestrero >> Cc: David Fitzgerald; [log in to unmask] >> Subject: Re: clients slow down due to unknown process >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> "/usr/sbin/lsof -p $PID" will also list all of the resources it uses which is often a big help in figuring out wtf it is all about. >> >> On 11/27/2012 10:52 AM, Sergio Ballestrero wrote: >>> Hello David, >>> I'm not familiar with freeIPA, but anyway you can start by better >>> identifying the process. >>> In top, get the PID and look under /proc/$PID - in particular exe >>> will be a link to the binary, like lrwxrwxrwx 1 root root 0 Nov 27 >>> 01:41 /proc/1/exe -> /sbin/init >>> >>> pstree -p -H $PID >>> will help you identify the parent process, if there's one. >>> >>> Cheers, >>> Sergio >>> >>> On 27 Nov 2012, at 16:21, David Fitzgerald wrote: >>> >>>> Hello, >>>> >>>> Sorry for the length of this post, but I want to make sure I give all >>>> the information needed for someone to help. >>>> >>>> I have a lab of 25 workstations running Scientific Linux 6.2. User >>>> accounts are authenticated via freeIPA, and auto mounted to an NFS >>>> server and the users use Gnome 2.8. The NFS and freeIPA servers are >>>> located on the same server (IP 10.10.10.10) which is also running >>>> Scientific Linux 6.2 and is a virtual guest in VMware ESXI 4.1. >>>> >>>> During class when the workstations are most heavily in use, the >>>> students are writing Fortran programs with gedit and usually have >>>> firefox up as well. Here is my predicament. During class some of >>>> the workstation screens will freeze with no mouse or keyboard input. >>>> This can last for varying lengths of time, sometimes a few minutes, >>>> some other times for the full length of the class. I can ssh in to >>>> the frozen machines and top will show load averages of up to 4 or more. >>>> The process taking up the most CPU is one I don't recognize named >>>> 10.10.10.10-ma. The 10.10.10.10 being the IP address of my server. >>>> I have no idea what that process is related to, whether it's freeIPA, >>>> NFS, Gnome or something else. Killing the process doesn't help as it >>>> simply restarts with a new PID. Note that the freezing does NOT >>>> happen when only a few people are using the lab, so reproducing the >>>> problem outside of class time is difficult. >>>> >>>> Can anyone help me track down this problem and fix it? >>>> >>>> I appreciate any help you can give. >>>> >>>> Thanks! >>>> >>>> Dave >>>> >>>> >>>> +++++++++++++++++++++++ >>>> David Fitzgerald >>>> Department of Earth Sciences >>>> Millersville University >>>> Millersville, PA 17551 >>>> >>>> Phone: 717-871-2394 >>>> >>> -- >>> Sergio Ballestrero - http://physics.uj.ac.za/psiwiki/Ballestrero >>> University of Johannesburg, Physics Department ATLAS TDAQ sysadmin >>> team - Office:75282 OnCall:164851 >>> >>> >>> >>> >>> >>> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.5 (GNU/Linux) >> >> iQEUAwUBULTwmfQM1KNWz8QaAQLU0Qf2JXa29RVDhJALq2TD72Nis4wAmxlqFIYP >> rIo5sHBUI+o/bebsDit9qoC+hWuCK3+xDai9fzF2jUQqXfhRZiPHjdQRpCViMurY >> Wp+aVZWCD1U3KusuWMSWlv6Xdx0QmaMQr8Nh8JRRWUi8cNEgAO2Th1txwdu3auJb >> LssTFmwUjLUEC0mKhgx6520hisirfOHNTnF3rQCN5ilZGEYEZ2vMm/lcm5yI0Sqc >> wdqWUXVYGNsBepFf4bRWaWPX0Hbf6sbLgoJNUHJOJ2pGpc3MUp3SiGsIIUGkZwPW >> xT6kS523J+nItY/odmvdl+ibHRVa7TgDx0xhuqISarr39g00yvvx >> =RQky >> -----END PGP SIGNATURE-----