SCIENTIFIC-LINUX-USERS Archives

December 2012

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
David Fitzgerald <[log in to unmask]>
Reply To:
David Fitzgerald <[log in to unmask]>
Date:
Fri, 7 Dec 2012 11:18:52 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (342 lines)
I think I pared the problem down to being something in VMware, or with my network.  When I run a script that puts a heavy load on the NFS mounted home directory, the workstation freezes up.  However, I created a virtual workstation on the VMware server, which uses the same virtual switch as my virtual NFS server.  Running the same script as before, I don't get the slowing down of the workstation, and the files are written just fine. No freezing of the client and the server handles everything without any problem.  Time to get networking involved, and someone more familiar with VMware than I am.



-----Original Message-----
From: Joseph Areeda [mailto:[log in to unmask]]
Sent: Saturday, December 01, 2012 9:19 AM
To: David Fitzgerald
Cc: [log in to unmask]
Subject: Re: clients slow down due to unknown process

Hi David,

I am certainly no expert but this looks to me like the classic NFS symptoms when the server gets overloaded, or a disk or the network gets flaky.

If it were me, I'd try to get the class to do more local i/o (if possible).  Perhaps a scratch area on the local disk would solve the problem.

I think you could reproduce the problem by writing a test script that does heavy i/o to the network folders and then running on more and more machines and watch the i/o throughput approach zero with the machines hung while waiting for NFS.

Again, I'm no expert feel free to ignore me.

Joe

On 11/29/2012 10:49 AM, David Fitzgerald wrote:
> Last night during class time I had a chance to check some of the machines with the frozen displays, and I am not sure what to make of what I found.  Running 'lsof -p $PID'  with (PID being 5044) on one of the affected machines, gave this which, doesn't tell me much:
>
> 10.10.10 5044 root  cwd       DIR    8,7     4096    2 /
> 10.10.10 5044 root  rtd       DIR    8,7     4096    2 /
> 10.10.10 5044 root  txt   unknown                      /proc/5044/exe
>
>
> I also ran pstree and I will put that output below, but I think I may be barking up the wrong tree.  While some of my clients were freezing up, I saw that my NFS server was getting very high 'top' loads.  Fortunately I  have sysstat running on the server and after class 'sar -u' showed that %iowait went from less than 1 before class to a high of 53 after class began, and stayed high until class ended.  Here is the relevant 'chunk' of the sar -u  output:
>
> 05:20:01 PM     all      0.03      0.00      0.07      0.17      0.00     99.73
> 05:30:01 PM     all      0.03      0.00      0.03      0.11      0.00     99.83
> 05:40:01 PM     all      0.18      0.00      0.50      1.88      0.00     97.44
> 05:50:01 PM     all      0.16      0.00      1.12      6.93      0.00     91.78
> 06:00:01 PM     all      0.73      0.00      5.23     32.61      0.00     61.43
> 06:10:01 PM     all      0.77      0.00      6.55     53.67      0.00     39.01
> 06:20:01 PM     all      0.13      0.00      4.81     27.81      0.00     67.25
> 06:30:01 PM     all      0.13      0.00      6.69     21.71      0.00     71.47
> 06:40:01 PM     all      0.11      0.00      3.47     33.34      0.00     63.08
> 06:50:01 PM     all      0.11      0.00      3.20     31.02      0.00     65.67
> 07:00:01 PM     all      0.24      0.00      3.93     30.79      0.00     65.05
> 07:10:01 PM     all      0.16      0.00      3.63     20.51      0.00     75.71
> 07:20:01 PM     all      0.18      0.00      5.23      1.45      0.00     93.13
> 07:30:01 PM     all      0.10      0.00      5.72      0.70      0.00     93.48
> Average:        all      0.06      0.01      0.46      2.13      0.00     97.34
>
>
>   The NFS server is a virtual machine in running ESXI 4.1 and VMware tools IS installed.  Could this be slow disk access, and thus a VMware misconfiguration?  I hate to admit it, but I am at a loss.
>
> I can run other sar reports on yesterday's (Wednesday's) data if anyone thinks there may be something in there to help.
>
> For what its worth, here is the output from pstree from one of the affected clients, and I do NOT see the PID that I was looking for:
>
> init(1)-+-NetworkManager(1782)-+-dhclient(1808)
>          |                      `-{NetworkManager}(1809)
>          |-abrtd(2341)
>          |-acpid(2039)
>          |-anacron(3615)
>          |-atd(2413)
>          |-atieventsd(2421)---authatieventsd.(4134)
>          |-auditd(1547)-+-audispd(1549)-+-sedispatch(1550)
>          |              |               `-{audispd}(1551)
>          |              `-{auditd}(1548)
>          |-automount(2134)-+-{automount}(2135)
>          |                 |-{automount}(2136)
>          |                 |-{automount}(2139)
>          |                 |-{automount}(2142)
>          |                 |-{automount}(2143)
>          |                 `-{automount}(2144)
>          |-avahi-daemon(1794)---avahi-daemon(1795)
>          |-bonobo-activati(4549)---{bonobo-activat}(4550)
>          |-cachefilesd(1597)
>          |-certmonger(2435)
>          |-clock-applet(4644)
>          |-console-kit-dae(2521)-+-{console-kit-da}(2522)
>          |                       |-{console-kit-da}(2523)
>          |                       |-{console-kit-da}(2524)
>          |                       |-{console-kit-da}(2525)
>          |                       |-{console-kit-da}(2526)
>          |                       |-{console-kit-da}(2527)
>          |                       |-{console-kit-da}(2528)
>          |                       |-{console-kit-da}(2529)
>          |                       |-{console-kit-da}(2530)
>          |                       |-{console-kit-da}(2531)
>          |                       |-{console-kit-da}(2532)
>          |                       |-{console-kit-da}(2533)
>          |                       |-{console-kit-da}(2534)
>          |                       |-{console-kit-da}(2535)
>          |                       |-{console-kit-da}(2536)
>          |                       |-{console-kit-da}(2537)
>          |                       |-{console-kit-da}(2538)
>          |                       |-{console-kit-da}(2539)
>          |                       |-{console-kit-da}(2540)
>          |                       |-{console-kit-da}(2541)
>          |                       |-{console-kit-da}(2542)
>          |                       |-{console-kit-da}(2543)
>          |                       |-{console-kit-da}(2544)
>          |                       |-{console-kit-da}(2545)
>          |                       |-{console-kit-da}(2546)
>          |                       |-{console-kit-da}(2547)
>          |                       |-{console-kit-da}(2548)
>          |                       |-{console-kit-da}(2549)
>          |                       |-{console-kit-da}(2550)
>          |                       |-{console-kit-da}(2551)
>          |                       |-{console-kit-da}(2552)
>          |                       |-{console-kit-da}(2553)
>          |                       |-{console-kit-da}(2554)
>          |                       |-{console-kit-da}(2555)
>          |                       |-{console-kit-da}(2556)
>          |                       |-{console-kit-da}(2557)
>          |                       |-{console-kit-da}(2558)
>          |                       |-{console-kit-da}(2559)
>          |                       |-{console-kit-da}(2560)
>          |                       |-{console-kit-da}(2561)
>          |                       |-{console-kit-da}(2562)
>          |                       |-{console-kit-da}(2563)
>          |                       |-{console-kit-da}(2564)
>          |                       |-{console-kit-da}(2565)
>          |                       |-{console-kit-da}(2566)
>          |                       |-{console-kit-da}(2567)
>          |                       |-{console-kit-da}(2568)
>          |                       |-{console-kit-da}(2569)
>          |                       |-{console-kit-da}(2570)
>          |                       |-{console-kit-da}(2571)
>          |                       |-{console-kit-da}(2572)
>          |                       |-{console-kit-da}(2573)
>          |                       |-{console-kit-da}(2574)
>          |                       |-{console-kit-da}(2575)
>          |                       |-{console-kit-da}(2576)
>          |                       |-{console-kit-da}(2577)
>          |                       |-{console-kit-da}(2578)
>          |                       |-{console-kit-da}(2579)
>          |                       |-{console-kit-da}(2580)
>          |                       |-{console-kit-da}(2581)
>          |                       |-{console-kit-da}(2582)
>          |                       |-{console-kit-da}(2583)
>          |                       `-{console-kit-da}(2585)
>          |-crond(2402)
>          |-cupsd(1955)
>          |-dbus-daemon(1772)
>          |-dbus-daemon(2883)
>          |-dbus-launch(2591)
>          |-dbus-launch(2882)
>          |-devkit-power-da(2602)
>          |-fcoemon(1760)
>          |-firefox(4968)
>          |-gconf-im-settin(4534)
>          |-gconfd-2(3175)
>          |-gdm-binary(2449)---gdm-simple-slav(2490)-+-Xorg(2492)
>          |                                          `-gdm-session-wor(2671)---tcsh(2849)---gnome-session(4148)-+-bluetooth-apple(436+
>          |                                                                                                     |-gdu-notificatio(432+
>          |                                                                                                     |-gnome-panel(4253)
>          |                                                                                                     |-gnome-power-man(434+
>          |                                                                                                     |-gnome-volume-co(432+
>          |                                                                                                     |-gpk-update-icon(430+
>          |                                                                                                     |-krb5-auth-dialo(435+
>          |                                                                                                     |-metacity(4244)
>          |                                                                                                     |-nautilus(4276)
>          |                                                                                                     |-nm-applet(4342)
>          |                                                                                                     |-polkit-gnome-au(432+
>          |                                                                                                     |-python(4294)
>          |                                                                                                     `-{gnome-session}(422+
>          |-gdm-user-switch(4640)
>          |-gedit(4779)-+-{gedit}(4894)
>          |             |-{gedit}(5037)
>          |             |-{gedit}(5038)
>          |             `-{gedit}(5039)
>          |-gnome-keyring-d(2831)-+-{gnome-keyring-}(2832)
>          |                       `-{gnome-keyring-}(4237)
>          |-gnome-screensav(4665)
>          |-gnome-settings-(4235)---{gnome-settings}(4248)
>          |-gnote(4635)
>          |-gvfs-afc-volume(4573)---{gvfs-afc-volum}(4574)
>          |-gvfs-gdu-volume(4569)
>          |-gvfs-gphoto2-vo(4571)
>          |-gvfsd(3168)
>          |-gvfsd-burn(4754)
>          |-gvfsd-metadata(4794)
>          |-gvfsd-trash(4656)
>          |-hald(2048)---hald-runner(2049)-+-hald-addon-acpi(2096)
>          |                                |-hald-addon-inpu(2088)
>          |                                `-hald-addon-stor(2097)
>          |-im-settings-dae(4371)
>          |-lldpad(1734)
>          |-master(2332)-+-pickup(2347)
>          |              `-qmgr(2348)
>          |-mingetty(2454)
>          |-mingetty(2456)
>          |-mingetty(2458)
>          |-mingetty(2460)
>          |-mingetty(2462)
>          |-modem-manager(1789)
>          |-notification-ar(4642)
>          |-ntpd(2249)
>          |-pcscd(2114)---{pcscd}(2129)
>          |-polkitd(2647)
>          |-pulseaudio(4331)-+-gconf-helper(4563)
>          |                  |-{pulseaudio}(4535)
>          |                  `-{pulseaudio}(4539)
>          |-qpidd(2356)-+-{qpidd}(2357)
>          |             |-{qpidd}(2358)
>          |             `-{qpidd}(2359)
>          |-rpc.idmapd(1864)
>          |-rpc.mountd(2190)
>          |-rpc.rquotad(2175)
>          |-rpc.statd(1818)
>          |-rpcbind(1648)
>          |-rsyslogd(1574)-+-{rsyslogd}(1575)
>          |                |-{rsyslogd}(1576)
>          |                `-{rsyslogd}(1578)
>          |-rtkit-daemon(2661)-+-{rtkit-daemon}(2662)
>          |                    `-{rtkit-daemon}(2663)
>          |-seahorse-agent(3155)
>          |-seahorse-daemon(4243)
>          |-sshd(2233)---sshd(5003)---bash(5005)---pstree(5057)
>          |-sssd(2216)-+-sssd_be(2281)
>          |            |-sssd_nss(2286)
>          |            `-sssd_pam(2287)
>          |-stap-serverd(1927)---{stap-serverd}(1932)
>          |-udevd(542)-+-udevd(1166)
>          |            `-udevd(1745)
>          |-udisks-daemon(4373)---udisks-daemon(4374)
>          |-wpa_supplicant(1813)
>          `-xinetd(2241)
>
>
>
>
> ________________________________________
> From: Christopher Tooley [[log in to unmask]]
> Sent: Wednesday, November 28, 2012 1:00 PM
> To: David Fitzgerald
> Cc: [log in to unmask]
> Subject: Re: clients slow down due to unknown process
>
> If/when you find out what it is, would you kindly report back to the
> list what you find? This has got me really curious now. :D
>
> -Chris
>
> On 2012-11-28, at 5:51 AM, David Fitzgerald<[log in to unmask]>  wrote:
>
>> Thank you everyone for all the good ideas.  I have class this evening and will be able to use your suggestions.  I'll let you know what I find.
>>
>> Dave
>>
>> -----Original Message-----
>> From: Robert Blair [mailto:[log in to unmask]]
>> Sent: Tuesday, November 27, 2012 11:56 AM
>> To: Sergio Ballestrero
>> Cc: David Fitzgerald; [log in to unmask]
>> Subject: Re: clients slow down due to unknown process
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> "/usr/sbin/lsof -p $PID" will also list all of the resources it uses which is often a big help in figuring out wtf it is all about.
>>
>> On 11/27/2012 10:52 AM, Sergio Ballestrero wrote:
>>> Hello David,
>>> I'm not familiar with freeIPA, but anyway you can start by better
>>> identifying the process.
>>> In top, get the PID and look under /proc/$PID - in particular  exe
>>> will be a link to the binary, like lrwxrwxrwx 1 root root 0 Nov 27
>>> 01:41 /proc/1/exe ->  /sbin/init
>>>
>>> pstree -p -H $PID
>>> will help you identify the parent process, if there's one.
>>>
>>> Cheers,
>>>   Sergio
>>>
>>> On 27 Nov 2012, at 16:21, David Fitzgerald wrote:
>>>
>>>> Hello,
>>>>
>>>> Sorry for the length of this post, but I want to make sure I give
>>>> all the information needed for someone to help.
>>>>
>>>> I have a lab of 25 workstations running Scientific Linux 6.2.  User
>>>> accounts are authenticated via freeIPA, and auto mounted to an NFS
>>>> server and the users use Gnome 2.8.  The NFS and freeIPA servers
>>>> are located on the same server (IP 10.10.10.10) which is also
>>>> running Scientific Linux 6.2 and is a virtual guest in VMware ESXI 4.1.
>>>>
>>>> During class when the workstations are most heavily in use, the
>>>> students are writing Fortran programs with gedit and usually have
>>>> firefox up as well.  Here is my predicament.  During class some of
>>>> the workstation screens will freeze with no mouse or keyboard input.
>>>> This can last for varying lengths of time, sometimes a few minutes,
>>>> some other times for the full length of the class.  I can ssh  in
>>>> to the frozen machines and top will show load averages of up to 4 or more.
>>>> The process taking up the most CPU is one I don't recognize named
>>>> 10.10.10.10-ma.  The 10.10.10.10 being the IP address of my server.
>>>> I have no idea what that process is related to, whether it's
>>>> freeIPA, NFS, Gnome or something else.  Killing the process doesn't help as it
>>>> simply restarts with a new PID.   Note that the freezing does NOT
>>>> happen when only a few people are using the lab, so reproducing the
>>>> problem outside of class time is difficult.
>>>>
>>>> Can anyone help me track down this problem and fix it?
>>>>
>>>> I appreciate any help you can give.
>>>>
>>>> Thanks!
>>>>
>>>> Dave
>>>>
>>>>
>>>> +++++++++++++++++++++++
>>>> David Fitzgerald
>>>> Department of Earth Sciences
>>>> Millersville University
>>>> Millersville, PA 17551
>>>>
>>>> Phone: 717-871-2394
>>>>
>>> --
>>> Sergio Ballestrero  - http://physics.uj.ac.za/psiwiki/Ballestrero
>>> University of Johannesburg, Physics Department  ATLAS TDAQ sysadmin
>>> team - Office:75282 OnCall:164851
>>>
>>>
>>>
>>>
>>>
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.5 (GNU/Linux)
>>
>> iQEUAwUBULTwmfQM1KNWz8QaAQLU0Qf2JXa29RVDhJALq2TD72Nis4wAmxlqFIYP
>> rIo5sHBUI+o/bebsDit9qoC+hWuCK3+xDai9fzF2jUQqXfhRZiPHjdQRpCViMurY
>> Wp+aVZWCD1U3KusuWMSWlv6Xdx0QmaMQr8Nh8JRRWUi8cNEgAO2Th1txwdu3auJb
>> LssTFmwUjLUEC0mKhgx6520hisirfOHNTnF3rQCN5ilZGEYEZ2vMm/lcm5yI0Sqc
>> wdqWUXVYGNsBepFf4bRWaWPX0Hbf6sbLgoJNUHJOJ2pGpc3MUp3SiGsIIUGkZwPW
>> xT6kS523J+nItY/odmvdl+ibHRVa7TgDx0xhuqISarr39g00yvvx
>> =RQky
>> -----END PGP SIGNATURE-----

ATOM RSS1 RSS2