SCIENTIFIC-LINUX-USERS Archives

February 2010

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Sergio Ballestrero <[log in to unmask]>
Reply To:
Sergio Ballestrero <[log in to unmask]>
Date:
Sat, 20 Feb 2010 13:57:03 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (24 lines)
 Dear Linuxers,
we  are using SL CERN 5.4 for the ATLAS Control Room at CERN, and we are experiencing a problem with the Xorg server that is proving very hard to track down. I'm hoping someone in the SL community will have the patience to read all this and offer some suggestion...

 The desktop systems show a very slow (but not uniformly slow) memory leak in Xorg, growing up to 3GB, sometimes even 5GB, and finally bringing the systems to some kind of crash - usually just the GUI freezes, but sometimes OOM Killer gets badly in the way and the whole system is left in a bad state and needs to be rebooted. Sometimes we can see the problem before it becomes critical and request the users to restart X11, but this is not a welcome procedure.
 Simply closing the applications (either gently or by killing) does not let Xorg release the occupied memory. Even logging out (without restarting X11) does not free the memory allocated by Xorg. 

 It takes anything between one week and more than 4 weeks for this to happen (depending on how heavily the specific desk is used and which applications are ran on it), so it's very hard to correlate to a specific application or usage pattern, and we are not finding a way to reproduce it in a shorter time, to de able to debug it. 

 xrestop only shows <20 entries with 10~20 MB pixmaps allocated, nowhere near to justifying the 3GB or more. The memory map from /proc/<Xorg pid>/smaps does show a heap of >650MB (not very different from a freshly started Xorg) and many allocated memory blocks, some as large as 800MB, but these are unlabeled and I don't see a way to correlate them with something useful. As you can imagine running Xorg under Valgrind on a production system is basically out of question, and doing it on a test system without knowing what to try and test seems quite pointless.

 The systems are dual quad-core Xeon systems, with 8 to 12GB RAM, 4GB swap, dual nVidia cards (NVS285 or FX370), quad screens, from 4 to 12 virtual desktops, and they now run KDE 3.5.10 (from kde-redhat.sourceforge.net) on SLC 5.4, x86_64, kernel 2.6.18-164.11.1, with nVidia drivers packaged by CERN IT (kernel-module-nvidia-2.6.18-164.11.1.el5-185.18.36-1.slc5.x86_64) . We had been seeing the same behavior with SLC 5.3 and standard KDE 3.5.6. The most used applications are Konqueror, PVSS (detector control system), plus a variety of CERN or ATLAS specific applications, mostly Java or Python. The issue appears also on desks where no 3D/OpenGL app is used.

 While this must be, at the bottom, a bug in Xorg, we could already be happy with identifying one or more specific applications which trigger this, and try to add workarounds / mitigations in the applications, if the Xorg bug can't be pinned down or is untreatable.

 Any help or suggestion of tools or procedures that may help us debug this issue would be most welcome.

 Thanks, and cheers,
   Sergio

-- 
 Sergio Ballestrero  - http://physics.uj.ac.za/psiwiki/Ballestrero
 University of Johannesburg, Physics Department
 ATLAS TDAQ sysadmin group 

ATOM RSS1 RSS2