LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

February 2010

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS February 2010

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: debugging a slow memory leak of Xorg
From:	Stephan Wiesand <[log in to unmask]>
Reply To:	Stephan Wiesand <[log in to unmask]>
Date:	Sun, 21 Feb 2010 13:56:41 +0100
Content-Type:	multipart/signed
Parts/Attachments:	text/plain (4 kB) , smime.p7s (4 kB)

Hi Sergio,

On Feb 20, 2010, at 13:57 , Sergio Ballestrero wrote:

> The desktop systems show a very slow (but not uniformly slow) memory leak in Xorg, growing up to 3GB, sometimes even 5GB, and finally bringing the systems to some kind of crash - usually just the GUI freezes, but sometimes OOM Killer gets badly in the way and the whole system is left in a bad state and needs to be rebooted.

you should be able to avoid the OOM Killer (and the other weird effects typical for true OOM situations) if you "echo 2 >> /proc/sys/vm/overcommit_memory". See /usr/share/doc/kernel-doc-2.6.18/Documentation/vm/overcommit-accounting. This requires sufficient swap space though, and "sufficient" means "at least as much as the sum of VSZ for all processes on the system". Which can be a lot. But you should get a meaningful error message instead of a machine that needs to be rebooted the next time you actually run out of (virtual) memory.

> Sometimes we can see the problem before it becomes critical and request the users to restart X11, but this is not a welcome procedure.

If it's a true leak (memory being allocated and later simply forgotten and never accessed again), simply adding more swap space may allow normal operation for periods long enough to not make this an actual problem. You could even do this on the fly without interruption if you have some LVM capacity left (or swapfiles still work). And then routinely reboot the PCs during machine interventions or on some other occasion when shift crews are bored anyway ;-)

> Simply closing the applications (either gently or by killing) does not let Xorg release the occupied memory. Even logging out (without restarting X11) does not free the memory allocated by Xorg. 
> 
> It takes anything between one week and more than 4 weeks for this to happen (depending on how heavily the specific desk is used and which applications are ran on it), so it's very hard to correlate to a specific application or usage pattern, and we are not finding a way to reproduce it in a shorter time, to de able to debug it. 
> 
> xrestop only shows <20 entries with 10~20 MB pixmaps allocated, nowhere near to justifying the 3GB or more. The memory map from /proc/<Xorg pid>/smaps does show a heap of >650MB (not very different from a freshly started Xorg) and many allocated memory blocks, some as large as 800MB, but these are unlabeled and I don't see a way to correlate them with something useful. As you can imagine running Xorg under Valgrind on a production system is basically out of question, and doing it on a test system without knowing what to try and test seems quite pointless.

I have never used MEMWATCH ( http://www.linkdata.se/sourcecode/memwatch ) myself, but it's high on my list of things to try should I ever get into a situation like the one you're in. This will require tinkering with the xorg source and rebuilding the packages though.

> The systems are dual quad-core Xeon systems, with 8 to 12GB RAM, 4GB swap, dual nVidia cards (NVS285 or FX370), quad screens, from 4 to 12 virtual desktops, and they now run KDE 3.5.10 (from kde-redhat.sourceforge.net) on SLC 5.4, x86_64, kernel 2.6.18-164.11.1, with nVidia drivers packaged by CERN IT (kernel-module-nvidia-2.6.18-164.11.1.el5-185.18.36-1.slc5.x86_64) . We had been seeing the same behavior with SLC 5.3 and standard KDE 3.5.6. The most used applications are Konqueror, PVSS (detector control system), plus a variety of CERN or ATLAS specific applications, mostly Java or Python. The issue appears also on desks where no 3D/OpenGL app is used.

You could try to run some stations without KDE, or with a much older nvidia driver, just to narrow down the problem space. Sadly, using the vesa or nv drivers is not an option with quad screens. Another option could be to separate the applications and the X servers they render to from the actual display devices by running the apps in a vncsession on some server and the VNC client on the desks (this may be awkward to do with your quad screen setup though).

> While this must be, at the bottom, a bug in Xorg, we could already be happy with identifying one or more specific applications which trigger this, and try to add workarounds / mitigations in the applications, if the Xorg bug can't be pinned down or is untreatable.
> 
> Any help or suggestion of tools or procedures that may help us debug this issue would be most welcome.

Good luck. Let us know how it goes.

Cheers,
	Stephan

-- 
Stephan Wiesand
DESY -DV-
Platanenenallee 6
15738 Zeuthen, Germany

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV