[brandz-discuss] Folding at Home, part 2
Nils Nieuwejaar
nils.nieuwejaar at sun.com
Wed Nov 1 13:10:02 PST 2006
The first thing I would do is to run "pstack <pid>" on the hanging process.
You don't have to restart anything, and it won't generate nearly as much
output. It's less likely to give us an answer, but it's a lower profile
way to start the investigation.
Nils
On Wed 11/01/06 at 11:58 AM, edward.pilatowicz at sun.com wrote:
> there are a few debugging methods you could try.
>
> strace is a good start.
>
> another idea is to set the following environment variables before
> running the application:
> LX_DEBUG=1
> LX_DEBUG_FILE=/tmp/<some_file_name>
>
> this will cause our emulation library to log large amounts of debugging
> information to the file specified which could help us determine where
> things are going wrong..
>
> lastly, you could also try setting LX_STRICT=1 in the environment before
> running the application. this will cause our emulation library to kill
> the application (and hopefully generate a core dump) if it tries
> to access an invalid or currently unsupported functionality.
>
> ed
>
> On Wed, Nov 01, 2006 at 10:43:25AM -0800, Scott L. Burson wrote:
> > Hi,
> >
> > So, the machine has been up running Folding at Home for a couple of days
> > now, which is not definitive but certainly is encouraging. Hopefully
> > the bad DIMM will turn out to have been the problem all along.
> >
> > But, there's another problem. The Folding at Home processes hang for
> > some reason at the point where they're supposed to send their results
> > back to the server. This may be a bit of a pain to debug since
> > Stanford doesn't give out the source (they're worried, reasonably
> > enough I suppose, about people circulating hacked versions as Trojan
> > horses). I don't think it's not a network connectivity problem, as
> > there's no problem downloading new work units. What do you think I
> > should try next? Running one of these processes under `strace'?
> >
> > -- Scott
> >
> > On Oct 30, 2006, at 9:06, Scott L. Burson wrote:
> >
> > >I don't think it's thermal. I was previously running Linux on the
> > >machine, and according to its `lm_sensors' package, CPU core temps
> > >are reliably under 60 C -- well below the operating limit of 68 --
> > >and yes, that's with four copies of Folding at home running
> > >continuously. (When I built the machine, I put in a custom water-
> > >cooling system.) (Is there some way to read Opteron core temps
> > >under Solaris? I wouldn't mind keeping an eye on this.)
> > >
> > >What it could be, though, as I've now discovered, is a marginal
> > >DIMM. When I installed Nevada build 49, I noticed this process
> > >`fmd' using up a substantial amount of CPU time (10% of the whole
> > >machine). I eventually looked into it, and discovered that there's
> > >now this thing called the Fault Manager that was telling me that
> > >one of my DIMMs was getting too many single-bit errors, and I
> > >should replace it.
> > >
> > >The machine was admittedly not perfectly stable under Linux. It
> > >would crash occasionally for no obvious reason; but its uptime was
> > >normally measured in weeks or months, not days. Running
> > >Folding at Home under BrandZ, it crashed twice, each time in about two
> > >days. At first I thought that this much greater instability must
> > >have a different cause; but eventually it occurred to me that
> > >Solaris undoubtedly lays out its hardware memory space differently
> > >from Linux, and it could just be making much heavier use of that DIMM.
> > >
> > >Anyway, I've now pulled the DIMM, and I'm starting Folding at Home
> > >again. Let's see what happens this time.
> > >
> > >-- Scott
> > >
> > >On Oct 30, 2006, at 3:03, William Kucharski wrote:
> > >
> > >>Have you had any other issues with this machine?
> > >>
> > >>Given the amount of computation Folding at Home does, I'm wondering
> > >>if it isn't
> > >>an actual hardware issue (for example, the machine overheating.)
> > >>
> > >>Is there any type of hardware monitoring on this machine, either
> > >>in BIOS or elsewhere you might be able to check when this occurs?
> > >>
> > >> William Kucharski
> > >> william.kucharski at sun.com
> > >
> > >_______________________________________________
> > >brandz-discuss mailing list
> > >brandz-discuss at opensolaris.org
> >
> > _______________________________________________
> > brandz-discuss mailing list
> > brandz-discuss at opensolaris.org
> _______________________________________________
> brandz-discuss mailing list
> brandz-discuss at opensolaris.org
More information about the brandz-discuss
mailing list