[brandz-discuss] Folding at Home, part 2
Edward Pilatowicz
edward.pilatowicz at sun.com
Wed Nov 1 11:58:38 PST 2006
there are a few debugging methods you could try.
strace is a good start.
another idea is to set the following environment variables before
running the application:
LX_DEBUG=1
LX_DEBUG_FILE=/tmp/<some_file_name>
this will cause our emulation library to log large amounts of debugging
information to the file specified which could help us determine where
things are going wrong..
lastly, you could also try setting LX_STRICT=1 in the environment before
running the application. this will cause our emulation library to kill
the application (and hopefully generate a core dump) if it tries
to access an invalid or currently unsupported functionality.
ed
On Wed, Nov 01, 2006 at 10:43:25AM -0800, Scott L. Burson wrote:
> Hi,
>
> So, the machine has been up running Folding at Home for a couple of days
> now, which is not definitive but certainly is encouraging. Hopefully
> the bad DIMM will turn out to have been the problem all along.
>
> But, there's another problem. The Folding at Home processes hang for
> some reason at the point where they're supposed to send their results
> back to the server. This may be a bit of a pain to debug since
> Stanford doesn't give out the source (they're worried, reasonably
> enough I suppose, about people circulating hacked versions as Trojan
> horses). I don't think it's not a network connectivity problem, as
> there's no problem downloading new work units. What do you think I
> should try next? Running one of these processes under `strace'?
>
> -- Scott
>
> On Oct 30, 2006, at 9:06, Scott L. Burson wrote:
>
> >I don't think it's thermal. I was previously running Linux on the
> >machine, and according to its `lm_sensors' package, CPU core temps
> >are reliably under 60 C -- well below the operating limit of 68 --
> >and yes, that's with four copies of Folding at home running
> >continuously. (When I built the machine, I put in a custom water-
> >cooling system.) (Is there some way to read Opteron core temps
> >under Solaris? I wouldn't mind keeping an eye on this.)
> >
> >What it could be, though, as I've now discovered, is a marginal
> >DIMM. When I installed Nevada build 49, I noticed this process
> >`fmd' using up a substantial amount of CPU time (10% of the whole
> >machine). I eventually looked into it, and discovered that there's
> >now this thing called the Fault Manager that was telling me that
> >one of my DIMMs was getting too many single-bit errors, and I
> >should replace it.
> >
> >The machine was admittedly not perfectly stable under Linux. It
> >would crash occasionally for no obvious reason; but its uptime was
> >normally measured in weeks or months, not days. Running
> >Folding at Home under BrandZ, it crashed twice, each time in about two
> >days. At first I thought that this much greater instability must
> >have a different cause; but eventually it occurred to me that
> >Solaris undoubtedly lays out its hardware memory space differently
> >from Linux, and it could just be making much heavier use of that DIMM.
> >
> >Anyway, I've now pulled the DIMM, and I'm starting Folding at Home
> >again. Let's see what happens this time.
> >
> >-- Scott
> >
> >On Oct 30, 2006, at 3:03, William Kucharski wrote:
> >
> >>Have you had any other issues with this machine?
> >>
> >>Given the amount of computation Folding at Home does, I'm wondering
> >>if it isn't
> >>an actual hardware issue (for example, the machine overheating.)
> >>
> >>Is there any type of hardware monitoring on this machine, either
> >>in BIOS or elsewhere you might be able to check when this occurs?
> >>
> >> William Kucharski
> >> william.kucharski at sun.com
> >
> >_______________________________________________
> >brandz-discuss mailing list
> >brandz-discuss at opensolaris.org
>
> _______________________________________________
> brandz-discuss mailing list
> brandz-discuss at opensolaris.org
More information about the brandz-discuss
mailing list