[brandz-discuss] Folding at Home, part 2
William Kucharski
William.Kucharski at Sun.COM
Wed Nov 1 14:54:32 PST 2006
Scott L. Burson wrote:
> Thanks for the replies. Here's what I get from `pstack':
>
> 24489: ./FahCore_78.exe -dir work/ -suffix 05 -checkpoint 15
> -lifeline 24473
> ----------------- lwp# 1 / thread# 1 --------------------
> 00000000 ????????(), exit value = 0x00000064
> ** zombie (exited, not detached, not yet joined) **
> ----------------- lwp# 4 / thread# 4 --------------------
> fef04a27 nanosleep (fcfff940, fcfff940)
> fefcdcde lx_emulate (fcfff8c4) + 1f6
> fefdf7fb ???????? (8243583, fcfff940, fcfff940, fcfffad4, 8243573, 2)
> 082435e1 ???????? (384, 0, cdbcf43, 0, fcfffb30, 838f5c8)
> 08206327 ???????? (dbba0, fcfffb30, 5fb5, fcfffd28)
> 0804d6f9 ???????? (0, 0, fcfffd28, 821bfa3, 821bf90, fd1fdda4)
> 0821c06a ???????? (fcfffc00, fcfffc00, 0, 0, 0, 0)
> 08245d8a ???????? ()
>
> Hmm, a zombie. Maybe we have some subtle divergence in `wait'
> semantics? Or maybe a problem with signal delivery?
>
> Here's the result of doing `pstack' on its parent process:
>
> 24473: ./FAH502-Linux
> ----------------- lwp# 1 / thread# 1 --------------------
> feee4a27 nanosleep (8047498, 8047498)
> fefcdcde lx_emulate (8047444) + 1f6
> fefdf7fb ???????? (0, fe56e90f, 8047498, 8047498, 0, 8047530)
> fe56eaec nanosleep (0, 0, 8047704, 80516a9, 8047668, 8047670) + 3c
> 08060722 ???????? (fffffff, fe5faa78, fe400bb0, 0, 80476d0, fe4da79a)
> 080516dc ???????? (1, 80476fc, 8047704, 0, fe5faa78, fef85020)
> fe4da79a __libc_start_main (80515c0, 1, 80476fc, 8049564, 807aa6c,
> fef7ccc0) + da
> 08049ca1 __moddi3 () + 1cd
> ----------------- lwp# 2 / thread# 2 --------------------
> feee5707 waitid (0, 5fa9, fe400030, 3)
> fefdd4c8 lx_wait4 (5fa9, fe400204, 0, 0) + 33c
> fefdd57f lx_waitpid (5fa9, fe400204, 0, fe400204, 100011, fe400348) + 23
> fefcdcde lx_emulate (fe4001ac) + 1f6
> fefdf7fb ???????? (fe5faa78, 0, fe5031d4, 5fa9, fe400204, 0)
> fe56e511 waitpid (fe400664, fe4003c0, fe400664, fe4005e4, fe400978,
> 804a3b1) + 41
> fe50304c system (fe400664, 8085b38, fe400bb0, fe400ac0, 8085abf, 17)
> + 4c
> 0804a3b1 ???????? (fe400a24, 8082160, fe4009d0, 0, fe94bc9c, fe400bb0)
> 0804acff ???????? (80855e8, 0, 0, 0)
> fe942dd8 start_thread (fe400bb0, 0, 0, 0, 0, 0) + 98
> fe5a1d1a clone () + 5a
> ----------------- lwp# 3 / thread# 3 --------------------
> feee4a27 nanosleep (fd8008b8, fd8008b8)
> fefcdcde lx_emulate (fd800864) + 1f6
> fefdf7fb ???????? (0, fe56e90f, fd8008b8, fd8008b8, 0, fd800950)
> fe56eaec nanosleep (0, 8085b38, fd800bb0, ffffffff, 0, fd800a88) + 3c
> 08060722 ???????? (1499700)
> 08052734 ???????? (8085b38, 0, 0, 0)
> fe942dd8 start_thread (fd800bb0, 0, 0, 0, 0, 0) + 98
> fe5a1d1a clone () + 5a
> ----------------- lwp# 4 / thread# 4 --------------------
> feee4a27 nanosleep (fcc00664, 0)
> fefcdcde lx_emulate (fcc00620) + 1f6
> fefdf7fb ???????? (0, fe59b94c, fcc00664, 0, 0, 1dcd6500)
> fe56eaec nanosleep (7a120, fcc006a0, 8087740, fcc00a88, 804fecb,
> fcc00a88) + 3c
> 0806074d ???????? (1f4, fe94bc9c, fcc00bb0, fcc00ac0, 646c6f46, 40676e69)
> 0804fedc ???????? (fe4003b4, 0, 0, 0)
> fe942dd8 start_thread (fcc00bb0, 0, 0, 0, 0, 0) + 98
> fe5a1d1a clone () + 5a
>
>
> It does look like it's waiting. According to `psig', it hasn't blocked
> or ignored `SIGCLD'. I sent it another `SIGCLD' for good measure, but
> nothing happened.
Well the parent process is doing an explicit waitpid for 24489, which is the
first process and won't proceed until it can reap it.
PID 24489 has the zombied thread, but the remaining thread that could reap it is
in nanosleep() and so it either never received the SIGCLD or for some reason
isn't set up to reap the zombied thread when it receives it.
Could you try sending the hanging child (in this case 24489) a SIGCLD and see if
it wakes up and reaps the zombied thread? What does psig say about PID 24489's
stance towards SIGCLD?
Thanks,
William Kucharski
william.kucharski at sun.com
More information about the brandz-discuss
mailing list