From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Tue Oct 22 2002 - 19:55:10 PDT
Tonight after our conferenfce call I was able to trace the problem while on the subway. It seems this is another manifestation of the parentage problem. When you invoke execle() or one of its relatives, libpthread must terminate all the other threads. To do this the main thread write()s to the pthread manager which then sends the pthread cancellation signal to all the other threads and then waits for them before exiting. Meanwhile the main thread waits for the manager thread to exit. The problem arrises because we are not yet rebuilding the proper parent-child relationships among the threads. So, one or more of the waits is failing. I was able to find that I could sometimes get lucky at the exec() would work while sometimes it would not. This is not a problem we can fix from user-space. It must be fixed in the kernel. This is among the things that Eric is working to fix before SC2002. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-495-2998