(Illustration by Gaich Muramatsu)
OK, over the weekend coda hung again). On two machines with two different problems (a signal 11 and an assert failure in realms.cc), but one of the machines we'll ignore for now since I haven't updated it with all the newest RPMs. Unfortunately, it seems that the debugging symbols are still not in the venus binary, though I don't have the slightest idea why not. I'll look into this yet again. In the meantime, I've noticed a trend. These signal 11's seem to happen almost always at 00:30:01. I've tried running all our cronjobs at once to force the crash and I've analyzed what cron jobs would be running around that time, including cron.daily type stuff, and I've come to the conclusion that the problem is probably being generated from somewhere outside of cron. There is nothing in the syslog or cronlog to indicate anything happening at that time. Is that a special time in Coda? Here's the latest batch of logs and backtraces: 18:30:00 Coda token for user 0 has been discarded 19:00:00 Coda token for user 0 has been discarded 20:00:01 Coda token for user 0 has been discarded Date: Sat 05/28/2005 00:30:01 Fatal Signal (11); pid 8041 becoming a zombie... 00:30:01 You may use gdb to attach to 8041 [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA (dir130) [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ D(829) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA (dir129) [ D(829) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() -> 22 [ D(829) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1 [ W(190) : 0000 : 00:30:01 ] ***** FATAL SIGNAL (11) ***** (gdb) bt #0 0xb73f79d6 in __sigsuspend (set=0x1560b0dc) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 #1 0x080ab4b1 in strcpy () at ../sysdeps/generic/strcpy.c:31 #2 <signal handler called> #3 0x080c8cae in strcpy () at ../sysdeps/generic/strcpy.c:31 #4 0x0817e3c0 in ?? () #5 0x0804e55d in strcpy () at ../sysdeps/generic/strcpy.c:31 #6 0x080ac414 in strcpy () at ../sysdeps/generic/strcpy.c:31 #7 0x080ac0e8 in strcpy () at ../sysdeps/generic/strcpy.c:31 #8 0x080abd66 in strcpy () at ../sysdeps/generic/strcpy.c:31 #9 0x080a4b57 in strcpy () at ../sysdeps/generic/strcpy.c:31 #10 0x080a5d39 in strcpy () at ../sysdeps/generic/strcpy.c:31 #11 0x080aa35e in strcpy () at ../sysdeps/generic/strcpy.c:31 #12 0x080a13b6 in strcpy () at ../sysdeps/generic/strcpy.c:31 #13 0xb741a8c4 in __makecontext () from /lib/libc.so.6 #14 0x08128bb8 in ?? () Cannot access memory at address 0x30303a30 (gdb) ..Patrick On Fri, 2005-05-27 at 13:00 -0400, Jan Harkes wrote: > On Fri, May 27, 2005 at 08:29:03AM -0600, Patrick Walsh wrote: > > Another day, another Signal 11. On the machine running the latest > > patches and versions, we have these logs: > > > [ W(327) : 0000 : 00:30:01 ] ***** FATAL SIGNAL (11) ***** > > 00:30:01 Fatal Signal (11); pid 25080 becoming a zombie... > > 00:30:01 You may use gdb to attach to 25080 > > > > And this gdb trace: > > > > 0xb73f79d6 in __sigsuspend (set=0x159250fc) > > at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 > > 45 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or > > directory. > > ---Type <return> to continue, or q <return> to quit--- > > in ../sysdeps/unix/sysv/linux/sigsuspend.c > > (gdb) bt > > #0 0xb73f79d6 in __sigsuspend (set=0x159250fc) > > at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 > > #1 0x080ab4b1 in strcpy () at ../sysdeps/generic/strcpy.c:31 > > #2 <signal handler called> > > #3 0x0804e53c in strcpy () at ../sysdeps/generic/strcpy.c:31 > > #4 0x0c319ef8 in ?? () > > #5 0x080ac414 in strcpy () at ../sysdeps/generic/strcpy.c:31 > > #6 0x080ac0e8 in strcpy () at ../sysdeps/generic/strcpy.c:31 > > Still no debug symbols in the binary, that's kind of annoying. Also, > this trace looks suspiciously similar to the previous ones so we're > probably looking at the same bug. Which means that the one I found > wasn't actually triggered in your case. > > That jump from 0x08 to 0x0c and then back to 0x08 looks a lot like we > called a library function which then called a callback in the main > program. > > Jan > -- Patrick Walsh eSoft Incorporated 303.444.1600 x3350 http://www.esoft.com/Received on 2005-05-31 11:10:33