(Illustration by Gaich Muramatsu)
On Tue, Jun 01, 2004 at 11:51:39AM +0200, Jim Page - emailsystems.com wrote: > I believe I have a similar issue to that described by Steve in the post > http://www.coda.cs.cmu.edu/maillists/codalist/codalist-2003/5878.html - > which I don't think was ever resolved, so I'd like to take up the torch. Well, I did find and fix what I believed to be the problem, an exponential growth in the RPC2 lookup path. But that only affects servers that are contacted by clients behind a masquerading firewall. > The failure mode is always the same. Here is a typical entry the SrvLog: > > 07:34:34 ****** WARNING entry at 0x8122320 already has deqing set! I remember seeing this message in my logs as well, but it never seemed fatal. Just checked the source to see what is triggering this. It looks like the server receives a COP2 (phase 2 commit) message for some operation that is already being committed. My quick guess is that the open() from the MTA was given a piggybacked COP2 message to commit a recently performed operation. But we get stuck trying to obtain a lock which we need to bump the version vector. At some point some thread in the client checks for as yet uncommitted operations and sends a ViceCOP2 message for the same operation which triggers the deqing message we see. It looks like a lot of the related debug messages are at least level 10 or higher which means that there will be a lot of noise, and since a server slows down when logging it might affect the reproducability. But I would be interested in your current log, and if it is reproducable a log at level 10 (volutil setdebug 10), and level 30. Just bump that loglevel somewhere after the onslaught has started and hopefully before the lockup. > here is where codasrv is at: > > (gdb) where > #0 0x4024b44e in select () from /lib/libc.so.6 > #1 0x400812fc in __JCR_LIST__ () from /usr/lib/liblwp.so.2 > #2 0x4007d130 in IOMGR (dummy=0x0) at iomgr.c:354 > #3 0x4007ef16 in Create_Process_Part2 () at lwp.c:796 Because we have userspace threads the gdb backtrace isn't useful in this case. There is a way to get traces from the threads although it tends to crash the server halfway through. Running it a second time sometimes manages to dump all the related threadstacks. (gdb) p LWP_Print_Processes() gdb intercepts a segfault or bus error (gdb) p LWP_Print_Processes() The information should all end up in the SrvErr file, but it only shows return addresses and no arguments. I slightly modified a script that came with dmalloc to do the trick of decoding (attached it). Jan