Coda File System

Re: replicated servers freezing under load

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 1 Jun 2004 14:14:29 -0400
On Tue, Jun 01, 2004 at 11:51:39AM +0200, Jim Page - emailsystems.com wrote:
> I believe I have a similar issue to that described by Steve in the post
> http://www.coda.cs.cmu.edu/maillists/codalist/codalist-2003/5878.html -
> which I don't think was ever resolved, so I'd like to take up the torch.

Well, I did find and fix what I believed to be the problem, an
exponential growth in the RPC2 lookup path. But that only affects
servers that are contacted by clients behind a masquerading firewall.

> The failure mode is always the same. Here is a typical entry the SrvLog:
> 
> 07:34:34 ****** WARNING entry at 0x8122320 already has deqing set!

I remember seeing this message in my logs as well, but it never seemed
fatal. Just checked the source to see what is triggering this. It looks
like the server receives a COP2 (phase 2 commit) message for some
operation that is already being committed.

My quick guess is that the open() from the MTA was given a piggybacked
COP2 message to commit a recently performed operation. But we get stuck
trying to obtain a lock which we need to bump the version vector. At
some point some thread in the client checks for as yet uncommitted
operations and sends a ViceCOP2 message for the same operation which
triggers the deqing message we see.

It looks like a lot of the related debug messages are at least level 10
or higher which means that there will be a lot of noise, and since a
server slows down when logging it might affect the reproducability.

But I would be interested in your current log, and if it is reproducable
a log at level 10 (volutil setdebug 10), and level 30. Just bump that
loglevel somewhere after the onslaught has started and hopefully before
the lockup.

> here is where codasrv is at:
> 
> (gdb) where
> #0  0x4024b44e in select () from /lib/libc.so.6
> #1  0x400812fc in __JCR_LIST__ () from /usr/lib/liblwp.so.2
> #2  0x4007d130 in IOMGR (dummy=0x0) at iomgr.c:354
> #3  0x4007ef16 in Create_Process_Part2 () at lwp.c:796

Because we have userspace threads the gdb backtrace isn't useful in this
case. There is a way to get traces from the threads although it tends to
crash the server halfway through. Running it a second time sometimes
manages to dump all the related threadstacks.

    (gdb) p LWP_Print_Processes()
    gdb intercepts a segfault or bus error
    (gdb) p LWP_Print_Processes()

The information should all end up in the SrvErr file, but it only shows
return addresses and no arguments. I slightly modified a script that
came with dmalloc to do the trick of decoding (attached it).

Jan




Received on 2004-06-01 14:16:54