Coda File System

Re: coda client hangs

From: Patrick Walsh <pwalsh_at_esoft.com>
Date: Wed, 25 May 2005 14:51:04 -0600
> This could mean that your client cache is too small, or that the file
> was deleted on the server between the status and data validation phases.
> As there were 131 interrupts, this would indicate that this is happening
> quite often.

	These log entries were referring to a log file, so I'm sure it was
being opened and closed pretty frequently.  As a workaround we've moved
the log files to /tmp and just use cron to copy them to coda.

> So we fetch the new version (which is 653 bytes larger), and again it
> already changed on the server before we were done. I guess some process
> on is opening/writing/closing the file a lot, and every close results in
> a store.

	True to the second part, but there is only one client writing to the
file and no conflicts ever occurred, so I'm not sure the version vectors
differed.

> > [ W(1783) : 0000 : 17:17:18 ] Cachefile::SetLength 3243845
> > [ W(1783) : 0000 : 17:17:19 ] *** Long Running (Multi)Store: code =
> > -2001, elapsed = 1252.4 ***
> > [ W(1783) : 0000 : 17:17:19 ] fsobj::StatusEq: (606e1fc8.7f000002.e.28),
> > VVs differ
> 
> Interesting, this makes it look like it isn't fetches, but stores. I
> don't see how/why the VV's could differ without causing a conflict.

	Me either.

> Argh, the file is getting real big here, so each store is taking longer
> and longer. By now other threads are getting blocked waiting for the
> store to complete (the store is taking 3.6 seconds).

	OK, that makes sense.

> > 12:55:01 root acquiring Coda tokens!
> > 12:55:01 Coda token for user 0 has been discarded
> > 15:55:00 root acquiring Coda tokens!
> > 21:55:00 Fatal Signal (11); pid 1708 becoming a zombie...
> 
> Ah, when tokens are refreshed all old connections are first destroyed.

	Why is this?  We have cron jobs that refresh tokens three times a day,
plus for various operations, like our cron jobs that download updates,
the tokens are refreshed at the beginning of the operation to insure
successful writes.  So are you saying that if something is trying to
write a file and something else refreshes the tokens for that user (even
though the coda user <-> local user mapping is the same) that the write
operation will be broken off?  (And potentially cause a crash?)  If so,
can this behavior be changed?

	Also, I'm not sure why there's that "Coda token for user 0 has been
discarded" line since we never log out and I'm pretty sure the coda user
and local user mapping doesn't change.

> I've seen crashes in this area, maybe not everyone correctly increments
> the reference count on the connection. So the token refresh kills the
> existing connection and when the RPC returns the thread gets a NULL ptr
> dereference as soon as it tries to use the connection. Or if we don't
> look at the connection anymore, links the destroyed connection back into
> the list of available connections and some other thread hits the
> segfault.

	Yikes.

> Fix all remaining bugs and race conditions in Coda.

	Gulp.


> > (gdb) bt
> > #0  0xb73f99d6 in __sigsuspend (set=0x1532f0bc)
> >     at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
> > #1  0x080ab4f5 in strcpy () at ../sysdeps/generic/strcpy.c:31
> > #2  <signal handler called>
> > #3  0x080c99a2 in strcpy () at ../sysdeps/generic/strcpy.c:31
> > #4  0x1532f480 in ?? ()
> > #5  0x0804e559 in strcpy () at ../sysdeps/generic/strcpy.c:31
> 
> Argh, dpkg seems to have stripped the binaries. I though I told it
> _not_ to do that in the build scripts. At least the program counters
> are still there, so I hopefully will be able to rebuild a clean copy
> and use addr2line to figure out where we crashed.

	This is a RedHat 2.4 kernel, not debian, and since I used the coda
scripts to make the RPMs, the symbol stripping probably happens
somewhere in there.  I can investigate this and try to find new
information.  And maybe upgrade from 6.0.8 to 6.0.10 at the same time.

	So at this point our servers have basically zero load and without the
logs on coda, they haven't crashed.  We'll stress test them in a bit,
assuming we can fix whatever is causing these problems.

-- 
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/

Received on 2005-05-25 16:51:43