Coda File System

Re: coda client hangs

From: Patrick Walsh <pwalsh_at_esoft.com>
Date: Tue, 31 May 2005 13:37:25 -0600
On Tue, 2005-05-31 at 14:15 -0400, Jan Harkes wrote: 
> I've been working on the iterators over the weekend. The 'safe deletion'
> parts are removed from the core iterator functions and I think I got all
> places where we were doing unsafe deletion.

Is there a patch I should apply to 6.0.10?  

> One thing I couldn't figure out until just now is why you actually have
> conn entries, which are used for non-replicated or backup volumes and
> during weak-reintegration, instead of mgrp entries which are used for
> replicated volumes.
> 
> That turned out to be quite simple, there is a special 'V_UID' that is
> used for some operation when we don't know the actual user who triggered
> the operation or for certain background operations. Some of the obvious
> ones are when we get volume information and during the periodic server
> probes/backprobes. And V_UID is set to '0', which means that the root
> user has a bunch of connections associated with it. Now most people
> probably aren't giving Coda tokens to root, so this explains why you
> seem to be seeing these problems a lot more than most people.

Hmmmm.  It never occurred to me that giving tokens to root might cause a
problem... 

> reauthentications and read and write operations. But then again, your
> 'test-environment' might be hitting it harder than I ever can.

At this point, the test environment isn't abusing the system.  We plan
stress tests, but haven't done them yet.  The read/write load is
actually pretty light for the most part.  We download rpm/virus
sig/other updates to coda regularly though. 

> I think rpmbuild actually strips the binaries before it packs them in
> the RPM, the unstripped versions seem to be placed in an associated
> 'debuginfo' RPM, although that might only be for libraries.

I've solved this.  I use a custom Makefile to build coda and its
dependencies.  I had added a --debug flag to the build-coda-rpm.pl
script call, but I didn't realize that it produces a different spec file
(with -debug in its name) so I was still building the RPM off the non-
debug spec file.  Anyway, I've got coda-debug clients running on each of
the client machines now.  Hopefully any future crashes will be easier to
trace. 

> > 	In the meantime, I've noticed a trend.  These signal 11's seem to
> > happen almost always at 00:30:01.  I've tried running all our cronjobs
> Not really, there are periodic hoard walks and server probes, but those
> are relative to when venus starts, so unless you always start your
> client at exactly the same time I wouldn't expect them to coincide so
> nicely on 00:30 real-time intervals.

Hmmmm.  Very strange then.  I'll keep looking to see what might happen
at 12:30:01.  It has been the exact time for the signal 11 on at least
two occasions.

> If the venus.log contains
> 
>     userent::Connect: Authenticated bind failure, uid = 0
> 
> Then this is a result of a server rejecting the token, otherwise it is
> because of an explicit call to cunlog.

We are still getting some of these messages.  The last crash, which took
place Friday night (Saturday morning) at 12:30, had an Authenticated
bind failure for uid 0 at several times, the last one before the crash
being at 8:00:01pm Friday night. 

Since I restarted venus this morning, there have been two of these
messages on the machine that had the signal 11.  It looks like all of
our clients have these messages in their logs though.  The only thing
going on on some of these machines is a once every 3 hours clog
operation.

> 'D' is the probe-daemon thread, and 'W' is a worker thread.
> ...
> My guess is that 'D' is sending out periodic server probes (using the
> V_UID = 0 connections), while at the same time the user calls 'cunlog'
> for the root user which triggers the worker thread to destroy any
> connections owned by root, which to me indicates that I'm fixing the
> correct bugs in the iterators.

	I hope so.  Just for your information, I grepped through the log just
for the worker thread to see what it was doing before the crash and this
is what I got:


[ W(190) : 0000 : 00:00:00 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ W(190) : 0000 : 00:00:00 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ W(190) : 0000 : 00:10:00 ] userent::Connect: ViceGetAttrPlusSHA
(dir129)
[ W(190) : 0000 : 00:10:00 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ W(190) : 0000 : 00:10:00 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ W(190) : 0000 : 00:10:00 ] userent::Connect: ViceGetAttrPlusSHA
(dir129)
[ W(190) : 0000 : 00:10:00 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ W(190) : 0000 : 00:10:00 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ W(190) : 0000 : 00:15:00 ] userent::Connect: ViceGetAttrPlusSHA
(dir130)
[ W(190) : 0000 : 00:15:00 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ W(190) : 0000 : 00:15:00 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ W(190) : 0000 : 00:15:00 ] userent::Connect: ViceGetAttrPlusSHA
(dir129)
[ W(190) : 0000 : 00:15:00 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ W(190) : 0000 : 00:15:00 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ W(190) : 0000 : 00:20:00 ] userent::Connect: ViceGetAttrPlusSHA
(dir129)
[ W(190) : 0000 : 00:20:00 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ W(190) : 0000 : 00:20:00 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ W(190) : 0000 : 00:20:00 ] userent::Connect: ViceGetAttrPlusSHA
(dir130)
[ W(190) : 0000 : 00:20:00 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ W(190) : 0000 : 00:20:00 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ W(190) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA
(dir129)
[ W(190) : 0000 : 00:30:01 ] userent::Connect: ViceGetAttrPlusSHA() ->
22
[ W(190) : 0000 : 00:30:01 ] userent::Connect: VGAPlusSHA_Supported -> 1
[ W(190) : 0000 : 00:30:01 ] *****  FATAL SIGNAL (11) *****


	Is it odd that though the servers are supposed to be checked every five
minutes, there was no check at 00:05 or 00:25?

	That's the only thing I can think of that might be unusual there.

	Thanks, Jan, for all your help so far.

-- 
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/

Received on 2005-05-31 15:40:34