Coda File System

New troubles in coda land

From: Patrick Walsh <pwalsh_at_esoft.com>
Date: Fri, 01 Jul 2005 11:44:58 -0600
	I left for a week on vacation and left tests running on our servers
only to come back and find a series of problems and probably coda bugs.
I think most of these should be reproducible.  I'll do my best to make
them easy to track down.


1) cfs getpath fid_at_realm

	This command works fine on consistent objects, but not at all on
inconsistent objects.  So when you get a log entry that looks like
this:  

08:58:41 xsp (606e1fc8.7f000003.7d.25861) inconsistent!

and you attempt to get the path by doing this:

cfs getpath 7f000003.7d.25861_at_realm 

you get this error:

VIOC_GETPATH: No such file or directory

2) cfs getmountpoint volid

	This command doesn't seem to be working anymore.  I'm sure I used to be
able to use it without problems.  Here's a line in our VRList file:

/log 7f000002 2 1000003 2000003 0 0 0 0 0 0 0

I've tried each of the following without success.  If I'm doing this
wrong, please let me know:

cfs getmountpoint /log 
cfs getmountpoint 7f000002
cfs getmountpoint 2130706434  (this is the decimal version of above)
cfs getmountpoint 1000003
cfs getmountpoint 0x7f000002
cfs getmountpoint 0x1000003

3) One of the reasons for all of the new problems is that we made a
watchdog script that checks for inconsistencies every five minutes by
doing a find command.  This happens on every client.  Ironically, this
seems to be causing inconsistencies.  I expect what is happening is that
the find commands, which look like this: 

find /coda/realm -noleaf -lname '@*'

are blocking the server so that normal write operations are timing out
on the client side and version mismatches are happening.  I expect this
is partially due to our very fast connections between clients and
servers and I think much of this might be avoidable if we could manually
increase the client timeouts.  RPC2_timeout and RPC2_retries are in
venus.conf are commented out so they should be 60 seconds and 5 retries.
The find command takes nowhere near that long to run (it's pretty darn
quick, actually).  Any ideas?

	(note, I know the -noleaf is no longer necessary -- i'm removing it
from our scripts, but don't expect this to make any difference)

4) We're back to having issues with clog (or so I believe).  To
reproduce this, you need to log in to coda (as the same user) over and
over again every two seconds or so, while in another window
copying/moving files back and forth between two directories.  We created
a login script that pipes in the password to log in the user over and
over and used the watch command to make it happen repeatedly:

watch gettokens.sh

where gettokens.sh basically just does this:

clog codauser_at_realm < passwordfile

Then in another window we create two directories and put some files in
the first one.  We then move those files back and forth repeatedly, like
this:

while [ 1 ] ; do mv WORK/s/* WORK/t; mv WORK/t/* WORK/s; done

Note that we do all this simply to try to force errors that we've been
seeing on occassion.

So while this is happening, we get errors in the SrvLog like this:

15:30:05 Worker5: Unbinding RPC connection 12416
15:30:05 Deleting client entry for user backend at 192.168.30.224.2435
rpcid 12416

This will eventually kill coda and require a server restart.

	Sorry to thrust all of this on you right before the Fourth of July
weekend.  Enjoy the weekend and when you get back if you have a chance
please let us know what we can do to mitigate some of these issues.

Thanks,

..Patrick


-- 
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/

Received on 2005-07-01 12:22:48