(Illustration by Gaich Muramatsu)
I left for a week on vacation and left tests running on our servers only to come back and find a series of problems and probably coda bugs. I think most of these should be reproducible. I'll do my best to make them easy to track down. 1) cfs getpath fid_at_realm This command works fine on consistent objects, but not at all on inconsistent objects. So when you get a log entry that looks like this: 08:58:41 xsp (606e1fc8.7f000003.7d.25861) inconsistent! and you attempt to get the path by doing this: cfs getpath 7f000003.7d.25861_at_realm you get this error: VIOC_GETPATH: No such file or directory 2) cfs getmountpoint volid This command doesn't seem to be working anymore. I'm sure I used to be able to use it without problems. Here's a line in our VRList file: /log 7f000002 2 1000003 2000003 0 0 0 0 0 0 0 I've tried each of the following without success. If I'm doing this wrong, please let me know: cfs getmountpoint /log cfs getmountpoint 7f000002 cfs getmountpoint 2130706434 (this is the decimal version of above) cfs getmountpoint 1000003 cfs getmountpoint 0x7f000002 cfs getmountpoint 0x1000003 3) One of the reasons for all of the new problems is that we made a watchdog script that checks for inconsistencies every five minutes by doing a find command. This happens on every client. Ironically, this seems to be causing inconsistencies. I expect what is happening is that the find commands, which look like this: find /coda/realm -noleaf -lname '@*' are blocking the server so that normal write operations are timing out on the client side and version mismatches are happening. I expect this is partially due to our very fast connections between clients and servers and I think much of this might be avoidable if we could manually increase the client timeouts. RPC2_timeout and RPC2_retries are in venus.conf are commented out so they should be 60 seconds and 5 retries. The find command takes nowhere near that long to run (it's pretty darn quick, actually). Any ideas? (note, I know the -noleaf is no longer necessary -- i'm removing it from our scripts, but don't expect this to make any difference) 4) We're back to having issues with clog (or so I believe). To reproduce this, you need to log in to coda (as the same user) over and over again every two seconds or so, while in another window copying/moving files back and forth between two directories. We created a login script that pipes in the password to log in the user over and over and used the watch command to make it happen repeatedly: watch gettokens.sh where gettokens.sh basically just does this: clog codauser_at_realm < passwordfile Then in another window we create two directories and put some files in the first one. We then move those files back and forth repeatedly, like this: while [ 1 ] ; do mv WORK/s/* WORK/t; mv WORK/t/* WORK/s; done Note that we do all this simply to try to force errors that we've been seeing on occassion. So while this is happening, we get errors in the SrvLog like this: 15:30:05 Worker5: Unbinding RPC connection 12416 15:30:05 Deleting client entry for user backend at 192.168.30.224.2435 rpcid 12416 This will eventually kill coda and require a server restart. Sorry to thrust all of this on you right before the Fourth of July weekend. Enjoy the weekend and when you get back if you have a chance please let us know what we can do to mitigate some of these issues. Thanks, ..Patrick -- Patrick Walsh eSoft Incorporated 303.444.1600 x3350 http://www.esoft.com/Received on 2005-07-01 12:22:48