Coda File System

Re: Unable to do beginrepair...

From: Greg Troxel <gdt_at_ir.bbn.com>
Date: 11 Dec 2002 09:31:35 -0500
I am having similar problems.  It has seemed to me that coda
reliablity, specifically repair and conflicts arising when they
shouldn't, has been worse recently (than say a year ago).  It could
also be that I'm just stressing it more.

I just mv'd a file from my non-coda homedir (NetBSD/i386 1.6) into
/coda/home/gdt, and ended up with a conflict shortly thereafter.  I
saw the dreaded 'local inconsistent object' message, and tried to do a
repair.  I got a failure to allocate (something - fuzzy now), and
found the normal directory-instead-of-object.  local had the right
contents, and global was a symlink to a volume id.  I was able to do
'cfs er' and 'cvs br' ok, but not to actually repair.  I then shut
down venus and tried to run norton, but it seems to be only a server
thing.  I then tried 'cfs fl .' in the directory, hoping to just
remove the CML entry, since I can easily regenerate the file.  Now the
file 'wi0.dump' is present in the directory, but everything says
connection timed out.  I even tried to flush the whole directory, but
I get "can't flush active file".

poblano gdt 239 ~/%co/HARDWARE/POBLANO > l
ls: wi0.dump: Connection timed out
total 18
-rw-r--r--  1 gdt  65534  1316 Dec  4 13:56 bios
-rw-r--r--  1 gdt  65534    44 Dec  2 15:05 disk
-rw-r--r--  1 gdt  65534  5228 Dec  9 14:50 dmesg.new-memory
-rw-r--r--  1 gdt  65534  5628 Dec  6 20:21 dmesg.sound,bluetooth
-rw-r--r--  1 gdt  65534  1316 Nov  8 20:36 shopping
-rw-r--r--  1 gdt  65534   263 Dec  4 09:52 wi0
poblano gdt 247 ~/%co/HARDWARE/POBLANO > cfs gf wi0.dump 
VIOC_GETFID: Connection timed out

[ W(19) : 0000 : 09:15:58 ] fsdb::Get: Locally created fid (0x7f000001.0xffffffff.0x80001) not found!
[ W(19) : 0000 : 09:16:00 ] fsdb::Get: Locally created fid (0x7f000001.0xffffffff.0x80001) not found!
[ W(19) : 0000 : 09:16:01 ] fsdb::Get: Locally created fid (0x7f000001.0xffffffff.0x80001) not found!

[ L(18) : 0000 : 09:16:02 ] LocalInconsistentObj: objFid=7f000001.4830.988c

The cml file from 'cfs ck':
Create  ???/local
Chown   ???/local (owner = 0)
Store   ???/local (length = 2836)
and the tar file is empty.
The cmd.old file:
Create  ???/wi0.dump
Chown   ???/wi0.dump (owner = 0)
Store   ???/wi0.dump (length = 2836)
and the tar file has
-rw-r--r-- [uid/gid redacted]  2836 Dec 11 08:48 2002 ???/wi0.dump

in server log:
08:56:56 Callback failed RPC2_DEAD (F) for ws [client-in-question]:65516
09:01:19 Callback failed RPC2_DEAD (F) for ws [client-in-question]:64967

So, it may be that I can wait to reinit until tomorrow, when my client
will be on the same ethernet as the server.  [pause] Nope, I get
EROFS, due to pending conflicts.

So, I hate to sound cranky, but coda has become less usable for me.
Having to reinit venus is a real problem when I don't have a
fast link (28.8kb/s right now).  And that's when it breaks, usually.
I realize my connectivity is lame for 2002 standards, but it seems
supporting this situation is one of the design goals for coda.

I know Jan has been working towards a totally new repair scheme in
terms of the local representation of in-process repairs.  The usual
caution towards putting such drastic changes in the mainline may not
be in order now, given that from my point of view, repair essentially
does not work at the moment.   (I do not have the usual bona fide
conflicts due to my work style; I only get pseudo-conflicts, I think
due to a reintegration failing on the client but succeeding on the
server.   And the latter kind can never be fixed, in my experience.)

In my view there are are two big problems, and perhaps more lurking:

* repair doesn't work in some situations, and there is apparently no
  way to recover.  If there were a tool to just remove the problematic
  LocalInconsistentObj entries, that would help a lot.

* I am not 100% certain, but having timeouts on reintegration
  operations seems to lead to declared conflicts due to the server
  completing an operation and the client not getting the ack.  Such
  operations should be idempotent even across disconnections and long
  time intervals (days).  This requires keeping state on the server, I
  think, since the underlying operations are no idempotent.  But, if I
  could just say 'repair, discardlocal, discardlocal, end' and not
  have to reinit, I wouldn't mind as much.

I am running code from 2002-10-14; I'll upgrade to -current CVS and
see if things are better.

        Greg Troxel <gdt_at_ir.bbn.com>
Received on 2002-12-11 09:37:34