Coda File System

Re: still losing with unresolvable conflicts, and Coda/IPsec HOWTO

From: Greg Troxel <gdt_at_ir.bbn.com>
Date: Mon, 07 Oct 2002 15:39:29 -0400
A long-delayed response:

  Actually local-global conflicts are never automatically resolved,
  resolution only occurs between servers. But I know what you mean.

Well, my complaints are
1) there was a conflict, when only 1 client was operating.  This
   should never happen.
2) More seriously, I couldn't fix the conflict, because there was
   something really screwed up going on.
It may be that working on the 'cells' branch and unkludging
local-global repair should be done before this is addressed.


  The problem is probably related to the 'store-id'. Every operation is
  tagged with a unique store-id so that the server can detect when an
  operation is retried and simply returns EALREADY. What you are seeing
  seems to be caused by the client timing out on the connected store
  operation, then switching back to write-disconnected, logging the store
  in the CML.

I concur (storeid confusion across transition to disconnected), and if
so this seems like a serious bug well worth fixing.  It would seem not
burdensome for clients to keep track of operations sent to the server
for which an ack has not been received, to make sure retries are
treated as retries rather than conflicting operations.  But I am
unclear on how hard this is.

  The second problem that I see is that obviously the reply packets are
  getting lost somehow, either that or the server is not yielding to other
  LWPs for more than 15-30 seconds, so that it is unable to send back
  RPC2_BUSY's when the client sends it's retries. The ACK's and BUSY's
  should be pretty small, so this cannot be fragmentation related.

This is during high load, so I think that the lossage can happen even
if the under-load behavior could be improved.

  The remove before the rename isn't that strange. It happens when an
  object is renamed over an existing object. Any existing target is then
  automatically removed. The remove is added whenever the client knows it
  is renaming over an existing object, and the rename reintegration
  explicitly does _not_ overwrite existing target objects.

So is there some way that we can be sure that the server will apply
the remove and rename atomically?  This is the whole point of atomic
rename from rcs's point of view.  Perhaps this behavior should be
changed and the rename CML entry should take both the name/storeid of
the source and destination.  I have not thought much about this, and
could be way off base.

    fd = open("path-c2", O_CREAT | O_WRONLY, 0); -> Create
    write(fd, data);
    fchmod(fd, 0644);				 -> Chmod
    rename("path-c2", "path-c3");		 -> Remove/Rename

    unlink(".pvect_path-c3");			 -> Remove
    symlink("e9c222dc", ".pvect_path-c3");	 -> Symlink

this is cfs storing the IV for the new file.

    close(fd);					 -> Store

This close delay is due to cfs, I think, which closes a fd after a
timeout or when it needs to open another file.  This is an
optimization which may be causing trouble.  But without it, each write
and fchmod would have an open/close cycle.  Perhaps I can make rename
force a close; that's almost always what one would want.

  The Bob solution (*BSD) was to block everything for a period of 1
  minute and then automatically return to the user with an error.

So it seems that the block works, and the real problem is that the
venus operation was not yet finished, for no apparent reason.
It seems to me that the bug is the venus op failure, not the blocking
state of things.
Received on 2002-10-07 15:45:28