Coda File System

Re: odd repair problem

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 15 Aug 2001 10:10:10 -0400
On Wed, Aug 15, 2001 at 09:35:21AM -0400, Greg Troxel wrote:
> Store   /coda/home/gdt/ELF-4 (length = 202597)
> Rename  /coda/home/gdt/ELF-4 (to: /coda/home/gdt/FreeBSD/ELF-4/bpdksh-5.2.14.tgz)
> Rename  /coda/home/gdt/FreeBSD/ELF-4/bpdksh-5.2.14.tgz (to: /coda/home/gdt/FreeBSD/ELF-4/pdksh-5.2.14.tgz)
...
> I would expect this to be a local/global conflict (I only have one server)
...

Ok, there are several known problems here. One is that there is a
server-server conflict. The resolution code doesn't know how to handle
cross-directory renames and always marks all involved objects
'in-conflict'.

Second problem is that due to the low bandwidth connection, your client
did a 'weak-reintegration'. This involves sending the updates to one
server, and then triggering a resolve to update the others. Normally, a
simple optimization cuts out the resolve RPC call but maybe they are
occasionally triggered for unknown reasons. This should be harmless
except for the fact that the rename resolution code dumbly marks all
related objects as a conflict, although there are no other replica's to
conflict with.

Third problem, the server-server conflict blocks reintegration of
subsequent operations and triggers a local-global conflict. This
local-global conflict in not repairable until the server-server conflict
is resolved from another client.

> As a minor issue, I am running with the following local patch, to get
> around lossage with the 28.8 link where the bw estimate gets too big
> and then I get timeouts.

Ehh, that is funny, this patch doesn't change anything to RPC2, which is
where the timeouts occur. The only thing this patch does is clamp the
bandwidth estimate down for venus, which result in reintegration sending
fewer CML entries and chopping up large files in smaller bits.

I think I know what is causing those timeouts, weak-reintegration of a
large batch of CML entries over a slow link results in a lot of resolve
operations. (Are you sure there is only a single replica??) I happened
to notice last night that when an uncached object is looked up and the
volume is in disconnected _or resolving_ state, the Get fails with
ETIMEDOUT. The following patch should simply release the volume and
reaquire it allowing the resolve to go through. I'm not sure if it will
loop like crazy but we'll notice that quickly enough.

Jan

Index: fso0.cc
===================================================================
RCS file: /afs/cs/project/coda-src/cvs/coda/coda-src/venus/fso0.cc,v
retrieving revision 4.50
diff -u -u -r4.50 fso0.cc
--- fso0.cc	2001/05/17 21:26:52	4.50
+++ fso0.cc	2001/08/15 13:56:59
@@ -609,6 +609,7 @@
 
         /* Must ensure that the volume is cached. */
         volent *v = 0;
+retry_vdbget:
         if (VDB->Get(&v, key->Volume) != 0) {
             LOG(100, ("Volume not cached and we couldn't get it...\n"));
             return(ETIMEDOUT);
@@ -620,8 +621,14 @@
 	    goto RestartFind;
 	}
 
+        if (v->state == Resolving) {
+            LOG(0, ("Volume resolvin and file not cached, retrying VDB->Get!\n"));
+            VDB->Put(&v);
+	    goto retry_vdbget;
+        }
+
         /* Cut-out early if volume is disconnected! */
-        if (v->state != Hoarding && v->state != Logging) {
+        if (v->state == Emulating) {
             LOG(100, ("Volume disconnected and file not cached!\n"));
             VDB->Put(&v);
             return(ETIMEDOUT);
Received on 2001-08-15 10:10:12