(Illustration by Gaich Muramatsu)
On Wed, Aug 15, 2001 at 09:35:21AM -0400, Greg Troxel wrote: > Store /coda/home/gdt/ELF-4 (length = 202597) > Rename /coda/home/gdt/ELF-4 (to: /coda/home/gdt/FreeBSD/ELF-4/bpdksh-5.2.14.tgz) > Rename /coda/home/gdt/FreeBSD/ELF-4/bpdksh-5.2.14.tgz (to: /coda/home/gdt/FreeBSD/ELF-4/pdksh-5.2.14.tgz) ... > I would expect this to be a local/global conflict (I only have one server) ... Ok, there are several known problems here. One is that there is a server-server conflict. The resolution code doesn't know how to handle cross-directory renames and always marks all involved objects 'in-conflict'. Second problem is that due to the low bandwidth connection, your client did a 'weak-reintegration'. This involves sending the updates to one server, and then triggering a resolve to update the others. Normally, a simple optimization cuts out the resolve RPC call but maybe they are occasionally triggered for unknown reasons. This should be harmless except for the fact that the rename resolution code dumbly marks all related objects as a conflict, although there are no other replica's to conflict with. Third problem, the server-server conflict blocks reintegration of subsequent operations and triggers a local-global conflict. This local-global conflict in not repairable until the server-server conflict is resolved from another client. > As a minor issue, I am running with the following local patch, to get > around lossage with the 28.8 link where the bw estimate gets too big > and then I get timeouts. Ehh, that is funny, this patch doesn't change anything to RPC2, which is where the timeouts occur. The only thing this patch does is clamp the bandwidth estimate down for venus, which result in reintegration sending fewer CML entries and chopping up large files in smaller bits. I think I know what is causing those timeouts, weak-reintegration of a large batch of CML entries over a slow link results in a lot of resolve operations. (Are you sure there is only a single replica??) I happened to notice last night that when an uncached object is looked up and the volume is in disconnected _or resolving_ state, the Get fails with ETIMEDOUT. The following patch should simply release the volume and reaquire it allowing the resolve to go through. I'm not sure if it will loop like crazy but we'll notice that quickly enough. Jan Index: fso0.cc =================================================================== RCS file: /afs/cs/project/coda-src/cvs/coda/coda-src/venus/fso0.cc,v retrieving revision 4.50 diff -u -u -r4.50 fso0.cc --- fso0.cc 2001/05/17 21:26:52 4.50 +++ fso0.cc 2001/08/15 13:56:59 @@ -609,6 +609,7 @@ /* Must ensure that the volume is cached. */ volent *v = 0; +retry_vdbget: if (VDB->Get(&v, key->Volume) != 0) { LOG(100, ("Volume not cached and we couldn't get it...\n")); return(ETIMEDOUT); @@ -620,8 +621,14 @@ goto RestartFind; } + if (v->state == Resolving) { + LOG(0, ("Volume resolvin and file not cached, retrying VDB->Get!\n")); + VDB->Put(&v); + goto retry_vdbget; + } + /* Cut-out early if volume is disconnected! */ - if (v->state != Hoarding && v->state != Logging) { + if (v->state == Emulating) { LOG(100, ("Volume disconnected and file not cached!\n")); VDB->Put(&v); return(ETIMEDOUT);Received on 2001-08-15 10:10:12