Coda File System

Re: server-server conflict doesn't seem resolvable

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 6 May 2005 10:37:51 -0400
On Thu, May 05, 2005 at 01:47:47PM -0600, Patrick Walsh wrote:
> 	We've hit our first server-server conflict and I've tried everything,
> but I can't seem to resolve it.  It's possible my attempts to resolve it
> have backfired and I need to reinitialize the volume (and I'd appreciate
> a pointer to the reference material on this if anyone has it).

Is this on a volume that was 'grown' from singly replicated to doubly
replicated? I bumped into a problem the other day when one of my servers
died it tried to resolve a directory conflict in such a volume. It turns
out that even when resolution is reenabled, the directories in the
original replica that do not have any resolution log entries trigger a
null-pointer dereference in the log-based resolution path.

However, since you're already at the 'repair' stage, it means that the
servers have already given up with the automatic resolution, so you
didn't get hit by this problem.

> 	Here's the quickest example:
> 
> # repair /coda/director/httpd/html /tmp/fix -owner 500 -mode 755
> Server-server directory repair session started.
> Available commands:
>         comparedirs
>         removeinc
>         dorepair
> sophos.last was removed at some sites; should it be REMOVED at ALL
> sites? [N]: y
> The fix file may be empty but ....
> You still need a dorepair because the Version state is different
> VIOC_REPAIR /coda/director/httpd/html: Resource temporarily unavailable
> Repair failed.
> Repair session completed.

Is this the first repair you tried, or did you have a failed or aborted
repair on the same object before? You could try to flush the cached
replica objects, during the first repair the client pulls in the things
from the underlying replicas, it creates the fix-file and sends it off
to the servers. The servers apply the operations and bump the version
vectors. Finally there is a check to see if the directories are now
identical, if that fails the object is marked in conflict again.

However it seems like we either didn't send callbacks, or they get
ignored when the version vectors get bumped. So the next time the client
tries to repair he is still looking at the (stale) directories in the
local cache and the repair will always be rejected because of the
version-vector mismatch. I have seen this more with files than with
directories. To check if this is the case,

    cfs br /coda/director/httpd/html		# expand the conflict
    cfs getfid /coda/director/httpd/html/*	# show version vectors
    cfs fl /coda/director/httpd/html/*		# flush cached replicas
    cfs getfid /coda/director/httpd/html/*	# refetch and show vvs
    cfs er /coda/director/httpd/html		# collapse the conflict

If the version before and after the flush were different, repair should
now be able to fix the conflict. Otherwise, check the server logs, maybe
there is an ACL difference, or some object doesn't exist on all servers.

Jan
Received on 2005-05-06 10:41:57