(Illustration by Gaich Muramatsu)
On Thu, May 05, 2005 at 01:47:47PM -0600, Patrick Walsh wrote: > We've hit our first server-server conflict and I've tried everything, > but I can't seem to resolve it. It's possible my attempts to resolve it > have backfired and I need to reinitialize the volume (and I'd appreciate > a pointer to the reference material on this if anyone has it). Is this on a volume that was 'grown' from singly replicated to doubly replicated? I bumped into a problem the other day when one of my servers died it tried to resolve a directory conflict in such a volume. It turns out that even when resolution is reenabled, the directories in the original replica that do not have any resolution log entries trigger a null-pointer dereference in the log-based resolution path. However, since you're already at the 'repair' stage, it means that the servers have already given up with the automatic resolution, so you didn't get hit by this problem. > Here's the quickest example: > > # repair /coda/director/httpd/html /tmp/fix -owner 500 -mode 755 > Server-server directory repair session started. > Available commands: > comparedirs > removeinc > dorepair > sophos.last was removed at some sites; should it be REMOVED at ALL > sites? [N]: y > The fix file may be empty but .... > You still need a dorepair because the Version state is different > VIOC_REPAIR /coda/director/httpd/html: Resource temporarily unavailable > Repair failed. > Repair session completed. Is this the first repair you tried, or did you have a failed or aborted repair on the same object before? You could try to flush the cached replica objects, during the first repair the client pulls in the things from the underlying replicas, it creates the fix-file and sends it off to the servers. The servers apply the operations and bump the version vectors. Finally there is a check to see if the directories are now identical, if that fails the object is marked in conflict again. However it seems like we either didn't send callbacks, or they get ignored when the version vectors get bumped. So the next time the client tries to repair he is still looking at the (stale) directories in the local cache and the repair will always be rejected because of the version-vector mismatch. I have seen this more with files than with directories. To check if this is the case, cfs br /coda/director/httpd/html # expand the conflict cfs getfid /coda/director/httpd/html/* # show version vectors cfs fl /coda/director/httpd/html/* # flush cached replicas cfs getfid /coda/director/httpd/html/* # refetch and show vvs cfs er /coda/director/httpd/html # collapse the conflict If the version before and after the flush were different, repair should now be able to fix the conflict. Otherwise, check the server logs, maybe there is an ACL difference, or some object doesn't exist on all servers. JanReceived on 2005-05-06 10:41:57