(Illustration by Gaich Muramatsu)
On Thu, Jul 04, 2013 at 02:35:41AM -0400, Jan Harkes wrote: [...] > > Odd, I don't think I have seen such a crash before, the usual cases that > I see involve the server crashing because it ran out of available > resolution log entries and the next mutating operation sent to the > server triggers an assertion. I have a great talent for bringing down such things :(. I promise to pay more attention next time, so maybe we will be able to reproduce this. > > After restarting everything I still have the conflict in the > > same > > node or it's parent node depending on the situation. > > Was this directory by any chance moved from one directory to another? No, it was just copy -> crash -> repair sequence. The conflict started deep down the directory tree. I tried to repair it by removedir, but that crashed non-SCM, repaired to SCM, and conflict just propagated up the path. I copied with cp -a, so maybe there was attribute inconsistency between two replicas, but for sure no renames nor moves. > > With files I've seen rename related conflicts where the default repair > suggestion when the source directory is resolved is to recreate the > renamed object but repair then fails because the server already has that > same object in the not-yet resolved destination directory. But this is > different since it is a directory, and it is a remove. I tried to remove conflicting directory on both replicas, but that also crashed non-SCM. > > [...] > Instead of removing the directory, recreate it on the other replica. That worked to some point. At the end I had a situation where two replicas contained same files, comparedirs generated empty fix, but complained about vectors being different. > If that doesn't work, and it is reliably only one server that > crashes, > you can try to repair the conflict with only the other server running. > If that works you can bring the crashed server back up, extract all the > volume replica information with volutil info volumename.0 or .1 and then > remove and recreate the corrupted replica and then repopulate the volume > through runt resolution by doing a 'find /coda/path/to/volume -noleaf'. That was the way to solve the problem. > Good luck, > > Jan Thank you, Jan. I very much appreciate your help.Received on 2013-07-05 00:51:09