Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Thu, 28 Feb 2008 16:01:29 -0500

On Thu, Feb 28, 2008 at 09:02:02PM +0100, u+codalist-p4pg_at_chalmers.se wrote:
> On Thu, Feb 28, 2008 at 02:18:57PM -0500, Jan Harkes wrote:
> > I guess in the single server case we are really being too pessimistic
> > and don't really need to interrupt reintegration to resolve because
> > there can't be a conflict between several servers, but that still
> > doesn't solve the more general case when there is replication.
> 
> Why not just retry?

Long story, but the summary is that reintegration happens in one thread
which signals the cfs forcereintegrate worker when it completes. This
worker thread doesn't really know why reintegration completed, the
assumption is that we either pushed everything back, or we hit a
conflict or disconnection.

In the conflict/disconnection case it is not useful to retry because we
wouldn't be able to make progress either way. And we cannot really check
if the number of CML entries remains the same because it may be that we
are adding entries at the same rate as we are reintegrating. The signal
that is sent should probably include a status, but it is sort of
broadcast to any thread who happens to have run cfs fr and not some
specific one.

> It seems I have to refresh my knowledge of resolution.
> Isn't the create operation present in the server volume modification logs
> which they are to exchange? (You don't have to explain resolution
> once again here - unless you feel inclined so. I'll go and read elsewhere).

No problem, resolution happens in several steps and the first one is to
lock the object. After the store we trigger resolution on the file and
the server tries to lock it on all replicas. But the "disconnected"
server has never heard of the object so it fails and we declare a
conflict.

The logs really only come into play for directory resolution. Assume we
didn't have the store conflict, but just some file/directory/symlink
created on only one replica. Then at some later point a client notices
that the directory versions are different between replicas and triggers
resolution which roughly follows the following steps,

- The servers lock the directory everywhere, collect the current version
  vectors and attribute information.
- If the version vectors are identical we're done.
- If the last store identifier on all server is identical we only missed
  the COP2 message, so the version vectors are brought back in sync and
  we're done.
- If any of the versions is strictly lower in all places the server
  missed one of more updates, we copy the directory data from another
  server and are (most likely) done.
- If the version vector indicates conflicting updates (f.i. <1 2> vs. <2
  1>) both sites had an update that the other site missed then we use
  log replay to apply any missed updates, check if the resulting
  directories are identical and we're done.
- If we get here, all attempts have failed and we declare a conflict.

When the servers hit the locking issue because one replica is missing
they should probably recurse up to the parent directory and try to
resolve that first. That is what we do in most other situations for
instance when we have a problem resolving a rename.

Jan

Coda File System

Re: what about adding such cfs subcommand?