(Illustration by Gaich Muramatsu)
On Thu, Feb 28, 2008 at 09:02:02PM +0100, u+codalist-p4pg_at_chalmers.se wrote: > On Thu, Feb 28, 2008 at 02:18:57PM -0500, Jan Harkes wrote: > > I guess in the single server case we are really being too pessimistic > > and don't really need to interrupt reintegration to resolve because > > there can't be a conflict between several servers, but that still > > doesn't solve the more general case when there is replication. > > Why not just retry? Long story, but the summary is that reintegration happens in one thread which signals the cfs forcereintegrate worker when it completes. This worker thread doesn't really know why reintegration completed, the assumption is that we either pushed everything back, or we hit a conflict or disconnection. In the conflict/disconnection case it is not useful to retry because we wouldn't be able to make progress either way. And we cannot really check if the number of CML entries remains the same because it may be that we are adding entries at the same rate as we are reintegrating. The signal that is sent should probably include a status, but it is sort of broadcast to any thread who happens to have run cfs fr and not some specific one. > It seems I have to refresh my knowledge of resolution. > Isn't the create operation present in the server volume modification logs > which they are to exchange? (You don't have to explain resolution > once again here - unless you feel inclined so. I'll go and read elsewhere). No problem, resolution happens in several steps and the first one is to lock the object. After the store we trigger resolution on the file and the server tries to lock it on all replicas. But the "disconnected" server has never heard of the object so it fails and we declare a conflict. The logs really only come into play for directory resolution. Assume we didn't have the store conflict, but just some file/directory/symlink created on only one replica. Then at some later point a client notices that the directory versions are different between replicas and triggers resolution which roughly follows the following steps, - The servers lock the directory everywhere, collect the current version vectors and attribute information. - If the version vectors are identical we're done. - If the last store identifier on all server is identical we only missed the COP2 message, so the version vectors are brought back in sync and we're done. - If any of the versions is strictly lower in all places the server missed one of more updates, we copy the directory data from another server and are (most likely) done. - If the version vector indicates conflicting updates (f.i. <1 2> vs. <2 1>) both sites had an update that the other site missed then we use log replay to apply any missed updates, check if the resulting directories are identical and we're done. - If we get here, all attempts have failed and we declare a conflict. When the servers hit the locking issue because one replica is missing they should probably recurse up to the parent directory and try to resolve that first. That is what we do in most other situations for instance when we have a problem resolving a rename. JanReceived on 2008-02-28 16:02:33