(Illustration by Gaich Muramatsu)
A long-delayed response: Actually local-global conflicts are never automatically resolved, resolution only occurs between servers. But I know what you mean. Well, my complaints are 1) there was a conflict, when only 1 client was operating. This should never happen. 2) More seriously, I couldn't fix the conflict, because there was something really screwed up going on. It may be that working on the 'cells' branch and unkludging local-global repair should be done before this is addressed. The problem is probably related to the 'store-id'. Every operation is tagged with a unique store-id so that the server can detect when an operation is retried and simply returns EALREADY. What you are seeing seems to be caused by the client timing out on the connected store operation, then switching back to write-disconnected, logging the store in the CML. I concur (storeid confusion across transition to disconnected), and if so this seems like a serious bug well worth fixing. It would seem not burdensome for clients to keep track of operations sent to the server for which an ack has not been received, to make sure retries are treated as retries rather than conflicting operations. But I am unclear on how hard this is. The second problem that I see is that obviously the reply packets are getting lost somehow, either that or the server is not yielding to other LWPs for more than 15-30 seconds, so that it is unable to send back RPC2_BUSY's when the client sends it's retries. The ACK's and BUSY's should be pretty small, so this cannot be fragmentation related. This is during high load, so I think that the lossage can happen even if the under-load behavior could be improved. The remove before the rename isn't that strange. It happens when an object is renamed over an existing object. Any existing target is then automatically removed. The remove is added whenever the client knows it is renaming over an existing object, and the rename reintegration explicitly does _not_ overwrite existing target objects. So is there some way that we can be sure that the server will apply the remove and rename atomically? This is the whole point of atomic rename from rcs's point of view. Perhaps this behavior should be changed and the rename CML entry should take both the name/storeid of the source and destination. I have not thought much about this, and could be way off base. fd = open("path-c2", O_CREAT | O_WRONLY, 0); -> Create write(fd, data); fchmod(fd, 0644); -> Chmod rename("path-c2", "path-c3"); -> Remove/Rename unlink(".pvect_path-c3"); -> Remove symlink("e9c222dc", ".pvect_path-c3"); -> Symlink this is cfs storing the IV for the new file. close(fd); -> Store This close delay is due to cfs, I think, which closes a fd after a timeout or when it needs to open another file. This is an optimization which may be causing trouble. But without it, each write and fchmod would have an open/close cycle. Perhaps I can make rename force a close; that's almost always what one would want. The Bob solution (*BSD) was to block everything for a period of 1 minute and then automatically return to the user with an error. So it seems that the block works, and the real problem is that the venus operation was not yet finished, for no apparent reason. It seems to me that the bug is the venus op failure, not the blocking state of things.Received on 2002-10-07 15:45:28