(Illustration by Gaich Muramatsu)
On Wed, Apr 16, 2003 at 12:16:50PM -0400, Samir Patel wrote: > On Tue, 15 Apr 2003, Jan Harkes wrote: > > to a local and a global directory. However the enumeration step failed, > > maybe we got completely disconnected from the servers and global is a > > dangling link instead of a proper directory (as I just noticed is the > > case further down in your email). > > If the client goes into disconnected mode, why would it then get > completely disconnected from the server? This is a desktop sitting > on a LAN. It is not completely disconnected here, the global object simply doesn't exist on the servers. A client can go write-disconnected or completely disconnected when the server is considered too slow. This can be caused by network latency/packet loss, or when a server is getting overloaded. Because clients automatically disconnect when networks become congested or servers have too much load, everything nicely balances out (network and server load goes down). Clients retest the 'water' about once every 5 minutes and reconnect. Reconnections and reintegrations typically use a lot less server CPU because clients optimize away unnecessary operations (create tmpfile, store tmpfile, remove tmpfile) and we process up to 100 operations in a single server transaction. > > > If I do an 'ls -l blah' now, I get: > > > > > > lrw-r--r-- 1 root nfsnobod 27 Apr 15 13:35 global -> > > > @7f000000.000018f2.000007ef > > > -rw-rw-r-- 1 samir nfsnobod 14 Apr 15 12:19 local > > > > This is easily explained. The object does not exist on the servers, so > > ofcourse we cannot show the global object. I believe this conflict > > 'should' be propagated to a directory conflict in the containing > > directory. Perhaps it fails to do so because your shell process has > > 'pinned' down the directory and therefore we fail to turn the directory > > into a conflict. Or maybe there is something wrong in the local-global > > repair handling of store conflicts, this is an update/remove type > > conflict that hasn't been tested as frequently as update/update > > conflicts (when both clients concurrently try to update the file). > > What are the plans [if any] on fixing these sort of things? Repair was fixed to allow repairs to complete when not all replicas are available. However in this case it might just claim there is nothing to repair, which is in fact correct as the real conflict is on the parent directory, the removal of the object is a directory operation. This is a clear example where Coda semantics are very different from Unix semantics that we all know and love. Unix semantics for this case would be that the stored data is simply dropped. However, because of Coda's disconnected (and lockless) operation, local changes to that file are considered more precious than the global state. If you look at how most text editors work with files this does make some sense. An editor often moves the original file to a backup copy and then writes out a new copy of the file. If the write was successful and no backup copies are kept, the original is then removed. Imagine I'm disconnected for a week or two and do a major rewrite of the Coda webpages. In the mean time someone at CMU fixes up a small typo in one of the pages, should my significant rewrite and possibly 2 weeks of work simply get dropped? Right now I believe we should definitely keep Coda semantics for these situations, just because they are more likely to occur as a result of disconnected operation. However, things are not perfect even with Coda semantics. If the editor didn't remove the original copy of the file, the store would have gone to file~ (or file.orig/bak) instead of the intended file and no conflict is declared. So in the long term we might just want to minimize what types of conflicting updates result in a local-global conflict and consider something like a per-volume lost+found directory that catches lost store operations. JanReceived on 2003-04-16 13:38:55