(Illustration by Gaich Muramatsu)
On Mon, May 02, 2005 at 03:22:13PM -0600, Patrick Walsh wrote: > We have cron jobs that run as local user root and coda uid 502. Apache > runs as user www-data with coda uid 501. But there's a catch. We > forgot that apache needs to start as user root in order to listen on > ports 80 and 443. It also does its logging as user root. Only its > children listener processes run as user www-data. So apache was > starting with proper coda permissions, then a cron job was essentially > logging user root in to coda with new tokens, thus disconnecting apache > from its log directory and causing conflicts. The kernel module remembers the user-id when the file was opened and uses the same user-id when it is closed. This is done not only for cases like these, but also in case a file descriptor is passed from one process to another. Effectively you would allow root (Coda userid 502) to write to the apache logs directory. If you associate root with Coda user 501 and later switch it to Coda user 502 then it would lose the permissions it had when the file is opened for writing. > Unfortunately, the conflict resolution process was frustrating because > it would show identical files in "local" and "global" down to the file > size and time stamp. The checklocal command inside the repiar utility cfs getfid would probably show a different version-vector/store-id for the local and global files. Coda never uses things like size or timestamp information to detect conflicts. > This is disappointing because it makes me worry that we could get hard- > to-reproduce occasional conflicts in coda on our backend. And in a > server that is basically automated, the problem could go unnoticed for > some time. I wonder if venus could detect when the server is on the > localhost and then adjust itself to be more patient? I'm sure it's Even when venus is more patient we cannot guarantee that there will not be a conflict. That is because everything is done optimistically. You'd need a (distributed) lock manager or obtain a lock with a quorum of (N/2)+1 servers to guarantee that any operation will not result in a conflict. The way we perform any operation, we assume that the server is available and the version vectors are as expected. We cannot rely on callbacks for conflict avoidance since the operation already completed before the callback is sent so they are only usable to invalidate cached entries for reading. Any mutating operation would first have to check if the objects in question haven't changed, and lock them to prevent concurrent updates. > common to want the server to be able to mount the coda files. Intuition > says that that should be the most reliable setup instead of the least- > reliable setup. It might be slower, but it shouldn't be less reliable. > We're putting `cfs strong` 's all over the place now to try to keep > things from getting write-disconnected. Is there anything else we can > do to enforce this? Slow writes are not a problem so much as conflicts > are. No, cfs strong only prevents the connected <> weakly-connected switches as a result of bandwidth estimates. It doesn't make us lock objects we intend to modify. It doesn't prevent us from switching to disconnected mode because of an rpc2 timeout. It doesn't let us know that another client modified a locally cached object until after the fact, etc. I strongly believe that connected mode (and cfs strong) mostly provide the user with the perception that he won't get conflicts, and in 99% of the cases this perception is probably true. But the remaining 1% will still cause conflicts or disconnections and reintegrations. So I'd rather work on making (write-)disconnected mode, reintegration and repair reliable enough so that people don't really have a need for connected mode anymore. Ideally we'd always be write-disconnected and use the 'application specific resolvers' to make sure that even if we there is a conflict that it will be repaired automatically, the simplest ASRs would just force the local copy to overwrite the servers version (AFS semantics), or pick a random copy and create backups for other conflicting copies. But to have reliable conflict resolution with ASRs requires repair to work in all possible situations and right now there are still far too many cases of unnecessary or unrepairable conflicts. The only other reason for connected mode that I see is because people want their updates to be visible on other clients as soon as they 'save a file', but that can be done with a synchronous mode where we force a reintegration before returning back to the application. JanReceived on 2005-05-03 18:48:27