Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Wed, 16 Apr 2003 13:32:50 -0400

On Wed, Apr 16, 2003 at 12:16:50PM -0400, Samir Patel wrote:
> On Tue, 15 Apr 2003, Jan Harkes wrote:
> > to a local and a global directory. However the enumeration step failed,
> > maybe we got completely disconnected from the servers and global is a
> > dangling link instead of a proper directory (as I just noticed is the
> > case further down in your email).
> 
> If the client goes into disconnected mode, why would it then get
> completely disconnected from the server?  This is a desktop sitting
> on a LAN.

It is not completely disconnected here, the global object simply doesn't
exist on the servers. A client can go write-disconnected or completely
disconnected when the server is considered too slow. This can be caused
by network latency/packet loss, or when a server is getting overloaded.

Because clients automatically disconnect when networks become congested
or servers have too much load, everything nicely balances out (network
and server load goes down). Clients retest the 'water' about once every
5 minutes and reconnect. Reconnections and reintegrations typically use
a lot less server CPU because clients optimize away unnecessary
operations (create tmpfile, store tmpfile, remove tmpfile) and we
process up to 100 operations in a single server transaction.

> > > If I do an 'ls -l blah' now, I get:
> > >
> > > lrw-r--r--    1 root     nfsnobod       27 Apr 15 13:35 global ->
> > > @7f000000.000018f2.000007ef
> > > -rw-rw-r--    1 samir    nfsnobod       14 Apr 15 12:19 local
> >
> > This is easily explained. The object does not exist on the servers, so
> > ofcourse we cannot show the global object. I believe this conflict
> > 'should' be propagated to a directory conflict in the containing
> > directory. Perhaps it fails to do so because your shell process has
> > 'pinned' down the directory and therefore we fail to turn the directory
> > into a conflict. Or maybe there is something wrong in the local-global
> > repair handling of store conflicts, this is an update/remove type
> > conflict that hasn't been tested as frequently as update/update
> > conflicts (when both clients concurrently try to update the file).
> 
> What are the plans [if any] on fixing these sort of things?

Repair was fixed to allow repairs to complete when not all replicas are
available. However in this case it might just claim there is nothing to
repair, which is in fact correct as the real conflict is on the parent
directory, the removal of the object is a directory operation.

This is a clear example where Coda semantics are very different from
Unix semantics that we all know and love. Unix semantics for this case
would be that the stored data is simply dropped. However, because of
Coda's disconnected (and lockless) operation, local changes to that file
are considered more precious than the global state.

If you look at how most text editors work with files this does make some
sense. An editor often moves the original file to a backup copy and then
writes out a new copy of the file. If the write was successful and no
backup copies are kept, the original is then removed. Imagine I'm
disconnected for a week or two and do a major rewrite of the Coda
webpages. In the mean time someone at CMU fixes up a small typo in one
of the pages, should my significant rewrite and possibly 2 weeks of work
simply get dropped?

Right now I believe we should definitely keep Coda semantics for these
situations, just because they are more likely to occur as a result of
disconnected operation. However, things are not perfect even with Coda
semantics. If the editor didn't remove the original copy of the file,
the store would have gone to file~ (or file.orig/bak) instead of the
intended file and no conflict is declared. So in the long term we might
just want to minimize what types of conflicting updates result in a
local-global conflict and consider something like a per-volume
lost+found directory that catches lost store operations.

Jan

Coda File System

Re: Simple problem... I think