Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Tue, 27 Feb 2007 16:20:52 -0500

On Tue, Feb 27, 2007 at 07:45:49PM +0100, Enrico Weigelt wrote:
> > I'm not too concerned with the amount of data on the wire, adding
> > attribute information of all children of a small directory along with
> > the directory data is possible. However in the Coda cache those
> > attributes are not stored as directory data, but as file-objects with
> > only attribute information. And the cache has a limited number of
> > file-objects so we would have to throw out existing objects to make
> > room for this (possibly never used) attribute information
> 
> So we could either increase the size of the filetable (maybe such 
> partial information could be stored a little bit more efficient)
> or aditionally store these relies as file-data.

Since the file system doesn't know which attributes the application is
interested in when it calls stat, we would still need to get all of it.

Also Coda clients use the getattr result to detect conflicting versions,
while file and directory contents are only fetched from a single server.
So if we piggypack file attributes with the directory data the client
would not see differences between replicas until we try to revalidate
the cached attributes, or fetch the data from another replica and notice
the version mismatch.

> > No, we do need the directory contents to look up the object identifier
> > of the next path component so we do need 4x dir_get().
> 
> hmmpf. best would be if the kernel could directly open the subdir
> (aka pass this call to the fs) ... I imagine such things would get
> tricky with certain mount situations. 

Once the subdir is opened once the in-kernel directory cache will
contain all the necessary information and we don't need the lookups
anymore. But if nothing is cached the path is resolved by doing
repeated lookups,

    d1 = lookup(root, 'coda')
    d2 = lookup(d1, 'coda.cs.cmu.edu')
    d3 = lookup(d2, 'usr')
    ...

And to do these lookups we need to get the directory contents, because
only directory data can tell us how to map the name 'coda' in the root
directory to the next-level object.

> >     # tell venus where the lookaside index is located
> >     cfs lka +/usr/coda/venus.lka
> 
> In other words: use cached data from an old cache ?

Correct, you could also have a backup of your data somewhere on your
system and use that. If some parts of the backup are out of date it
doesn't matter, the checksum won't match and we'll fetch the latest
copy from the server.

> > to fully utilize the available bandwidth. Of course if your latencies
> > are in the order of 2 seconds we will never be able to push more than
> > about 128Kbit/s, and on a faster link that would definitely make it look
> > like SFTP is doing a stop and wait data transfer.
> 
> Okay. Is it possible to increase this window ? 

Probably, but I've not really used that much and it may just cause other
problems. Many routers assume smooth window scaling, and if for instance
we increase number of packets sent at a time from 8 to 16 or so we would
only worsen the problem if there is a queueing issue in an intermediate
router.

If we allow more data on the wire we need more memory to keep packets
that may need to be retransmitted around which would hurt on servers
that are dealing with a lot of clients, etc.

> BTW: why isn't just TCP used for longer transmissions ? TCP seems to 
> be quite advanced and can be tweaked for special needs.

Back in the day when AFS2 was built select would only accept 16 file
descriptors and TCP used too many resources. AFS was designed to scale
to a campus wide deployment (100's to 1000's of clients) and TCP simply
wasn't up to the job. Coda also focussed a lot on wireless (lossy)
networks and had selective retransmissions pretty much from the start,
that has only been a recent TCP addition.

Nowadays servers have plenty of memory efficient ways to poll many
sockets so TCP may have become a feasible alternative, but there are
still some properties about the existing UDP based protocol that are
useful to us such as a predictable detection of dead servers the ability
to have 1000's of mostly inactive connections between clients and
servers, etc and we get to query our networking layer about observed and
estimated latency and bandwidth values to various servers.

A client typically has many (logical) connections to a server, one for
each internal thread for each user, so a client with 2 or 3 users can
easily have 40-60 open connections to a server, multiply that by a
hundred clients or more and you'd start hitting fd limits on many
systems. And most connections are not used that much, so TCP connections
would probably be somewhat sluggish (slow-start) whenever the first rpc
is sent.

Jan

Coda File System

Re: coda very slow on roadwarrior