Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Sat, 2 Oct 2004 11:40:53 -0400

On Thu, Sep 30, 2004 at 04:02:01PM -0500, Troy Benjegerdes wrote:
> If you added linux & solaris cachefs support to the coda kernel module,
> you'd get around the objections people have to podfuk.   ;)

Don't get too stuck on the idea of using cachefs. It isn't really what
you expect it to be, and should really be named 'fs cache helper'.

Network filesystems like NFS, SMBFS and the in-kernel AFS implementation
do not have a persistent on-disk cache. So any page of data that is
fetched from the server will only live in VM, and it will be simply
dropped when there is memory pressure (and re-fetched across the network
on the next access).

Cachefs simply provides a form of swapping to local disk for VM pages
whose backing store is not local. It doesn't even do this in a generic
way, the filesystem needs to be modified so that every readpage will
check the cache and if it is not cached or stale refetch and repopulate,
while writepage invalidates or updates the cached page. On a local GigE
network with a fileserver that has plenty of memory, refetching across
the network is in many cases faster than re-reading from local disk so
cachefs would probably slow down the client.

The thing is, Coda already has a local backing store for files, so when
VM pages are dropped they are written back to the local container file
and don't have to be refetched across the network. When we fill the
container files we never actively sync, and pass the open fd to the
kernel, as a result we don't really touch the disk for most small files.

There are other problems with cachefs. All (currently potential) cachefs
users are all in kernel space, they don't have to care about deadlocks,
kernel processes never get swapped or paged out and can allocate memory
without putting pressure to the VM. Also the cachefs logic hooks into the
readpage and writepage operations, which Coda doesn't even provide. Our
read and write page functions are provided by the inode of the container
file. Because people could put their venus.cache into anything, we can't
even assume it is similar to the generic_readpage implementation or
something like that.

Another thing is that we don't fetch parts of a file on a page by page
basis. With fat pipes that have a high latency it is more efficient to
stream a whole file, than to fetch individual blocks. Ofcourse the SFTP
implementation doesn't necessarily stream very well and has a lot of
overhead since it deals with 1KB packets in userspace, we want want to
replace it to something that uses TCP and sendfile or SCTP.

If we have a 100KB/s link with a 200ms RTT, we could have sent another
20KB of data in the time it takes us to make a single rpc for the next
block. The kernel does some readahead which fills in the gaps, but with
a 4KB pagesize, we would need 5 concurrent read requests on the wire to
fill the pipe when we make rpcs on a block-by-block basis. And networks
are getting faster, while geographical distances stay the same. So the
100ms latency between here and europe will probably stay around that
value, while the available bandwidth will steadily increase. In the end,
when we have Gb/s across the atlantic, it will be much more efficient to
fetch a whole ISO image at once than to request 100,000 individual 8KB
blocks which would take several hours.

There are only a few optimizations that I consider valuable, for
instance we shouldn't bother fetching a file if the process set the
O_TRUNCATE flag on open. And to give early access before we fetched the
whole file to allow access to the part that has already arrived. But
early access does add a lot of complexity, what to do when the
application expects non-blocking operation on reads and writes, what to
do if some application opens a lot of files just to look at the first
couple of bytes of each file, etc.

Jan

Coda File System

Re: venus-kernel interface