Coda File System

Whole file caching

Coda uses whole file caching and AFS semantics, we block the open call until the complete file is fetched and only write back changes when the file is closed.

Why block until everything is cached locally?

The main reason is that Coda's kernel module does tell us about individual read and write operations. Once we return the handle to the cached file, the kernel expects to be able to do anything it wants.

Ofcourse with some effort it would be possible to change the kernel-venus protocol systems, but there are other reasons as well...

Overhead and possible full-system deadlocks

Intercepting each individual read and write operation would cause significant overhead. Not only would we need a context switch between the application and the Coda cache manager, we would end up copying the same data several times. Also simply hooking into the read and write operations doesn't see accesses to an mmap'ed file. For those we'd have to intercept read and write operations on a page-by-page basis (readpage/writepage) which introduces more overhead because we need an upcall for every 4KB of data and it introduces the possibility of a pretty serious deadlock in the kernel.

A deadlock? Yes, consider a system that is quite low on memory, most of the Coda userspace has been swapped or paged out. Now somewhere some process needs a page, but nothing is easily available, this causes the kernel to try to flush dirty pages to disk, so it calls writepage for dirty pages. But if this writepage happens to be for a Coda file there are several places that also might need memory pages. We allocate some memory for the upcall message, then when venus wakes up it might need to page in some code pages, or it might need to pull data structures from swap, and finally Coda needs to write the page to the actual container file, which requires allocation. All of these places where we might need a page will then block waiting for the dirty page writeout operation to complete. But that one is waiting for an answer from us.

Ofcourse we can limit Coda's functionality by simply denying shared/writeable mmaps to avoid such a deadlock situation, this is what the FUSE developers have done.

Why we don't fetch parts of a file on a page by page basis.

With fat pipes that have a high latency it is more efficient to stream a whole file, than to fetch individual blocks. Ofcourse the existing SFTP implementation doesn't necessarily stream very well and has a lot of overhead since it deals with 1KB packets in userspace, we want want to eventually replace it to something that uses either TCP and sendfile or a reliable dategram protocol like SCTP.

If we have a 100KB/s link with a 200ms RTT, we could have moved another 20KB of data in the time it takes us to make a single rpc for the next block. The kernel does some readahead which fills in the gaps, but with a 4KB pagesize, we would need 5 concurrent read requests on the wire to fill this particular pipe when we make rpcs on a block-by-block basis. And networks are getting faster while geographical distances stay the same. So the 100ms latency between here and europe will most likely stay around that value, while the available bandwidth will steadily increase. In the end, when we have Gb/s across the atlantic, it will be much more efficient to fetch a whole ISO image at once than to request the thousands of individual blocks which would probably still take several hours.

So there really is no way convincing you, is there?

There are few possibly optimizations that I actually do consider valuable, for instance we shouldn't bother fetching a file if the process set the O_TRUNCATE flag on open. And possibly give early access to the parts that have already arrived while the file is still being fetched. But such early access does add a lot of complexity to the kernel module and requires some thinking on what to do when the application expects non-blocking operation on reads and writes, although I guess the kernel already doesn't give non-blocking guarantees for local disk files anyways. Or what to do if some application opens a lot of files just to look at the first couple of bytes of each file, etc.