(Illustration by Gaich Muramatsu)
On Thu, Sep 30, 2004 at 04:02:01PM -0500, Troy Benjegerdes wrote: > If you added linux & solaris cachefs support to the coda kernel module, > you'd get around the objections people have to podfuk. ;) Don't get too stuck on the idea of using cachefs. It isn't really what you expect it to be, and should really be named 'fs cache helper'. Network filesystems like NFS, SMBFS and the in-kernel AFS implementation do not have a persistent on-disk cache. So any page of data that is fetched from the server will only live in VM, and it will be simply dropped when there is memory pressure (and re-fetched across the network on the next access). Cachefs simply provides a form of swapping to local disk for VM pages whose backing store is not local. It doesn't even do this in a generic way, the filesystem needs to be modified so that every readpage will check the cache and if it is not cached or stale refetch and repopulate, while writepage invalidates or updates the cached page. On a local GigE network with a fileserver that has plenty of memory, refetching across the network is in many cases faster than re-reading from local disk so cachefs would probably slow down the client. The thing is, Coda already has a local backing store for files, so when VM pages are dropped they are written back to the local container file and don't have to be refetched across the network. When we fill the container files we never actively sync, and pass the open fd to the kernel, as a result we don't really touch the disk for most small files. There are other problems with cachefs. All (currently potential) cachefs users are all in kernel space, they don't have to care about deadlocks, kernel processes never get swapped or paged out and can allocate memory without putting pressure to the VM. Also the cachefs logic hooks into the readpage and writepage operations, which Coda doesn't even provide. Our read and write page functions are provided by the inode of the container file. Because people could put their venus.cache into anything, we can't even assume it is similar to the generic_readpage implementation or something like that. Another thing is that we don't fetch parts of a file on a page by page basis. With fat pipes that have a high latency it is more efficient to stream a whole file, than to fetch individual blocks. Ofcourse the SFTP implementation doesn't necessarily stream very well and has a lot of overhead since it deals with 1KB packets in userspace, we want want to replace it to something that uses TCP and sendfile or SCTP. If we have a 100KB/s link with a 200ms RTT, we could have sent another 20KB of data in the time it takes us to make a single rpc for the next block. The kernel does some readahead which fills in the gaps, but with a 4KB pagesize, we would need 5 concurrent read requests on the wire to fill the pipe when we make rpcs on a block-by-block basis. And networks are getting faster, while geographical distances stay the same. So the 100ms latency between here and europe will probably stay around that value, while the available bandwidth will steadily increase. In the end, when we have Gb/s across the atlantic, it will be much more efficient to fetch a whole ISO image at once than to request 100,000 individual 8KB blocks which would take several hours. There are only a few optimizations that I consider valuable, for instance we shouldn't bother fetching a file if the process set the O_TRUNCATE flag on open. And to give early access before we fetched the whole file to allow access to the part that has already arrived. But early access does add a lot of complexity, what to do when the application expects non-blocking operation on reads and writes, what to do if some application opens a lot of files just to look at the first couple of bytes of each file, etc. JanReceived on 2004-10-02 11:42:42