(Illustration by Gaich Muramatsu)
On Wed, May 02, 2007 at 09:24:41AM -0400, shivers_at_ccs.neu.edu wrote: > > 10Gb is small, if you use the proper unit of measurement: it's $4 worth of > > storage. I have a little over a terabyte of disk on my personal office > > computer, so dedicating 10Gb to my cache is trivial. > > In fact, if you think about it, you can make an argument that *most* of your > disk-drive capacity should be given over to a coda cache. I totally agree with all of the points you mention. > Hey, just my music collection is 300Gb -- I abandoned lossy compression around > 1998. If I owned a TV, we'd be talking serious storage -- most of my > technically proficient friends buffer Netflix movies on their hard drives and > use MythTV to do Tivo-like capture of their favorite tv series off of cable. > Coda *should* be perfect for a large media store, because that data is so > typically read-only. Right, and here is where the existing design makes flawed assumptions. Basically we mostly get to choose the amount of disk blocks used for the Coda cache, but as far as both the client and the server are concerned we don't really care about the amount of data, but about the number of cached objects. So Coda uses a magic ratio which is probably based on the average file size on various desktop and server machines back in 1988 or so. Maybe it got adjusted slightly later on. In any case, it is assumed that an average file is 24KB. So if we ask for a 1TB client cache this formula assumes we're going to be storing a little over 44 million objects. But if we're talking about what people that use/need a lot of diskspace tend to store, losslessly compressed whole CD images (300-400MB), captured video (600MB to a few GB) (I have a 3 hour HD recording at 20GB) digital camera images (6MB per photo) lossy compressed mp3 songs (3MB a song) vmware/qemu images (several GB each) Clearly the average file size is considerably larger and we are far more likely to see reasonable numbers for the number of cached files. If we have a 1TB cache we may see something in the order of 200K digital photos, 3000 whole-CD flacs, 1000 TV recordings, or a couple of hundred VM images. And those are really reasonable and realistic numbers. Some large centralized file server may have to handle millions of files (and have peta bytes of storage), but if it took only 1ms to access or check each file, it would take over 12 hours to go through 44 million objects so it seems unreasonable for clients unless we really start to look at how data is stored, indexed and accessed differently. So really we shouldn't be telling Coda to use some limited amount of diskspace for the cache, but tell it to cache as many as N objects and possibly that it should make sure that there is some safety margin of local diskspace that should always remain unused. JanReceived on 2007-05-02 12:14:47