Coda File System

Re: coda & reasonable cache sizes

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 2 May 2007 12:13:22 -0400
On Wed, May 02, 2007 at 09:24:41AM -0400, shivers_at_ccs.neu.edu wrote:
> > 10Gb is small, if you use the proper unit of measurement: it's $4 worth of
> > storage. I have a little over a terabyte of disk on my personal office
> > computer, so dedicating 10Gb to my cache is trivial.
> 
> In fact, if you think about it, you can make an argument that *most* of your
> disk-drive capacity should be given over to a coda cache.

I totally agree with all of the points you mention.

> Hey, just my music collection is 300Gb -- I abandoned lossy compression around
> 1998. If I owned a TV, we'd be talking serious storage -- most of my
> technically proficient friends buffer Netflix movies on their hard drives and
> use MythTV to do Tivo-like capture of their favorite tv series off of cable.
> Coda *should* be perfect for a large media store, because that data is so
> typically read-only.

Right, and here is where the existing design makes flawed assumptions.

Basically we mostly get to choose the amount of disk blocks used for the
Coda cache, but as far as both the client and the server are concerned
we don't really care about the amount of data, but about the number of
cached objects.

So Coda uses a magic ratio which is probably based on the average file
size on various desktop and server machines back in 1988 or so. Maybe it
got adjusted slightly later on. In any case, it is assumed that an
average file is 24KB. So if we ask for a 1TB client cache this formula
assumes we're going to be storing a little over 44 million objects.

But if we're talking about what people that use/need a lot of diskspace
tend to store,

    losslessly compressed whole CD images (300-400MB),
    captured video (600MB to a few GB) (I have a 3 hour HD recording at 20GB)
    digital camera images (6MB per photo)
    lossy compressed mp3 songs (3MB a song)
    vmware/qemu images (several GB each)

Clearly the average file size is considerably larger and we are far more
likely to see reasonable numbers for the number of cached files. If we
have a 1TB cache we may see something in the order of 200K digital
photos, 3000 whole-CD flacs, 1000 TV recordings, or a couple of hundred
VM images.

And those are really reasonable and realistic numbers. Some large
centralized file server may have to handle millions of files (and have
peta bytes of storage), but if it took only 1ms to access or check each
file, it would take over 12 hours to go through 44 million objects so it
seems unreasonable for clients unless we really start to look at how
data is stored, indexed and accessed differently.

So really we shouldn't be telling Coda to use some limited amount of
diskspace for the cache, but tell it to cache as many as N objects and
possibly that it should make sure that there is some safety margin of
local diskspace that should always remain unused.

Jan
Received on 2007-05-02 12:14:47