Coda File System

Re: Coda spinning

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 15 May 2007 23:35:15 -0400
On Tue, May 15, 2007 at 10:56:03AM -0400, shivers_at_ccs.neu.edu wrote:
> Are we concluding that this kind of use is simply not on for coda, or is
> it that there is a bug, which, if fixed, would make this kind of use OK?
> 
> I don't have a good understanding of how critical these O(n) context-checking
> ops are -- do they have to happen, are they optional and performed when
> connectivity is good, or what. 

The best explanation of the whole hoard walking process is in J.J.
Kistlers thesis 'Disconnected Operation in a Distributed File System'.
Chapter 5 covers hoarding. I just re-read the whole chapter and my head
is still a bit reeling, it pretty much covers all of the details.

Although it may not seem that way, with the 100% CPU usage and all, the
hoard expansion mechanism tries to minimize the amount of work required
to re-associate cached objects with their intended hoard priority. And
when we stay connected and get couple or no callbacks it actually does
perform about as little work as possible.

The problem we are seeing here is because of a large demotion event
(reconnection to the servers) which flags all objects as suspicious. The
cached status is quickly revalidated, according to the codacon output a
single ValidateVols rpc call was sufficient. However the demotion also
placed all hoard related bindings in a suspect state, and to revalidate
these the client iterates through every object (actually hoard entry <->
file system object binding) and has to check if the path name of the is
still valid, this is somewhat expensive as we walk the tree starting at
the root for every object. Also if the object is a directory we iterate
through all entries and check them against a list of previously known
children to see if there were any added or removed entries.

Now these directories are internally hashed, but the list of previous
entries is a simple list, so reversing this and doing a directory lookup
for each entry in the list would already be somewhat more efficient.
Second, unless I didn't understand the code correctly, it seems like
we're revalidating every object against the hoard database twice, based
on the path and then on whether the object is still a valid child of a
hoarded parent directory. Finally, since the status revalidation clearly
discovered that nothing had actually changed on the servers, all of
these hoard bindings probably shouldn't have been demoted in the first
place. So I would say these are definitely some things that can be
improved.

Another approach is to replace the existing hoard mechanism, the current
implementation is both flexible but also quite complex. It combines both
cache-replacement policy, in the way of hoard priority, with a demand
fetch preference which indicates which files need to be automatically
refetched when they are changed on the server.

Jan
Received on 2007-05-15 23:37:16