(Illustration by Gaich Muramatsu)
On Tue, May 15, 2007 at 10:56:03AM -0400, shivers_at_ccs.neu.edu wrote: > Are we concluding that this kind of use is simply not on for coda, or is > it that there is a bug, which, if fixed, would make this kind of use OK? > > I don't have a good understanding of how critical these O(n) context-checking > ops are -- do they have to happen, are they optional and performed when > connectivity is good, or what. The best explanation of the whole hoard walking process is in J.J. Kistlers thesis 'Disconnected Operation in a Distributed File System'. Chapter 5 covers hoarding. I just re-read the whole chapter and my head is still a bit reeling, it pretty much covers all of the details. Although it may not seem that way, with the 100% CPU usage and all, the hoard expansion mechanism tries to minimize the amount of work required to re-associate cached objects with their intended hoard priority. And when we stay connected and get couple or no callbacks it actually does perform about as little work as possible. The problem we are seeing here is because of a large demotion event (reconnection to the servers) which flags all objects as suspicious. The cached status is quickly revalidated, according to the codacon output a single ValidateVols rpc call was sufficient. However the demotion also placed all hoard related bindings in a suspect state, and to revalidate these the client iterates through every object (actually hoard entry <-> file system object binding) and has to check if the path name of the is still valid, this is somewhat expensive as we walk the tree starting at the root for every object. Also if the object is a directory we iterate through all entries and check them against a list of previously known children to see if there were any added or removed entries. Now these directories are internally hashed, but the list of previous entries is a simple list, so reversing this and doing a directory lookup for each entry in the list would already be somewhat more efficient. Second, unless I didn't understand the code correctly, it seems like we're revalidating every object against the hoard database twice, based on the path and then on whether the object is still a valid child of a hoarded parent directory. Finally, since the status revalidation clearly discovered that nothing had actually changed on the servers, all of these hoard bindings probably shouldn't have been demoted in the first place. So I would say these are definitely some things that can be improved. Another approach is to replace the existing hoard mechanism, the current implementation is both flexible but also quite complex. It combines both cache-replacement policy, in the way of hoard priority, with a demand fetch preference which indicates which files need to be automatically refetched when they are changed on the server. JanReceived on 2007-05-15 23:37:16