(Illustration by Gaich Muramatsu)
Hi Rune, Thank you for valuable feedback! Most importantly, thank you for taking the time to write about your experience in a constructive manner and in sufficient detail to help us improve Coda. It is gratifying that the system was helpful to you under real-world stress, except at the very end when it collapsed. Real-world experience and thoughtful feedback of this kind is priceless. The hoarding/caching area of Venus is definitely one that we have had an eye on for some time now for a complete redesign and rewrite. What's difficult is to strike the right balance between precisely tracking changes on servers and presenting a frozen snapshot view. How automated versus manual the synchronization process should be is a delicate balance. Also, how "atomic" the resync process should be if failures happen during the resync process (as is very possible with flaky wireless networks). The current code is biased towards the fully automated end of the spectrum and no atomicity. So the cache management policy is roughly "as fresh a state as possible, without user interaction or atomicity guarantees." The resulting caching/hoarding code is complex/buggy and can also have counterinuitive behavior, as you experienced. But the obvious alternatives have their own problems. Any redesign is going to have to make some hard choices on very important corner cases. It would help to hear thoughts from you and other experienced Coda users on the points below. One very early design choice we considered (but rejected) was to simply pin objects in the cache via hoarding. Hoard priority is then not a useful conccept, but explicit hoard walks are still important (that's the resync step). Hoarded objects are "sticky" --- they never get thrown out, but new versions of them get fetched on hoard walks. One reason for rejecting the "sticky" approach was that we didn't have a good answer to the question of what to do if the resync step would cause a pinned subtree to expand greatly (beyond cache size limits). E.g. you disconnect after hoarding a 1-byte subtree; a later "hoard walk" discovers that the 1-byte subtree has grown to 10 GB, which is bigger than the cache. What does Venus do now? Currently, Coda tries to use the hoard priority information to figure out what to throw out. A different approach would be to ask for user help at this point or just to give an error message at the hoard walk. User interaction at this point is questionable in the Unix design philosophy (unlike Windows or Mac), because there may not be a GUI or a user to interact with. That design philosophy is the reason why conflicts are represented as dangling sym links --- it is out of band communication to even non-interactive programs. The ASR (application-specific resolution) mechanism can pop up a dialog box, but Coda views that as an application-specific resolver and not as part of Venus. Should we do something similar here? i.e. an upcall to applications-specific code for exception handling in cache management? Is there an analog of the dangling sym link for the rock bottom fallback? The deeper issue is static partitioning of the cache versus dynamic partitioning. Even without growth of hoarded subtrees, there could be cache pressure to throw things out. E.g. you hoard critical objects, then start crawling some big tree while still connected. The cache misses during the crawl will eventually force a hard decision: to throw out a hoarded object or not. The "sticky" approach would never throw out a hoarded object to relieve cache pressure. But it would make the apparent cache size smaller for non-hoarded objects. This is similar to the problem faced by a VM system that dynamically balances between use of physical memory for VM pages versus I/O buffer cache. The difference is that we don't just face a performance penalty in our case. We face the much more difficult problem of failure semantics and user distraction, not just for planned failures (voluntary disconnections) but unplanned failures (involuntary disconnections, such as caused by RF signal loss when mobile). Usage-based insights and ideas from the Coda user community on these issues would be very helpful --- please contribute. -- SatyaReceived on 2006-06-27 08:49:39