(Illustration by Gaich Muramatsu)
On Wed, May 04, 2016 at 07:43:46PM -0400, Greg Troxel wrote: > The last big thing is to make conflict repair more robust, so that > normal people rarely lose. It's quite good already, but the last > little bit is likely real work. I've been thinking long and hard about this. I've pretty much been trying to get repair working more reliably ever since I joined CMU. We've had several undergraduate and master students do project work on trying to make repair and application specific resolvers work reliably. During the recent break from working on Coda, I've played around with building applications using what you can call 'web-technology'. Simple restful web apps, sharing state through sql/redis/nosql databases, even handling things like push updates to client 'programs' consisting of the javascript running in browsers. And there are a lot of ways to scale with load balancing multiple webapps behind nginx or haproxy, sharding data across databases etc. Anyhow, it allowed me to take a step back and look at Coda from a different perspective and the one thing that adds a lot of complexity and ultimately causes every single server-server conflict is replication. Now optimistic replication is totally awesome and Coda has proven that it can be made to work well enough to have a usable system for users that are not afraid to repair the occasional conflict. But my wordprocessor doesn't want to deal with that conflict, neither does my webserver, or many times me when I'm working on a deadline. So lets look at the pros and cons of (optimistic) replication for a bit. pro: - Awesome concept to prove it can work, you can write papers about it. - When every server but one crashes you can still continue working. - When network routing messes up and different clients see a different subset you can continue on both sides of the split. - When something gets corrupted in a client's volume replica data structure you can remove the replica and rebuild by walking the tree. con: - Somewhat tricky fallbacks when operations cross directories (rename), or to handle how to deal with a resolution log when a directory is removed, especially when it contains a mix of source and target of rename operations. These are still the most common reasons for manual repair failures, it is a real pain when you need to repair mkdir foo ; touch foo/bar ; mv foo/bar . ; rmdir foo - Extra metadata to track what we know of other replicas, version vectors, store identifiers. - Extra protocol messages (COP2) to distribute said knowledge. - Special set of heuristics for most common replica differences, missed COP2 by looking for identical storeids, missed update on one or more replicas by looking at version vectors, runt resolution to rebuild a replica, etc. - Keep track of reintegration (operation) logs so we can merge directory operations from diverged replicas. - A protocol that goes through 6 phases for each conflicting object to handle server resolution. This actually makes placing servers in different locations not work very well. - As a result, the need to have all servers basically co-located, so we still can't handle datacenter outages or multiple site replication. - Manual repair fallback when the heuristics and logs fail, which requires very different steps for directories (fixing log operations) and files (overwrite with new/preferred version) which isn't obvious to the user when repair starts. - Need to have the concept of a high level 'replicated volume' and a low level 'volume replica' on the client. I probably should stop here. Now when there is no optimistic replication: - we just need read-write and read-only volumes on clients which probably could even be represented by a single 'readonly' flag on a single volume data structure. Turning a readonly backup snapshot back in to a read-write volume may become just toggling that flag. - If a server crashes or otherwise becomes unreachable we still have disconnected operation on a (hoarded) cache of files. - We only have to deal with reintegration conflict, however unlike server-server conflict they only block the local user who performed the operations, and the server side implementation is pretty much (or should be close to) the normal reintegation path. There is a also very cheap fix if the user doesn't care about the local changes in the form of a 'cfs purgeml'. A headless server automated conflict resolution could be a cfs checkpointml/cfs purgeml/send reintegration log with the conflict to the appropriate user. - Reduce the size of a version vector to 2 values, a single data version and the store identifier, on the wire it can be even more compact because the client identifier part of the storeid does not change for the lifetime of a client-server connection so it could be passed on connection setup. This in return makes operations like ValidateAttrs more efficient because we can pack a lot more results in the response. - The server can lose functionality related to directory operation logging, resolution, (server-server) repair. - On the server side things like RAID can help deal with drive failures, and everyone is making backups, right? - More exciting, back the server data with something like Ceph's RADOS object store which gives replication, self-healing and a lot of goodies, and have Coda provide disconnected operation/hoarding/etc. - As servers become simpler and store data in a replicated backend store (rados/s3) and mostly deal with handling applying reintegration logs and breaking callbacks they become much easier to scale out, clients can be scattered across multiple frontends, grabbing a volume lock/lease and callback breaks can be distributed between frontends with publish/subscribe queues. I probably should stop here, because it is getting quite a bit pie in the sky and with the available manpower might just hit that magic 10 year window that Rune was talking about. JanReceived on 2016-05-05 00:22:25